Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Table Input + Text File Output - trailing spaces in output #4303

Open
dave-csc opened this issue Sep 9, 2024 · 4 comments
Open

[Bug]: Table Input + Text File Output - trailing spaces in output #4303

dave-csc opened this issue Sep 9, 2024 · 4 comments

Comments

@dave-csc
Copy link
Contributor

dave-csc commented Sep 9, 2024

Apache Hop version?

2.9.0, 2.10.0 SNAPSHOT

Java version?

openjdk 21.0.2 2024-01-16

Operating system

Linux

What happened?

Steps to reproduce:

  1. Create a template pipeline for Metadata Injection, with a Table Input and a Text File Output transforms
  2. Create a second pipeline, inject in the previous one a query on an Oracle connection and the file in which writing the output

The VARCHAR2/String values are always written in the file with the trailing spaces needed to match their maximum length, regardless you choose or not the flag "Right pad fields?".

A workaround would probably be to use a String operations to trim data manually, but to use that you need to specify the table/view schema (so you can't use it with Metadata Injection)

Issue Priority

Priority: 2

Issue Component

Component: Metadata, Component: Transforms

@dave-csc
Copy link
Contributor Author

dave-csc commented Sep 9, 2024

Update: the same behaviour occurs even without the use of Metadata Injection (i.e. the simplest pipeline Table Input -> Text File Output).

Update # 2: I managed to have a simpler file by selecting the option "Fast data dump", in this case the output has no trailing spaces as expected.

I suggest to revise the functionalities "Right pad fields" and "Fast data dump": I would expect to have just the first one in the UI and select it if I like to have a padded output (or keep it unselected if I don't want to). In the documentation it's not so clear what this sentence means: "Improves the performance when dumping large amounts of data to a text file by not including any formatting information", since a text file has no format information...

@hansva
Copy link
Contributor

hansva commented Sep 9, 2024

I suspect that you have specified a length on the text fields in the text file output, putting the length to 0 will limit the size to the exact length.

The main reason for this is that some people need fixed width fields (usually older systems)

There is a button in the bottom for "minimal width" that will change the settings accordingly

For the trim types the documentation states:

The trimming method to apply on the string. Note: Trimming only works when there is no field length given.

@dave-csc
Copy link
Contributor Author

I suspect that you have specified a length on the text fields in the text file output, putting the length to 0 will limit the size to the exact length.

Actually not: I just created a template pipeline with the two components and injected the SQL query and the file to write (I can't set a schema in Text File Output, since I'm using this template to dump several tables in text files).
When developing the same pipeline as a stand-alone one, you can use the Get Fields button and you get the schema from the database query, including the field lengths.
By the way, lengths are retrieved for Integer values too, but differently from Strings they're not applied in the files with the default options (they occupy the minimum length, even with the option "Right pad fields" enabled).

There is a button in the bottom for "minimal width" that will change the settings accordingly

It works if you set a stand-alone pipeline, but this function can't be set through metadata injection

The main reason for this is that some people need fixed width fields (usually older systems)
[...]
For the trim types the documentation states:

The trimming method to apply on the string. Note: Trimming only works when there is no field length given.

I was referring to the "Fast data dump" option, not the "Trim type" one. And indeed for that reason I think the "Fast data dump" and "Right pad fields" should be merged and corrected: you could only implement a single flag like "Write fixed length fields". When enabled it prints a file with fixed width fields (suitable for older systems), when disabled the file is written in CSV-style...

@hansva
Copy link
Contributor

hansva commented Sep 10, 2024

Fast data dump is not related to fixed width, it just means it will ignore the formatting options you specify on the fields. Eg we will dump the raw float when the field is a number and we don't format it to #.##. You are a bit in an edge case where the size of the strings is coming from the database but you want to ignore those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants