Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not getting all columns while streaming parquet file from S3 or locally downloaded file reading #60

Open
rajan596 opened this issue Dec 30, 2020 · 4 comments

Comments

@rajan596
Copy link

rajan596 commented Dec 30, 2020

Hi,

I am using standard code mentioned below. I am not getting desired columns in cursor. Can anyone tell where is the issue in library ?

Desired columns in S3 parquet is A,B,C while I am not getting column A most of the records.
While validating same parquet file downloading local and converting it to CSV A column value is present for all the dataset.
Please help where can this go wrong ?

Version: "parquetjs-lite": "0.8.0",
NodeJs version: v8.0.0

import parquet from 'parquetjs-lite/parquet'
let reader = await parquet.ParquetReader.openS3(s3Client, params);
let cursor = reader.getCursor();
while (record = await cursor.next()) {
 console.log(record)
}

@rajan596 rajan596 changed the title Not getting all columns while streaming parquet file from S3 Not getting all columns while streaming parquet file from S3 or locally downloaded file reading Dec 30, 2020
@rajan596
Copy link
Author

Update: same is completely readable by using the python panda module.
The parquet file present in s3 was originated by spark/python.

@entitycs
Copy link

entitycs commented Jan 24, 2021 via email

@rajan596
Copy link
Author

@entitycs I can not share the parquerjs file but there was one pattern for all the parquet files which I tried to read. It was giving this field missing after 70k lines of data read.
Not sure about the data type but it was in form of number.

@entitycs
Copy link

@rajan596 When you get a chance, can you try reverting to this commit, and attempt to read past the 70K mark again?

5277eb8

Without being able to see the data, my hunch is that either the numbers in the field grow, and a compression algorithm not in the lite version takes on a different path, or there is an issue in lib/shred.js, which has differing iteration methods between HEAD and the above commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants