Not getting all columns while streaming parquet file from S3 or locally downloaded file reading #60

rajan596 · 2020-12-30T15:02:37Z

Hi,

I am using standard code mentioned below. I am not getting desired columns in cursor. Can anyone tell where is the issue in library ?

Desired columns in S3 parquet is A,B,C while I am not getting column A most of the records.
While validating same parquet file downloading local and converting it to CSV A column value is present for all the dataset.
Please help where can this go wrong ?

Version: "parquetjs-lite": "0.8.0",
NodeJs version: v8.0.0

import parquet from 'parquetjs-lite/parquet'
let reader = await parquet.ParquetReader.openS3(s3Client, params);
let cursor = reader.getCursor();
while (record = await cursor.next()) {
 console.log(record)
}

The text was updated successfully, but these errors were encountered:

rajan596 · 2021-01-24T13:23:54Z

Update: same is completely readable by using the python panda module.
The parquet file present in s3 was originated by spark/python.

entitycs · 2021-01-24T14:33:15Z

Have you found any patterns in which rows have 'column A' data missing? Are they strings or numbers (eg. BigInt?) Regards, Dustin

…

-----Original message----- From: Rajan Kasodariya Sent: Sunday, January 24 2021, 6:24 am To: ZJONSSON/parquetjs Cc: Dustin Charles; Comment Subject: Re: [ZJONSSON/parquetjs] Not getting all columns while streaming parquet file from S3 or locally downloaded file reading (#60) Update: same is completely readable by using the python panda module. The parquet file present in s3 was originated by spark/python. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

rajan596 · 2021-01-24T15:54:12Z

@entitycs I can not share the parquerjs file but there was one pattern for all the parquet files which I tried to read. It was giving this field missing after 70k lines of data read.
Not sure about the data type but it was in form of number.

entitycs · 2021-01-25T21:48:40Z

@rajan596 When you get a chance, can you try reverting to this commit, and attempt to read past the 70K mark again?

5277eb8

Without being able to see the data, my hunch is that either the numbers in the field grow, and a compression algorithm not in the lite version takes on a different path, or there is an issue in lib/shred.js, which has differing iteration methods between HEAD and the above commit.

rajan596 changed the title ~~Not getting all columns while streaming parquet file from S3~~ Not getting all columns while streaming parquet file from S3 or locally downloaded file reading Dec 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not getting all columns while streaming parquet file from S3 or locally downloaded file reading #60

Not getting all columns while streaming parquet file from S3 or locally downloaded file reading #60

rajan596 commented Dec 30, 2020 •

edited

Loading

rajan596 commented Jan 24, 2021

entitycs commented Jan 24, 2021 via email •

edited

Loading

rajan596 commented Jan 24, 2021

entitycs commented Jan 25, 2021

Not getting all columns while streaming parquet file from S3 or locally downloaded file reading #60

Not getting all columns while streaming parquet file from S3 or locally downloaded file reading #60

Comments

rajan596 commented Dec 30, 2020 • edited Loading

rajan596 commented Jan 24, 2021

entitycs commented Jan 24, 2021 via email • edited Loading

rajan596 commented Jan 24, 2021

entitycs commented Jan 25, 2021

rajan596 commented Dec 30, 2020 •

edited

Loading

entitycs commented Jan 24, 2021 via email •

edited

Loading