-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to view new .parquet files using the new "Preview" function #6990
Comments
The library we use may have limitations and may not support all data types. Let me see if your file runs into a limitation or we have a bug in our code. Edit: I just checked. We are using a version of the library that doesn't support the DECIMAL data type. A new version of the library added support to it. I will update the library and test it in 1.31.0. Also, could you share your data's schema that sees the buffer.readUInt32LE error with us? We would like to know the structure of the data (e.g. field names, nesting structure and data types). You can redact all the actual field names to remove your personal information. |
Table info attached... |
having a similar issue, here is the error info, if requested i can also share a file |
what about other data types? i get a similar issue, but for Double { |
For the |
I got the The parquet was created from Azure Data Factory (ADF). DuckDB Schema Information
Pyarrow Schema Information
ADF Schema of Copy Task
|
@mithom Would you like to share a file with us? Please make sure you don't accidentally share any sensitive information with us. If you don't want to post it here, you can email it to sehelp at microsoft dot com. Thanks. |
i will ask my manager, but it shouldn't be hard to generate a file with some noise data in it. We don't handle user data so gdpr wise everything is safe anyways |
Same issue, part of a delta lake created by Databricks using their default writer settings in Unity Catalog Managed Tables.
Attached the data file. there nothing sensitive in it. part-00004-7690cf81-ffe2-49dc-aaa2-9e622a3b3f5a.c000.snappy.parquet.zip |
A fix has been made in the library. I made a few private builds using a private release including the fix. I tested with some shared files and they seem to solve most of the problems, except for yours @tobimax. The library doesn't support decimal data types with precision > 18. You may try to install the private build and see if it works in your use case. |
Didn't work for me due to fields defined as (19,4) |
Hi, @JasonYeMSFT , The Decimals were defined as decimal(38,14) - is it defined differently in the parquet file structure? |
I'm receiving this as well using 1.30.0 |
I have found another issue on 1.30.0 "o.buffer.readBigInt64LE is not a function" {
"name": "TypeError",
"message": "o.buffer.readBigInt64LE is not a function",
"stack": "TypeError: o.buffer.readBigInt64LE is not a function
at jR (file:///C:/Users/tobi/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:736182)
at Object.QR (file:///C:/Users/tobi/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:736230)
at Tn (file:///C:/Users/tobi/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:764741)
at rD (file:///C:/Users/tobi/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:766114)
at async sb (file:///C:/Users/tobi/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:765449)
at async fb (file:///C:/Users/tobi/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:767625)
at async Mu.readRowGroup (file:///C:/Users/tobi/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:763489)
at async Object.next (file:///C:/Users/tobi/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:756357)
at async ParquetParser.getRecords (file:///C:/Users/tobi/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:5046)
at async getParquetData (file:///C:/Users/tobi/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:6298)"
} |
I have also the same exception. In my opinion, it would a good idea to support in addition to the "Preview" feature a "Schema" one. |
@mithom @LoHertel @ddepante @manishj216 @daviewales @andrePKI @gpcottle @cf-dtrznadel @ghormann @royalh13 @tobimax |
I believe I'm using the official release, but I'm not sure how to be 100% confident. |
I experienced it with the official 1.30.0 version. |
Strange to announce such a feature and it doesn't even work with files generated by Azure Data Factory. The private build does work. |
The private build works, although it appears to want to download the whole Parquet file before rendering it, rather than just loading the top 100 rows. |
We always attempt to download the whole parquet file before rendering it. As far as I know, the data in parquet are stored column-wise which means you may need to read until the end of the file to get the first row. If you are running into performance issue because of that, please open a new issue and we can think about potential optimizations. |
Is it possible to read the last n bytes of a file in Azure storage? For example, here is the parquet metadata from the NYC Taxi Trips for Yellow Taxi Trips data January 2023:
You can see it includes the file offset information for each column and row group. (Only contains a single row group due to the small size of the file.) I accessed this metadata from the parquet file using duckdb: select * from parquet_metadata("yellow_tripdata_2023-01.parquet") |
No it is not true. Hopefully Parquet files are splittable (that's why we use it for distributed computing). As mentionned by https://github.com/daviewales it would be also very usefull to support in addition of an extract of the data, the schema and the metadata (especialy each row group metadata contains "column metadata" used by the Predicate Push Down). See https://parquet.apache.org/docs/file-format/. You can get the offset of any row group, and read any of them fully independently to each other. |
Well, @sbarnoud and @daviewales "Metadata is written after the data to allow for single pass writing", so how do you get to the metadata without reading the last bit of the file? If you need to seek into the file (without reading it actually) you would need to know where to seek to, so still a search to find where the metadata starts. So I don't think it is strange to download the whole file before rendering it. |
Look at parquet-cli in Parquet Github repo => this tools scans and print all those meta data. https://github.com/apache/parquet-format#file-format Look for example here
When using parquet-cli to read a file on ADLS (or HDFS) you will see that it don't download all the file, but just peaces of it. |
So you will find here all needed info: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java This is how any distributed systems is able to read slices of a Parquet file and avoid to download the full file at each worker
|
Ok, I get it. |
I will look into the possibility of reading partial data. Just to reemphasize, please let us know if the private build does or doesn't resolves your "buffer.readXXX" is not a function type of error. |
We have published 1.30.1 which takes the library fix for the "buffer.readXXX is not a function" type of bug. |
Previewing parquet files is functioning as expected now, thank you sooooo much!!! |
Seems I'm late to this party, apologies! I too appreciate the idea of being able to read .parquet files in ASE, yet am not seeing the expected results, yet. |
Please open a new issue for this, and yes, being able to provide an example file is a HUGE help. :) |
@tobimax We integrated a library fix that would unblock your scenario by letting the decimal fields that are represented by byte arrays, which may have precision > 18, pass the parser validation. However, the library doesn't support parsing the byte arrays into human readable formats right now so those fields will be presented as raw buffer values. This fix will be shipped in 1.32.0. |
Preflight Checklist
Storage Explorer Version
1.30.0
Regression From
No response
Architecture
x64
Storage Explorer Build Number
20230609.2
Platform
Windows
OS Version
Windows 10 Enterprise - Version 10.0.19045 Build 19045
Bug Description
I tried to view the contents of multiple .parquet files by using the new "Preview" feature, but Storage Explorer errored out on all the files I tested.
Below is the error text from the 2 types of errors I'm seeing:
{
"name": "TypeError",
"message": "o.buffer.readUInt32LE is not a function",
"stack": "TypeError: o.buffer.readUInt32LE is not a function\n at ZR (file:///C:/Users/My.Name/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:736826)\n at Object.QR (file:///C:/Users/My.Name/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:736928)\n at Tn (file:///C:/Users/My.Name/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:764741)\n at rD (file:///C:/Users/My.Name/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:766114)\n at async sb (file:///C:/Users/My.Name/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:765449)\n at async fb (file:///C:/Users/My.Name/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:767625)\n at async Mu.readRowGroup (file:///C:/Users/My.Name/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:763489)\n at async Object.next (file:///C:/Users/My.Name/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:756357)\n at async ParquetParser.getRecords (file:///C:/Users/My.Name/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:5046)\n at async getParquetData (file:///C:/Users/My.Name/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:6298)"
}
{
"name": null,
"message": ""Invalid parquet type: DECIMAL, for Column: AjeraBilledConsultant\nInvalid parquet type: DECIMAL, for Column: AjeraBilledLabor\nInvalid parquet type: DECIMAL, for Column: AjeraBilledReimbursable\nInvalid parquet type: DECIMAL, for Column: AjeraCostConsultant\nInvalid parquet type: DECIMAL, for Column: AjeraCostLabor\nInvalid parquet type: DECIMAL, for Column: AjeraCostReimbursable\nInvalid parquet type: DECIMAL, for Column: AjeraReceivedConsultant\nInvalid parquet type: DECIMAL, for Column: AjeraReceivedLabor\nInvalid parquet type: DECIMAL, for Column: AjeraReceivedReimbursable\nInvalid parquet type: DECIMAL, for Column: AjeraSpentConsultant\nInvalid parquet type: DECIMAL, for Column: AjeraSpentLabor\nInvalid parquet type: DECIMAL, for Column: AjeraSpentReimbursable\nInvalid parquet type: DECIMAL, for Column: AjeraWIPConsultant\nInvalid parquet type: DECIMAL, for Column: AjeraWIPLabor\nInvalid parquet type: DECIMAL, for Column: AjeraWIPReimbursable\nInvalid parquet type: DECIMAL, for Column: BillingExchangeRate\nInvalid parquet type: DECIMAL, for Column: BudOHRate\nInvalid parquet type: DECIMAL, for Column: ConsultFee\nInvalid parquet type: DECIMAL, for Column: ConsultFeeBillingCurrency\nInvalid parquet type: DECIMAL, for Column: ConsultFeeFunctionalCurrency\nInvalid parquet type: DECIMAL, for Column: ExpPctComp\nInvalid parquet type: DECIMAL, for Column: FEAddlExpenses\nInvalid parquet type: DECIMAL, for Column: FEAddlExpensesPct\nInvalid parquet type: DECIMAL, for Column: Fee\nInvalid parquet type: DECIMAL, for Column: FeeBillingCurrency\nInvalid parquet type: DECIMAL, for Column: FeeDirExp\nInvalid parquet type: DECIMAL, for Column: FeeDirExpBillingCurrency\nInvalid parquet type: DECIMAL, for Column: FeeDirExpFunctionalCurrency\nInvalid parquet type: DECIMAL, for Column: FeeDirLab\nInvalid parquet type: DECIMAL, for Column: FeeDirLabBillingCurrency\nInvalid parquet type: DECIMAL, for Column: FeeDirLabFunctionalCurrency\nInvalid parquet type: DECIMAL, for Column: FeeFunctionalCurrency\nInvalid parquet type: DECIMAL, for Column: FEOther\nInvalid parquet type: DECIMAL, for Column: FEOtherPct\nInvalid parquet type: DECIMAL, for Column: FESurcharge\nInvalid parquet type: DECIMAL, for Column: FESurchargePct\nInvalid parquet type: DECIMAL, for Column: FirmCost\nInvalid parquet type: DECIMAL, for Column: ICBillingExpMult\nInvalid parquet type: DECIMAL, for Column: ICBillingLabMult\nInvalid parquet type: DECIMAL, for Column: LabPctComp\nInvalid parquet type: DECIMAL, for Column: MultAmt\nInvalid parquet type: DECIMAL, for Column: PctComp\nInvalid parquet type: DECIMAL, for Column: POCNSRate\nInvalid parquet type: DECIMAL, for Column: PORMBRate\nInvalid parquet type: DECIMAL, for Column: ProjectExchangeRate\nInvalid parquet type: DECIMAL, for Column: ReimbAllow\nInvalid parquet type: DECIMAL, for Column: ReimbAllowBillingCurrency\nInvalid parquet type: DECIMAL, for Column: ReimbAllowCons\nInvalid parquet type: DECIMAL, for Column: ReimbAllowConsBillingCurrency\nInvalid parquet type: DECIMAL, for Column: ReimbAllowConsFunctionalCurrency\nInvalid parquet type: DECIMAL, for Column: ReimbAllowExp\nInvalid parquet type: DECIMAL, for Column: ReimbAllowExpBillingCurrency\nInvalid parquet type: DECIMAL, for Column: ReimbAllowExpFunctionalCurrency\nInvalid parquet type: DECIMAL, for Column: ReimbAllowFunctionalCurrency\nInvalid parquet type: DECIMAL, for Column: TotalProjectCost\nInvalid parquet type: DECIMAL, for Column: XChargeMult""
}
Steps to Reproduce
Launch Storage Explorer
Drill down to the appropriate folder containing the .parquet files
Select the .parquet file
Push the "Preview" button
Actual Experience
I Launched Storage Explorer.
Then I drilled down to the appropriate folder containing the .parquet file I wanted to preview
Then I selected the .parquet file
Then I pushed the "Preview" button
Expected Experience
I expected a preview of the data within the .parquet file.
Additional Context
No response
The text was updated successfully, but these errors were encountered: