Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to view new .parquet files using the new "Preview" function #6990

Closed
3 tasks done
royalh13 opened this issue Jun 15, 2023 · 33 comments
Closed
3 tasks done

Unable to view new .parquet files using the new "Preview" function #6990

royalh13 opened this issue Jun 15, 2023 · 33 comments
Assignees
Labels
❔ external Root cause of this issue is in another component, product, or service
Milestone

Comments

@royalh13
Copy link

Preflight Checklist

Storage Explorer Version

1.30.0

Regression From

No response

Architecture

x64

Storage Explorer Build Number

20230609.2

Platform

Windows

OS Version

Windows 10 Enterprise - Version 10.0.19045 Build 19045

Bug Description

I tried to view the contents of multiple .parquet files by using the new "Preview" feature, but Storage Explorer errored out on all the files I tested.

Below is the error text from the 2 types of errors I'm seeing:

{
"name": "TypeError",
"message": "o.buffer.readUInt32LE is not a function",
"stack": "TypeError: o.buffer.readUInt32LE is not a function\n at ZR (file:///C:/Users/My.Name/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:736826)\n at Object.QR (file:///C:/Users/My.Name/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:736928)\n at Tn (file:///C:/Users/My.Name/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:764741)\n at rD (file:///C:/Users/My.Name/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:766114)\n at async sb (file:///C:/Users/My.Name/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:765449)\n at async fb (file:///C:/Users/My.Name/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:767625)\n at async Mu.readRowGroup (file:///C:/Users/My.Name/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:763489)\n at async Object.next (file:///C:/Users/My.Name/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:756357)\n at async ParquetParser.getRecords (file:///C:/Users/My.Name/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:5046)\n at async getParquetData (file:///C:/Users/My.Name/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:6298)"
}

{
"name": null,
"message": ""Invalid parquet type: DECIMAL, for Column: AjeraBilledConsultant\nInvalid parquet type: DECIMAL, for Column: AjeraBilledLabor\nInvalid parquet type: DECIMAL, for Column: AjeraBilledReimbursable\nInvalid parquet type: DECIMAL, for Column: AjeraCostConsultant\nInvalid parquet type: DECIMAL, for Column: AjeraCostLabor\nInvalid parquet type: DECIMAL, for Column: AjeraCostReimbursable\nInvalid parquet type: DECIMAL, for Column: AjeraReceivedConsultant\nInvalid parquet type: DECIMAL, for Column: AjeraReceivedLabor\nInvalid parquet type: DECIMAL, for Column: AjeraReceivedReimbursable\nInvalid parquet type: DECIMAL, for Column: AjeraSpentConsultant\nInvalid parquet type: DECIMAL, for Column: AjeraSpentLabor\nInvalid parquet type: DECIMAL, for Column: AjeraSpentReimbursable\nInvalid parquet type: DECIMAL, for Column: AjeraWIPConsultant\nInvalid parquet type: DECIMAL, for Column: AjeraWIPLabor\nInvalid parquet type: DECIMAL, for Column: AjeraWIPReimbursable\nInvalid parquet type: DECIMAL, for Column: BillingExchangeRate\nInvalid parquet type: DECIMAL, for Column: BudOHRate\nInvalid parquet type: DECIMAL, for Column: ConsultFee\nInvalid parquet type: DECIMAL, for Column: ConsultFeeBillingCurrency\nInvalid parquet type: DECIMAL, for Column: ConsultFeeFunctionalCurrency\nInvalid parquet type: DECIMAL, for Column: ExpPctComp\nInvalid parquet type: DECIMAL, for Column: FEAddlExpenses\nInvalid parquet type: DECIMAL, for Column: FEAddlExpensesPct\nInvalid parquet type: DECIMAL, for Column: Fee\nInvalid parquet type: DECIMAL, for Column: FeeBillingCurrency\nInvalid parquet type: DECIMAL, for Column: FeeDirExp\nInvalid parquet type: DECIMAL, for Column: FeeDirExpBillingCurrency\nInvalid parquet type: DECIMAL, for Column: FeeDirExpFunctionalCurrency\nInvalid parquet type: DECIMAL, for Column: FeeDirLab\nInvalid parquet type: DECIMAL, for Column: FeeDirLabBillingCurrency\nInvalid parquet type: DECIMAL, for Column: FeeDirLabFunctionalCurrency\nInvalid parquet type: DECIMAL, for Column: FeeFunctionalCurrency\nInvalid parquet type: DECIMAL, for Column: FEOther\nInvalid parquet type: DECIMAL, for Column: FEOtherPct\nInvalid parquet type: DECIMAL, for Column: FESurcharge\nInvalid parquet type: DECIMAL, for Column: FESurchargePct\nInvalid parquet type: DECIMAL, for Column: FirmCost\nInvalid parquet type: DECIMAL, for Column: ICBillingExpMult\nInvalid parquet type: DECIMAL, for Column: ICBillingLabMult\nInvalid parquet type: DECIMAL, for Column: LabPctComp\nInvalid parquet type: DECIMAL, for Column: MultAmt\nInvalid parquet type: DECIMAL, for Column: PctComp\nInvalid parquet type: DECIMAL, for Column: POCNSRate\nInvalid parquet type: DECIMAL, for Column: PORMBRate\nInvalid parquet type: DECIMAL, for Column: ProjectExchangeRate\nInvalid parquet type: DECIMAL, for Column: ReimbAllow\nInvalid parquet type: DECIMAL, for Column: ReimbAllowBillingCurrency\nInvalid parquet type: DECIMAL, for Column: ReimbAllowCons\nInvalid parquet type: DECIMAL, for Column: ReimbAllowConsBillingCurrency\nInvalid parquet type: DECIMAL, for Column: ReimbAllowConsFunctionalCurrency\nInvalid parquet type: DECIMAL, for Column: ReimbAllowExp\nInvalid parquet type: DECIMAL, for Column: ReimbAllowExpBillingCurrency\nInvalid parquet type: DECIMAL, for Column: ReimbAllowExpFunctionalCurrency\nInvalid parquet type: DECIMAL, for Column: ReimbAllowFunctionalCurrency\nInvalid parquet type: DECIMAL, for Column: TotalProjectCost\nInvalid parquet type: DECIMAL, for Column: XChargeMult""
}

Steps to Reproduce

Launch Storage Explorer
Drill down to the appropriate folder containing the .parquet files
Select the .parquet file
Push the "Preview" button

Actual Experience

I Launched Storage Explorer.
Then I drilled down to the appropriate folder containing the .parquet file I wanted to preview
Then I selected the .parquet file
Then I pushed the "Preview" button

Expected Experience

I expected a preview of the data within the .parquet file.

Storage Explorer Errors

Additional Context

No response

@JasonYeMSFT JasonYeMSFT self-assigned this Jun 15, 2023
@JasonYeMSFT JasonYeMSFT added the ❔ investigate We need to look into this further label Jun 15, 2023
@JasonYeMSFT
Copy link
Contributor

JasonYeMSFT commented Jun 15, 2023

The library we use may have limitations and may not support all data types. Let me see if your file runs into a limitation or we have a bug in our code.

Edit: I just checked. We are using a version of the library that doesn't support the DECIMAL data type. A new version of the library added support to it. I will update the library and test it in 1.31.0.
https://github.com/LibertyDSNP/parquetjs/releases/tag/v1.2.3

Also, could you share your data's schema that sees the buffer.readUInt32LE error with us? We would like to know the structure of the data (e.g. field names, nesting structure and data types). You can redact all the actual field names to remove your personal information.

@JasonYeMSFT JasonYeMSFT added this to the 1.31.0 milestone Jun 15, 2023
@JasonYeMSFT JasonYeMSFT added ❔ external Root cause of this issue is in another component, product, or service and removed ❔ investigate We need to look into this further labels Jun 15, 2023
@royalh13
Copy link
Author

Table info attached...
Table Info.txt

@mithom
Copy link

mithom commented Jun 16, 2023

having a similar issue, here is the error info, if requested i can also share a file
perror.txt

@indexample
Copy link

what about other data types? i get a similar issue, but for Double

{
"name": "TypeError",
"message": "o.buffer.readDoubleLE is not a function",
"stack": "TypeError: o.buffer.readDoubleLE is not a function\n at GR

@JasonYeMSFT
Copy link
Contributor

For the readX is not a function type of errors, they are likely due to a bug in the library not converting the uncompressed data to buffer correctly. I'll try to get someone fix it in the library.

@LoHertel
Copy link

LoHertel commented Jun 19, 2023

I got the o.buffer.readUInt32LE is not a function error as well.

The parquet was created from Azure Data Factory (ADF).
The only numeric column is ExtractDate I believe.
According to DuckDB it's of the INT96 type, according to Pyarrow, it's of type timestamp[ns].
The column was exported from the ADF as type DateTime.

DuckDB Schema Information

     file_name      name                        type          type_length  repetition_type    num_children     converted_type    scale    precision    field_id    logical_type
0    abc.parquet    adms_schema                 BOOLEAN       0            REQUIRED           33               UTF8              0        0            0            None
1    abc.parquet    user_id                     BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
2    abc.parquet    displayName                 BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
3    abc.parquet    surname                     BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
4    abc.parquet    givenName                   BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
5    abc.parquet    employeeId                  BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
6    abc.parquet    mail                        BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
7    abc.parquet    jobTitle                    BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
8    abc.parquet    department                  BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
9    abc.parquet    companyName                 BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
10   abc.parquet    onPremisesDomainName        BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
11   abc.parquet    onPremisesSamAccountName    BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
12   abc.parquet    accountEnabled              BOOLEAN       0            OPTIONAL           0                UTF8              0        0            0            None
13   abc.parquet    userType                    BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
14   abc.parquet    extensionAttribute1         BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
15   abc.parquet    extensionAttribute2         BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
16   abc.parquet    extensionAttribute3         BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
17   abc.parquet    extensionAttribute4         BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
18   abc.parquet    extensionAttribute5         BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
19   abc.parquet    extensionAttribute6         BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
20   abc.parquet    extensionAttribute7         BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
21   abc.parquet    extensionAttribute8         BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
22   abc.parquet    extensionAttribute9         BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
23   abc.parquet    extensionAttribute10        BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
24   abc.parquet    extensionAttribute11        BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
25   abc.parquet    extensionAttribute12        BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
26   abc.parquet    extensionAttribute13        BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
27   abc.parquet    extensionAttribute14        BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
28   abc.parquet    extensionAttribute15        BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
29   abc.parquet    costCenter                  BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
30   abc.parquet    division                    BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
31   abc.parquet    manager_id                  BYTE_ARRAY    0            OPTIONAL           0                UTF8              0        0            0            None
32   abc.parquet    ExtractDate                 INT96         0            OPTIONAL           0                UTF8              0        0            0            None

Pyarrow Schema Information

user_id: string
displayName: string
surname: string
givenName: string
employeeId: string
mail: string
jobTitle: string
department: string
companyName: string
onPremisesDomainName: string
onPremisesSamAccountName: string
accountEnabled: bool
userType: string
extensionAttribute1: string
extensionAttribute2: string
extensionAttribute3: string
extensionAttribute4: string
extensionAttribute5: string
extensionAttribute6: string
extensionAttribute7: string
extensionAttribute8: string
extensionAttribute9: string
extensionAttribute10: string
extensionAttribute11: string
extensionAttribute12: string
extensionAttribute13: string
extensionAttribute14: string
extensionAttribute15: string
costCenter: string
division: string
manager_id: string
ExtractDate: timestamp[ns]
-- schema metadata --
writer.model.name: 'example'

ADF Schema of Copy Task

        "translator": {
            "type": "TabularTranslator",
            "mappings": [
                {
                    "source": {
                        "path": "['id']"
                    },
                    "sink": {
                        "name": "user_id",
                        "type": "Guid"
                    }
                },
                {
                    "source": {
                        "path": "['displayName']"
                    },
                    "sink": {
                        "name": "displayName",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['surname']"
                    },
                    "sink": {
                        "name": "surname",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['givenName']"
                    },
                    "sink": {
                        "name": "givenName",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['employeeId']"
                    },
                    "sink": {
                        "name": "employeeId",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['mail']"
                    },
                    "sink": {
                        "name": "mail",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['jobTitle']"
                    },
                    "sink": {
                        "name": "jobTitle",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['department']"
                    },
                    "sink": {
                        "name": "department",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['companyName']"
                    },
                    "sink": {
                        "name": "companyName",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['onPremisesDomainName']"
                    },
                    "sink": {
                        "name": "onPremisesDomainName",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['onPremisesSamAccountName']"
                    },
                    "sink": {
                        "name": "onPremisesSamAccountName",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['accountEnabled']"
                    },
                    "sink": {
                        "name": "accountEnabled",
                        "type": "Boolean"
                    }
                },
                {
                    "source": {
                        "path": "['userType']"
                    },
                    "sink": {
                        "name": "userType",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['onPremisesExtensionAttributes']['extensionAttribute1']"
                    },
                    "sink": {
                        "name": "extensionAttribute1",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['onPremisesExtensionAttributes']['extensionAttribute2']"
                    },
                    "sink": {
                        "name": "extensionAttribute2",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['onPremisesExtensionAttributes']['extensionAttribute3']"
                    },
                    "sink": {
                        "name": "extensionAttribute3",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['onPremisesExtensionAttributes']['extensionAttribute4']"
                    },
                    "sink": {
                        "name": "extensionAttribute4",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['onPremisesExtensionAttributes']['extensionAttribute5']"
                    },
                    "sink": {
                        "name": "extensionAttribute5",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['onPremisesExtensionAttributes']['extensionAttribute6']"
                    },
                    "sink": {
                        "name": "extensionAttribute6",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['onPremisesExtensionAttributes']['extensionAttribute7']"
                    },
                    "sink": {
                        "name": "extensionAttribute7",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['onPremisesExtensionAttributes']['extensionAttribute8']"
                    },
                    "sink": {
                        "name": "extensionAttribute8",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['onPremisesExtensionAttributes']['extensionAttribute9']"
                    },
                    "sink": {
                        "name": "extensionAttribute9",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['onPremisesExtensionAttributes']['extensionAttribute10']"
                    },
                    "sink": {
                        "name": "extensionAttribute10",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['onPremisesExtensionAttributes']['extensionAttribute11']"
                    },
                    "sink": {
                        "name": "extensionAttribute11",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['onPremisesExtensionAttributes']['extensionAttribute12']"
                    },
                    "sink": {
                        "name": "extensionAttribute12",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['onPremisesExtensionAttributes']['extensionAttribute13']"
                    },
                    "sink": {
                        "name": "extensionAttribute13",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['onPremisesExtensionAttributes']['extensionAttribute14']"
                    },
                    "sink": {
                        "name": "extensionAttribute14",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['onPremisesExtensionAttributes']['extensionAttribute15']"
                    },
                    "sink": {
                        "name": "extensionAttribute15",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['employeeOrgData']['costCenter']"
                    },
                    "sink": {
                        "name": "costCenter",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['employeeOrgData']['division']"
                    },
                    "sink": {
                        "name": "division",
                        "type": "String"
                    }
                },
                {
                    "source": {
                        "path": "['manager']['id']"
                    },
                    "sink": {
                        "name": "manager_id",
                        "type": "Guid"
                    }
                },
                {
                    "source": {
                        "path": "$['ExtractDate']"
                    },
                    "sink": {
                        "name": "ExtractDate",
                        "type": "DateTime"
                    }
                }
            ],
        }

@JasonYeMSFT
Copy link
Contributor

@mithom Would you like to share a file with us? Please make sure you don't accidentally share any sensitive information with us. If you don't want to post it here, you can email it to sehelp at microsoft dot com. Thanks.

@mithom
Copy link

mithom commented Jun 20, 2023

i will ask my manager, but it shouldn't be hard to generate a file with some noise data in it. We don't handle user data so gdpr wise everything is safe anyways

@tobimax
Copy link

tobimax commented Jun 22, 2023

Same issue, part of a delta lake created by Databricks using their default writer settings in Unity Catalog Managed Tables.

{
  "name": null,
  "message": "\"Invalid parquet type: DECIMAL, for Column: volume_sold_in_millilitres\\nInvalid parquet type: DECIMAL, for Column: weight_sold_in_grams\\nInvalid parquet type: DECIMAL, for Column: units_sold_by_volume\\nInvalid parquet type: DECIMAL, for Column: units_sold_by_weight\""
}

Attached the data file. there nothing sensitive in it.

part-00004-7690cf81-ffe2-49dc-aaa2-9e622a3b3f5a.c000.snappy.parquet.zip

@JasonYeMSFT
Copy link
Contributor

A fix has been made in the library. I made a few private builds using a private release including the fix. I tested with some shared files and they seem to solve most of the problems, except for yours @tobimax. The library doesn't support decimal data types with precision > 18.
win x64
mac x64
mac arm64
linux

You may try to install the private build and see if it works in your use case.

@royalh13
Copy link
Author

Didn't work for me due to fields defined as (19,4)

@tobimax
Copy link

tobimax commented Jun 23, 2023

Hi,

@JasonYeMSFT , The Decimals were defined as decimal(38,14) - is it defined differently in the parquet file structure?

image

@ddepante
Copy link

I got the o.buffer.readUInt32LE is not a function error as well.

I'm receiving this as well using 1.30.0

@tobimax
Copy link

tobimax commented Jun 28, 2023

I have found another issue on 1.30.0

"o.buffer.readBigInt64LE is not a function"

{
  "name": "TypeError",
  "message": "o.buffer.readBigInt64LE is not a function",
  "stack": "TypeError: o.buffer.readBigInt64LE is not a function
    at jR (file:///C:/Users/tobi/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:736182)
    at Object.QR (file:///C:/Users/tobi/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:736230)
    at Tn (file:///C:/Users/tobi/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:764741)
    at rD (file:///C:/Users/tobi/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:766114)
    at async sb (file:///C:/Users/tobi/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:765449)
    at async fb (file:///C:/Users/tobi/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:767625)
    at async Mu.readRowGroup (file:///C:/Users/tobi/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:763489)
    at async Object.next (file:///C:/Users/tobi/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:756357)
    at async ParquetParser.getRecords (file:///C:/Users/tobi/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:5046)
    at async getParquetData (file:///C:/Users/tobi/AppData/Local/Programs/Microsoft%20Azure%20Storage%20Explorer/resources/app/node_modules/@storage-explorer/file-preview/dist/src/Panels/ParquetPreviewPanel/index.js:2:6298)"
}

@sbarnoud
Copy link

I have also the same exception. In my opinion, it would a good idea to support in addition to the "Preview" feature a "Schema" one.

@JasonYeMSFT
Copy link
Contributor

JasonYeMSFT commented Jun 30, 2023

A fix has been made in the library. I made a few private builds using a private release including the fix. I tested with some shared files and they seem to solve most of the problems, except for yours @tobimax. The library doesn't support decimal data types with precision > 18. win x64 mac x64 mac arm64 linux

You may try to install the private build and see if it works in your use case.

@mithom @LoHertel @ddepante @manishj216 @daviewales @andrePKI @gpcottle @cf-dtrznadel @ghormann
For those who are getting "buffer.readXXX is not a function" type of errors, are you seeing the error in 1.30.0 official release or in the private build? I shared a few private builds which contains a fix that might solve this problem. I would like to have you try it and confirm if the fix is working.

@royalh13 @tobimax
As for your issue, your precision for decimal fields is too large which isn't supported right now. I will track a fix in the library.
LibertyDSNP/parquetjs#91

@ddepante
Copy link

are you seeing the error in 1.30.0 official release or in the private build? I shared a few private builds which contains a fix that might solve this problem. I would like to have you try it and confirm if the fix is working.

I believe I'm using the official release, but I'm not sure how to be 100% confident.

@andrePKI
Copy link

I experienced it with the official 1.30.0 version.
I will try the newer private build asasp

@mariussoutier
Copy link

Strange to announce such a feature and it doesn't even work with files generated by Azure Data Factory. The private build does work.

@daviewales
Copy link

The private build works, although it appears to want to download the whole Parquet file before rendering it, rather than just loading the top 100 rows.

@JasonYeMSFT
Copy link
Contributor

The private build works, although it appears to want to download the whole Parquet file before rendering it, rather than just loading the top 100 rows.

We always attempt to download the whole parquet file before rendering it. As far as I know, the data in parquet are stored column-wise which means you may need to read until the end of the file to get the first row. If you are running into performance issue because of that, please open a new issue and we can think about potential optimizations.

@daviewales
Copy link

Is it possible to read the last n bytes of a file in Azure storage?
Parquet files store metadata and statistics at the end of the file, which should allow you to find the byte offsets of each column and row group.

For example, here is the parquet metadata from the NYC Taxi Trips for Yellow Taxi Trips data January 2023:

file_name row_group_id row_group_num_rows row_group_num_columns row_group_bytes column_id file_offset num_values path_in_schema type stats_min stats_max stats_null_count stats_distinct_count stats_min_value stats_max_value compression encodings index_page_offset dictionary_page_offset data_page_offset total_compressed_size total_uncompressed_size
yellow_tripdata_2023-01.parquet 0 3066766 19 83210720 0 345651 3066766 VendorID INT64 1 2 0 1 2 GZIP PLAIN_DICTIONARY, PLAIN, RLE 4 44 345647 462414
yellow_tripdata_2023-01.parquet 0 3066766 19 83210720 1 11762846 3066766 tpep_pickup_datetime INT64 2008-12-31 23:01:42 2023-02-01 00:56:53 0 2008-12-31 23:01:42 2023-02-01 00:56:53 GZIP PLAIN_DICTIONARY, PLAIN, RLE, PLAIN 345755 923145 11417091 24337327
yellow_tripdata_2023-01.parquet 0 3066766 19 83210720 2 23501620 3066766 tpep_dropoff_datetime INT64 2009-01-01 14:29:11 2023-02-02 09:28:47 0 2009-01-01 14:29:11 2023-02-02 09:28:47 GZIP PLAIN_DICTIONARY, PLAIN, RLE, PLAIN 11762977 12349805 11738643 24340223
yellow_tripdata_2023-01.parquet 0 3066766 19 83210720 3 24103594 3066766 passenger_count DOUBLE -0.0 9.0 71743 -0.0 9.0 GZIP PLAIN_DICTIONARY, PLAIN, RLE 23501754 23501817 601840 1316224
yellow_tripdata_2023-01.parquet 0 3066766 19 83210720 4 28405186 3066766 trip_distance DOUBLE -0.0 258928.15 0 -0.0 258928.15 GZIP PLAIN_DICTIONARY, PLAIN, RLE 24103715 24115852 4301471 4782381
yellow_tripdata_2023-01.parquet 0 3066766 19 83210720 5 28623194 3066766 RatecodeID DOUBLE 1.0 99.0 71743 1.0 99.0 GZIP PLAIN_DICTIONARY, PLAIN, RLE 28405304 28405362 217890 544446
yellow_tripdata_2023-01.parquet 0 3066766 19 83210720 6 28654134 3066766 store_and_fwd_flag BYTE_ARRAY 71743 N Y GZIP PLAIN_DICTIONARY, PLAIN, RLE 28623309 28623351 30825 69833
yellow_tripdata_2023-01.parquet 0 3066766 19 83210720 7 30925059 3066766 PULocationID INT64 1 265 0 1 265 GZIP PLAIN_DICTIONARY, PLAIN, RLE 28654223 28654702 2270836 3109031
yellow_tripdata_2023-01.parquet 0 3066766 19 83210720 8 33993890 3066766 DOLocationID INT64 1 265 0 1 265 GZIP PLAIN_DICTIONARY, PLAIN, RLE 30925176 30925679 3068714 3458506
yellow_tripdata_2023-01.parquet 0 3066766 19 83210720 9 34401303 3066766 payment_type INT64 0 4 0 0 4 GZIP PLAIN_DICTIONARY, PLAIN, RLE 33994007 33994053 407296 754599
yellow_tripdata_2023-01.parquet 0 3066766 19 83210720 10 37493902 3066766 fare_amount DOUBLE -900.0 1160.1 0 -900.0 1160.1 GZIP PLAIN_DICTIONARY, PLAIN, RLE 34401418 34421503 3092484 4533871
yellow_tripdata_2023-01.parquet 0 3066766 19 83210720 11 38259058 3066766 extra DOUBLE -7.5 12.5 0 -7.5 12.5 GZIP PLAIN_DICTIONARY, PLAIN, RLE 37494018 37494307 765040 2035825
yellow_tripdata_2023-01.parquet 0 3066766 19 83210720 12 38344695 3066766 mta_tax DOUBLE -0.5 53.16 0 -0.5 53.16 GZIP PLAIN_DICTIONARY, PLAIN, RLE 38259167 38259254 85528 252230
yellow_tripdata_2023-01.parquet 0 3066766 19 83210720 13 41824535 3066766 tip_amount DOUBLE -96.22 380.8 0 -96.22 380.8 GZIP PLAIN_DICTIONARY, PLAIN, RLE 38344805 38356165 3479730 4638025
yellow_tripdata_2023-01.parquet 0 3066766 19 83210720 14 42187929 3066766 tolls_amount DOUBLE -65.0 196.99 0 -65.0 196.99 GZIP PLAIN_DICTIONARY, PLAIN, RLE 41824650 41826949 363279 1804932
yellow_tripdata_2023-01.parquet 0 3066766 19 83210720 15 42243482 3066766 improvement_surcharge DOUBLE -1.0 1.0 0 -1.0 1.0 GZIP PLAIN_DICTIONARY, PLAIN, RLE 42188045 42188098 55437 166262
yellow_tripdata_2023-01.parquet 0 3066766 19 83210720 16 47165717 3066766 total_amount DOUBLE -751.0 1169.4 0 -751.0 1169.4 GZIP PLAIN_DICTIONARY, PLAIN, RLE 42243606 42292442 4922111 5500323
yellow_tripdata_2023-01.parquet 0 3066766 19 83210720 17 47417171 3066766 congestion_surcharge DOUBLE -2.5 2.5 71743 -2.5 2.5 GZIP PLAIN_DICTIONARY, PLAIN, RLE 47165834 47165878 251337 551452
yellow_tripdata_2023-01.parquet 0 3066766 19 83210720 18 47662860 3066766 airport_fee DOUBLE -1.25 1.25 71743 -1.25 1.25 GZIP PLAIN_DICTIONARY, PLAIN, RLE 47417296 47417338 245564 552816

You can see it includes the file offset information for each column and row group. (Only contains a single row group due to the small size of the file.)

I accessed this metadata from the parquet file using duckdb:

select * from parquet_metadata("yellow_tripdata_2023-01.parquet")

@sbarnoud
Copy link

sbarnoud commented Jul 6, 2023

We always attempt to download the whole parquet file before rendering it. As far as I know, the data in parquet are stored column-wise which means you may need to read until the end of the file to get the first row. If you are running into performance issue because of that, please open a new issue and we can think about potential optimizations.

No it is not true. Hopefully Parquet files are splittable (that's why we use it for distributed computing). As mentionned by https://github.com/daviewales it would be also very usefull to support in addition of an extract of the data, the schema and the metadata (especialy each row group metadata contains "column metadata" used by the Predicate Push Down).

See https://parquet.apache.org/docs/file-format/. You can get the offset of any row group, and read any of them fully independently to each other.

@andrePKI
Copy link

andrePKI commented Jul 6, 2023

Well, @sbarnoud and @daviewales "Metadata is written after the data to allow for single pass writing", so how do you get to the metadata without reading the last bit of the file? If you need to seek into the file (without reading it actually) you would need to know where to seek to, so still a search to find where the metadata starts. So I don't think it is strange to download the whole file before rendering it.

@sbarnoud
Copy link

sbarnoud commented Jul 6, 2023

Look at parquet-cli in Parquet Github repo => this tools scans and print all those meta data.

https://github.com/apache/parquet-format#file-format

Look for example here
https://github.com/apache/parquet-mr/blob/3a4eb2aabae6d1d6e6971073eee318c72d1ca28d/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ParquetMetadataCommand.java#L68

    ParquetMetadata footer = ParquetFileReader.readFooter(
        getConf(), qualifiedPath(source), ParquetMetadataConverter.NO_FILTER);

    console.info("\nFile path:  {}", source);
    console.info("Created by: {}", footer.getFileMetaData().getCreatedBy());

    Map<String, String> kv = footer.getFileMetaData().getKeyValueMetaData();

When using parquet-cli to read a file on ADLS (or HDFS) you will see that it don't download all the file, but just peaces of it.

https://github.com/apache/parquet-mr/blob/master/parquet-cli/src/main/java/org/apache/parquet/cli/rawpages/RawPagesReader.java

@sbarnoud
Copy link

sbarnoud commented Jul 6, 2023

So you will find here all needed info: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

This is how any distributed systems is able to read slices of a Parquet file and avoid to download the full file at each worker
node.

  private static final ParquetMetadata readFooter(InputFile file, ParquetReadOptions options,
      SeekableInputStream f, ParquetMetadataConverter converter) throws IOException {

    long fileLen = file.getLength();
    String filePath = file.toString();
    LOG.debug("File length {}", fileLen);

    int FOOTER_LENGTH_SIZE = 4;
    if (fileLen < MAGIC.length + FOOTER_LENGTH_SIZE + MAGIC.length) { // MAGIC + data + footer + footerIndex + MAGIC
      throw new RuntimeException(filePath + " is not a Parquet file (length is too low: " + fileLen + ")");
    }

    // Read footer length and magic string - with a single seek
    byte[] magic = new byte[MAGIC.length];
    long fileMetadataLengthIndex = fileLen - magic.length - FOOTER_LENGTH_SIZE;
    LOG.debug("reading footer index at {}", fileMetadataLengthIndex);
    f.seek(fileMetadataLengthIndex);
    int fileMetadataLength = readIntLittleEndian(f);
    f.readFully(magic);

    boolean encryptedFooterMode;
    if (Arrays.equals(MAGIC, magic)) {
      encryptedFooterMode = false;
    } else if (Arrays.equals(EFMAGIC, magic)) {
      encryptedFooterMode = true;
    } else {
      throw new RuntimeException(filePath + " is not a Parquet file. Expected magic number at tail, but found " + Arrays.toString(magic));
    }

    long fileMetadataIndex = fileMetadataLengthIndex - fileMetadataLength;
    LOG.debug("read footer length: {}, footer index: {}", fileMetadataLength, fileMetadataIndex);
    if (fileMetadataIndex < magic.length || fileMetadataIndex >= fileMetadataLengthIndex) {
      throw new RuntimeException("corrupted file: the footer index is not within the file: " + fileMetadataIndex);
    }
    f.seek(fileMetadataIndex);
...

@andrePKI
Copy link

andrePKI commented Jul 6, 2023

Ok, I get it.
File ends with 4byte integer footer length and "PAR1"
So seek to filelength-8, read 4 bytes as int, then seek to (filelength -8-footer length).
Nice, I didn't know that.

@JasonYeMSFT
Copy link
Contributor

I will look into the possibility of reading partial data. Just to reemphasize, please let us know if the private build does or doesn't resolves your "buffer.readXXX" is not a function type of error.

@JasonYeMSFT
Copy link
Contributor

We have published 1.30.1 which takes the library fix for the "buffer.readXXX is not a function" type of bug.

@ddepante
Copy link

We have published 1.30.1 which takes the library fix for the "buffer.readXXX is not a function" type of bug.

Previewing parquet files is functioning as expected now, thank you sooooo much!!!

@SqlAndWood
Copy link

SqlAndWood commented Jul 21, 2023

Seems I'm late to this party, apologies! I too appreciate the idea of being able to read .parquet files in ASE, yet am not seeing the expected results, yet.
Using ASE 1.30.1 (just tested 1.20.2 with same result), and while it 'reads' the .parquet files (created by Python PyArrow Parquet ), the preview doesn't seem to match the encoding (not that PyArrow provides a choice). Happy to provide a parquet file containing random data if this would help.

@MRayermannMSFT
Copy link
Member

Using ASE 1.30.1 (just tested 1.20.2 with same result), and while it 'reads' the .parquet files (created by Python PyArrow Parquet ), the preview doesn't seem to match the encoding (not that PyArrow provides a choice). Happy to provide a parquet file containing random data if this would help.

Please open a new issue for this, and yes, being able to provide an example file is a HUGE help. :)

@JasonYeMSFT
Copy link
Contributor

JasonYeMSFT commented Aug 31, 2023

@tobimax We integrated a library fix that would unblock your scenario by letting the decimal fields that are represented by byte arrays, which may have precision > 18, pass the parser validation. However, the library doesn't support parsing the byte arrays into human readable formats right now so those fields will be presented as raw buffer values. This fix will be shipped in 1.32.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
❔ external Root cause of this issue is in another component, product, or service
Projects
None yet
Development

No branches or pull requests