Skip to content

Commit

Permalink
Add clearer description of overwrite behaviour
Browse files Browse the repository at this point in the history
  • Loading branch information
jackdelv committed Nov 2, 2023
1 parent edea269 commit adde424
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions plugins/parquet/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ dataset := ParquetIO.Read(layout, '/source/directory/data.parquet');

#### 2. Writing Parquet Files

The Write function empowers ECL programmers to write ECL datasets to Parquet files. By leveraging the Parquet format's columnar storage capabilities, this function provides efficient compression and optimized storage for data. There is an optional argument that sets the overwrite behavior of the plugin. The default value is false meaning it will throw an error if the target file already exists.
The Write function empowers ECL programmers to write ECL datasets to Parquet files. By leveraging the Parquet format's columnar storage capabilities, this function provides efficient compression and optimized storage for data. There is an optional argument that sets the overwrite behavior of the plugin. The default value is false meaning it will throw an error if the target file already exists. If overwrite is set to true the plugin will check for files that match the target path passed in and delete them first before writing new files.

The Parquet Plugin supports all available Arrow compression types. Specifying the compression when writing is optional and defaults to Uncompressed. The options for compressing your files are Snappy, GZip, Brotli, LZ4, LZ4Frame, LZ4Hadoop, ZSTD, Uncompressed.

Expand All @@ -65,7 +65,7 @@ github_dataset := ParquetIO.DirectoryPartition.Read(layout, 'source/directory/pa

#### 2. Writing Partitioned Files

To select the fields that you wish to partition your data on pass in a string of semicolon seperated field names. If the fields you select create too many subdirectories you may need to partition your data on different fields. The rowSize field defaults to 100000 rows and determines how many rows to put in each part of the output files. When writing a partitioned file setting the overwrite option to true will erase the contents of the target directory before writing the output data. The default for this setting is false and will give an error message if you try to write to a directory that already contains data.
To select the fields that you wish to partition your data on pass in a string of semicolon seperated field names. If the fields you select create too many subdirectories you may need to partition your data on different fields. The rowSize field defaults to 100000 rows and determines how many rows to put in each part of the output files. Writing a partitioned file to a directory that already contains data will fail unless the overwrite option is set to true. If the overwrite option is set to true and the target directory is not empty the plugin will first erase the contents of the target directory before writing the new files.

```
ParquetIO.HivePartition.Write(outDataset, rowSize, '/source/directory/partioned_dataset', overwriteOption, 'year;month;day');
Expand Down

0 comments on commit adde424

Please sign in to comment.