You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a problem reading parquet with codename/parquet/helper/ParquetDataIterator. When they are small, it reads them without problem. But after 500 records, it saturates the memory, and the whole process fails.
I have been reading the documentation, and making thousands of tests with dataIterator, and with parquetReader but it always fails with those parquet sizes, although it increases the machine and the memory.
I also tested to get only X number of rows, but I can't get it to work and there is no documentation about it.
Do you have any solution to use parquetDataIterator or parquetReader, limiting the number of records? Being able to request them in an orderly way from N in N records, without having to load everything in memory.
I am currently using these codes:
1)For parquetReader
`
use jocoon\parquet\ParquetReader;
// open file stream (in this example for reading only)
$fileStream = fopen(DIR.'/test.parquet', 'r');
// open parquet file reader
$parquetReader = new ParquetReader($fileStream);
// get file schema (available straight after opening parquet reader)
// however, get only data fields as only they contain data values
$dataFields = $parquetReader->schema->GetDataFields();
// enumerate through row groups in this file
for($i = 0; $i < $parquetReader->getRowGroupCount(); $i++)
{
// create row group reader
$groupReader = $parquetReader->OpenRowGroupReader($i);
// read all columns inside each row group (you have an option to read only
// required columns if you need to.
$columns = [];
foreach($dataFields as $field) {
$columns[] = $groupReader->ReadColumn($field);
}
// get first column, for instance
$firstColumn = $columns[0];
// .Data member contains a typed array of column data you can cast to the type of the column
$data = $firstColumn->getData();
// Print data or do other stuff with it
print_r($data);
}`
I have a problem reading parquet with codename/parquet/helper/ParquetDataIterator. When they are small, it reads them without problem. But after 500 records, it saturates the memory, and the whole process fails.
I have been reading the documentation, and making thousands of tests with dataIterator, and with parquetReader but it always fails with those parquet sizes, although it increases the machine and the memory.
I also tested to get only X number of rows, but I can't get it to work and there is no documentation about it.
Do you have any solution to use parquetDataIterator or parquetReader, limiting the number of records? Being able to request them in an orderly way from N in N records, without having to load everything in memory.
I am currently using these codes:
1)For parquetReader
`
use jocoon\parquet\ParquetReader;
// open file stream (in this example for reading only)
$fileStream = fopen(DIR.'/test.parquet', 'r');
// open parquet file reader
$parquetReader = new ParquetReader($fileStream);
// get file schema (available straight after opening parquet reader)
// however, get only data fields as only they contain data values
$dataFields = $parquetReader->schema->GetDataFields();
// enumerate through row groups in this file
for($i = 0; $i < $parquetReader->getRowGroupCount(); $i++)
{
// create row group reader
$groupReader = $parquetReader->OpenRowGroupReader($i);
// read all columns inside each row group (you have an option to read only
// required columns if you need to.
$columns = [];
foreach($dataFields as $field) {
$columns[] = $groupReader->ReadColumn($field);
}
// get first column, for instance
$firstColumn = $columns[0];
// .Data member contains a typed array of column data you can cast to the type of the column
$data = $firstColumn->getData();
// Print data or do other stuff with it
print_r($data);
}`
`use codename\parquet\helper\ParquetDataIterator;
$iterateMe = ParquetDataIterator::fromFile('your-parquet-file.parquet');
foreach($iterateMe as $dataset) {
// $dataset is an associative array
// and already combines data of all columns
// back to a row-like structure
}`
The text was updated successfully, but these errors were encountered: