GitHub - sebbegg/parquet-python: python implementation of the parquet columnar file format.

parquet-python

parquet-python is a pure-python implementation (currently with only read-support) of the parquet format. It comes with a script for reading parquet files and outputting the data to stdout as JSON or TSV (without the overhead of JVM startup). Performance has not yet been optimized, but it's useful for debugging and quick viewing of data in files.

Not all parts of the parquet-format have been implemented yet or tested e.g. nested data—see Todos below for a full list. With that said, parquet-python is capable of reading all the data files from the parquet-compatability project.

requirements

parquet-python has been tested on python 2.7, 3.4, and 3.5. It depends on thrift (0.9) and python-snappy (for snappy compressed files).

getting started

parquet-python is available via PyPi and can be installed using pip install parquet. The package includes the parquet command for reading python files, e.g. parquet test.parquet. See parquet --help for full usage.

Example

parquet-python currently has two programatic interfaces with similar functionality to Python's csv reader. First, it supports a DictReader which returns a dictionary per row. Second, it has a reader which returns a list of values for each row. Both function require a file-like object and support an optional columns field to only read the specified columns.

import parquet
import json

## assuming parquet file with two rows and three columns:
## foo bar baz
## 1   2   3
## 4   5   6

with open("test.parquet") as fo:
   # prints:
   # {"foo": 1, "bar": 2}
   # {"foo": 4, "bar": 5}
   for row in parquet.DictReader(fo, columns=['foo', 'bar']):
       print(json.dumps(row))


with open("test.parquet") as fo:
   # prints:
   # 1,2
   # 4,5
   for row in parquet.reader(fo, columns=['foo', 'bar]):
       print(",".join([str(r) for r in row]))

Todos

Support the deprecated bitpacking
Fix handling of repetition-levels and definition-levels
Tests for nested schemas, null data
Support reading of data from HDFS via snakebite and/or webhdfs.
Implement writing
performance evaluation and optimization (i.e. how does it compare to the c++, java implementations)

Contributing

Is done via Pull Requests. Please include tests with your changes and follow pep8.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
parquet		parquet
test-data		test-data
test		test
.gitignore		.gitignore
.pylintrc		.pylintrc
.travis.yml		.travis.yml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

parquet-python

requirements

getting started

Example

Todos

Contributing

About

Releases

Packages

Languages

License

sebbegg/parquet-python

Folders and files

Latest commit

History

Repository files navigation

parquet-python

requirements

getting started

Example

Todos

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages