This is the python binding for the Stryke build on top of Apache-Orc C++ library. All code is available on github. It's a template library but depend on Apache-Orc.
Apache-Orc is defined as :
the smallest, fastest columnar storage for Hadoop workloads.
The straight forward way to use Orc file in Python is through PySpark.
As Stryke is a template Library we have to defined Writer template specialization at compilation. For each C++ Writer, we have define some binding mainly for test purpose.
OrcWriterImpl
writers are insimple
namespaceOrcWriterDispatch
writers are indispatch
namespaceOrcWriterSequential
writers are insequential
namespaceOrcWriterThread
writers are inthread
namespace- Your writers will be definer in
custom
namespace
This file show some example to define new writer.
You can edit this file in place and recompile python binding. Your new binding will be available in the stryke.custom
namespace (with a Writer
prefix)
You can copy this file elsewhere. Then to tell compiler where your binding is located, you have to define the CUSTOM_BINDING=<path_to_your_file>
in the cmake command. After compilation your new binding will be available in the stryke.custom
namespace (with a Writer
prefix)
cmake .. -DCMAKE_TOOLCHAIN_FILE=${VCPKG_DIR}/scripts/buildsystems/vcpkg.cmake -DBUILD_PYTHON_BINDING=ON -DCUSTOM_BINDING=<path_to_your_file>
We use vcpkg to manage dependencies
Stryke depend on:
./vcpkg install catch2 orc pybind11 zstd
#from root folder
mkdir build
cd build
# configure make with vcpkg toolchain
cmake .. -DCMAKE_TOOLCHAIN_FILE=${VCPKG_DIR}/scripts/buildsystems/vcpkg.cmake -DBUILD_PYTHON_BINDING=ON
make install
The shared library (stryke.so
) with the binding will be store in python/stryke
folder.
# from python folder
python3 setup.py install
python3 -m pytest tests/python
The following python writer defined by :
declare_writer_impl<TimestampNumber, Long, Long, Long>(m_simple, "ForTest");
import stryke as sy
writer = sy.simple.WriterForTest(["date", "univers", "value1", "value2"], "example.orc", sy.WriterOptions())
writer.write(1544400000, 42, 10, 1) # Timestamp Type is the number of second since 1970.
writer.write(1544400000.123456789, 42, 20, 2) # Timestamp Type is the number of second since 1970.
# the file is close at writer destruction
resulting file read by orc-content
:
{"date": "2018-12-10 00:00:00.0", "value": 42}
{"date": "2018-12-10 00:00:00.123456716", "value": 42}
The following python writer defined by :
declare_writer_dispatch<FolderEncode<TimestampNumber, Long>, Long, Long>(m_dispatch, "ForTestWithFolder");
import stryke as sy
writer = sy.dispatch.WriterForTestWithFolder(["date", "univers", "value1", "value2"], "data", "example", sy.WriterOptions())
writer.write(1544400000, 42, 10, 1) # Timestamp Type is the number of second since 1970.
writer.write(1544400000.123456789, 42, 20, 2) # Timestamp Type is the number of second since 1970.
# the file is close at writer destruction
resulting files read by orc-content
:
{"value1": 10, "value2": 1}
{"value1": 20, "value2": 2}