Improve performance #7

tylerdq · 2019-08-13T15:21:15Z

At least with .parquet, there are opportunities to improve speed and reduce disk usage with dataframe binaries via pyarrow's built-in threading and compression options.

There also may be opportunities to multi-thread the PDF extraction itself (using PyPDF2 or switching to an alternate library).

tylerdq · 2019-08-13T23:04:47Z

The Parquet improvements might be unnecessary. It appears from pyarrow's documentation that the default behavior is already fairly optimized.

tylerdq added the enhancement New feature or request label Aug 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance #7

Improve performance #7

tylerdq commented Aug 13, 2019

tylerdq commented Aug 13, 2019

Improve performance #7

Improve performance #7

Comments

tylerdq commented Aug 13, 2019

tylerdq commented Aug 13, 2019