Elbow is a lightweight and scalable library for getting diverse data out of specialized formats and into common tabular data formats for downstream analytics.
Extract image metadata and pixel values from all JPEG image files under the current directory and save as a Parquet dataset.
import numpy as np
import pandas as pd
from PIL import Image
from elbow.builders import build_parquet
def extract_image(path: str):
img = Image.open(path)
width, height = img.size
pixel_values = np.asarray(img)
return {
"path": path,
"width": width,
"height": height,
"pixel_values": pixel_values,
}
build_parquet(
source="**/*.jpg",
extract=extract_image,
output="images.pqds/",
workers=8,
)
df = pd.read_parquet("images.pqds")
For a complete example, see here.
pip install elbow
The current development version can be installed with
pip install git+https://github.com/childmindresearch/elbow.git
There are many other high quality projects for extracting, loading, and transforming data. Some alternative projects focused on somewhat different use cases are:
We welcome contributions of any kind! If you'd like to contribute, please feel free to start a conversation in our issues.