diff --git a/docs/source/cudf.md b/docs/source/cudf.md new file mode 100644 index 0000000000..578b992a68 --- /dev/null +++ b/docs/source/cudf.md @@ -0,0 +1,30 @@ +# cuDF + +[cuDF](https://docs.rapids.ai/api/cudf/stable/) is a Python GPU DataFrame library. + +To read from a single Parquet file, use the [`read_parquet`](https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/api/cudf.read_parquet/) function to read it into a DataFrame: + +```py +import cudf + +df = ( + cudf.read_parquet("https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet") + .groupby('horoscope')['text'] + .apply(lambda x: x.str.len().mean()) + .sort_values(ascending=False) + .head(5) +) +``` + +To read multiple Parquet files - for example, if the dataset is sharded - you'll need to use [`dask-cudf`](https://docs.rapids.ai/api/dask-cudf/stable/): + +```py +import dask +import dask.dataframe as dd + +dask.config.set({"dataframe.backend": "cudf"}) + +df = ( + dd.read_parquet("https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/*.parquet") +) +``` \ No newline at end of file diff --git a/docs/source/parquet_process.md b/docs/source/parquet_process.md index f268a9dd1b..78683d9ea0 100644 --- a/docs/source/parquet_process.md +++ b/docs/source/parquet_process.md @@ -7,6 +7,7 @@ For private datasets, the feature is provided if the repository is owned by a [P There are several different libraries you can use to work with the published Parquet files: - [ClickHouse](https://clickhouse.com/docs/en/intro), a column-oriented database management system for online analytical processing +- [cuDF](https://docs.rapids.ai/api/cudf/stable/), a Python GPU DataFrame library - [DuckDB](https://duckdb.org/docs/), a high-performance SQL database for analytical queries - [Pandas](https://pandas.pydata.org/docs/index.html), a data analysis tool for working with data structures - [Polars](https://pola-rs.github.io/polars-book/user-guide/), a Rust based DataFrame library