Skip to content

Commit

Permalink
add cudf example (#2941)
Browse files Browse the repository at this point in the history
  • Loading branch information
raybellwaves authored Jun 24, 2024
1 parent d222d52 commit c2f1467
Show file tree
Hide file tree
Showing 2 changed files with 31 additions and 0 deletions.
30 changes: 30 additions & 0 deletions docs/source/cudf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# cuDF

[cuDF](https://docs.rapids.ai/api/cudf/stable/) is a Python GPU DataFrame library.

To read from a single Parquet file, use the [`read_parquet`](https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/api/cudf.read_parquet/) function to read it into a DataFrame:

```py
import cudf

df = (
cudf.read_parquet("https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet")
.groupby('horoscope')['text']
.apply(lambda x: x.str.len().mean())
.sort_values(ascending=False)
.head(5)
)
```

To read multiple Parquet files - for example, if the dataset is sharded - you'll need to use [`dask-cudf`](https://docs.rapids.ai/api/dask-cudf/stable/):

```py
import dask
import dask.dataframe as dd

dask.config.set({"dataframe.backend": "cudf"})

df = (
dd.read_parquet("https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/*.parquet")
)
```
1 change: 1 addition & 0 deletions docs/source/parquet_process.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ For private datasets, the feature is provided if the repository is owned by a [P
There are several different libraries you can use to work with the published Parquet files:

- [ClickHouse](https://clickhouse.com/docs/en/intro), a column-oriented database management system for online analytical processing
- [cuDF](https://docs.rapids.ai/api/cudf/stable/), a Python GPU DataFrame library
- [DuckDB](https://duckdb.org/docs/), a high-performance SQL database for analytical queries
- [Pandas](https://pandas.pydata.org/docs/index.html), a data analysis tool for working with data structures
- [Polars](https://pola-rs.github.io/polars-book/user-guide/), a Rust based DataFrame library
Expand Down

0 comments on commit c2f1467

Please sign in to comment.