Skip to content

Commit

Permalink
Add videos to README (#194)
Browse files Browse the repository at this point in the history
* update

* update

* Update README.md

* update

* update

* update

* update

* update

* update

* Update README.md

* update
  • Loading branch information
nanne-aben authored Oct 4, 2023
1 parent 73c0513 commit a97036e
Show file tree
Hide file tree
Showing 8 changed files with 767 additions and 73 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -48,4 +48,4 @@ jobs:
coverage report -m --fail-under 100
- name: Run notebooks
run: |
for FILE in docs/source/*.ipynb; do papermill $FILE output.json -k python3; done
for FILE in docs/*/*.ipynb; do papermill $FILE output.json -k python3; done
75 changes: 75 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Typedspark: column-wise type annotations for pyspark DataFrames

We love Spark! But in production code we're wary when we see:

```python
from pyspark.sql import DataFrame

def foo(df: DataFrame) -> DataFrame:
# do stuff
return df
```

Because… How do we know which columns are supposed to be in ``df``?

Using ``typedspark``, we can be more explicit about what these data should look like.

```python
from typedspark import Column, DataSet, Schema
from pyspark.sql.types import LongType, StringType

class Person(Schema):
id: Column[LongType]
name: Column[StringType]
age: Column[LongType]

def foo(df: DataSet[Person]) -> DataSet[Person]:
# do stuff
return df
```
The advantages include:

* Improved readibility of the code
* Typechecking, both during runtime and linting
* Auto-complete of column names
* Easy refactoring of column names
* Easier unit testing through the generation of empty ``DataSets`` based on their schemas
* Improved documentation of tables

## Demo videos

### IDE demo

https://github.com/kaiko-ai/typedspark/assets/47976799/e6f7fa9c-6d14-4f68-baba-fe3c22f75b67

You can find the corresponding code [here](docs/videos/ide.ipynb).

### Jupyter / Databricks notebooks demo

https://github.com/kaiko-ai/typedspark/assets/47976799/39e157c3-6db0-436a-9e72-44b2062df808

You can find the corresponding code [here](docs/videos/notebook.ipynb).

## Installation

You can install ``typedspark`` from [pypi](https://pypi.org/project/typedspark/) by running:

```bash
pip install typedspark
```
By default, ``typedspark`` does not list ``pyspark`` as a dependency, since many platforms (e.g. Databricks) come with ``pyspark`` preinstalled. If you want to install ``typedspark`` with ``pyspark``, you can run:

```bash
pip install "typedspark[pyspark]"
```

## Documentation
Please see our documentation on [readthedocs](https://typedspark.readthedocs.io/en/latest/index.html).

## FAQ

**I found a bug! What should I do?**</br>
Great! Please make an issue and we'll look into it.

**I have a great idea to improve typedspark! How can we make this work?**</br>
Awesome, please make an issue and let us know!
69 changes: 0 additions & 69 deletions README.rst

This file was deleted.

2 changes: 1 addition & 1 deletion docs/run_notebooks.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
for FILE in docs/source/*.ipynb; do
for FILE in docs/*/*.ipynb; do
papermill $FILE $FILE;
python docs/remove_metadata.py $FILE;
done
70 changes: 69 additions & 1 deletion docs/source/README.rst
Original file line number Diff line number Diff line change
@@ -1 +1,69 @@
.. include:: ../../README.rst
===============================================================
Typedspark: column-wise type annotations for pyspark DataFrames
===============================================================

We love Spark! But in production code we're wary when we see:

.. code-block:: python
from pyspark.sql import DataFrame
def foo(df: DataFrame) -> DataFrame:
# do stuff
return df
Because… How do we know which columns are supposed to be in ``df``?

Using ``typedspark``, we can be more explicit about what these data should look like.

.. code-block:: python
from typedspark import Column, DataSet, Schema
from pyspark.sql.types import LongType, StringType
class Person(Schema):
id: Column[LongType]
name: Column[StringType]
age: Column[LongType]
def foo(df: DataSet[Person]) -> DataSet[Person]:
# do stuff
return df
The advantages include:

* Improved readibility of the code
* Typechecking, both during runtime and linting
* Auto-complete of column names
* Easy refactoring of column names
* Easier unit testing through the generation of empty ``DataSets`` based on their schemas
* Improved documentation of tables

Installation
============

You can install ``typedspark`` from `pypi <https://pypi.org/project/typedspark/>`_ by running:

.. code-block:: bash
pip install typedspark
By default, ``typedspark`` does not list ``pyspark`` as a dependency, since many platforms (e.g. Databricks) come with ``pyspark`` preinstalled. If you want to install ``typedspark`` with ``pyspark``, you can run:

.. code-block:: bash
pip install "typedspark[pyspark]"
Documentation
=================
Please see our documentation on `readthedocs <https://typedspark.readthedocs.io/en/latest/index.html>`_.

FAQ
===

| **I found a bug! What should I do?**
| Great! Please make an issue and we'll look into it.
|
| **I have a great idea to improve typedspark! How can we make this work?**
| Awesome, please make an issue and let us know!
Loading

0 comments on commit a97036e

Please sign in to comment.