Skip to content

Commit

Permalink
Merge pull request #1 from DerwenAI/propose-open-standard
Browse files Browse the repository at this point in the history
proposed standard
  • Loading branch information
ceteri authored Oct 2, 2022
2 parents bf13042 + 8757eda commit f836bd7
Show file tree
Hide file tree
Showing 5 changed files with 96 additions and 21 deletions.
80 changes: 67 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,33 @@
# pynock

The following describes a proposed standard `NOCK` for a Parquet
format that supports efficient distributed serialization of multiple
kinds of graph technologies.

This library `pynock` provides Examples for working with low-level
Parquet read/write efficiently in Python.

Our intent is to serialize graphs which align the data representations
required for multiple areas of popular graph technologies:
Our intent is to serialize graphs in a way which aligns the data
representations required for popular graph technologies and related
data sources:

* semantic graphs (e.g., W3C)
* semantic graphs (e.g., W3C formats RDF, TTL, JSON-LD, etc.)
* labeled property graphs (e.g., openCypher)
* probabilistic graphs (e.g., PSL)
* edge lists (e.g., NetworkX)
* spreadsheet import/export (e.g., CSV)
* dataframes (e.g., Pandas, Dask, Spark, etc.)
* edge lists (e.g., NetworkX, cuGraph, etc.)

This approach also supports distributed partitions based on Parquet
which can scale to very large (+1 T node) graphs.
This approach also efficient distributed partitions based on Parquet,
which can scale on a cluster to very large (+1 T node) graphs.

For details about the formatting required in Parquet files, see the
For details about the proposed format in Parquet files, see the
[`FORMAT.md`](https://github.com/DerwenAI/pynock/blob/main/FORMAT.md)
page.
file.

If you have questions, suggestions, or bug reports, please open
[an issue](https://github.com/DerwenAI/pynock/issues)
on our public GitHub repo.


## Caveats
Expand All @@ -37,7 +48,9 @@ no guarantees regarding correct behaviors on other versions.

The Parquet file formats depend on Arrow 5.0.x or later.

For the Python dependencies, see the `requirements.txt` file.
For the Python dependencies, the library versioning info is listed in the
[`requirements.txt`](https://github.com/DerwenAI/pynock/blob/main/requirements.txt)
file.


## Set up
Expand All @@ -63,17 +76,17 @@ python3 -m pip install -r requirements.txt
To run examples from CLI:

```
python3 example.py load-parq --file dat/recipes.parq --debug
python3 cli.py load-parq --file dat/recipes.parq --debug
```

```
python3 example.py load-rdf --file dat/tiny.ttl --save-cvs foo.cvs
python3 cli.py load-rdf --file dat/tiny.ttl --save-cvs foo.cvs
```

For further information:

```
python3 example.py --help
python3 cli.py --help
```

## Usage programmatically in Python
Expand All @@ -100,8 +113,31 @@ _Towards Data Science_ (2020-06-25)

A `nock` is the English word for the end of an arrow opposite its point.

If you must have an acronym, the proposed standard `NOCK` stands for
**N**etwork **O**bjects for **C**onsistent **K**nowledge.

Also, the library name had minimal namespace collisions on GitHub and
PyPi :)


## Developer updates

To set up the build environment locally, also run:
```
python3 -m pip install -U pip setuptools wheel
python3 -m pip install -r requirements-dev.txt
```

Note that we require the use of [`pre-commit` hooks](https://pre-commit.com/)
and to configure that locally:

```
pre-commit install
git config --local core.hooksPath .git/hooks/
```


## Package Release
## Package releases

First, verify that `setup.py` will run correctly for the package
release process:
Expand All @@ -111,3 +147,21 @@ python3 -m pip install -e .
python3 -m pytest tests/
python3 -m pip uninstall pynock
```

Next, update the semantic version number in `setup.py` and create a
release on GitHub, and make sure to update the local repo:

```
git stash
git checkout main
git pull
```

Make sure that you have set up your 2FA authentication for generating
an API token on PyPi: <https://pypi.org/manage/account/token/>

Then run our PyPi push script:

```
./bin/push_pypi.sh
```
File renamed without changes.
15 changes: 11 additions & 4 deletions pynock/pynock.py
Original file line number Diff line number Diff line change
Expand Up @@ -454,6 +454,15 @@ def iter_gen_rows (
edge_id += 1


def to_df (
self,
) -> pd.DataFrame:
"""
Represent the partition as a DataFrame.
"""
return pd.DataFrame([row for row in self.iter_gen_rows()])


def save_file_parquet (
self,
save_parq: cloudpathlib.AnyPath,
Expand All @@ -463,8 +472,7 @@ def save_file_parquet (
"""
Save a partition to a Parquet file.
"""
df = pd.DataFrame([row for row in self.iter_gen_rows()])
table = pa.Table.from_pandas(df)
table = pa.Table.from_pandas(self.to_df())
writer = pq.ParquetWriter(save_parq.as_posix(), table.schema)
writer.write_table(table)
writer.close()
Expand All @@ -479,8 +487,7 @@ def save_file_csv (
"""
Save a partition to a CSV file.
"""
df = pd.DataFrame([row for row in self.iter_gen_rows()])
df.to_csv(save_csv.as_posix(), index=False)
self.to_df().to_csv(save_csv.as_posix(), index=False)


def save_file_rdf (
Expand Down
17 changes: 13 additions & 4 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""
Package set up, used for CI testing.
Package set up.
"""

import pathlib
Expand All @@ -9,13 +9,21 @@


DESCRIP = """
Examples for low-level Parquet read/write in Python
A proposed standard `NOCK` for a Parquet format that supports efficient
distributed serialization of multiple kinds of graph technologies.
""".strip()

KEYWORDS = [
"CSV",
"Parquet",
"RDF",
"dataframe",
"graph data science",
"knowledge graph",
"parquet",
"openCypher",
"serialization",
"spreadsheet",
"open standard",
]


Expand All @@ -40,12 +48,13 @@ def parse_requirements_file (filename: str) -> typing.List[ str ]:
if __name__ == "__main__":
setuptools.setup(
name = "pynock",
version = "1.0.0",
version = "1.0.1",
license = "MIT",

python_requires = ">=3.8",
install_requires = parse_requirements_file("requirements.txt"),
packages = setuptools.find_packages(exclude=[
"bin",
"dat",
"tests",
"venv",
Expand Down
5 changes: 5 additions & 0 deletions tiny.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
programmatically, based on the graph described in `dat/tiny.rdf`
"""

from icecream import ic
import cloudpathlib

from pynock import Partition, Node, Edge
Expand Down Expand Up @@ -124,3 +125,7 @@
part.save_file_rdf(cloudpathlib.AnyPath("foo.rdf"), "ttl")

# check the files "foo.*" to see what was constructed programmatically
# also, here's a dataframe representation
df = part.to_df()
ic(df.head())

0 comments on commit f836bd7

Please sign in to comment.