Skip to content

Commit

Permalink
Update docs on dataset formats and contributing
Browse files Browse the repository at this point in the history
  • Loading branch information
Ostrzyciel committed Jul 12, 2024
1 parent 5fc1fa7 commit ded55fd
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 5 deletions.
2 changes: 1 addition & 1 deletion docs/documentation/contribute.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ What datasets are we looking for?
- **Interesting thematically and relevant** – does your dataset bring an interesting new use case? Perfect! Even better if the use case has seen real-world usage.
- **Large** – the stream must be at least 10 thousand elements long, preferably more. The larger the better.

If you think your dataset fits the bill, have a look at the [dedicated guide on creating new datasets](creating-new-dataset.md). If you still have questions, don't hesitate to contact RiverBench's maintainer by [opening an issue on GitHub](https://github.com/RiverBench/RiverBench/issues/new/choose).
If you think your dataset fits the bill, have a look at the **[dedicated guide on creating new datasets](creating-new-dataset.md)**. If you still have questions, don't hesitate to contact RiverBench's maintainer by [opening an issue on GitHub](https://github.com/RiverBench/RiverBench/issues/new/choose).

## Contributing benchmark tasks

Expand Down
8 changes: 4 additions & 4 deletions docs/documentation/dataset-release-format.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,9 @@ The flat distribution files are named `flat_{size}.nt.gz` or `flat_{size}.nq.gz`

## Stream distributions

In streaming distributions each stream element is represented by a separate file. The files are compressed in a `.tar.gz` archive. The files are sequentially named starting from `0000000000.Y`, and sequentially up to `X.Y`, where `X + 1` is the number of stream elements in the dataset, and Y is the file extension. All numbers are zero-padded to exactly ten digits. The files are in nested directories with at most 1000 files per directory, to avoid issues with some file systems and file browsers. The number of levels of directories depends on the size of the distribution, with 10K–1M distributions having one level of directories and larger distributions having two.
In streaming distributions each stream element is represented by a separate file. The files are compressed in a `.tar.gz` archive. The files are sequentially named starting from `0000000000.Y` up to `X.Y`, where `X + 1` is the number of stream elements in the dataset, and Y is the file extension. All numbers are zero-padded to exactly ten digits. The files are in nested directories with at most 1000 files per directory, to avoid issues with some file systems and file browsers. The number of levels of directories depends on the size of the distribution, with 10K–1M distributions having one level of directories and larger distributions having two.

The file names are sequentially numbered, and the files are stored in nested directories with at most 1000 files per directory, to avoid issues with some file systems and file browsers. The files are laid out in the archive sequentially, that is, physically the bytes files `X` and `X+1` are next to each other. This allows the package to be processed one element at a time without decompressing the entire archive. To do that, you will need a streaming decompressor / untar utility like the one in Pekko Streams ([tarReader](https://pekko.apache.org/docs/pekko-connectors/current/file.html#tar-archive), [gunzip](https://pekko.staged.apache.org/docs/pekko/current/stream/operators/Compression/gunzip.html)) or [Apache Commons Compress](https://commons.apache.org/proper/commons-compress/).
The files are laid out in the archive sequentially, that is, physically the bytes of files `X` and `X+1` are next to each other. This allows the package to be processed one element at a time without decompressing the entire archive. To do that, you will need a streaming decompressor / untar utility like the one in Pekko Streams ([tarReader](https://pekko.apache.org/docs/pekko-connectors/current/file.html#tar-archive) and [gunzip](https://pekko.staged.apache.org/docs/pekko/current/stream/operators/Compression/gunzip.html)), or [Apache Commons Compress](https://commons.apache.org/proper/commons-compress/).

Each element is serialized in either the [Turtle](https://www.w3.org/TR/turtle/) or the [TriG](https://www.w3.org/TR/trig/) format, depending on the stream type. In case RDF-star is used in the dataset, the used formats are [Turtle-star](https://www.w3.org/2021/12/rdf-star.html#turtle-star) or [TriG-star](https://www.w3.org/2021/12/rdf-star.html#trig-star).

Expand Down Expand Up @@ -89,9 +89,9 @@ The streaming distribution files are named `stream_{size}.tar.gz`, where `{size}

Jelly distributions simply use delimited `RdfStreamFrame`s to denote the individual elements in the stream. The streams are either of `TRIPLES` type (for [RDF graph streams](https://w3id.org/stax/dev/taxonomy#rdf-graph-stream)) or `QUADS` for [RDF dataset streams](https://w3id.org/stax/dev/taxonomy#rdf-dataset-stream). The resulting file is gzip-compressed.

Parsing Jelly files should be [**about 5 times faster**](https://arxiv.org/pdf/2207.04439.pdf) than the other distribution types, depending on the dataset and your hardware. Dataset sizes should be more-or-less the same when compressed, but **when uncompressed Jelly will be 3–4 times smaller**.
Parsing Jelly files should be [**about 5 times faster**](https://arxiv.org/pdf/2207.04439.pdf) than the other distribution types, depending on the dataset and your hardware. Dataset sizes should be more-or-less the same when compressed, but **when uncompressed, Jelly will be 3–4 times smaller**.

Reading Jelly files is currently supported in Apache Jena and RDF4J, using the [`jelly-jvm`](https://github.com/Jelly-RDF/jelly-jvm) library. Please refer to [Jelly's website](https://jelly-rdf.github.io/latest/jvm/) for usage examples and documentation.
Reading Jelly files is currently supported in Apache Jena and RDF4J, using the [`jelly-jvm`](https://github.com/Jelly-RDF/jelly-jvm) library. Please refer to [Jelly's website](https://jelly-rdf.github.io/jelly-jvm/) for usage examples and documentation.

## See also

Expand Down

0 comments on commit ded55fd

Please sign in to comment.