diff --git a/docs/documentation/contribute.md b/docs/documentation/contribute.md index d4d2badc4..4d09a76d4 100644 --- a/docs/documentation/contribute.md +++ b/docs/documentation/contribute.md @@ -16,7 +16,7 @@ What datasets are we looking for? - **Interesting thematically and relevant** – does your dataset bring an interesting new use case? Perfect! Even better if the use case has seen real-world usage. - **Large** – the stream must be at least 10 thousand elements long, preferably more. The larger the better. -If you think your dataset fits the bill, have a look at the [dedicated guide on creating new datasets](creating-new-dataset.md). If you still have questions, don't hesitate to contact RiverBench's maintainer by [opening an issue on GitHub](https://github.com/RiverBench/RiverBench/issues/new/choose). +If you think your dataset fits the bill, have a look at the **[dedicated guide on creating new datasets](creating-new-dataset.md)**. If you still have questions, don't hesitate to contact RiverBench's maintainer by [opening an issue on GitHub](https://github.com/RiverBench/RiverBench/issues/new/choose). ## Contributing benchmark tasks diff --git a/docs/documentation/dataset-release-format.md b/docs/documentation/dataset-release-format.md index d66e361ea..0f9d75a76 100644 --- a/docs/documentation/dataset-release-format.md +++ b/docs/documentation/dataset-release-format.md @@ -26,9 +26,9 @@ The flat distribution files are named `flat_{size}.nt.gz` or `flat_{size}.nq.gz` ## Stream distributions -In streaming distributions each stream element is represented by a separate file. The files are compressed in a `.tar.gz` archive. The files are sequentially named starting from `0000000000.Y`, and sequentially up to `X.Y`, where `X + 1` is the number of stream elements in the dataset, and Y is the file extension. All numbers are zero-padded to exactly ten digits. The files are in nested directories with at most 1000 files per directory, to avoid issues with some file systems and file browsers. The number of levels of directories depends on the size of the distribution, with 10K–1M distributions having one level of directories and larger distributions having two. +In streaming distributions each stream element is represented by a separate file. The files are compressed in a `.tar.gz` archive. The files are sequentially named starting from `0000000000.Y` up to `X.Y`, where `X + 1` is the number of stream elements in the dataset, and Y is the file extension. All numbers are zero-padded to exactly ten digits. The files are in nested directories with at most 1000 files per directory, to avoid issues with some file systems and file browsers. The number of levels of directories depends on the size of the distribution, with 10K–1M distributions having one level of directories and larger distributions having two. -The file names are sequentially numbered, and the files are stored in nested directories with at most 1000 files per directory, to avoid issues with some file systems and file browsers. The files are laid out in the archive sequentially, that is, physically the bytes files `X` and `X+1` are next to each other. This allows the package to be processed one element at a time without decompressing the entire archive. To do that, you will need a streaming decompressor / untar utility like the one in Pekko Streams ([tarReader](https://pekko.apache.org/docs/pekko-connectors/current/file.html#tar-archive), [gunzip](https://pekko.staged.apache.org/docs/pekko/current/stream/operators/Compression/gunzip.html)) or [Apache Commons Compress](https://commons.apache.org/proper/commons-compress/). +The files are laid out in the archive sequentially, that is, physically the bytes of files `X` and `X+1` are next to each other. This allows the package to be processed one element at a time without decompressing the entire archive. To do that, you will need a streaming decompressor / untar utility like the one in Pekko Streams ([tarReader](https://pekko.apache.org/docs/pekko-connectors/current/file.html#tar-archive) and [gunzip](https://pekko.staged.apache.org/docs/pekko/current/stream/operators/Compression/gunzip.html)), or [Apache Commons Compress](https://commons.apache.org/proper/commons-compress/). Each element is serialized in either the [Turtle](https://www.w3.org/TR/turtle/) or the [TriG](https://www.w3.org/TR/trig/) format, depending on the stream type. In case RDF-star is used in the dataset, the used formats are [Turtle-star](https://www.w3.org/2021/12/rdf-star.html#turtle-star) or [TriG-star](https://www.w3.org/2021/12/rdf-star.html#trig-star). @@ -89,9 +89,9 @@ The streaming distribution files are named `stream_{size}.tar.gz`, where `{size} Jelly distributions simply use delimited `RdfStreamFrame`s to denote the individual elements in the stream. The streams are either of `TRIPLES` type (for [RDF graph streams](https://w3id.org/stax/dev/taxonomy#rdf-graph-stream)) or `QUADS` for [RDF dataset streams](https://w3id.org/stax/dev/taxonomy#rdf-dataset-stream). The resulting file is gzip-compressed. -Parsing Jelly files should be [**about 5 times faster**](https://arxiv.org/pdf/2207.04439.pdf) than the other distribution types, depending on the dataset and your hardware. Dataset sizes should be more-or-less the same when compressed, but **when uncompressed Jelly will be 3–4 times smaller**. +Parsing Jelly files should be [**about 5 times faster**](https://arxiv.org/pdf/2207.04439.pdf) than the other distribution types, depending on the dataset and your hardware. Dataset sizes should be more-or-less the same when compressed, but **when uncompressed, Jelly will be 3–4 times smaller**. -Reading Jelly files is currently supported in Apache Jena and RDF4J, using the [`jelly-jvm`](https://github.com/Jelly-RDF/jelly-jvm) library. Please refer to [Jelly's website](https://jelly-rdf.github.io/latest/jvm/) for usage examples and documentation. +Reading Jelly files is currently supported in Apache Jena and RDF4J, using the [`jelly-jvm`](https://github.com/Jelly-RDF/jelly-jvm) library. Please refer to [Jelly's website](https://jelly-rdf.github.io/jelly-jvm/) for usage examples and documentation. ## See also