Releases: ldbc/ldbc_snb_datagen_spark
v0.5.1
What's Changed
- Factorgen: Make factor tables deterministic regardless of the degree of parallelism by @szarnyasg in #423
- Factorgen: Add temporal factor tables (PersonDay, PersonKnowsPersonDay, PersonStudyAtUniversityDay, PersonWorkAtCompany)
Full Changelog: v0.5.0...v0.5.1
v0.5.0
- Scalability improvements. SF30k works, SF100k likely works (untested): #382 #411
- Factor generation for both Interactive and BI: #386 #391 #392 #397 #400 #415 #416
- Build moved to SBT: #409
- Add option to use epoch millis for datetime values: #401
- Bugfixes, e.g. #388 #406
See our recent blogpost for more details: https://ldbcouncil.org/post/ldbc-snb-datagen-the-winding-path-to-sf100k/
v0.4.0
This is the first Datagen release with Spark.
Execution environments
- Both Spark 2 and 3 are supported.
- The generator can be run in a Docker container (for tests and small data sets), on a Spark cluster, and in cloud-based Spark implementations.
- We provide scripts for AWS EMR. We used these to generate data sets up to scale factor 30,000.
Data and parameter generation
- The generator produces a temporal graph where entities can be both inserted (
creationDate
) and deleted (deletionDate
). It support three serialization modes:- Raw mode: generates the entire temporal graph with the
creationDate
anddeletionDate
properties included for each dynamic entity. (Not intended for a benchmark but to be used for experiments where custom data sets are required.) - BI mode: generates an initial data set and daily batches of deletions and insertions. To be used with the LDBC SNB Business Intelligence workload.
- Interactive mode (incomplete): does not take deletions into account. Generates an initial data set. Does not yet generate update streams. See ldbc/ldbc_snb_interactive_v1_impls#173 for the plans to use the new Datagen for SNB Interactive.
- Raw mode: generates the entire temporal graph with the
- Supports producing factor tables.
- This release does not yet have a parameter generator. It will be added in later releases.
SIGMOD 2014 Programming Contest
This version is the closest to the one used in the 2014 SIGMOD Programming Contest.
v0.3.3
Early 2014 Datagen
This version is close to the one used in the 2014 SIGMOD Programming Contest.
v0.3.2
Datagen version that confirms the LDBC SNB specification v0.3.2 released on arXiv.
v0.2.8
v0.2.7
-
Added gscale option which allows specifying the size of the generated dataset
based on the graphalytics scaling metric. -
Fixed a bug which caused wrong serialization of data once the numPartitions
parameter was larger than 1. -
Integrated codacy and fixed a zillion of coding style issues
-
Updated license header and added it to those files where it was missing
-
Fixed a Bug causing the distribution of population to be wrong. Added a test
for this -
Fixed a Bug causing the distribution of posts per country to be wrong. Added a
test for this -
Fixed several bugs related to date generation of dependent events. Added a
test for this
v0.2.6
- Added draft version of bi parameter generation
- Added tests for data integrity (ids are valid, unique, etc.)
- Added tests for interactive workload parameter bindings
- Added tests for update streams proper sorting
- Added a test script for automatic test of determinism for pseudo-/distributed execution modes. This script must be manually executed outside the testing framework
- Improved performance of activity generation.
- Execution flow within activity generation is changed to shape things better for the java compiler
- Improved factor storage, releasing used memory when factors are no longer needed. This makes activity generation to scale close to linear with respect to the network's size
- Improved performance of sorting update streams and serialization
- Added the option to turn on/off printing the endline separator on csv serializers
- Added the option to override the way weights of edges are computed
- Added the option to override the way dates and datetimes are formatted
- Added the graphalytics extended serializer including weights and timestamps
- Added the option to override the way text of posts and comments is generated
- Improved tunable clustering coefficient edge generation
- Integrated parameter generation within java execution
- Added the option to enable/disable the sorting of persons prior to serialization (Enabled by default)
- Improved the way Exceptions are handled
- Critical Bug fixing
- Fixed a bug that caused corrupted data when a reducer failed to execute and hadoop retried its execution in another node
- Fixed a bug with factors generation, causing some queries not to produce valid parameters
- Fixed a bug causing message lengths to go beyond the maximum size of 2000 characters
- Fixed bugs at ttl serializer