Skip to content

Releases: ldbc/ldbc_snb_datagen_spark

v0.5.1

03 Nov 20:01
2459f4e
Compare
Choose a tag to compare

What's Changed

  • Factorgen: Make factor tables deterministic regardless of the degree of parallelism by @szarnyasg in #423
  • Factorgen: Add temporal factor tables (PersonDay, PersonKnowsPersonDay, PersonStudyAtUniversityDay, PersonWorkAtCompany)

Full Changelog: v0.5.0...v0.5.1

v0.5.0

16 Sep 16:42
Compare
Choose a tag to compare
  • Scalability improvements. SF30k works, SF100k likely works (untested): #382 #411
  • Factor generation for both Interactive and BI: #386 #391 #392 #397 #400 #415 #416
  • Build moved to SBT: #409
  • Add option to use epoch millis for datetime values: #401
  • Bugfixes, e.g. #388 #406

See our recent blogpost for more details: https://ldbcouncil.org/post/ldbc-snb-datagen-the-winding-path-to-sf100k/

v0.4.0

04 Dec 22:56
Compare
Choose a tag to compare

This is the first Datagen release with Spark.

Execution environments

  • Both Spark 2 and 3 are supported.
  • The generator can be run in a Docker container (for tests and small data sets), on a Spark cluster, and in cloud-based Spark implementations.
  • We provide scripts for AWS EMR. We used these to generate data sets up to scale factor 30,000.

Data and parameter generation

  • The generator produces a temporal graph where entities can be both inserted (creationDate) and deleted (deletionDate). It support three serialization modes:
    • Raw mode: generates the entire temporal graph with the creationDate and deletionDate properties included for each dynamic entity. (Not intended for a benchmark but to be used for experiments where custom data sets are required.)
    • BI mode: generates an initial data set and daily batches of deletions and insertions. To be used with the LDBC SNB Business Intelligence workload.
    • Interactive mode (incomplete): does not take deletions into account. Generates an initial data set. Does not yet generate update streams. See ldbc/ldbc_snb_interactive_v1_impls#173 for the plans to use the new Datagen for SNB Interactive.
  • Supports producing factor tables.
  • This release does not yet have a parameter generator. It will be added in later releases.

SIGMOD 2014 Programming Contest

15 Oct 10:07
Compare
Choose a tag to compare
Pre-release

This version is the closest to the one used in the 2014 SIGMOD Programming Contest.

v0.3.3

23 Jul 22:44
Compare
Choose a tag to compare

Followup to the v0.3.2 release with identical functionality but with correct version numbers in the Maven artifacts.

Early 2014 Datagen

25 Apr 14:54
Compare
Choose a tag to compare
Early 2014 Datagen Pre-release
Pre-release

This version is close to the one used in the 2014 SIGMOD Programming Contest.

v0.3.2

28 Feb 22:58
e6213b5
Compare
Choose a tag to compare

Datagen version that confirms the LDBC SNB specification v0.3.2 released on arXiv.

v0.2.8

18 Jan 17:03
db5e4f3
Compare
Choose a tag to compare
  • Added CompositeSerializer
  • Renamed query parameter files to easily differentiate between interactive and bi ones
  • Diverse bug fixing

v0.2.7

17 Oct 06:52
Compare
Choose a tag to compare
  • Added gscale option which allows specifying the size of the generated dataset
    based on the graphalytics scaling metric.

  • Fixed a bug which caused wrong serialization of data once the numPartitions
    parameter was larger than 1.

  • Integrated codacy and fixed a zillion of coding style issues

  • Updated license header and added it to those files where it was missing

  • Fixed a Bug causing the distribution of population to be wrong. Added a test
    for this

  • Fixed a Bug causing the distribution of posts per country to be wrong. Added a
    test for this

  • Fixed several bugs related to date generation of dependent events. Added a
    test for this

v0.2.6

20 Jun 12:25
Compare
Choose a tag to compare
  • Added draft version of bi parameter generation
  • Added tests for data integrity (ids are valid, unique, etc.)
  • Added tests for interactive workload parameter bindings
  • Added tests for update streams proper sorting
  • Added a test script for automatic test of determinism for pseudo-/distributed execution modes. This script must be manually executed outside the testing framework
  • Improved performance of activity generation.
  • Execution flow within activity generation is changed to shape things better for the java compiler
  • Improved factor storage, releasing used memory when factors are no longer needed. This makes activity generation to scale close to linear with respect to the network's size
  • Improved performance of sorting update streams and serialization
  • Added the option to turn on/off printing the endline separator on csv serializers
  • Added the option to override the way weights of edges are computed
  • Added the option to override the way dates and datetimes are formatted
  • Added the graphalytics extended serializer including weights and timestamps
  • Added the option to override the way text of posts and comments is generated
  • Improved tunable clustering coefficient edge generation
  • Integrated parameter generation within java execution
  • Added the option to enable/disable the sorting of persons prior to serialization (Enabled by default)
  • Improved the way Exceptions are handled
    • Critical Bug fixing
    • Fixed a bug that caused corrupted data when a reducer failed to execute and hadoop retried its execution in another node
    • Fixed a bug with factors generation, causing some queries not to produce valid parameters
    • Fixed a bug causing message lengths to go beyond the maximum size of 2000 characters
    • Fixed bugs at ttl serializer