Releases · ldbc/ldbc_snb_datagen_spark

03 Nov 20:01

szarnyasg

v0.5.1

2459f4e

v0.5.1 Latest

Latest

What's Changed

Factorgen: Make factor tables deterministic regardless of the degree of parallelism by @szarnyasg in #423
Factorgen: Add temporal factor tables (PersonDay, PersonKnowsPersonDay, PersonStudyAtUniversityDay, PersonWorkAtCompany)

Full Changelog: v0.5.0...v0.5.1

Contributors

szarnyasg

Assets 2

16 Sep 16:42

szarnyasg

v0.5.0

250da28

v0.5.0

Scalability improvements. SF30k works, SF100k likely works (untested): #382 #411
Factor generation for both Interactive and BI: #386 #391 #392 #397 #400 #415 #416
Build moved to SBT: #409
Add option to use epoch millis for datetime values: #401
Bugfixes, e.g. #388 #406

See our recent blogpost for more details: https://ldbcouncil.org/post/ldbc-snb-datagen-the-winding-path-to-sf100k/

Assets 2

04 Dec 22:56

szarnyasg

v0.4.0

aacc4af

v0.4.0

This is the first Datagen release with Spark.

Execution environments

Both Spark 2 and 3 are supported.
The generator can be run in a Docker container (for tests and small data sets), on a Spark cluster, and in cloud-based Spark implementations.
We provide scripts for AWS EMR. We used these to generate data sets up to scale factor 30,000.

Data and parameter generation

The generator produces a temporal graph where entities can be both inserted (creationDate) and deleted (deletionDate). It support three serialization modes:
- Raw mode: generates the entire temporal graph with the creationDate and deletionDate properties included for each dynamic entity. (Not intended for a benchmark but to be used for experiments where custom data sets are required.)
- BI mode: generates an initial data set and daily batches of deletions and insertions. To be used with the LDBC SNB Business Intelligence workload.
- Interactive mode (incomplete): does not take deletions into account. Generates an initial data set. Does not yet generate update streams. See ldbc/ldbc_snb_interactive_v1_impls#173 for the plans to use the new Datagen for SNB Interactive.
Supports producing factor tables.
This release does not yet have a parameter generator. It will be added in later releases.

Assets 2

15 Oct 10:07

szarnyasg

sigmod2014contest

17c39d4

SIGMOD 2014 Programming Contest Pre-release

Pre-release

This version is the closest to the one used in the 2014 SIGMOD Programming Contest.

Assets 2

23 Jul 22:44

szarnyasg

v0.3.3

334e9fe

v0.3.3

Followup to the v0.3.2 release with identical functionality but with correct version numbers in the Maven artifacts.

Assets 2

25 Apr 14:54

szarnyasg

early2014

6edcb22

Early 2014 Datagen Pre-release

Pre-release

This version is close to the one used in the 2014 SIGMOD Programming Contest.

Assets 2

28 Feb 22:58

szarnyasg

v0.3.2

e6213b5

v0.3.2

Datagen version that confirms the LDBC SNB specification v0.3.2 released on arXiv.

Assets 2

18 Jan 17:03

ArnauPrat

v0.2.8

db5e4f3

v0.2.8

Added CompositeSerializer
Renamed query parameter files to easily differentiate between interactive and bi ones
Diverse bug fixing

Assets 2

17 Oct 06:52

ArnauPrat

v0.2.7

d3bd63b

v0.2.7

Added gscale option which allows specifying the size of the generated dataset
based on the graphalytics scaling metric.
Fixed a bug which caused wrong serialization of data once the numPartitions
parameter was larger than 1.
Integrated codacy and fixed a zillion of coding style issues
Updated license header and added it to those files where it was missing
Fixed a Bug causing the distribution of population to be wrong. Added a test
for this
Fixed a Bug causing the distribution of posts per country to be wrong. Added a
test for this
Fixed several bugs related to date generation of dependent events. Added a
test for this

Assets 2

20 Jun 12:25

ArnauPrat

v0.2.6

c34bf5c

v0.2.6

Added draft version of bi parameter generation
Added tests for data integrity (ids are valid, unique, etc.)
Added tests for interactive workload parameter bindings
Added tests for update streams proper sorting
Added a test script for automatic test of determinism for pseudo-/distributed execution modes. This script must be manually executed outside the testing framework
Improved performance of activity generation.
Execution flow within activity generation is changed to shape things better for the java compiler
Improved factor storage, releasing used memory when factors are no longer needed. This makes activity generation to scale close to linear with respect to the network's size
Improved performance of sorting update streams and serialization
Added the option to turn on/off printing the endline separator on csv serializers
Added the option to override the way weights of edges are computed
Added the option to override the way dates and datetimes are formatted
Added the graphalytics extended serializer including weights and timestamps
Added the option to override the way text of posts and comments is generated
Improved tunable clustering coefficient edge generation
Integrated parameter generation within java execution
Added the option to enable/disable the sorting of persons prior to serialization (Enabled by default)
Improved the way Exceptions are handled
- Critical Bug fixing
- Fixed a bug that caused corrupted data when a reducer failed to execute and hadoop retried its execution in another node
- Fixed a bug with factors generation, causing some queries not to produce valid parameters
- Fixed a bug causing message lengths to go beyond the maximum size of 2000 characters
- Fixed bugs at ttl serializer

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

Execution environments

Data and parameter generation

Releases: ldbc/ldbc_snb_datagen_spark

v0.5.1

What's Changed

Contributors

v0.5.0

v0.4.0

Execution environments

Data and parameter generation

SIGMOD 2014 Programming Contest

v0.3.3

Early 2014 Datagen

v0.3.2

v0.2.8

v0.2.7

v0.2.6