NoSQL and multi-model systems represent so-called variety of big data, i.e., enables to represent data in different (or combined) logical models and formats. In addition to the relational model, we also distinguish, e.g., hierarchical (document) and graph data, hence the scope and efficiency of querying over these representations differ.
The goal of this thesis is to focus on various data models (e.g., document, column, graph, relational) and to compare typical query languages for these data models in terms of expressive power and run-time efficiency, taking into account scalability with varying number of stored data.
We will analyze the current state of knowledge. Then, we will select 6 candidates for comparison and statically compare the expressive power of the supported query languages. Next, based on the static comparison, we will propose query scenarios and perform dynamic query comparison over the selected database systems using the proposed queries. Finally, we will suggest appropriate recommendations or discuss open questions and challenges in querying over various data.
The top-level repository structure contains folders for all databases used in the thesis. Since we use Docker to set up all DBMS containers, docker-compose.yml
defines the services, their volumes, and configurations used for testing.
The repository is structure as follows (along with some important files and folders mentioned):
common/
- common files and scripts for all databasesdocs/
- extra documentation files (e.g. schema diagrams)thesis-dummy-data-generator/
- TypeScript NPM project for generating pseudo-random relational data using Faker.js and Scramjet frameworkcount_table_rows.sql
- for MySQL and SQLitefilter_results.ipynb
- Jupyter Notebook for filtering and processing experiment resultsresults.csv
- a combined CSV file containing all results from experiments with a header:db,record_volume,query,iteration,time_in_seconds
db_sizes.csv
- a CSV file containing database sizes in MB for each database system and record volumefiltered_results.csv
- post-processed CSV file
mysql/
- MySQL database files and scriptsdata/
- configuration files mounted to the MySQL containerdumps/
- MySQL SQL and/or CSV dumps copied from generatedcommon/thesis-dummy-data-generator/data_<entity_count>*
folderexports/
- contains (denormalized) exported data from MySQL database (later imported into Cassandra)queries/
- SQL queries for MySQLtesting/
- individual files with individual queries used for testingquery.sql
- complete list of queries and their descriptions (check comments for more details)schema.sql
- complete schema of the MySQL database (check comments for more details)
sqlite/
- SQLite database files and scripts (is similar to MySQL structure, so only differences are mentioned)data/
- database file (ecommerce.db
) mounted to the SQLite containersqlite3-analyzer
1 - binary for analyzing SQLite database and generating statistics about tables called withsqlite3-analyzer data/ecommerce.db
outside the container, insidesqlite/
folder
neo4j/
- Neo4j database files and scriptsarangodb-json-transform/
- NPM package for transforming exported JSON data from Neo4j to ArangoDB JSON formatdumps/
- Neo4j database dumps (entire database files in compressed binary format)exports/
- exported JSON documents from Neo4j (later transformed to ArangoDB JSON format)edges/
- each file inside this folder contains JSON Lines of edges (relationships) generated byexport_to_json.sh
nodes/
- each file inside this folder contains JSON Lines of nodes (vertices) generated byexport_to_json.sh
neo4j-etl-tool/
- Neo4j ETL tool2 files for importing data from CSV files (tested with version 1.6.0)neo4-etl-cli-1.6.0/
- release files downloaded from above GitHub repositorybin/neo4j-etl
- binary for running ETL tool
queries/
- Cypher queries for Neo4jquery.cypher
- complete list of queries and their descriptions (check comments for more details)
arangodb/
- ArangoDB database files and scriptsdumps/
- ArangoDB database dumps (copied from exportedneo4j/exports
folder)queries/
- AQL queries for ArangoDBquery.js
- definition of AQL queries and their descriptions (check comments for more details)query_testing.js
- entire testing script with error handling for running AQL queries and generating logs
cassandra/
- Cassandra database files and scriptsdata/
- configuration filecassandra.yaml
mounted to the Cassandra containerdsbulk/
- DataStax Bulk Loader (dsbulk)3 JAR file for loading and counting data in Cassandra (tested with version 1.11.0)dumps/
- Cassandra CQL dumps or CSV files copied from exportedmysql/exports
folderqueries/
- CQL queries for Cassandraquery.cql
- complete list of queries and their descriptions (check comments for more details)schema.cql
- complete schema of the Cassandra database (check comments for more details)
mongodb/
- MongoDB database files and scriptsdumps/
- MongoDB database dumps (generated withdumps.sh
)queries/
- MongoDB queries for MongoDBquery.js
- complete list of queries and their descriptions (check comments for more details)query_testing.js
- entire testing script with error handling similar to ArangoDB
relational-migrator/
- installation and configuration files for MongoDB Relational Migratormongodb-relational-migrator_1.5.0_amd64.deb
- installation file for Debian based systems (NOTE: GIT LFS is required to download this file!)ecommerce-mapping*.relmig
- these are configuration files used to setup MongoDB Relational Migrator4 tool (tested with version 1.5.0)
Folders and files common to most of the database directories:
logs/
- all logs generated from scriptsqueries/
- logs from query testing invoked byrun_queries.sh
queries/
- queries for each database systemtesting/
- individual files with individual queries used for testing
stats/
- final statistics generated from queries, imports, and exports (later used for analysis)dumps
- database dumps or exported data from other databases (used for importing data)
Some scripts are common for all databases, some are specific to a particular DBMS. Some like init.sh
and run_queries.sh
require manual variable editing to select required data directories (see NOTES inside scripts before execution!). If not explicitly specified otherwise (see script comments), scripts must be run inside respective Docker containers, i.e. with docker exec -it query-languages-analysis-thesis-<service_name>-1 bash
or docker exec -it query-languages-analysis-thesis-<service_name>-1 sh
.
Common scripts:
/run_queries.sh
- convenience script for running all queries for selected databases (in parallel)/init.sh
- convenience script for inializing all databases (creating schemas, importing data, etc.) for selected databases (in parallel)/<dbms>
run_queries.sh
,/init.sh
- individual scripts for running queries or initializing (see comments for more details)
Notable scripts and flows:
common/thesis-dummy-data-generator/run.sh
- run with./run.sh <entity_count>
to generate CSV files with<entity_count>
number as reference count for top-level entitiesmysql
- Initialization scripts:
csv_init.sh
- initialize with CSV files as source of datainit.sh
- initialize with SQL dump as source of data (older method used with small datasets - see Legacy generator)
cassandra_export_to_csv.sh
- transform MySQL tables to denormalized CSV files for importing into Cassandra
neo4j
neo4j-etl-tool/
import.sh
- automatically run Neo4j ETL tool CLI and subsequently import data usingneo4j-admin database import full
command
export_to_json.sh
- export Neo4j vertices and relationships to JSON Lines formatarangodb_transform_json.sh
- transform exported JSON Lines to ArangoDB JSON format
cassandra
- Initialization scripts:
dsbulk_init.sh
- initialize Cassandra with CSV files as source of datainit.sh
- initialize with CQL dump as source of data (older method used with small datasets - see Legacy generator) - NOTE: Very slow for large datasets.
count_table_rows.sh
- count rows in each table in Cassandra usingdsbulk count
table_stats.sh
- generate statistics for each table in Cassandra usingnodetool tablestats
Tools and software required for running the experiments:
- Docker (tested with v25.0.3)
- Docker Compose (tested with v2.24.5)
- Node.js (v20.8+)
- NPM
- Java (tested with OpenJDK 11.0.22)
Tools that need to be (downloaded and) installed manually: (for reproducibility, binaries and installation files are already included in the respective folders)
- Neo4j ETL Tool (v1.6.0) - download link: https://github.com/neo4j-contrib/neo4j-etl/releases/tag/1.6.0
- DataStax Bulk Loader (dsbulk) (v1.11.0) - download link: https://github.com/datastax/dsbulk/releases/tag/1.11.0
- MongoDB Relational Migrator (v1.5.0) - download link: https://migrator-installer-repository.s3.ap-southeast-2.amazonaws.com/index.html#1.5.0/
Tools and software required for filtering the results:
- Python 3
- Pandas, NumPy
- Jupyter Notebook
- Clone the repository:
git clone https://github.com/corovcam/Query-Languages-Analysis-Thesis.git cd Query-Languages-Analysis-Thesis
- If not already done, download and install the required tools and software and place them in the respective folders.
mongodb/relational-migrator/mongodb-relational-migrator_1.5.0_amd64.deb
should be installed on the host machine using any Debian package manager (e.g.sudo apt install ./mongodb-relational-migrator_1.5.o_amd64.deb
)5.
- Run the generator script to generate dummy data for MySQL and SQLite databases:
where
cd common/thesis-dummy-data-generator ./run.sh <entity_count>
<entity_count>
is the number of entities to generate (e.g. 1000, 4000, 128000, etc.) - Copy the generated data to MySQL and SQLite directories:
cp -r data_<entity_count>_<timestamp> ../mysql/dumps/data_<entity_count> cp -r data_<entity_count>_<timestamp> ../sqlite/dumps/data_<entity_count>
- Run the following command in the root project directory to build and start the Docker container/s:
docker-compose up -d --build <service_name>
- For most databases (starting with SQLite and MySQL - see ETL Flow), check
init.sh
script for initializing the database, change required variable parameters, and then run the script inside the respective Docker container:or if bash is not available:docker exec -it query-languages-analysis-thesis-<service_name>-1 bash ./init.sh
docker exec -it query-languages-analysis-thesis-<service_name>-1 sh ./init.sh
- NOTE: For MySQL and SQLite, you can use
csv_init.sh
script to initialize the database with CSV files as source of data (which comes default from the generator - seecommon/thesis-dummy-data-generator/data-generator-old.js
for legacy import used up until 256k record volume experiments to generate SQL dumps - see Legacy generator). - NOTE: For Cassandra, you can use
dsbulk_init.sh
script to initialize the database with CSV files as source of data.
- NOTE: For MySQL and SQLite, you can use
- For Cassandra, you must generate the CSV files from MySQL data using the
cassandra_export_to_csv.sh
script:- Export the MySQL query outputs to CSV files: (as denormalized tables for Cassandra import)
docker exec -it query-languages-analysis-thesis-mysql-1 bash ./cassandra_export_to_csv.sh
- NOTE: The script will generate CSV files in the
exports/
folder, where each CSV represents one table in the Cassandra database.
- NOTE: The script will generate CSV files in the
- Copy the exported CSV files to the Cassandra container:
cp -r exports/ ../cassandra/dumps/data_<entity_count>
- Initialize the Cassandra database using the
dsbulk_init.sh
script (inside the Cassandra container):docker exec -it query-languages-analysis-thesis-cassandra-1 bash ./dsbulk_init.sh
- NOTE: Change the required variable parameters in the script before running it!
- NOTE: You can also use the
init.sh
script to initialize the database with CQL dump as source of data (which is generated by the legacy generator script - see Legacy generator).
- Export the MySQL query outputs to CSV files: (as denormalized tables for Cassandra import)
- For MongoDB, you must proceed according to the official documentation (https://www.mongodb.com/docs/relational-migrator/installation/install-on-a-local-machine/install-ubuntu/).
- After installation, run the binary:
cd /opt/mongodb-relational-migrator/bin ./mongodb-relational-migrator
- NOTE: The tool will start a GUI interface accessible at http://localhost:8278/
- You must configure the tool using the provided configuration files, i.e.
mongodb/relational-migrator/ecommerce-mapping*.relmig
.- NOTE:
ecommerce-mapping.relmig
was used for < 256k record volume experiments andecommerce-mapping-modified-orders-types-persons.relmig
for 256k+ record volume experiments.
- NOTE:
- Import the config file by following offical guide (https://www.mongodb.com/docs/relational-migrator/projects/import-project/).
- Run the Sync Job by following official guide (https://www.mongodb.com/docs/relational-migrator/jobs/creating-jobs/).
- NOTE: Both, the MySQL and MongoDB containers must be running before running the sync job.
- NOTE: Use the default settings for the sync job. Use
root
as username and password for MySQL and leave defaults for MongoDB. - NOTE:
Drop destination collections before migration
must be checked to avoid conflicts with existing data!
- Check the import stats and logs for any errors or warnings.
- After installation, run the binary:
- For Neo4j, you can use
neo4j-etl-tool/import.sh
script to automatically run the ETL tool and import data into Neo4j.- Run the script inside the Neo4j container:
docker exec -it query-languages-analysis-thesis-neo4j-1 bash cd neo4j-etl-tool ./import.sh
- NOTE: MySQL and Neo4j containers must be running before running the script. (conversely, neo4j database instance must be stopped before running neo4j-admin commands - use
neo4j stop
inside neo4j container). - NOTE: After running the script, you must manually start the Neo4j database instance using
neo4j start
inside the Neo4j container or by restarting the container to proceed.
- NOTE: MySQL and Neo4j containers must be running before running the script. (conversely, neo4j database instance must be stopped before running neo4j-admin commands - use
- Run the script inside the Neo4j container:
- For ArangoDB, you must do the following:
- Export the Neo4j data to JSON format using
export_to_json.sh
script (inside the Neo4j container).docker exec -it query-languages-analysis-thesis-neo4j-1 bash ./export_to_json.sh
- NOTE: The script will generate two directories:
nodes/
andedges/
in theexports
folder containing JSON Lines files for each node and edge type.
- NOTE: The script will generate two directories:
- Transform the exported JSON data to ArangoDB JSON format using
arangodb_transform_json.sh
(outside the Neo4j container).cd neo4j ./arangodb_transform_json.sh
- NOTE: The script will generate the
exports/transformed_<timestamp>/
directory containing JSON Lines files for each node and edge type in ArangoDB format.
- NOTE: The script will generate the
- Copy the transformed data to the ArangoDB container:
cp -r exports/transformed_<timestamp> ../arangodb/dumps/data_<entity_count>
- Initialize the ArangoDB database using the
init.sh
script (inside the ArangoDB container).docker exec -it query-languages-analysis-thesis-arangodb-1 sh ./init.sh
- NOTE: Change the required variable parameters in the script before running it!
- Export the Neo4j data to JSON format using
The legacy generator script is used to generate SQL dumps for MySQL and SQLite databases and CQL dumps for Cassandra database. It was used for generating data for experiments with record volumes up to 256k (included). The script is located in the common/thesis-dummy-data-generator
folder and is named data-generator-old.js
. To run the script, use the following command:
cd common/thesis-dummy-data-generator
./run-old.sh <entity_count> <sql_dump_dir> <cql_dump_dir>
where <entity_count>
is the number of entities to generate (e.g. 1000, 4000, 128000, 256000), <sql_dump_dir>
and <cql_dump_dir>
are the directories where the generated SQL and CQL dumps will be saved.
- Run the following command in the root project directory to start the Docker container/s:
docker-compose up -d <service_name>
- Either run the
run_queries.sh <service_name>
in the root project directory to run all queries for the selected database systems or run the individual scripts for each database system inside the respective Docker container:or if bash is not available:docker exec -it query-languages-analysis-thesis-<service_name>-1 bash ./run_queries.sh
docker exec -it query-languages-analysis-thesis-<service_name>-1 sh ./run_queries.sh
- NOTE: Change the required variable parameters in the script before running it!
- After running the queries, you can find the results in the
logs/queries/
folder for each database system.
This project is licensed under the MIT License - see the LICENSE file for details.