Skip to content

Latest commit

 

History

History
180 lines (124 loc) · 10 KB

README.md

File metadata and controls

180 lines (124 loc) · 10 KB

Data Science Tool for Heterogeneous Network Mining (SciNeM)


SciNeM is an open-source tool that offers a wide range of functionalities for exploring and analysing HINs and utilises Apache Spark for scaling out through parallel and distributed computation. SciNeM provides an intuitive, Web-based user interface to build and execute complex constrained metapath-based queries and to explore and visualise the corresponding results. Under the hood, all the supported state-ofthe-art HIN analysis types have been implemented in a scalable manner supporting the distributed execution of analysis tasks on computational clusters. SciNeM has a modular architecture making it easy to extend it with additional algorithms and functionalities. Currently, it supports the following operations, given a user-specified metapath: ranking entities using a random walk model, retrieving the top-𝑘 most similar pairs of entities, finding the most similar entities to a query entity, and discovering entity communities.

How to cite

@inproceedings{chatzopoulos2021scinem,
  title={SciNeM: A Scalable Data Science Tool for Heterogeneous Network Mining.},
  author={Chatzopoulos, Serafeim and Vergoulis, Thanasis and Deligiannis, Panagiotis and Skoutas, Dimitrios and Dalamagas, Theodore and Tryfonopoulos, Christos},
  booktitle={EDBT},
  pages={654--657},
  year={2021}
}

Installation on Ubuntu

  1. Determine current Ubuntu version.
    1. Use cat /etc/os-release to find current Ubuntu version name (e.g. Xenial, Bionic, Focal).
  2. Install mongodb for your current version of Ubuntu (follow instructions on this link).
  3. Ensure you have a Python 3 environment.
    • It is advised to use a virtualenv for the interface, which isolates Python package installations to a seperate environment without affecting the system's global Python environment (more information about creating and using virtual environments here).
    • To retrieve the current installed Python 3 binary's path, you can use the command which python3.
  4. Install pandas Python package, using the following command: pip install -U pandas.
  5. Install and configure Apache Spark for Hadoop 3.2+ (follow instructions on this link). Note down the directory's path where spark is installed (for the rest of the guide this directory will be referenced as [SPARK_DIR]).
  6. Create and modify a file named spark-defaults.conf under [SPARK_DIR]/conf.
    • Spark ships with a template file for spark-defaults.conf, named as spark-defaults.conf.template, which you can use to copy, with the following command cp [SPARK_DIR]/conf/spark-defaults.conf.template [SPARK_DIR]/conf/spark-defaults.conf.
    • Append/modify the following fields according to your needs/case: - spark.local.dir /path/to/a/directory/with/ample/storage/capacity - a file system path pointing to a directory where spark will save temporary files. - spark.driver.memory 4g - max memory that the cluster driver will use. - spark.memory.fraction 0.6 - fraction of heap space used for execution and storage. - spark.executor.memory 12g - heap size alloted to each cluster's executor.
  7. Install jq JSON processor using the following command: sudo apt install jq.
  8. Create two directories, on a drive with ample storage capacity, which will contain the available datasets and analysis results. Note down the file system paths for these directories (for the rest of this guide these directory paths will be referenced as [DATASETS_DIR] and [RESULTS_DIR], respectively).
  9. Place at least ONE compatible dataset to be available with SciNeM. A sample dataset can be found here.
  10. Select a directory and clone into that directory the contents of the SciNeM repository using the following command git clone --recursive https://github.com/schatzopoulos/SciNeM.git. Note down the file system path for this directory as well (for the rest of this guide this directory path will be referenced as [CODE_DIR]).
  11. Modify the contents of the Java file Constants.java under [CODE_DIR]/src/main/java/athenarc/imsi/sdl/config/.
    • Set BASE_PATH so that it points to [RESULTS_DIR]. Make sure that the string does NOT end with a slash ('/')!.
    • Set DATA_DIR so that it points to [DATASETS_DIR]. Make sure that the string DOES end with a slash ('/')!.
    • Set WORKFLOW_DIR so that it points to [CODE_DIR]/libs/SciNeM-workflows/. Make sure that the string DOES end with a slash ('/')!.
  12. To enable the changes of environment variables, made in step 5, use the following command: source ~/.profile.
  13. Redirect port 80 to 8181: sudo iptables -t nat -I PREROUTING --src 0/0 --dst 0/0 -p tcp --dport 80 -j REDIRECT --to-ports 8181

Development

Clone

git clone --recursive https://github.com/schatzopoulos/SciNeM.git

Dependencies

Before you can build this project, you must install and configure the following dependencies on your machine:

  1. Node.js: We use Node to run a development web server and build the project. Depending on your system, you can install Node either from source or as a pre-packaged bundle.

After installing Node, you should be able to run the following command to install development tools. You will only need to run this command when dependencies change in package.json.

npm install

We use npm scripts and Webpack as our build system.

Run the following commands in two separate terminals to create a blissful development experience where your browser auto-refreshes when files change on your hard drive.

./mvnw
npm start

Npm is also used to manage CSS and JavaScript dependencies used in this application. You can upgrade dependencies by specifying a newer version in package.json. You can also run npm update and npm install to manage dependencies. Add the help flag on any command to see how you can use it. For example, npm help update.

The npm run command will list all of the scripts available to run for this project.

Managing dependencies

For example, to add Leaflet library as a runtime dependency of your application, you would run following command:

npm install --save --save-exact leaflet

To benefit from TypeScript type definitions from DefinitelyTyped repository in development, you would run following command:

npm install --save-dev --save-exact @types/leaflet

Then you would import the JS and CSS files specified in library's installation instructions so that Webpack knows about them: Note: There are still a few other things remaining to do for Leaflet that we won't detail here.

For further instructions on how to develop with JHipster, have a look at Using JHipster in development.

Building for production

Packaging as jar

To build the final jar and optimize the SpOT application for production, run:

./mvnw -Pprod clean verify

This will concatenate and minify the client CSS and JavaScript files. It will also modify index.html so it references these new files. To ensure everything worked, run:

java -jar target/*.jar

Then navigate to http://localhost:8080 in your browser.

Refer to Using JHipster in production for more details.

Packaging as war

To package your application as a war in order to deploy it to an application server, run:

./mvnw -Pprod,war clean verify

Using Docker

You can use Docker to simplify deployment. For a minimal wrapper to set up SciNeM in a single Docker stack deployment follow the instructions here