Skip to content
Scott Kirkland edited this page Sep 22, 2020 · 9 revisions

Welcome to the cec-dataprep wiki!

Running on Farm

Setup your environment

While git is already installed, you need to install node in a conda environment.

First, load the conda module in farm module load conda3.

One time only, create the environment: conda create -yn cec nodejs.

From then on, every time you login you need to load the conda3 module and do source activate cec

Get ORSM data

You need to download the OSRM data onto farm and put it in your home directory before the project will run. Use wget to download.

Run the project

If you haven't already, git clone https://github.com/ucdavis/cec-dataprep to get this project and then npm install your dependencies. Run npm run build to build the javascript files from the typescript files in the repo. You'll have to rebuild every time you update from source.

Now create a batch file to run the main job over NUM_CLUSTERS clusters.

dataprep.sh

#!/bin/bash -l

# Name of the job - You'll probably want to customize this.
#SBATCH -J jdataprep

# Print the hostname for debugging purposes
hostname

# Set your variables
export DB_HOST=""
export DB_USER=""
export DB_PASS=""
export DB_NAME="cecdss"
export OSRM_FILE="/home/postit/osrm-data/california-latest.osrm"
# export DEBUG=knex:tx
export NUM_CLUSTERS=1250

# Run the actual work you want to do
srun node ./cec-dataprep/index.js

Once you have a script, the real work happens when it is batched to be run by the farm slurm system

Example runs:

sbatch -t 30 dataprep.sh -- submit the job to be run on one node for up to 30 minutes sbatch -N 1 -n 2 -t 30 dataprep.sh use one node with 2 processes sbatch --array=[1-5] -t 30 dataprep.sh run it 5 times, once for each array value

You can monitor your submitted job with squeue -u $USER

Pre-processing the data

Get the CSV file(s)

Make the file accessible via link, possibly using Box. If you use box, you must copy the "direct download link" and not preview share link.

Download it into a subfolder of your home directory: wget -O file.csv https://url.com/

Install postgres tools

In order to copy CSV data into postgres, you'll need to run conda install -c anaconda postgresql once. After that it's part of your environment.

Create an import script

Next create an import script like the following:

importpixels.sh

#!/bin/bash -l

# Name of the job - You'll probably want to customize this.
#SBATCH -J csvimport

# Standard out and Standard Error output files with the job number in the name.
#SBATCH -o csvimport-%j.output
#SBATCH -e csvimport-%j.output

hostname

srun psql postgresql://user:pass@server -c "\copy desttable FROM '~/pixels/file.csv' delimiter ',' csv header"

Import some CSV data

Then you can process the import by running

sbatch -t 30 importpixels.sh

It takes about 90 seconds to run 1 million rows between farm and a campus test server.

Process the imported data

The https://github.com/ucdavis/cec-dataprep/blob/master/sql/db_tables.sql file has all the scripts needed to create the table structures. On each run of a new CSV you'll want to clear out the csvpixels table.

Once the data is in a raw import table, move it to the real table with the INSERT INTO PIXELS script found at https://github.com/ucdavis/cec-dataprep/blob/master/sql/db_scripts.sql. You'll also want to run the INSERT INTO CLUSTERS script if this CSV includes new clusters.

Once the data is imported, you can truncate the rawpixels import table, or leave it around. Running the insert script again won't duplicate the data because it checks for the existence of a cluster before importing. (note: we will probably want to check for more than cluster_no, including year and maybe county).

Clone this wiki locally