-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Welcome to the cec-dataprep wiki!
While git
is already installed, you need to install node
in a conda environment.
First, load the conda module in farm module load conda3
.
One time only, create the environment: conda create -yn cec nodejs
.
From then on, every time you login you need to load the conda3 module and do source activate cec
If you haven't already, git clone https://github.com/ucdavis/cec-dataprep
to get this project and then npm install
your dependencies.
Now create a batch file to run the npm build
and npm start
commands as well as set your env variables.
dataprep.sh
(TODO: this is an example script, real one to come soon)
#!/bin/bash -l
# Name of the job - You'll probably want to customize this.
#SBATCH -J csvimport
# Standard out and Standard Error output files with the job number in the name.
#SBATCH -o csvimport-%j.output
#SBATCH -e csvimport-%j.output
# Print the hostname for debugging purposes
hostname
# Set your variables
export DB_HOST="test.host.name"
# Run the actual work you want to do
srun echo 'hello world'
Once you have a script, the real work happens when it is batched to be run by the farm slurm system
Example runs:
sbatch -t 30 dataprep.sh
-- submit the job to be run on one node for up to 30 minutes
sbatch -N 1 -n 2 -t 30 dataprep.sh
use one node with 2 processes
sbatch --array=[1-5] -t 30 dataprep.sh
run it 5 times, once for each array value
You can monitor your submitted job with squeue -u $USER
Make the file accessible via link, possibly using Box. If you use box, you must copy the "direct download link" and not preview share link.
Download it into a subfolder of your home directory: wget -O file.csv https://url.com/
In order to copy CSV data into postgres, you'll need to run conda install -c anaconda postgresql
once. After that it's part of your environment.
Next create an import script like the following:
importpixels.sh
#!/bin/bash -l
# Name of the job - You'll probably want to customize this.
#SBATCH -J csvimport
# Standard out and Standard Error output files with the job number in the name.
#SBATCH -o csvimport-%j.output
#SBATCH -e csvimport-%j.output
hostname
srun psql postgresql://user:pass@server -c "\copy desttable FROM '~/pixels/file.csv' delimiter ',' csv header"
Then you can process the import by running
sbatch -t 30 importpixels.sh
It takes about 90 seconds to run 1 million rows between farm and a campus test server.
Once the data is in a raw import table, move it to the real table with something like this
INSERT INTO public.pixels (county, ba_0, ba_15, ba_2, ba_25, ba_35, ba_40, ba_7, basa_as, basa_ra, basa_wi, bmcwn_0, bmcwn_15, bmcwn_2, bmcwn_25, bmcwn_35, bmcwn_40, bmcwn_7, bmfol_0, bmfol_15, bmfol_2, bmfol_25, bmfol_35, bmfol_40, bmfol_7, bmstm_0, bmstm_15, bmstm_2, bmstm_25, bmstm_35, bmstm_40, bmstm_7, dbmcn_0, dbmcn_15, dbmcn_2, dbmcn_25, dbmcn_35, dbmcn_40, dbmcn_7, dbmsm_0, dbmsm_15, dbmsm_2, dbmsm_25, dbmsm_35, dbmsm_40, dbmsm_7, sng_0, sng_15, sng_2, sng_25, sng_35, sng_40, sng_7, tpa_0, tpa_15, tpa_2, tpa_25, tpa_35, tpa_40, tpa_7, vmsg_0, vmsg_15, vmsg_2, vmsg_25, vmsg_35, vmsg_40, vmsg_7, vol_15, vol_2, vol_25, vol_35, vol_40, vol_7, sit_raster, cluster1, cluster2, land_use, x, y, cluster_no, forest_type)
select county_name, ba_0, ba_15, ba_2, ba_25, ba_35, ba_40, ba_7, basa_as, basa_ra, basa_wi, bmcwn_0, bmcwn_15, bmcwn_2, bmcwn_25,
bmcwn_35, bmcwn_40, bmcwn_7, bmfol_0, bmfol_15, bmfol_2, bmfol_25, bmfol_35, bmfol_40, bmfol_7, bmstm_0, bmstm_15,
bmstm_2, bmstm_25, bmstm_35, bmstm_40, bmstm_7, dbmcn_0, dbmcn_15, dbmcn_2, dbmcn_25, dbmcn_35, dbmcn_40, dbmcn_7, dbmsm_0,
dbmsm_15, dbmsm_2, dbmsm_25, dbmsm_35, dbmsm_40, dbmsm_7, sng_0, sng_15, sng_2, sng_25, sng_35, sng_40, sng_7, tpa_0, tpa_15,
tpa_2, tpa_25, tpa_35, tpa_40, tpa_7, vmsg_0, vmsg_15, vmsg_2, vmsg_25, vmsg_35, vmsg_40, vmsg_7, vol_15, vol_2,
vol_25, vol_35, vol_40, vol_7, sit_raster, cluster1, cluster2, land_use, x, y, cluster_no, forest_type
from rawpixels
where not exists ( select * from pixels where pixels.cluster_no = rawpixels.cluster_no );
Note that column names/order/number all have to match. If tables exactly match, you don't need to specify columns. Actual format and columns may change.
Once the data is imported, you can truncate
the rawpixels import table, or leave it around. Running the insert script again won't duplicate the data because it checks for the existence of a cluster before importing. (note: we will probably want to check for more than cluster_no, including year and maybe county).