Skip to content

AWS Records

Ryan Brott edited this page Jul 7, 2016 · 16 revisions

Records

Group Downloaded Extracted Processed
1-20 yes yes yes
21-40 yes yes yes
41-60 yes yes yes

Key

  • Group = record numbers
  • Downloaded = downloaded .sra files
  • Extracted = extracted .fasta files from the .sra files
  • Processed = created signatures from the .fasta files

Scripts

  1. install.sh: designed to download the SRA toolkit and aspc. (note: you may need to manually add the SRA toolkit to the path)
  2. download.sh: designed for parallel downloading from the SRA. Usage: reads identifiers in line-by-line from standard in and writes the downloaded .fasta files to the directory provided as the first argument. (note: set the environment TENAYA_HOME to set which directory should contain cached .sra files from previous downloads)
  3. process.sh: designed for parallel processing of .fasta format data. Usage: process.sh where files is a comma-separated list of .fasta file names, groups is the number of parallel processes to run, and threads is the number of threads to use. (note: threads and files should both be divisible by groups to allow for even segmentation; tenaya.jar must also be present in the current working directory)

Generation args

-M 10000000000 -k 20 -c 1 -m partition -b 1048576 -q 10000 -t <threads>

Commands

  1. aws configure
  2. aws s3 cp necessary files
  3. scp scripts and JARs
  4. Run scripts/install.sh
  5. Download using scripts/download.sh
  6. Process using java -Xmx20g -jar <tenaya.jar location> generate [args]

Get file list: cat records.txt | head -n 10 | sed 's/^/\/media\/ephemeral0\/tenaya\/data\//g' | sed 's/$/.fasta/g' | sed ':a;N;$!ba;s/\n/,/g'

Clone this wiki locally