Skip to content

Tenaya Project Results

Ken Krugler edited this page Jul 14, 2016 · 1 revision

End Results

  1. Created a software toolkit that allows for the analysis of DNA/RNA samples from the SRA. Supports the generation of compact representations of the samples (“signatures”) that are much smaller than the original sample size, yet still allow for accurate approximate comparisons. In practice, samples containing anywhere from 0.1GB to 20GB of data can be compressed to lossy signatures of only a few KB. Using these signatures, the tool supports clustering using various methods based on the pairwise similarities across a collection of samples. Beyond this, the software facilitates simple organism searches on the SRA by scientific name. Finally, there are a primitive set of shell scripts that allow for fast, streamlined downloading of samples and also scaling the signature generation process through both multithreading and multiprocessing to maximize compute time and minimize I/O read time.
  2. Through working on this project, I’ve learned a lot about writing efficient Java programs and actually deploying these solutions in the AWS environment. Especially in the beginning, I had to think a lot about how to implement data structures like the Count-Min Sketch in a manner that is both performant and also somewhat thread-safe. During my optimization, I learned how to use profiling tools to monitor the application and analyze where the slow spots were. My working knowledge of the Unix command line also greatly increased while working on this project because I had no other method for interacting with the EC2 instances.

Next Steps

  1. Enhance I/O functionality – Currently, the stream-based, BufferedReader I/O supports both FASTA and FASTQ file reading (including gzip decompression if necessary) through a convenient, polymorphic interface. On the other hand, the faster NIO reading only supports uncompressed FASTA files and uses a separate interface to integrated directly with a byte-based k-mer encoder separate from the char-based EncodedKmerGenerator. In order to support more files, the NIO side should be expanded to include support for FASTQ files and potentially gzipped data. Additionally, this functionality should be exposed with the same API as the BufferedReader I/O as a fallback in case compression is infeasible to implement with NIO.
  2. Subdivide deployment scripts – The deployment scripts need to be subdivided into more independent utilities to increase usability. For example, the download.sh script both downloads the .sra files from the SRA and unpacks them into .fasta of .fastq files. Conceivably, the download operation should be separated from the extraction operation.
  3. Increase I/O throughput with multiple reads per file – As file I/O is usually the bottleneck in signature generation (depends on the number of threads), it could potentially be advantageous to have multiple threads reading through the same file but at different locations. These multiple producer threads can then populate the consumer threads in a model similar to the partition strategy (currently, the main thread does all of the file reading and k-mer generation and then adds the encoded k-mers to the consumer thread queues).

Most valuable things I learned about software development

  • Unit tests are your friend
  • Premature optimization is the root of all evil
  • Deployment is difficult

Further Input

The one thing that I would have liked is some “reading material” for before I actually started working. I think it would have been helpful to come in with some domain-specific knowledge and a general, abstract idea of what the project actually is and how it is to be accomplished.

Clone this wiki locally