This project was based on the results published in the following article: Photosystem I gene cassettes are present in marine virus genomes by Sharon et al. The authors identified a number of photosystem I genes in marine virus genomes by performing a tBLASTx search for photosynthesis genes in the contigs of the Global Ocean Sampling project.
The analysis performed in the article was done on publicly available data from the Global Ocean Sampling project. The sequences used were contigs, so the assembly was already done by the GOS researchers. The data can be downloaded from the CAMERA portal. You can obviously use your own data if you like.
Here we will try to reproduce the results from the published article.
If you want to run the analysis yourself, on the same or on your own data, you can clone this repository.
- Python (I use 2.7)
- A local version of the NCBI BLAST+ tools with the executables in your local PATH variable
- gnu make
- results/ : directory containing results of experiments. You'll find experiment-specific code there.
- README file: this file
- notes: notes taken while developing this project
- Makefile: contains all commands needed to run this analysis. To run all commands, do
make all
. - data/: directory containing data. This mainly includes a collection of photosynthesis genes that are used as query for BLAST searches.
- The Global Ocean Sampling dataset, as these files are too large to put them here. You can download them yourself from the CAMERA portal and create a simlink in the data/ folder of this repo called
input.fasta
referencing to the Global Ocean Sampling fastafile. The one I use contains the assembled sequences. There is a rule in the makefile that will create a blastable database from that file.
git clone thisrepo
make all
View the Makefile for more documentation.
Please feel free to contribute to this project. You can track my progress and you can fork this repository or propose changes if you like.