GitHub - pierrepeterlongo/commet: Comparing and combining multiple metagenomic datasets

pierrepeterlongo / commet Public

Notifications You must be signed in to change notification settings
Fork 7
Star 15

Comparing and combining multiple metagenomic datasets

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github/workflows		.github/workflows
ABCDE_bench		ABCDE_bench
bin		bin
doc		doc
include		include
src		src
test_dissymmetry		test_dissymmetry
.gitignore		.gitignore
Commet.py		Commet.py
Commet_analysis.py		Commet_analysis.py
Makefile		Makefile
README		README
dendro.R		dendro.R
heatmap.r		heatmap.r
index.html		index.html

Repository files navigation

Contributors : 
  Pierre PETERLONGO, [email protected] [24/07/14]
  Nicolas MAILLET, [email protected]     [24/07/14]
  Guillaume Collet, [email protected]        [24/07/14]
  
————————————————————————————————————————————————————————————————————————————————
At a glance:
Commet is used for de novo comparisons of raw read sets. It is the successor of 
the compareads tool. 
————————————————————————————————————————————————————————————————————————————————

This software is a computer program whose purpose is to find all the similar
reads between two set of NGS reads. It also provide a similarity score between
the two samples.

Copyright (C) 2014  INRIA

This program is free software: you can redistribute it and/or modify it under
the terms of the GNU Affero General Public License as published by the Free
Software Foundation, either version 3 of the License, or any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.  See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along
with this program.  If not, see <http://www.gnu.org/licenses/>.

————————————————————————————————————————————————————————————————————————————————
REQUIREMENTS

 - C++ compiler (g++-4.9, clang-3.2…)
 - python 2.7
 - R

————————————————————————————————————————————————————————————————————————————————
INSTALLATION

Commet's applications have been written in C++. 
To compile those programs, a makefile is provided using g++:

make

If you want to install in /usr/local/bin, run with root permissions (use sudo):

make
sudo make install

Then, you can test Commet on the ABCDE_bench data by running the following
command:

python Commet.py ABCDE_bench/sets_config.txt -k 32

————————————————————————————————————————————————————————————————————————————————
USER GUIDE

For complete descriptions of modules and usage, a user guide is provided in the 
doc directory.

————————————————————————————————————————————————————————————————————————————————
SCRIPTS

Commet.py:
	- input:
		- A file containing a list of read sets (see below)
	- output:
		- a bit vector for each intersection of pairs of read sets
		- 3 matrices in CVS format
			- plain: line i, column j contains the number of reads 
			  from set i similar to at least one read of set j
			- percentage: line i, column j contains the percentage
			  of reads from set i similar to at least one read of
			  set j
			- normalized: line i, column j contains the percentage
			  of reads similar between sets i and j with respect to
			  the total number of reads in sets i and j.
		- 3 heatmaps (png format), computed from the three previously
		  mentioned matrices
		- A dendrogram (png format) of input reads obtained from the
		  normalized matrix, by hierarchical complete clusterization of
		  all input reads.
	- misc.:
		- This script needs python 2.7
		- This script can parallelize computations on SGE clusters using
		  the --sge option. In this case the analysis of the CVS
		  matrices generating the four png figures (three heatmaps and
		  one dendrogram) is performed once all jobs are finished using
		  the Commet_analysis.py script.


————————————————————————————————————————————————————————————————————————————————
LISTS OF READ SETS

Read sets may be composed of several files in different format (fasta, fastq,
gzip). To give these read sets to Commet, we use the following format:

	- Each line corresponds to a read set.
	- Each line begins with an identifier for the set directly followed by a
	  colon (:).
	- Then, each file of a set is separated by a semicolon (;).
	- A bit vector may be associated to a file by adding its name after a
	  comma (,).
	
Example:
	Name1:path_set1.1.fq.gz;path_set_1.2.fq.gz,bv1.2.bv;path_set1.3.fq;...
	Name2:path_set2.1.fq.gz;path_set_2.2.fq.gz
	Name3:path_set3.1.fq,bv3.1.bv;path_set_3.2.fq,bv3.2.bv;path_set3.3.fq.gz
	...
————————————————————————————————————————————————————————————————————————————————
APPLICATIONS SUMMARY (compiled in the 'bin' directory)

filter_reads
index_and_search
extract_reads
bvop

————————————————————————————————————————————————————————————————————————————————
FILTER_READS

  - input: 
    - A file containing reads (nucleotide sequences) in fasta or fastq format. 
      The input file may be compressed with gzip algorithm.
    - Four parameters for selection : - minimum size           [default=0]
                                      - maximum number of N    [default=any]
                                      - minimum Shannon index  [default=0]
                                      - maximum selected reads [default=all].

  - output: 
    - A file containing a vector of bits. 
      Each bit of the vector corresponds to a read in the input file. 
      If the bit is 1 then the read is selected, if 0 the read was filtered out.
	  
————————————————————————————————————————————————————————————————————————————————
INDEX_AND_SEARCH 

  - input:
    - A file read sets representing the reference (only the first set is used)
    - A file read sets representing the queries (-s) (all sets are searched)

  - output:
    - for each read file in the queries, a bit vector stores reads that share 
      enough similarity with reads from the reference.
	  
————————————————————————————————————————————————————————————————————————————————
BVOP

  - input:
    - A file containing a vector of bits.
    - A logical operation to perform (with another bit vector optionaly)

  - output:
    - A file containing the resulting bit vector.
	
————————————————————————————————————————————————————————————————————————————————
EXTRACT_READS

  - input:
      - A file containing reads (nucleotide sequences) in fasta or fastq format. 
        The input file may be compressed with gzip algorithm.
      - A file containing a vector of bits of the same size.

    - output:
      - A file containing only reads that correspond to a 1 in the bit vector