MetaFX (METAgenomic Feature eXtraction) is a toolkit for feature construction and classification of metagenomic samples.
The idea behind MetaFX is to introduce feature extraction algorithm specific for metagenomics short reads data. It is capable of processing hundreds of samples 1-10 Gb each. The distinct property of suggest approach is the construction of meaningful features, which can not only be used to train classification model, but also can be further annotated and biologically interpreted.
This version of MetaFX is deprecated since 23.05.2023 and moved to https://github.com/ctlab/metafx_old.
New version with many more supported features and improvements is available. Consider trying it!
MetaFX documentation is available on the GitHub wiki page.
Here is a short version of it.
- JRE 1.8 or higher
- python3
- python libraries for classification problem:
Should you choose to build contigs via third-party SPAdes software, please follow their installation instructions (not recommended for first-time use).
To run MetaFX you need scripts from the bin/
folder. Consider adding it to the PATH
variable. The main script to run is metafx.sh
.
Scripts have been tested under Ubuntu 18.04 LTS, but should generally work on Linux/MacOS.
To run MetaFX use the following syntax:
metafx.sh [<Launch options>] [<Input parameters>]
Full description of launch options and input parameters can be found below in section Options and Parameters.
For detailed step-by-step running instructions, please refer to the Wiki page. It analyzes 54 gut samples from Inflammatory Bowel Disease dataset as a part of iHMP project. The same analysis with the same results can be reproduced with one command.
From now on we suppose, that metafx.sh
script is in the current directory and bin/
folder has been added to the system PATH
. To download the samples, run the following commands:
cd data/
for i in `tail -n +2 Class_labels.txt | cut -f2` ; do
wget https://ibdmdb.org/tunnel/static/HMP2/WGS/1818/${i}.tar
tar -xf ${i}.tar
done
cd ../
As a result in data/
folder there will be 54 samples with paired-end reads in format <sample>_[R1|R2].fastq.gz
.
The following command will run the pipeline on the test dataset and produce the same outputs as the step-by-step solution.
./metafx.sh -p 32 -m 128G -w workDir \
-k 31 \
-i data/*.fastq.gz \
-b 4 \
--min-samples 1 --max-samples 10 \
--class1 data/cd_filelist.txt \
--class2 data/uc_filelist.txt \
--class3 data/nonibd_filelist.txt
MetaFX accepts input sequence files of FASTQ and FASTA formats. Input files can also be compressed with gzip of bzip2.
Metadata files with samples split by categories should contain one sample name per line without path and extensions.
When MetaFX finishes, working directory will contain following results:
kmer-counter-many/kmers/<sample>.kmers.bin
– files with k-mers from each sample in binary formatunique_kmers_class[1|2|3]/kmers/filtered_G.kmers.bin
– files with group-specific k-mers in binary format for each class and different minimal number of samples containing such k-mers (G values).components_class[1|2|3]/components.bin
– file with graph components for each class selected as features in binary format. G value for extracting components is selected automatically for each class. For user-defined fine tuning please refer to step-by-step pipeline.features_class[1|2|3]/vectors/<sample>.breadth
– numeric feature vectors for each class in each sample. Value for feature is calculated as mean breadth coverage of component by samples' k-mers.contigs_class[1|2|3]/seq-builder-many/sequences/component.seq.fasta
– contigs in FASTA format forming features' components for each class. Suitable for annotation and biological interpretation.
Input parameters for MetaFX:
- -k, --k <N>
K-mer size (in nucleotides, maximum value is 31). (Mandatory) - -i, --reads <files>
List of reads files from single environment. FASTQ, FASTA files are acceptable, gzip- and bzip2-compressed files are allowed too. Files can be set by bash regexp, for example-i dir/*.fastq
. (Mandatory) - -b, --maximal-bad-frequence <N>
Maximal frequency for a k-mer to be assumed erroneous. (Optional, default value 1) - --min-samples <N>
K-mer is considered group-specific if it presents in at least G samples of that group. G iterates in range [--min-samples
;--max-samples
]. (Optional, default value 1) - --max-samples <N>
K-mer is considered group-specific if it presents in at least G samples of that group. G iterates in range [--min-samples
;--max-samples
]. (Optional, default value 1) - --class1 <file>
Text file with names from the first group of samples. Only file names without path and extensions. (Mandatory) - --class2 <file>
Text file with names from the second group of samples. Only file names without path and extensions. (Mandatory) - --class3 <file>
Text file with names from the third group of samples. Only file names without path and extensions. (Optional, if absent program runs in 2-class mode)
Launch options:
- -p, --available-processors <N>
Available processors. By default MetaFX uses all available processors. - -m, --memory <MEM>
Memory to use (values with suffix, for example: 1500M, 4G, etc.). By default MetaFX uses 90% of free memory. - -w, --work-dir <DIR>
Working directory. The default working directory isworkDir/
in the current directory. - -h, --help
Print help message.
Please report any problems directly to the GitHub issue tracker.
Also, you can send your feedback to [email protected].
Authors:
- Software: Artem Ivanov (ITMO University)
- Supervisor: Vladimir Ulyantsev (ITMO University)
The MIT License (MIT)
- MetaFast – a toolkit for comparison of metagenomic samples.
- MetaCherchant – a tool for analysing genomic environment within a metagenome.
- RECAST – a tool for sorting reads per their origin in metagenomic time series.