Skip to content

Commit

Permalink
Remove splitting
Browse files Browse the repository at this point in the history
  • Loading branch information
blue-moon22 committed Jan 31, 2022
1 parent 2a72b34 commit 5142086
Show file tree
Hide file tree
Showing 5 changed files with 18 additions and 44 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
.DS_Store
build
19 changes: 8 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,18 +66,17 @@ Type *pal-mem -h* for a list of options.

### OUTPUT

pal-MEM outputs a tab-delimited file and two fasta files (if a single fasta file is specified) or four fasta files (if paired fasta files are specified). The tab-delimited file with suffix `_IR.tab` contains the original sequence names of reads containing the inverted repeats (IRs) with the first and second columns representing the first and second read of the pair, respectively.
pal-MEM outputs a tab-delimited file and two fasta files (if a single fasta file is specified) or four fasta files (if paired fasta files are specified). The tab-delimited file with suffix `_IR.tab` contains the original sequence names of reads containing the inverted repeats (IRs) with the first and second columns representing the first and second read of the pair, respectively. The sequence name is reported with "LCoord" and "RCoord" values representing the coordinates of the ITR.

Seq2691_ERR589346.2764_FCC4C01ACXX:6:1101:14662:3637#AAGTCTCT/1_f1 Seq422_ERR589346.468_FCC4C01ACXX:6:1101:11875:2371#AAGTCTCT/1_f1
Seq4972_ERR589346.5141_FCC4C01ACXX:6:1101:21190:4913#AAGTCTCT/1_f1 Seq3747_ERR589346.3872_FCC4C01ACXX:6:1101:3886:4252#AAGTCTCT/1_f1
Seq7491_ERR589346.7724_FCC4C01ACXX:6:1101:11071:6396#AAGTCTCT/1_f1 Seq1796_ERR589346.1851_FCC4C01ACXX:6:1101:14914:3103#AAGTCTCT/1_f1
Seq2691_ERR589346.2764_FCC4C01ACXX:6:1101:14662:3637#AAGTCTCT/1_f1_LCoord_38_RCoord_64 Seq422_ERR589346.468_FCC4C01ACXX:6:1101:11875:2371#AAGTCTCT/1_f1_LCoord_30_RCoord_56
Seq4972_ERR589346.5141_FCC4C01ACXX:6:1101:21190:4913#AAGTCTCT/1_f1_LCoord_45_RCoord_60 Seq3747_ERR589346.3872_FCC4C01ACXX:6:1101:3886:4252#AAGTCTCT/1_f1_LCoord_51_RCoord_66
...

The fasta file(s) with suffix `_IR.fasta` (if single file) or `_IR_1.fasta` and `_IR_2.fasta` (if paired files) contains reads with inverted repeats (IRs). The sequence name is reported with "LCoord" and "RCoord" values representing the coordinates of the ITR. For example, in the first file `_IR_1.fasta`, the IR pair of length 41 is situated in the first and second read from 28th to 68th and from 31st to 71st nucleotide, respectively.
The fasta file(s) with suffix `_IR.fasta` (if single file) or `_IR_1.fasta` and `_IR_2.fasta` (if paired files) contains reads with inverted repeats (IRs).

>Seq2691_ERR589346.2764_FCC4C01ACXX:6:1101:14662:3637#AAGTCTCT/1_f1_LCoord_38_RCoord_64
>Seq2691_ERR589346.2764_FCC4C01ACXX:6:1101:14662:3637#AAGTCTCT/1
AGCCTCTGGTAGGGCTAATTGCACCAAATGTCCCAATGCCCAAGTAACGGCATAATCATTACCTTGCAAATACCCATCTTGTTTTTCGGTTGCCCCCAC
>Seq422_ERR589346.468_FCC4C01ACXX:6:1101:11875:2371#AAGTCTCT/1_f1_LCoord_30_RCoord_56
>Seq422_ERR589346.468_FCC4C01ACXX:6:1101:11875:2371#AAGTCTCT/1
GCAAGTGAAAAACAAGATGGATATTTGTTAGGTAATGATTATGCCGTTACTTGGGCTCTAGGGCATTTGGTGCAATTAGCCCTCCCAGAGGCTTATGGTT

The fasta file(s) with suffix `_discord_non_IR.fasta` (if single file) or `_discord_non_IR_1.fasta` and `_discord_non_IR_2.fasta` (if paired files) contains reads that do not contain IRs but are paired to reads that do contain IRs. For example, if the reads paired to the above do not contain IRs, then the second file `_discord_non_IR_2.fasta` would contain:
Expand All @@ -95,14 +94,12 @@ The program can be run in both serial and parallel mode. The parallel mode has a

-k set the k-mer length. Default: 15

-d set the split size. Default: 1. The option -d is used for splitting the sequences into two or more parts. By default this value is set to 1, which means no splitting. This option with value >1 will reduce the overall memory requirement of the program. For large genomic files (such as metagenomic files over 4 Gigabytes), it is recommended to increase d when RAM is limited, otherwise the program will fail (see example below).

-t number of threads. Default: 1. The option -t is used for running the program in parallel mode. The default value is set to 1, which means serial mode. This option with value > 1 will reduce overall running time of the program.

-h show possible options

## EXAMPLE
To get ITRs with a minimum length of 30 from paired-end metagenomic fasta files with a k-mer length of 18, a split size of 20 on a machine with 8 threads:
To get ITRs with a minimum length of 30 from paired-end metagenomic fasta files with a k-mer length of 18 on a machine with 8 threads:
```
pal-mem -f1 example_1.fasta -f2 example_1.fasta -o example -l 30 -k 18 -d 20 -t 8
pal-mem -f1 example_1.fasta -f2 example_1.fasta -o example -l 30 -k 18 -t 8
```
8 changes: 3 additions & 5 deletions file.h
Original file line number Diff line number Diff line change
Expand Up @@ -1207,17 +1207,15 @@ class tmpFilesInfo {
r.end = lRef;
seqitR = lower_bound(vecSeqInfo.begin(), vecSeqInfo.end(), r, seqData());

tabFile << (*seqitQ).seq << "\t" << (*seqitR).seq << "\n";
QueryFile.getKmerLeftnRightBoundForNs(lQue, QueryNpos);
RefFile.getKmerLeftnRightBoundForNs(lRef, RefNpos);
tabFile << (*seqitQ).seq + "_LCoord_" + to_string(((lQue - (QueryNpos.left==1?QueryNpos.left=0:QueryNpos.left)) + 2)/2) + "_RCoord_" + to_string((rQue - (QueryNpos.left==1?QueryNpos.left=0:QueryNpos.left) + 2)/2) << "\t" << (*seqitR).seq + "_LCoord_" + to_string(((lRef - (RefNpos.left==1?RefNpos.left=0:RefNpos.left)) + 2)/2) + "_RCoord_" + to_string((rRef - (RefNpos.left==1?RefNpos.left=0:RefNpos.left) + 2)/2) << "\n";

if ((*seqitQ).keep) {
QueryFile.getKmerLeftnRightBoundForNs(lQue, QueryNpos);
(*seqitQ).header = '>' + (*seqitQ).seq + "_LCoord_" + to_string(((lQue - (QueryNpos.left==1?QueryNpos.left=0:QueryNpos.left)) + 2)/2) + "_RCoord_" + to_string((rQue - (QueryNpos.left==1?QueryNpos.left=0:QueryNpos.left) + 2)/2);
(*seqitQ).keep = 0;
}

if ((*seqitR).keep) {
RefFile.getKmerLeftnRightBoundForNs(lRef, RefNpos);
(*seqitR).header = '>' + (*seqitR).seq + "_LCoord_" + to_string(((lRef - (RefNpos.left==1?RefNpos.left=0:RefNpos.left)) + 2)/2) + "_RCoord_" + to_string((rRef - (RefNpos.left==1?RefNpos.left=0:RefNpos.left) + 2)/2);
(*seqitR).keep = 0;
}
}
Expand Down
20 changes: 0 additions & 20 deletions pal-mem.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -463,13 +463,6 @@ void checkCommandLineOptions(uint32_t &options)
exit(EXIT_FAILURE);
}

if (IS_SPLIT_SIZE_DEF(options)){
if (commonData::d <= 0){
cout << "ERROR: -d cannot be less than or equal to zero!" << endl;
exit(EXIT_FAILURE);
}
}

if (IS_NUM_THREADS_DEF(options)){
if (commonData::numThreads <= 0){
cout << "ERROR: -t cannot be less than or equal to zero!" << endl;
Expand Down Expand Up @@ -520,7 +513,6 @@ void print_help_msg()
cout << "Optional:" << endl;
cout << "-l\t" << "set the minimum length of a match. Default: 24" << endl;
cout << "-k\t" << "set the k-mer length. Default: 15" << endl;
cout << "-d\t" << "set the split size. Default: 1" << endl;
cout << "-t\t" << "number of threads. Default: 1" << endl;
cout << "-h\t" << "show possible options" << endl;
}
Expand Down Expand Up @@ -598,18 +590,6 @@ int main (int argc, char *argv[])
commonData::minMemLen = 2*std::stoi(argv[n+1]);
commonData::lenBuffer = commonData::minMemLen - 2;
n+=2;
}else if (boost::equals(argv[n],"-d")){
if (IS_SPLIT_SIZE_DEF(options)) {
cout << "ERROR: Split size argument passed multiple times!" << endl;
exit(EXIT_FAILURE);
}
if (!argv[n+1] || !is_numeric(argv[n+1])){
cout << "ERROR: Invalid value for -d option!" << endl;
exit(EXIT_FAILURE);
}
SET_SPLIT_SIZE(options);
commonData::d = std::stoi(argv[n+1]);
n+=2;
}else if (boost::equals(argv[n],"-t")){
if (IS_NUM_THREADS_DEF(options)) {
cout << "ERROR: Number of threads argument passed multiple times!" << endl;
Expand Down
13 changes: 5 additions & 8 deletions pal-mem.h
Original file line number Diff line number Diff line change
Expand Up @@ -37,20 +37,18 @@
#define LENGTH 0x00000010
#define REF_FILE 0x00000020
#define QUERY_FILE 0x00000040
#define SPLIT_SIZE 0x00000080
#define KMER_SIZE 0x00000100
#define NUM_THREADS 0x00000200
#define REL_REV_QUEPOS 0x00000400
#define FOUR_COL_OUTPUT 0x00000800
#define LEN_IN_HEADER 0x00001000
#define KMER_SIZE 0x0000080
#define NUM_THREADS 0x00000100
#define REL_REV_QUEPOS 0x00000200
#define FOUR_COL_OUTPUT 0x00000400
#define LEN_IN_HEADER 0x00000800

#define IS_FASTA1_DEF(x) (x & FASTA1)
#define IS_FASTA2_DEF(x) (x & FASTA2)
#define IS_FASTAU_DEF(x) (x & FASTAU)
#define IS_OUT_FILE_DEF(x) (x & OUT_FILE)
#define IS_LENGTH_DEF(x) (x & LENGTH)
#define IS_KMER_SIZE_DEF(x) (x & KMER_SIZE)
#define IS_SPLIT_SIZE_DEF(x) (x & SPLIT_SIZE)
#define IS_NUM_THREADS_DEF(x) (x & NUM_THREADS)

#define SET_FASTA1(x) (x |= FASTA1)
Expand All @@ -59,7 +57,6 @@
#define SET_OUT_FILE(x) (x |= OUT_FILE)
#define SET_LENGTH(x) (x |= LENGTH)
#define SET_KMER_SIZE(x) (x |= KMER_SIZE)
#define SET_SPLIT_SIZE(x) (x |= SPLIT_SIZE)
#define SET_NUM_THREADS(x) (x |= NUM_THREADS)

class Knode {
Expand Down

0 comments on commit 5142086

Please sign in to comment.