diff --git a/README.md b/README.md index a316ea9..1740efc 100755 --- a/README.md +++ b/README.md @@ -14,7 +14,7 @@ ANNEXA works by using only three parameter files (a reference genome, a referenc 2. Transcriptome reconstruction and quantification with [bambu](https://github.com/GoekeLab/bambu). 3. Novel classification with [FEELnc](https://github.com/tderrien/FEELnc). 4. Retrieve information from input annotation and format final gtf with 3 level structure: gene -> transcript -> exon. -5. Filter novel transcripts based on [bambu](https://github.com/GoekeLab/bambu) and/or [TransforKmers](https://github.com/mlorthiois/transforkmers) Novel Discovery Rates. +5. Filter novel transcripts based on [bambu NDR (Novel Discovery Rates)](https://github.com/GoekeLab/bambu) and/or [TransforKmers TSS validation](https://github.com/IGDRion/transforkmers) to assess fulllength transcripts. 6. Perform a quality control of both the full and filtered extended annotations (see [example](https://github.com/igdrion/ANNEXA/blob/master/examples/results/qc_gtf.pdf)). 7. Optional: Check gene body coverage with [RSeQC](http://rseqc.sourceforge.net/#genebody-coverage-py). @@ -28,7 +28,7 @@ This pipeline has been tested with reference annotation from Ensembl and NCBI-Re ```sh nextflow run IGDRion/ANNEXA \ - -profile test,conda + -profile test,singularity ``` 3. Run ANNEXA on your own data (change input, gtf, fa with path of your files). @@ -69,7 +69,7 @@ Optional: --tfkmers_tokenizer : Path to TransforKmers tokenizer. Required if filter activated. --tfkmers_model : Path to TransforKmers model. Required if filter activated. --bambu_threshold : bambu NDR threshold below which new transcripts are retained. ---tfkmers_threshold : TransforKmers NDR threshold below which new transcripts are retained. +--tfkmers_threshold : TransforKmers prediction threshold below which new transcripts are retained. --operation : Operation to retained novel transcripts. "union" retain tx validated by either bambu or transforkmers, "intersection" retain tx validated by both. --withGeneCoverage : Run RSeQC (can be long depending on annotation and bam sizes). False by default. @@ -78,15 +78,15 @@ Optional: --maxMemory : max memory used by ANNEXA. 40GB by default. ``` -> If the filter argument is set to `true`, TransforKmers model and tokenizer paths have to be given. They can be either downloaded from the [TransforKmers official repository](https://github.com/mlorthiois/TransforKmers) or trained in advance by yourself on your own data. +> If the filter argument is set to `true`, TransforKmers model and tokenizer paths have to be given. They can be either downloaded from the [TransforKmers official repository](https://github.com/IGDRion/TransforKmers) or trained in advance by yourself on your own data. ### Filtering step By activating the filtering step (`--filter`), ANNEXA proposes to filter the generated extended annotation according to 2 methods: -1. By using the NDR proposed by [bambu](https://github.com/GoekeLab/bambu). This threshold includes several information such as sequence profile, structure (mono-exonic, etc) and quantification (number of samples, expression). Each transcript with an NDR below the classification threshold will be retained by ANNEXA. +1. By using the NDR proposed by [bambu](https://github.com/GoekeLab/bambu). This threshold includes several information such as sequence profile, structure (mono-exonic, etc) and quantification (number of samples, expression). Each transcript with an NDR below the classification threshold will be retained by ANNEXA (default: 0.2). -2. By analysing the TSS of each new transcript using the [TransforKmers](https://github.com/mlorthiois/TransforKmers) (deep-learning) tool. Each TSS validated below a certain threshold will be retained. We already provide 2 trained models for filtering TSS with TransforKmers. +2. By analysing the Transcription Start Sites (TSS) of each new transcripts using the [TransforKmers](https://github.com/IGDRion/TransforKmers) deep-learning based tool. Each TSS validated below a certain threshold will be retained (default: 0.2). We already provide 2 trained models for filtering TSS with TransforKmers. - A [human specific model](https://genostack-api-swift.genouest.org/v1/AUTH_07c8a078861e436ba41c4416a821e5d0/transforkmers/hsa_5prime_bert_6-512.zip?temp_url_sig=59e4bd439f42fc2bb8953e78eae82306466917d2&temp_url_expires=2661501621) @@ -94,6 +94,6 @@ By activating the filtering step (`--filter`), ANNEXA proposes to filter the gen To use them, extract the zip, and point `--tfkmers_model` and `--tfkmers_tokenizer` to the subdirectories. -The filtered annotation can be the `union` of these 2 tools, _i.e._ all the transcripts validated by one or both of these tools; or the `intersection`, _i.e._ the transcripts validated by both tools. +The filtered annotation can be the `union` of these 2 tools, _i.e._ all the transcripts validated by one or both of these tools; or the `intersection`, _i.e._ the transcripts validated by both tools (the latter being the default). At the end, the QC steps are performed both on the full and filtered extended annotations.