Skip to content

Commit

Permalink
Overhauling compare pipeline. Broken up the pipeline into more stages
Browse files Browse the repository at this point in the history
and created tools to which can run the stages independently.  This
should make debugging much easier.  Also Updating readme and diagrams for pipelines
  • Loading branch information
sbastkowski committed Oct 15, 2023
1 parent 9dfa90e commit 1ec74f4
Show file tree
Hide file tree
Showing 111 changed files with 15,637 additions and 681 deletions.
20 changes: 14 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -212,7 +212,9 @@ tradis pipeline create_plots --output_dir plots_output my_fastq_list.txt my_refe

The output will be a series of directories for each fastq, containing the plot file for each sequence in the given reference,
along with the aligned reads, and some statistics regarding transposon insertion sites both for each sample, and across
all samples.
all samples. The workflow for `tradis pipeline create_plots` is shown in this diagram:

![create_plots pipeline](docs/pipeline_create_plots.png)

A set of plot files can then be analysed in relation to each other using the `tradis pipeline compare` command. For example:

Expand All @@ -222,6 +224,10 @@ tradis pipeline compare --output_dir analysis_output --annotations my_annotation
--control_files cont1.fq.gz cont2.fq.gz
```

The workflow for `tradis pipeline compare` is shown in this diagram:

![compare pipeline](docs/pipeline_combine.png)

The `tradis` tool has multiple functions, as shown below, and can also be found using the tools help message `tradis --help`:

* `pipeline`
Expand All @@ -235,17 +241,19 @@ The `tradis` tool has multiple functions, as shown below, and can also be found
* `count` - Takes genome annotation in embl format along with plot files produced by "tradis pipeline" and generates tab-delimited files containing gene-wise annotations of insert sites and read counts.
* `normalise` - Scales the insertion site counts of plot files based on the sample with the maximum number of reads.
* `comparison`
* `analyse` - Comparative analysis of TraDIS experiments whilst also predicting the impact of inserts on nearby genes
* `logfc_plot` - Run logfc analysis for a specific plot file direction: forward, reverse, combined, original over all experiments
* `presence_absence` - Take in gene report files and produce a heatmap
* `scatterplot` - Create scatter plot of controls vs conditions
* `figures` - Create graphics of controls vs conditions
* `gene_report` - Generate gene report from logfc_plot analysis data
* `split` - Splits a plot file into forward only, reverse only and combined plot files.
* `essentiality` - Determines how essential each gene is based on the transposon insertion sites
* `essentiality_analysis` - Compares essentiality across condition and control samples
* `prepare_embl` - Prepares an embl annotations file for comparative analysis from a plotfile. If an existing embl file is supplied then genes in that file are expanded based on data from the plot file
* `utils`
* `index` - Indexes a reference using the specified alignment tool
* `extract_names` - Creates a file containing the sequence names in a fasta file
* `annotation` - Take in an EMBL file and add flanking 3 prime and 5 prime annotation.
* `artemis_project` - Create an artemis project file. Sometimes you want to view the insert site plots in Artemis. It can be quite a manual task to open up different replicates and combinations. This script will generate a project.properties file from a spreadsheet which gets automatically loaded by Artemis (from the current working directory). This then makes it quicker to view multiple different insert site plots.
* `gene_reports` - Manipulate gene_report.csv files, such as performing set operations
* `prepare_embl` - Prepares an embl annotations file for comparative analysis from a plotfile. If an existing embl file is supplied then genes in that file are expanded based on data from the plot file
* `essentiality` - Determines how essential each gene is based on the transposon insertion sites
* `tags`
* `add` - Generates a BAM file with tags added to read strings.
* `check`- Prints 1 if tags are present in alignment file, prints 0 if not.
Expand Down
Binary file added docs/pipeline_combine.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
57 changes: 57 additions & 0 deletions docs/pipeline_combine.pu
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
@startuml

split
-[hidden]->
#palegreen: embl;
split again
-[hidden]->
#palegreen: control plot files;
split again
-[hidden]->
#palegreen: condition plot files;
end split

#lightblue:start tradis pipeline compare;

split
: create count file for
original plot and original embl;
split again
split
:prepare embl file;
split again
: split plot file;
split
#lightyellow:forward;
split again
#lightyellow:reverse;
split again
#lightyellow:combined;
end split
end split
: create count files for split plots
and prepared embl;
end split
split
:run essentiality for
each plot file;
:essentiality analysis;
split again
:compare plot files
create figures;
split again
: logfc comparison;
: create gene report;
end split

#lightblue: end tradis pipeline compare;

split
#palegreen: gene report;
kill
split again
#palegreen: figures;
kill
end split

@enduml
Binary file added docs/pipeline_create_plots.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
40 changes: 40 additions & 0 deletions docs/pipeline_create_plots.pu
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
@startuml

split
-[hidden]->
#palegreen: reference genome;
split again
-[hidden]->
#palegreen: fastq files;
end split

#lightblue:start tradis pipeline create_plots;

split
:samtools index reference (faidx);
split again
:index reference (selected mapping tool);
split
:create plot file for fastq 1;
split again
:create plot file for fastq 2;
split again
:create plot file for fastq n;
end split
:combine stats from plot files;
end split

#lightblue: end tradis pipeline create_plots;

split
#palegreen: indexed reference;
kill
split again
#palegreen: plot files;
kill
split again
#palegreen: statistics;
kill
end split

@enduml
162 changes: 106 additions & 56 deletions quatradis/comparison/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,39 +2,54 @@

from quatradis.comparison.scatterplot import scatterplot_run
from quatradis.comparison.presence_absence import presence_absence_run
from quatradis.comparison.comparison import prep_essentiality_pipeline_output_for_comparison, run_comparisons, generate_logfc_plot
from quatradis.comparison.comparison import prep_essentiality_pipeline_output_for_comparison, run_comparisons, \
generate_logfc_plot
from quatradis.comparison.gene_stats import gene_statistics
from quatradis.comparison.plot_logfc import PlotLogOptions
from quatradis.comparison.split import split_plot as sp
from quatradis.comparison.essentiality import essentiality as ess
from quatradis.embl.prepare import PrepareEMBLFile
from quatradis.comparison.essentiality_analysis import essentiality_analysis as ess_an, EssentialityInput


def add_subparser(subparsers):
compare_parser_desc = "Comparative analysis and visualisation of TraDIS experiments (albatradis)"
compare_parser = subparsers.add_parser("compare", help=compare_parser_desc)
compare_subparsers = compare_parser.add_subparsers(title=compare_parser_desc)
create_parser("analyse", compare_subparsers, comparative_analysis, add_compare_options,
"Comparative analysis of TraDIS experiments whilst also predicting the impact of inserts on nearby genes",
usage="tradis compare analyse [options] --condition_dirs <condition_plotfiles>+ --control_dirs <control_plotfiles>+ <EMBL_file>")
create_parser("logfc_plot", compare_subparsers, logfc_plot, add_logfc_options,
"Run logfc analysis for a specific plot file direction: forward, reverse, combined, original over all experiments.",
usage="tradis compare logfc_plot [options] --condition_dirs <condition_plotfiles>+ --control_dirs <control_plotfiles>+ <EMBL_file> <analysis_type>")
create_parser("presence_absence", compare_subparsers, presence_absence, add_pa_options,
"Take in gene report files and produce a heatmap",
usage="tradis compare presence_absence [options] <EMBLfile> <gene_reports>+")
create_parser("figures", compare_subparsers, scatterplot, add_scatterplot_options,
create_parser("figures", compare_subparsers, figures, figures_options,
"Create graphics of controls vs conditions",
usage="tradis compare figures [options] --controls control1.plot control2.plot --conditions condition1.plot condition2.plot")
create_parser("gene_report", compare_subparsers, gene_report, add_genereport_options,
"Generate gene report from comparison analysis data",
usage="tradis compare gene_report [options] <comparison_dir> <EMBL_file>")
create_parser("split", compare_subparsers, split_plot, split_plot_options,
"Splits a plot file into forward, reverse and combined files",
usage="tradis compare split [options] <EMBL_file> <plot_file>")
create_parser("essentiality", compare_subparsers, essentiality, essentiality_utils_options,
"Determines how essential each gene is based on the transposon insertion site plot counts",
usage="tradis compare essentiality [options] <plot_count_file>")
create_parser("prepare_embl", compare_subparsers, prepare_embl, prepare_embl_utils_options,
"Prepares an embl annotations file for comparative analysis from a plotfile. If an existing embl file is supplied then genes in that file are expanded based on data from the plot file",
usage="tradis compare prepare_embl [options] <plot>")
create_parser("essentiality_analysis", compare_subparsers, essentiality_analysis, essentiality_analysis_options,
"Compares essentiality across condition and control samples",
usage="tradis compare essentiality_analysis [options] --controls <control_gene_files> --conditions <condition_gene_files> --ess_controls <control_essential_gene_files> --ess_conditions <condition_essential_gene_files>")


def add_compare_options(parser):
def add_logfc_options(parser):
parser.add_argument('emblfile', help='Annotation file in EMBL format', type=str)
parser.add_argument('--control_dirs',
help='Directories from essentiality processed control plot files (optionally gzipped). There must be an equal number of condition and control directories',
parser.add_argument('analysis_type', help='Analysis type: forward, reverse, combined, original', type=str)
parser.add_argument('--controls',
help='Control plot files (optionally gzipped). There must be an equal number of condition and control files.',
nargs='+', type=str)
parser.add_argument('--condition_dirs',
help='Directories from essentiality processed condition plot files (optionally gzipped). There must be an equal number of condition and control directories',
parser.add_argument('--conditions',
help='Condition plot files (optionally gzipped). There must be an equal number of condition and control files.',
nargs='+', type=str)

parser.add_argument('--span_gaps', '-s', help='Span a gap if it is this multiple of a window size', type=int,
Expand All @@ -50,48 +65,21 @@ def add_compare_options(parser):
parser.add_argument('--pvalue', '-p', help='Do not report anything above this p-value', type=float, default=0.05)
parser.add_argument('--qvalue', '-q', help='Do not report anything above this q-value', type=float, default=0.05)
parser.add_argument('--window_size', '-w', help='Window size', type=int, default=100)
parser.add_argument('--dont_report_decreased_insertions', '-r', help="Whether or not to report decreased insertions", action='store_true', default=False)
parser.add_argument('--genome_length', '-g', help="The length of the genome or sequence being analysed", type=int, required=True)

def comparative_analysis(args, help='', type=str):

if len(args.control_dirs) != len(args.condition_dirs):
raise "Must have equal number of control and condition essentiality directories to process"

if len(args.control_dirs) == 0:
raise "Must have at least one control and condition essentiality directory to process"

controls_all, controls_only_ess, conditions_all, conditions_only_ess = prep_essentiality_pipeline_output_for_comparison(args.control_dirs, args.condition_dirs, 'combined')

options = PlotLogOptions(
genome_length=args.genome_length,
minimum_logfc=args.minimum_logfc,
maximum_pvalue=args.pvalue,
maximum_qvalue=args.qvalue,
minimum_logcpm=args.minimum_logcpm,
window_size=args.window_size,
span_gaps=args.span_gaps,
report_decreased_insertions=not args.dont_report_decreased_insertions,
minimum_block=args.minimum_block)

run_comparisons(controls_all, conditions_all, controls_only_ess, conditions_only_ess, args.emblfile, args.prefix, options, args.verbose)
parser.add_argument('--dont_report_decreased_insertions', '-r',
help="Whether or not to report decreased insertions", action='store_true', default=False)
parser.add_argument('--genome_length', '-g', help="The length of the genome or sequence being analysed", type=int,
required=True)

gene_statistics(args.prefix, args.window_size, args.emblfile, args.prefix)



def add_logfc_options(parser):
add_compare_options(parser)
parser.add_argument("analysis_type", help='The type of plot file to analyse: forward, reverse, combined, original', type=str)

def logfc_plot(args):
if len(args.control_dirs) != len(args.condition_dirs):
raise "Must have equal number of control and condition essentiality directories to process"
if len(args.controls) != len(args.conditions):
raise "Must have equal number of control and condition files to process"

if len(args.control_dirs) == 0:
raise "Must have at least one control and condition essentiality directory to process"
if len(args.controls) == 0:
raise "Must have at least one control and condition file to process"

controls_all, controls_only_ess, conditions_all, conditions_only_ess = prep_essentiality_pipeline_output_for_comparison(args.control_dirs, args.condition_dirs, args.analysis_type)
#controls_all, controls_only_ess, conditions_all, conditions_only_ess = prep_essentiality_pipeline_output_for_comparison(
# args.control_dirs, args.condition_dirs, args.analysis_type)

options = PlotLogOptions(
genome_length=args.genome_length,
Expand All @@ -104,8 +92,7 @@ def logfc_plot(args):
report_decreased_insertions=not args.dont_report_decreased_insertions,
minimum_block=args.minimum_block)

plotfile, scoresfile = generate_logfc_plot(args.analysis_type, controls_all, conditions_all,
controls_only_ess, conditions_only_ess,
plotfile, scoresfile = generate_logfc_plot(args.analysis_type, args.controls, args.conditions,
args.emblfile, args.prefix, options, args.verbose)

return plotfile, scoresfile
Expand All @@ -121,25 +108,88 @@ def presence_absence(args):
presence_absence_run(args)


def add_scatterplot_options(parser):
def figures_options(parser):
parser.add_argument('--controls', '-c', help='control files (use 2 or more)', type=str, nargs='+')
parser.add_argument('--conditions', '-d', help='condition files (use 2 or more)', type=str, nargs='+')
parser.add_argument('--window_size', '-w', help='Window size', type=int, default=50)
parser.add_argument('--prefix', '-o', help='Output filename prefix', type=str, default='scatter')

parser.add_argument('--prefix', '-p', help='Output prefix', type=str, default='figures')
parser.add_argument('--normalise', '-n', action='store_true', help='normalise the files', default=False)


def scatterplot(args):
def figures(args):
scatterplot_run(args)


def add_genereport_options(parser):
parser.add_argument('comparison_dir', help='Output directory from tradis compare analyse', type=str)
parser.add_argument('embl', help='Original EMBL file used for analysis', type=str)
parser.add_argument('--combined', help='Combined plotfile', type=str)
parser.add_argument('--forward', help='Forward plotfile', type=str)
parser.add_argument('--reverse', help='Reverse plotfile', type=str)
parser.add_argument('--scores', help='Combined scores', type=str)
parser.add_argument('--window_size', '-w', help='Window size', type=int, default=50)
parser.add_argument('--output_dir', '-o', help='Output filename prefix', type=str, default='gene_report.csv')
parser.add_argument('--output_dir', '-o', help='Output filename prefix', type=str, default='.')
parser.add_argument('--annotations', '-a', help='EMBL file used for annotations', type=str)


def gene_report(args):
gene_statistics(args.comparison_dir, window_size=args.window_size, embl_file=args.embl, output_dir=args.output_dir, annotation_file=args.annotations)
gene_statistics(combined_plotfile=args.combined, forward_plotfile=args.forward, reverse_plotfile=args.reverse,
combined_scorefile=args.scores, window_size=args.window_size, embl_file=args.embl, output_dir=args.output_dir,
annotation_file=args.annotations)


def split_plot_options(parser):
parser.add_argument('plot_file', help='Plot file to split', type=str)
parser.add_argument('--minimum_threshold', '-m', help='Minimum number of insertions required to pass filter',
type=int, default=5)
parser.add_argument('--output_dir', '-o', help='Output directory', type=str, required=True)
parser.add_argument('--gzipped', '-g', help='Whether to output gzipped plot files', action='store_true')


def split_plot(args):
sp(args.plot_file, output_dir=args.output_dir, minimum_threshold=args.minimum_threshold, gzipped=args.gzipped)


def essentiality_utils_options(parser):
parser.add_argument('count_file',
help='Counts from a transposon insertion site plot file to be used. i.e. generated from "tradis plot count"',
type=str)


def essentiality(args):
ess(args.count_file, verbose=args.verbose)


def prepare_embl_utils_options(parser):
parser.add_argument('plotfile', help='Transposon insertion site plot file to be used', type=str)
parser.add_argument('--output', '-o', help='Output file path to store the prepared EMBL file', type=str,
default='prepared.embl')
parser.add_argument('--emblfile', '-e',
help='If provided genes in this EMBL annotations file will expanded based on data in the plotfile.',
type=str, default=None)
parser.add_argument('--minimum_threshold', '-m',
help='Only include insert sites with this number or greater insertions', type=int, default=5)
parser.add_argument('--prime_feature_size', '-z',
help='Feature size when adding 5/3 prime block when --use_annotation', type=int, default=198)
parser.add_argument('--window_interval', '-l', help='Window interval', type=int, default=25)
parser.add_argument('--window_size', '-w', help='Window size', type=int, default=100)


def prepare_embl(args):
pef = PrepareEMBLFile(args.plotfile, args.minimum_threshold, args.window_size, args.window_interval,
args.prime_feature_size, args.emblfile)
pef.create_file(args.output)


def essentiality_analysis(args):

i = EssentialityInput(args.controls, args.conditions, args.ess_controls, args.ess_conditions)
ess_an(i, args.output_dir, args.type)


def essentiality_analysis_options(parser):
parser.add_argument('--controls', '-c', help='control files (use 2 or more)', type=str, nargs='+')
parser.add_argument('--conditions', '-d', help='condition files (use 2 or more)', type=str, nargs='+')
parser.add_argument('--ess_controls', '-e', help='control files (use 2 or more)', type=str, nargs='+')
parser.add_argument('--ess_conditions', '-f', help='condition files (use 2 or more)', type=str, nargs='+')
parser.add_argument('--output_dir', '-o', help='Output directory', type=str, default=".")
parser.add_argument('--type', '-t', help='Analysis type: original, combined, forward, reverse', type=str, required=True)
Loading

0 comments on commit 1ec74f4

Please sign in to comment.