Skip to content

Commit

Permalink
Update Spectre to version 0.2.0 Alpha
Browse files Browse the repository at this point in the history
  • Loading branch information
Philippe Sanio committed Mar 18, 2024
1 parent 01b8ed5 commit de1740c
Show file tree
Hide file tree
Showing 55 changed files with 5,309 additions and 1,416 deletions.
Binary file removed .DS_Store
Binary file not shown.
File renamed without changes.
60 changes: 60 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Compiled class file
*.class

# Log file
*.log

# BlueJ files
*.ctxt


# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# Mobile Tools for Java (J2ME)
.mtj.tmp/

# Package Files #
*.jar
*.war
*.nar
*.ear
*.zip
*.tar.gz
*.rar

# virtual machine crash logs, see http://www.java.com/en/download/help/error_hotspot.xml
hs_err_pid*

# Jetbrains
.idea/
test/

# local
_test_luis
_test_luis/*
__pycache__/*
*pyc

# deploy
deploy_baylor.sh
deployScriptsBaylor.sh
.rsync-ignore
resultadoRsync.txt
33 changes: 33 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
FROM python:3.11.4-slim-buster

# Metadata
LABEL VERSION="0.2.0.alpha"
LABEL NAME="Spectre:${VERSION}"
LABEL AUTHOR="Philippe Sanio"



# Path: /app
WORKDIR /app/Spectre

# pip install requirements.txt
COPY requirements.txt .

# pip install
RUN pip install --no-cache-dir --upgrade pip \
&& pip install --no-cache-dir -r requirements.txt

RUN pip install -U setuptools wheel build

# copy Spectre directory
COPY . .

RUN python -m build

# Install what ever the dist folder contains and ends with .tar.gz
RUN pip install --no-cache-dir dist/*.tar.gz
RUN pip install --no-cache-dir .

# Start the application
ENTRYPOINT ["spectre"]
CMD ["Spectre"]
5 changes: 3 additions & 2 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
MIT License

Copyright (c) 2023 Fritz Sedlazeck
The MIT License (MIT)

Copyright (c) 2023 Philippe Sanio

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
150 changes: 130 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,66 +1,112 @@

![Spectre](./logo.png)
# Spectre - Long read CNV caller
Spectre is a long read copy number variation (CNV) caller.
Spectre is designed to detect large CNVs (>100kb) in a couple of minutes depending on your hardware.

To calculate CNVs Spectre uses primarily the coverage (Read depth) data.
However, it can also use SNV data to detect loss of heterozygosity (LoH) regions.
Additionally, Spectre can use the breakpoint (SNF) data from Sniffles to improve the CNV calling. However, it has to be converted to the SNFJ format using [snf2json](https://github.com/philippesanio/snf2json).


The CNV output of Spectre is stored in three files, VCF, BED and .SPC which can be used in the population mode.

Furthermore, Spectre offers a population mode, which can be used to search for CNV support in multiple samples.
Compared to other tools, Spectre searches not only in the final CNVs but also in CNV candidates which did not qualify for the final output of Spectre.
## Required programs (conda)

Install Spectre with Pip:
```bash
pip install spectre-cnv
```

Setup a conda environment for Spectre (copy and paste the following commands)
```bash
conda create -n spectre python=3.8.5 pysam==0.21.0 numpy==1.24.3 pandas==2.0.1 matplotlib==3.7.1 scipy==1.10.1 -y
conda create -n spectre python=3.10 pysam==0.22.0 numpy==1.24.3 pandas==2.0.1 matplotlib==3.7.1 scipy==1.10.1 -y
conda activate spectre
```
Alternatively, you can use pip for installing the packages stored in the requirements txt

```bash
conda create -n spectre python=3.8.5 pip -y
conda create -n spectre python=3.10 pip -y
conda activate spectre
pip install -r requirements.txt
```
or install everything manually (check for package version in the requirements.txt file)

|Program| Conda |
|-------|-------------------------------------|
| python3 | conda install python=3.8.5 |
| pysam | conda install -c bioconda pysam=0.21.0 |
| pandas| conda install -c anaconda pandas==2.0.1 |
|Program| Conda |
|-------|---------------------------------------------|
| python3 | conda install python=3.10 |
| pysam | conda install -c bioconda pysam=0.22.0 |
| pandas| conda install -c anaconda pandas==2.0.1 |
| numpy| conda install -c anaconda numpy==1.24.3 |
| scipy| conda install -c anaconda scipy==1.10.1 |
| matplotlib| conda install -c anaconda matplotlib==3.7.1 |


## How to run
Spectre need as input:
- The result of Mosdepth (directory)

Prerequisites:
Extract the coverage data from a BAM using [Mosdepth](https://github.com/brentp/mosdepth).
Example command:
```bash
mosdepth -t 8 -x -b 1000 -Q 20 -c X "${out_path}/${sample_id}" "${bam_path}"
```

>IMPORTANT: We recommend to run **Mosdepth** with a **bin size of 1kb** and a **mapping quality of at least 20** (-Q 20), as Spectre is optimized for that.
- The region coverage file (mosdepth)
- SampleID e.g.
- Output directory
- Reference genome (can be bgzip compressed)
- Window size used in Mosdepth (Make sure the binsize between Mosdepth and Spectre are matching. We suggest a binsize of 1000 base pairs.)

Optional
- **MDR** file (if not already generated, Spectre will do that for you. You can also use the MDR file for every sample which has been aligned to the same reference genome)
- VCF file containing SNV
- SNF data from Sniffles (if parsed through [snf2json](https://github.com/philippesanio/snf2json))

>INFO: Make sure to include "/" at the end if you are adding directory paths.
## Run Spectre
### MDR file
MDR files hold the information of N regions in the reference genome and restrict Spectre of using data from those regions.
We are providing sample MDR files for the reference genomes GRCh37 and GRCh38.

If not provided, Spectre will generate a MDR file for you, which can take some time.
Thus, we highly recommend to generate a MDR file for your reference genome before running Spectre on multiple samples which have been aligned to the same reference.


Providing an MDR file will save you an substantial amount of time, as Spectre will not have to calculate the N regions for every sample.

Generagtion of MDR file can be with either the `RemoveNs` or `CNVCaller` command. In the latter case, the MDR (metadata.mdr) file will be saved in the output directory of the sample.
```bash
spectre.py CNVCaller \
--bin-size 1000 \
--coverage mosdepth/sampleid/ \
spectre RemoveNs \
--reference reference.fasta.gz \
--output-dir output_directory_path/
```
### Blacklists
The blacklist is a supplementary file to the MDR file. It contains regions which should be ignored by Spectre.
Those regions are based on gap data from USCS.
During testing we found that the gap data is not totally sufficient masking high frequency coverage regions such as telomeric and centromeric regions.
Thus we have extended the especially those problematic regions in the blacklist file. (grch37_blacklist_spectre_refined.bed and grch38_blacklist_spectre.bed)

### Run Spectre with a single sample
```bash
spectre CNVCaller \
--coverage mosdepth/sampleid/mosdepth.regions.bed.gz \
--sample-id sampleid \
--output-dir sampleid_output_directory_path/ \
--reference reference.fasta.gz \
--snv sampleid.vcf.gz
--reference reference.fasta.gz
```
### Run Spectre with multiple samples
Run Spectre with multiple samples:
>INFO: This will start the population mode automatically.
```bash
spectre.py CNVCaller \
--bin-size 1000 \
--coverage mosdepth/sampleid-1/ mosdepth/sampleid-1/ \
--coverage mosdepth/sampleid-1/mosdepth.regions.bed.gz mosdepth/sampleid-2/mosdepth.regions.bed.gz \
--sample-id sampleid-1 sampleid-2 \
--output-dir sampleid_output_directory_path/ \
--reference reference.fasta.gz \
--snv sampleid.vcf.gz
--reference reference.fasta.gz
```

### Population mode
Expand All @@ -69,8 +115,72 @@ Run Spectre in population mode with two or more samples:
> located in the output folder of given sample.
```bash
spectre.py population \
spectre population \
--candidates /path/to/sample1.spc /path/to/sample2.spc \
--sample-id output_name \
--output-dir sampleid_output_directory_path/
```


### Help
```
Spectre:
CNVCaller:
Required
--coverage Path to the coverage file from Mosdepth output. Expects the following files:
<prefix>.regions.bed.gz
<prefix>.regions.bed.gz.csi
Can be one or more directories. Example:
--coverage /path/md1.regions.gz /path/md2.regions.gz
--sample-id Sample name/ID. Can be one or more ID. Example:
--sample-id id1 id2
--output-dir Output directory
--reference Reference sequence used for mapping (for N removal)
Optional, if missing it will be created
--metadata Metadata file for Ns removal
Optional
--blacklist Blacklist in bed format for sites that will be ignored (Default = "")
--only-chr Comma separated list of chromosomes to use
--ploidy Set the ploidy for the analysis, useful for sex chromosomes (Default = 2)
--ploidy-chr Comma separated list of key:value-pairs for individual chromosome ploidy control
(e.g. chrX:2,chrY:1) If chromosome is not specified, the default ploidy will be used.
--snv VCF file containing the SNV for the same sample CNV want to be called
--snfj Breakpoints from from Sniffle which has been converted from the SNF to the SNFJ format.
--n-size Length of consecutive Ns (Default = 5)
--min-cnv-len Minimum length of CNV (Default 100kb)
--cancer Set this flag if the sample is cancer (Default = False)
--population Runs the population mode on all provided samples
--threads Amount of threads (This will boost performance if multiple samples are provided)
Coverage
--sample-coverage-overwrite Overwrites the calculated sample coverage, which is used to normalize
the coverage. e.g. a value of 30 equals to 30X coverage.
--disable-max-coverage Disables the maximum coverage check. This will allow to call CNVs
LoH (requires --snv)
--loh-min-snv-perkb Minimum number of SNVs per kilobase for an LoH region (default=5)
--loh-min-snv-total Minimum number of SNVs total for an LoH region (default=100)
--loh-min-region-size Minimum size of a region for a LoH region (default=100000)
RemoveNs:
Required
--reference Reference genome used for mapping
--output-dir Output dir
--output-file Output file for results
--bin-size Bin/Window size (same as Mosdepth)
Optional
--blacklist Blacklist in bed format for sites that will be ignored (Default = "")
--n-size Length of consecutive Ns (Default = 5)
--save-only Will only save the metadata file and not show the results in screen (Default = False)
Population:
Required
--candidates At least 2 candidate files (.spc or .vcf) which should be taken into consideration for the population mode.
--sample-id Name of the output file
--output-dir Output directory
Optional
--reference Reference sequence (Required if VCF files are used!)
Version:
version Shows current version/build
```
Loading

0 comments on commit de1740c

Please sign in to comment.