The config.yml file configures values that should stay constant between samples.
samples: "config/samples.tsv" # .tsv file containing sample names and locations
regions: "config/regions.tsv" # .tsv file containing bed files with regions of interest
## snakemake_GE_analysis
proteinAtlas: "Blood" #RNAtable name ["Blood", "Tissue", "Extended"]
# tissues for generating plots, valid tissues labels can be found in respective label files for the used protein atlas
tissue: ["NK_cell", "memory_B_cell", "classical_monocyte", "basophil", "memory_CD4_T_cell", "memory_CD8_T_cell"] # tissues for generating plots, see respective
refSample: "BH01" # reference sample for rank correlation comparison
minRL: 120 # minimum read length for calculating WPS
maxRL: 180 # maximum read length for calculating WPS
bpProtection: 120
### genome build specific options ##
GRCh37:
genome: "resources/genome/hg19.fa.genome" #full .genome file
genome_autosomes: "resources/genome/hg19.fa.genome.regular_autosomes" # .genome file reduced to regular autosomes
UCSC_gap: "resources/blacklists/UCSC/UCSC_gap.hg19.bed" # UCSC_gap file in .bed format
universal_blacklist: "resources/blacklists/universal_blacklist.hg19.bed" # UCSC_gap + ENCODE blacklist combined file in .bed format
transcriptAnno: "resources/annotations/transcriptAnno-GRCh37.103.tsv.gz" # file containing TSSs
GRCh38:
genome: "resources/genome/hg38.fa.genome" #full .genome file
genome_autosomes: "resources/genome/hg38.fa.genome.regular_autosomes" #.genome file reduced to regular autosome
UCSC_gap: "resources/blacklists/UCSC/UCSC_gap.hg38.bed" # UCSC_gap file in .bed format
universal_blacklist: "resources/blacklists/universal_blacklist.hg38.bed" # UCSC_gap + ENCODE blacklist combined file in .bed format
transcriptAnno: "resources/annotations/transcriptAnno-GRCh38.103.tsv.gz" # file containing TSSs
## unsupervised
unsupervised:
frequencies: [[120,280],[160,200],[190,200]] # defines FFT frequencies used for unsupervised methods
kmeans:
n_clusters: [2,3,4] # number of clusters to try
UMAP:
n_components: [10,15,20,25,30] # number of components to reduce to
The samples.tsv contains a header with four columns:
ID sample path ref_samples genome_build
experimentID testsample1 "/path/to/testsample1.bam" testsample2,testsample3 GRCh37
experimentID testsample2 "/path/to/testsample2.bam" testsample1,testsample3 GRCh37
experimentID testsample3 "/path/to/testsample3.bam" testsample1,testsample2 GRCh38
- ID - ID for a certain analysis to create identifiable directories and/or filenames
- sample - sample name used to identify files
- path - path to input file
- ref_sample - Reference sample for some visualizations/calculations. ref_samples are comma separated, must be in present in the sample column and every sample needs a ref_sample (e.g. itself).
- genome_build - Defines the genome build to be used for a specific sample. Valid options are ["GRCh37","GRCh38"].
Note: Input files should match the specified genome build.
The regions.tsv contains a header with two columns:
target path
gene1 /path/to/gene1.bed
TF1 /pat/to/TF1BS.bed
- target - describes the targets defined in the correspoding .bed file
- path - path to input .bed file containing coordinates of interest (all coordinates should be centered around a specific feature and of same length)
Note: .bed has to contain the first 6 fields (chrom, chromStart, chromEnd, name, value, strand), even though name and value are not actively used.