Skip to content

Latest commit

 

History

History
80 lines (59 loc) · 3.59 KB

README.md

File metadata and controls

80 lines (59 loc) · 3.59 KB

Documentation for config files

config.yml

The config.yml file configures values that should stay constant between samples.

samples: "config/samples.tsv" # .tsv file containing sample names and locations
regions: "config/regions.tsv" # .tsv file containing bed files with regions of interest

## snakemake_GE_analysis

proteinAtlas: "Blood" #RNAtable name ["Blood", "Tissue", "Extended"]
# tissues for generating plots, valid tissues labels can be found in respective label files for the used protein atlas
tissue: ["NK_cell", "memory_B_cell", "classical_monocyte", "basophil", "memory_CD4_T_cell", "memory_CD8_T_cell"] # tissues for generating plots, see respective
refSample: "BH01" # reference sample for rank correlation comparison
minRL: 120 # minimum read length for calculating WPS
maxRL: 180 # maximum read length for calculating WPS
bpProtection: 120 

### genome build specific options ##

GRCh37:
  genome: "resources/genome/hg19.fa.genome" #full .genome file
  genome_autosomes: "resources/genome/hg19.fa.genome.regular_autosomes" # .genome file reduced to regular autosomes
  UCSC_gap: "resources/blacklists/UCSC/UCSC_gap.hg19.bed" # UCSC_gap file in .bed format
  universal_blacklist: "resources/blacklists/universal_blacklist.hg19.bed" # UCSC_gap + ENCODE blacklist combined file in .bed format
  transcriptAnno: "resources/annotations/transcriptAnno-GRCh37.103.tsv.gz" # file containing TSSs

GRCh38:
  genome: "resources/genome/hg38.fa.genome" #full .genome file
  genome_autosomes: "resources/genome/hg38.fa.genome.regular_autosomes" #.genome file reduced to regular autosome
  UCSC_gap: "resources/blacklists/UCSC/UCSC_gap.hg38.bed" # UCSC_gap file in .bed format
  universal_blacklist: "resources/blacklists/universal_blacklist.hg38.bed" # UCSC_gap + ENCODE blacklist combined file in .bed format
  transcriptAnno: "resources/annotations/transcriptAnno-GRCh38.103.tsv.gz" # file containing TSSs

## unsupervised 

unsupervised:
  frequencies: [[120,280],[160,200],[190,200]] # defines FFT frequencies used for unsupervised methods
  kmeans:
    n_clusters: [2,3,4] # number of clusters to try
  UMAP:
    n_components: [10,15,20,25,30] # number of components to reduce to

samples.tsv

The samples.tsv contains a header with four columns:

ID	sample	path	ref_samples	genome_build
experimentID	testsample1	"/path/to/testsample1.bam"	testsample2,testsample3	GRCh37
experimentID	testsample2	"/path/to/testsample2.bam"	testsample1,testsample3	GRCh37
experimentID	testsample3	"/path/to/testsample3.bam"	testsample1,testsample2	GRCh38
  • ID - ID for a certain analysis to create identifiable directories and/or filenames
  • sample - sample name used to identify files
  • path - path to input file
  • ref_sample - Reference sample for some visualizations/calculations. ref_samples are comma separated, must be in present in the sample column and every sample needs a ref_sample (e.g. itself).
  • genome_build - Defines the genome build to be used for a specific sample. Valid options are ["GRCh37","GRCh38"].

Note: Input files should match the specified genome build.

regions.tsv

The regions.tsv contains a header with two columns:

target  path
gene1   /path/to/gene1.bed
TF1     /pat/to/TF1BS.bed
  • target - describes the targets defined in the correspoding .bed file
  • path - path to input .bed file containing coordinates of interest (all coordinates should be centered around a specific feature and of same length)

Note: .bed has to contain the first 6 fields (chrom, chromStart, chromEnd, name, value, strand), even though name and value are not actively used.