Skip to content
This repository has been archived by the owner on Aug 23, 2024. It is now read-only.
Victoria Offord edited this page Sep 2, 2021 · 2 revisions

Input files

The formats of the three required inputs are described below.

An input file can contain the required columns in any order, with any column delimiter, with or without a column header, as long as the non-default column delimiters and column indices are specified using the appropriate input file configuration parameters.

sgRNA library file

The library file is a text file containing the guide name and associated gene name. The genomic coordinates of the guide must also be provided if including read count correction for gene independent cell responses using CRISPRcleanR. Here is a snippet of the Yusa_v1.1 library file:

sgRNA        gene    chr     start   end
A1BG_CCDS12976.1_ex4_19:58863655-58863678:+_5-2 A1BG    19      58863655        58863678
A1BG_CCDS12976.1_ex4_19:58863697-58863720:-_5-3 A1BG    19      58863697        58863720
A1BG_CCDS12976.1_ex3_19:58862927-58862950:-_5-1 A1BG    19      58862927        58862950
A1BG_CCDS12976.1_ex4_19:58863866-58863889:+_5-4 A1BG    19      58863866        58863889
A1BG_CCDS12976.1_ex5_19:58864367-58864390:-_5-5 A1BG    19      58864367        58864390
A1CF_CCDS7241.1_ex6_10:52588014-52588037:-_5-1  A1CF    10      52588014        52588037
A1CF_CCDS7241.1_ex7_10:52595962-52595985:-_5-2  A1CF    10      52595962        52595985
A1CF_CCDS7241.1_ex9_10:52603844-52603867:-_5-5  A1CF    10      52603844        52603867
A1CF_CCDS7241.1_ex7_10:52596023-52596046:+_5-3  A1CF    10      52596023        52596046
A1CF_CCDS7241.1_ex9_10:52603761-52603784:+_5-4  A1CF    10      52603761        52603784

By default, the library file is expected to have a header, tab-delimited columns and the sgRNA ID and gene name in columns 1 and 2, respectively. All other columns are ignored. Guide IDs must be unique.

If genomic coordinates are missing and the correction of read counts is performed using the CRISPRcleanR step (no_crisprcleanr = false), the guides with no genomic coordinates will be excluded from the resulting count matrices.

sgRNA read count files

Read count files must be placed in a single directory. Counts are assumed to be counts at the sample level (i.e. one file per sample). Replicates should be provided in separate read count files with the groupings indicated in the sample mapping file. Read count files may be gzipped.

By default, read count files are expected to be tab-delimited text files that have a header, tab-delimited columns, and the sgRNA ID, gene name, and count in columns 1-3, respectively. All other columns are ignored. Here is an example of the first few lines in a count file:

sgRNA                                               gene     Example.sample  Plasmid_v1.1
CNST_CCDS1628.1_ex3_1:246797289-246797312:+_5-3     CNST     163                360
GJC3_CCDS34697.1_ex1_7:99526769-99526792:+_5-2      GJC3     271                634
RASSF1_CCDS2820.1_ex2_3:50369085-50369108:-_5-2     RASSF1   465                627
ANKRD36_CCDS54379.1_ex19_2:97830184-97830207:+_5-5  ANKRD36  16                 281
EIF5AL1_CCDS53546.1_ex0_10:81272561-81272584:+_5-3  EIF5AL1  584                728

In this example, the plasmid count is also included in the 4th column, however, by default the sample of interest is expected to be in column 3.

Note that the sgRNA IDs in the read count file must match those in the library file.

Sample mapping file

The sample mapping file is used to specify which files have the read count data for the plasmid, control, and treatment groups, which files belong to each group, and which sample are replicates.

Below is an example sample mapping file for a study with read counts for treatment and control samples (each with 2 replicates), and a plasmid, shown with the default expected column order. The first 7 columns are required. For the generation of quality control plots, the reads column indicates the total number of raw reads generated from sequencing, including unmapped reads. If the read column is absent and not specified by the info_reads_column_index parameter, not all quality control plots will be generated.

filename                        label           plasmid control treatment  group   replicate reads
Day-10-A-Cont.read_count.tsv    Day-10-A-Cont   0       1       0          d10C    1         50738612
Day-10-B-Cont.read_count.tsv    Day-10-B-Cont   0       1       0          d10C    2         50010807
Day-10-A-Treat.read_count.tsv   Day-10-A-Treat  0       0       1          d10     1         49104113
Day-10-B-Treat.read_count.tsv   Day-10-B-Treat  0       0       1          d10     2         58884798
Plasmid_v1.1.tsv                Plasmid_v1.1    1       0       0          Plasmid 1         121086275

The files in the filename column are read count files located in the directory specified by the --counts parameter (without the full path).

The label column lists the label you wish to use for you samples for all QC and downstream analyses. The label can be the sample name found in your read count file, or an alternative name. For example, if your count file sample name is an accession, you may wish to use a more descriptive label for your samples within the pipeline. The sample type (plasmid, control or treatment) is indicated by 1 in the appropriate column.

You may have either a plasmid or control sample(s), or both. If you have no control sample, for example, your control column will consist of all zeros. If both a plasmid and control are provided, a comparison of treatment vs. plasmid, treatment vs. control and control vs. plasmid will be performed at the appropriate stages.

Clone this wiki locally