Skip to content

An R script that can detect sample swaps or sample mislabelling from tNGS (targeted next generation sequencing) data

Notifications You must be signed in to change notification settings

deyanyosifov/swap_checker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Swap_checker

An R script that can detect sample swaps or sample mislabelling from tNGS (targeted next generation sequencing) data

Installation

Prerequisites

  • A working installation of R (version 4.2.0 or more recent) on a computer with a Linux or Windows operating system. (Theoretically MacOS should be possible, too, but I haven't had the chance to test whether it works.) The RStudio integrated development environment is recommended for convenient use of the script but not required.
  • The following R packages have to be installed: maftools (obligatory) and pheatmap (optional, for generating graphs).

Installation

Download the zip archive of all files in the repository by clicking on "Code" and then on "Download ZIP" on the GitHub page of the Swap_checker project or by following this direct download link: https://github.com/deyanyosifov/Swap_checker/archive/refs/heads/master.zip. Unzip the archive, this action will create a new directory named swap_checker-master in the current directory. You can rename the new directory to Swap_checker or whatever other name you choose and move it to a convenient place on your computer. For the purpose of this manual, we will assume that your installation is located in the directory Swap_checker in your home folder on a Linux machine, i.e. ~/Swap_checker. If your situation is different, just replace the ~/Swap_checker part in the further instructions with the actual path to your installation.

Usage

Preparation of the data

Your data (coordinate-sorted and indexed bam files) should be located in one or several directories.

Preparation of the script

Some variables in the Swap_checker.R file will have to be adjusted to suit your data before you run the script. This can be done most conveniently in RStudio.

  • Line 16 – this is the place to specify the list of SNPs that will be used for the analysis. If left unchanged, the built-in list will be used which is specific for our custom NGS panel covering various genes of relevance for chronic lymphocytic leukaemia. Prepare your own list of SNPs following the format of the SNPs_selected.bed file and save it in the same directory as the other files of the program. Then modify line 16 of the Swap_checker.R file to point to the file with your own list of SNPs. Alternatively, you can directly edit the SNPs_selected.bed file and leave line 16 of the Swap_checker.R file unmodified.
  • Line 19 – Replace the placeholders “/path/to/folder/with/BAMs/1” and “/path/to/folder/with/BAMs/2” with the actual paths to the directories containing your bam files, keeping the quotation marks. You can add as many directories as you want inside the brackets, separating them with commas. If all your bam files are in one directory, remove the comma after the path to it, as well as the second placeholder (“/path/to/folder/with/BAMs/2”). The pattern="*.bam$" part can be used to select only a particular type of BAM files from the data director(y/ies). For example, if you have both "normal" and realigned BAM files in the same directory, you can perform the analysis only on the realigned files if they have a respective uniform file name ending, using a pattern like e.g. *.realigned.bam$ instead of the default *.bam$ pattern that will select all bam files.
  • Line 21 – Here you have the option to limit the analysis to a subset of the bam files in your director(y/ies), based on their file names, i.e. only bam files that contain a specified text string in their names will be considered. To make use of this possibility, uncomment the line (remove the # at its beginning and replace CLL with the actual text that has to be present in the file names. You can use more complicated rules for selection of files if you are familiar with regular expressions.
  • Line 24 – Replace getwd() with the path (absolute or relative) to the directory where the results should be saved. Otherwise the results will be saved in the directory where the script file is located, potentially overwriting older results without warning. The chosen directory must be pre-existing.
  • Line 27 – Set the ncores parameter to the number of CPU cores that are available on your system.
  • Line 30 – the pairwise concordance table will be saved under the name Pairwise_concordance.csv. It is a good idea to modify this generic name by adding to it a descriptor of the analyzed study/samples, thus avoiding the confusion that might ensue when you analyze several groups of samples and get several different results files with the same names in different directories.
  • Line 31 – the SNP readcounts for reference and alternative alleles for all samples will be saved in the file SNP_readcounts.csv. It is a good idea to modify this generic name by adding to it a descriptor of the analyzed study/samples, thus avoiding the confusion that might ensue when you analyze several groups of samples and get several different results files with the same names in different directories.
  • Line 38 – relevant if you want to execute the optional part of the script that checks whether samples that should be concordant are really concordant (e.g. multiple samples from the same individual). This can be tested automatically only if file names contain identifiers for the subjects. File names should follow a specific and uniform pattern, i.e. the subject identifier should be present in a definable part of each file name. In this way, the subject identifier can be extracted using a regular expression so that samples with the same identifier can be compared to each other. The example regular expression at line 38 ^.*-(\\d+)-.*$ assumes that the file names follow the pattern "something-number-something" where the number uniquely identifies each subject/patient. Modify the regular expression to fit your most probably different naming convention. A good quick guide to regular expressions: https://www.rexegg.com/regex-quickstart.html
  • Line 47 – relevant if you want to execute the optional part of the script that checks whether samples that should be concordant are really concordant (see above). The correlation table of all pairs that should be concordant will be saved in the file Expected_concordant_pairs.csv. It is a good idea to modify this generic name by adding to it a descriptor of the analyzed study/samples, thus avoiding the confusion that might ensue when you analyze several groups of samples and get several different results files with the same names in different directories.
  • Line 52 – relevant if you want to execute the optional part of the script that plots a graphical depiction of the correlations between any two samples of the cohort (this can be useful only for relatively small cohorts, otherwise the plot will be too crowded and unreadable if printed on one page). At this line, you can enter a regular expression between the quotation marks to find and remove repetitive parts of sample names. For example, all of our file names end in _L001_R1_001.cutadapt.alignMEM.dedupOptPIC.realignGATK.reads.bam, reflecting all pipeline stages that led to the generation of the final bam file, and entering _L.* as a regular expression will remove this whole unnecessary part of the names so that labels can fit on the graph. Modify according to your needs. A good quick guide to regular expressions: https://www.rexegg.com/regex-quickstart.html
  • Line 73 – relevant if you want to execute the optional part of the script that plots a graphical depiction of the correlations between any two samples of the cohort (see above). The graph will be saved under the file name Pairwise_concordance.png. It is a good idea to modify this generic name by adding to it a descriptor of the analyzed study/samples, thus avoiding the confusion that might ensue when you analyze several groups of samples and get several different results files with the same names in different directories.

Execution of the script

After changing the parameters as necessary, start the script. There are different ways to do this but the most convenient way is to do it from within RStudio. Select the part of the script that you want to execute (from the beginning to line 32 if you only want to check for unexpected pairing of samples, from the beginning to line 48 if you also want to check whether expected pairs are paired as they should be, or all of the text if in addition you also want to get a graph of the correlation coefficients) and then click on Run. Alternatives without RStudio:

  • You can navigate in the terminal to the directory where Swap Checker has been installed and then issue the command Rscript Swap_checker.R. This will execute the whole script.
  • You can start R in a terminal and then enter the command source("~/Swap_checker/Swap_checker.R") (if Swap Checker is installed in a different directory on your computer, you will have to modify the path accordingly). It will take from several minutes (tens of samples) to a few hours (thousands of samples) for the script to execute depending on the speed of your computer, the sizes of the dataset and of the SNP list, and the extent of the analysis (only matching samples or also checking whether expected pairs are really concordant ± producing a graph of the correlation coefficients).

About

An R script that can detect sample swaps or sample mislabelling from tNGS (targeted next generation sequencing) data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages