Skip to content
ManuelTgn edited this page Dec 4, 2020 · 30 revisions

Welcome to the GRAFIMO wiki!

In a hurry? Check out our Quickstart guide

Regulatory proteins, such as Transcription Factors (TFs), are key genomic elements which promote or reduce the expression of genes by binding short, evolutionary conserved DNA sequences, often referred to as motifs. Mutations occurring in DNA motifs have been shown to have deleterious effects on the transcriptional landscape of the cell (Li & Ovcharenko, 2015; Guo et al., 2018). The recent introduction of Genome Variation Graphs (VG) (Garrison et al., 2018) allowed to represent in a single and efficient data-structure the genomic variation present within a population of individuals.

GRAFIMO (GRAph-based Finding of Individual Motif Occurrences) is a command-line tool that extends the traditional Position Weight Matrix (PWM) scanning procedure to VGs. GRAFIMO can search the occurrences of a given PWM in many genomes in a single run, accounting for the effects that SNPs, indels and potentially any structural variation (handled by VG) have on found potential motif occurrences. As result, GRAFIMO produces a report containing the statistically significant motif candidates found, reporting their frequency within the haplotypes embedded in the scanned VG and if they contain genomic variants or belong to the reference genome sequence.

For any bug report, doubts or improvement suggestions, do not hesitate to open an issue!

Before you start

Running GRAFIMO

Looking for an hands-on tutorial? Check out our practical tutorials.

In this section will be presented how to run GRAFIMO and some advanced options. Here we assume that the genome variation graph (VG) has been built constructing a VG for each chromosome. If you have a single whole genome variation graph, substitute -d argument with -g followed by the path to your whole genome VG. If you do not have a genome variation graph, here is described how to construct it with GRAFIMO.

Note that in both cases the XG and GBWT indexes must be stored in the same location.

Doubts on building genomes as single whole genome variation graph or constructing a VG for each chromosome? Check out our discussion.

Input

GRAFIMO requires three mandatory arguments:

  • path to a directory containing the chromosomes VGs (XG and GBWT indexes) or path to the whole genome variation graph (XG and GBWT indexes). See VG's wiki for further details on XG and GBWT indexes.
  • path to PWM motif file in MEME or JASPAR format
  • BED file containing a set of genomic regions, where GRAFIMO will search the potential motif occurrences

Run

To run GRAFIMO from command-line

grafimo findmotif -d /path/to/my/xg/and/gbwt/directory -m /path/to/my/motif -b /path/to/my/bed/file

By default GRAFIMO will create a directory in the current location called grafimo_out_PID_MOTIFID, containing the results. For further details on result files see Results description section.

Advanced options

Using background distributions

For each potential motif occurrence GRAFIMO computes a log-likelihood score, a P-value and a q-value. Such measures are weighted by a background probability distribution. By default, GRAFIMO assumes a uniform background distribution for nucleotides. The user can specify a different background distribution in a text file and give it to GRAFIMO using -k option. An example of background file is

A	0.2951
C	0.2047
T	0.2955
G	0.2048

Assume that our background distribution has been written in a file named bg_nt, then we call GRAFIMO with

grafimo findmotif -d /path/to/my/xg/and/gbwt/directory -m /path/to/my/motif -b /path/to/my/bed/file -k path/to/bg_nt

For an example of background files accepted by GRAFIMO, take a look at bg_nt in tutorials/findmotif_tutorial/data directory.

Setting thresholds on motif occurrences statistical significance

By default GRAFIMO applies a threshold of 1e-4 on the P-value of each retrieved potential motif occurrence. So, will be reported the motif candidates with an associated P-value smaller than 1e-4. The threshold can be changed by using the -t option. For example, let us set a threshold of 1e-3 on the P-values

grafimo findmotif -d /path/to/my/xg/and/gbwt/directory -m /path/to/my/motif -b /path/to/my/bed/file -t 1e-3

GRAFIMO, besides P-values, computes q-values for the motif occurrences candidates. The user can apply a threshold on q-values, rather than on P-values, by using the --qvalueT option. --qvalue option can be used in combination with -t to define a threshold value different from 1e-4. Let us apply a threshold of 1e-5 on q-values

grafimo findmotif -d /path/to/my/xg/and/gbwt/directory -m /path/to/my/motif -b /path/to/my/bed/file --qvalueT -t 1e-5

Exploring all possible recombinants

GRAFIMO by default reports only those potential motif occurrences which are found in at least one of the haplotypes encoded in the VG. GRAFIMO allows to include in the final report also those motif occurrences (surviving the threshold on statistical significance) which are not observed in the haplotypes encoded in the VG, but which can be obtained from the set of genomic variants, used to construct the genome variation graph. In order to do this, we use the --recomb option

grafimo findmotif -d /path/to/my/xg/and/gbwt/directory -m /path/to/my/motif -b /path/to/my/bed/file --recomb

Searching motif only on certain chromosomes

It is possible to limit GRAFIMO search for potential motif occurrences to a subset of chromosomes by using the -c option. Let us assume we are interested only in what is happening on chr2, chr10 and chrX. In order to do this, we call GRAFIMO from command-line with

grafimo findmotif -d /path/to/my/xg/and/gbwt/directory -m /path/to/my/motif -b /path/to/my/bed/file -c 2 10 X

Setting output options

As mentioned in "How to run GRAFIMO" section, by default, GRAFIMO creates a directory named grafimo_out_PID_MOTIFID, which contains the results. To store GRAFIMO results in a different directory (can be both a new or an already existing directory), we call GRAFIMO with -o option

grafimo findmotif -d /path/to/my/xg/and/gbwt/directory -m /path/to/my/motif -b /path/to/my/bed/file -o /path/to/my/output/directory

GRAFIMO allows also to explore the VG structure for a user-defined number of regions (the top n regions, starting from the one containing the potential binding site with most significant P-value). GRAFIMO will create a PNG image for each one of the best n regions. The images will be stored in results directory, inside top_graphs folder. Let us assume we want to inspect the structure of the best 10 regions. In order to do this, we call GRAFIMO with --top-graph option

grafimo findmotif -d /path/to/my/xg/and/gbwt/directory -m /path/to/my/motif -b /path/to/my/bed/file --top-graph 10

It is also possible to visualize results directly on stdout, without writing report files. Note that on the stdout will be printed only the content of the TSV report. To do this, we call GRAFIMo from command-line with -f option

grafimo findmotif -d /path/to/my/xg/and/gbwt/directory -m /path/to/my/motif -b /path/to/my/bed/file -f

Setting the number of cores

By default, GRAFIMO will use all available cores. As discussed here, the user can set a smaller number of used cores, in order to limit the amount of required resources. For example, to run GRAFIMO using just four cores, we type

grafimo findmotif -d /path/to/my/xg/and/gbwt/directory -m /path/to/my/motif -b /path/to/my/bed/file --cores 4

For more options

For more options available when running GRAFIMO, type from command-line

grafimo -h 

Results description

GRAFIMO results are reported in three files (stored in output directory):

  • tab-delimited report (TSV report)
  • HTML report
  • GFF3 report

The TSV report contains all the statistically significant potential motif occurrence found by GRAFIMO (according to the applied threshold). Each retrieved motif occurrence has a log-likelihood score, a P-value, a q-value, its DNA sequence, a flag value stating if a sequence is part of the reference or has been found in the haplotypes and the number of haplotype sequences where the motif candidate sequence occurs. An example of TSV report is the following

	motif_id	motif_alt_id	sequence_name	start	stop	strand	score	p-value	q-value	matched_sequence	haplotype_frequency	reference
1	MA0139.1	CTCF	chr22:43481590-43481860	43481733	43481714	-	21.26229508196724	4.403657357543095e-08	0.004175283686980911	AAGCCAGCAGGGGGCACAG	5096	ref
2	MA0139.1	CTCF	chr22:19038291-19038561	19038422	19038441	+	19.245901639344254	1.9442011615088443e-07	0.005962538354344465	TGGCCAGCAAGGGGCACTG	4	non.ref
3	MA0139.1	CTCF	chr22:19038291-19038561	19038422	19038441	+	19.114754098360663	2.1268826066771178e-07	0.005962538354344465	CGGCCAGCAAGGGGCACTG	5092	ref
4	MA0139.1	CTCF	chr22:40856678-40856948	40856891	40856910	+	18.295081967213093	3.6764803446618004e-07	0.005962538354344465	TCCCCTCCAGGGGGCGACG	5096	ref
5	MA0139.1	CTCF	chr22:11285607-11285877	11285804	11285785	-	18.213114754098342	3.8774723287177635e-07	0.005962538354344465	ATACCGCCAGGTGGCAGCA	5096	ref
6	MA0139.1	CTCF	chr22:22125904-22126174	22126044	22126063	+	18.13114754098359	4.088625891963074e-07	0.005962538354344465	CAGCCTGCAGATGGCACAG	5096	ref
7	MA0139.1	CTCF	chr22:20146797-20147067	20147010	20147029	+	17.688524590163922	5.4295945317287e-07	0.005962538354344465	CGGCCCGCAGGGGGCGGAT	5092	ref
8	MA0139.1	CTCF	chr22:34842682-34842952	34842827	34842846	+	17.672131147540995	5.486120126825257e-07	0.005962538354344465	GAGCCAGTAGGGGACAGCG	146	non.ref
9	MA0139.1	CTCF	chr22:42532903-42533173	42533062	42533081	+	17.622950819672155	5.659801842459994e-07	0.005962538354344465	GGGCCACCAGAGGGCTCCT	5096	ref
10	MA0139.1	CTCF	chr22:34842682-34842952	34842827	34842846	+	17.44262295081967	6.331282484526275e-07	0.006002942174878742	GAGCCAGTAGGGGACAGTG	4950    ref

This report can be easily processed for a downstream analysis by using Python or R programming languages, for example.

The HTML report has the same content of the TSV, but it can be loaded and viewed on the most commonly used web browsers.

The GFF3 report can be loaded on the UCSC Genome Browser as a custom track. For example, this allows a fast linking between the genomic variants used to build the VG and those present in annotated databases like dbSNP or ClinVar.

References

Li, Shan, and Ivan Ovcharenko. "Human enhancers are fragile and prone to deactivating mutations." Molecular biology and evolution 32.8 (2015): 2161-2180.

Guo, Yu Amanda, et al. "Mutation hotspots at CTCF binding sites coupled to chromosomal instability in gastrointestinal cancers." Nature communications 9.1 (2018): 1-14.

Garrison, Erik, et al. "Variation graph toolkit improves read mapping by representing genetic variation in the reference." Nature biotechnology 36.9 (2018): 875-879.