Skip to content

Commit

Permalink
Copy paste old outputs content for pangenome analyses
Browse files Browse the repository at this point in the history
  • Loading branch information
jpjarnoux committed Nov 27, 2023
1 parent bb7384d commit 8026a2b
Show file tree
Hide file tree
Showing 3 changed files with 176 additions and 9 deletions.
43 changes: 42 additions & 1 deletion docs/user/PangenomeAnalyses/pangenomeFigures.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,50 @@
### Pangenome figures output

#### U-shape plot
A U-shaped plot is a figure presenting the number of families (y axis) per number of organisms (x axis).
It is a .html file that can be opened with any browser and with which you can interact, zoom, move around, mouseover to see numbers in more detail, and you can save what you are seeing as a .png image file.

It can be generated using the 'draw' subcommand as such :

`ppanggolin draw -p pangenome.h5 --ucurve`

#### tile plot

A tile plot is a heatmap representing the gene families (y axis) in the organisms (x axis) making up your pangenome. The tiles on the graph will be colored if the gene family is present in an organism and uncolored if absent. The gene families are ordered by partition, and the genomes are ordered by a hierarchical clustering based on their shared gene families (basically two genomes that are close together in terms of gene family composition will be close together on the figure).

This plot is quite helpful to observe potential structures in your pangenome, and can also help you to identify eventual outliers. You can interact with it, and mousing over a tile in the plot will indicate to you which is the gene identifier(s), the gene family and the organism that corresponds to the tile.

If you build your pangenome using the 'workflow' subcommand and you have more than 500 organisms, only the 'shell' and the 'persistent' partitions will be drawn, leaving out the 'cloud' as the figure tends to be too heavy for a browser to open it otherwise.

It can be generated using the 'draw' subcommand as such :

`ppanggolin draw -p pangenome.h5 --tile_plot`

and if you do not want the 'cloud' gene families as it is a lot of data and can be hard to open with a browser sometimes, you can use the following option :

`ppanggolin draw -p pangenome.h5 --tile_plot --nocloud`

#### Rarefaction curve
This figure is not drawn by default in the 'workflow' subcommand as it requires a lot of computations. It represents the evolution of the number of gene families for each partition as you add more genomes to the pangenome. It has been used a lot in the literature as an indicator of the diversity that you are missing with your dataset on your taxonomic group. The idea is that if at some point when you keep adding genomes to your pangenome you do not add any more gene families, you might have access to your entire taxonomic group's diversity. On the contrary if you are still adding a lot of genes you may be still missing a lot of gene families.

There are 8 partitions represented. For each of the partitions there are multiple representations of the observed data. You can find the observed means, medians, 1st and 3rd quartiles of the number of gene families per number of genome used. And you can find the fitting of the data by the Heaps' law, which is usually used to represent this evolution of the diversity in terms of gene families in each of the partitions.

It can be generated using the 'rarefaction' subcommand, which is dedicated to drawing this graph, as such :

`ppanggolin rarefaction -p pangenome.h5`

A lot of options can be used with this subcommand to tune your rarefaction curves, most of them are the same as with the `partition` workflow.
The following 3 are related to the rarefaction alone:

- `--depth` defines the number of sampling for each number of organism (default 30)
- `--min` defines the minimal number of organisms in a sample (default 1)
- `--max` defines the maximal number of organisms in a sample (default 100)

So for example the following command:
`ppanggolin rarefaction -p pangenome.h5 --min 5 --max 50 --depth 30`

Will draw a rarefaction curve with sample sizes between 5 and 50 (between 5 and 50 genomes will be used), and with 30 samples at each point (so 30 samples of 5 genomes, 30 samples or 6 genomes ... up to 50 genomes).

#### ProkSee

#### ProkSee
[//]: # (TODO after merge with split command)
32 changes: 30 additions & 2 deletions docs/user/PangenomeAnalyses/pangenomeGraphOut.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,34 @@
### Pangenome graph output

#### Gephi
The Graph can be given through the .gexf and through the _light.gexf files. The _light.gexf file will contain the gene families as nodes and the edges between gene families describing their relationship, and the .gexf file will contain the same thing, but also include more informations about each gene and each relation between gene families.
We have made two different files representing the same graph because, while the non-light file is exhaustive, it can be very heavy to manipulate and most of the information in it are not of interest to everyone. The _light.gexf file should be the one you use to manipulate the pangenome graph most of the time.

They can be manipulated and visualised through a software called [Gephi](https://gephi.org/), with which we have made extensive testings, or potentially any other softwares or libraries that can read gexf files such as [networkx](https://networkx.github.io/documentation/stable/index.html) or [gexf-js](https://github.com/raphv/gexf-js) among others.

#### JSON
Using Gephi, the layout can be tuned as illustrated below:

![Gephi layout](../../_static/gephi.gif)

We advise the Gephi "Force Atlas 2" algorithm to compute the graph layout with "Stronger Gravity: on" and "scaling: 4000" but don't hesitate to tinker the layout parameters.

In the _light.gexf file :
The nodes will contain the number of genes belonging to the gene family, the most commun gene name (if you provided annotations), the most common product name(if you provided annotations), the partitions it belongs to, its average and median size in nucleotids, and the number of organisms that have this gene family.

The edges contain the number of times they are present in the pangenome.

The .gexf non-light file will contain in addition to this all the information about genes belonging to each gene families, their names, their product string, their sizes and all the information about the neighborhood relationships of each pair of genes described through the edges.

The light gexf can be generated using the 'write' subcommand as such :

`ppanggolin write -p pangenome.h5 --light_gexf`

while the gexf file can be generated as such :

`ppanggolin write -p pangenome.h5 --gexf`

#### JSON
The json's file content corresponds to the .gexf file content, but in json rather than gexf file format. It follows the 'node-link' format as shown in [this example](https://observablehq.com/@d3/force-directed-graph) in javascript, or as used in the [networkx](https://networkx.github.io/documentation/stable/reference/readwrite/json_graph.html) python library and it should be usable with both [D3js](https://d3js.org/) and [networkx](https://networkx.github.io/documentation/stable/index.html), or any other software or library that supports this format.

The json can be generated using the 'write' subcommand as such :

`ppanggolin write -p pangenome.h5 --json`
Loading

0 comments on commit 8026a2b

Please sign in to comment.