Skip to content

Commit

Permalink
create r doc and few index changes
Browse files Browse the repository at this point in the history
  • Loading branch information
connor-french committed Aug 8, 2024
1 parent 44dd925 commit 048274d
Show file tree
Hide file tree
Showing 2 changed files with 359 additions and 11 deletions.
27 changes: 16 additions & 11 deletions docs/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ format:

::: {.callout-note}
## Note for R users
*spaceprime* is implemented in Python, yet many interested users may come from an R background. I have a [*spaceprime* for R users]() vignette (COMING SOON) that provides a brief introduction to the Python concepts necessary to use *spaceprime* through a practical walk-through of an example analysis. If you still want to use R, it is possible to use Python code in R using the *reticulate* package. For more information on how to use Python code in R, see the [reticulate documentation](https://rstudio.github.io/reticulate/).
*spaceprime* is implemented in Python, yet many interested users may come from an R background. I have a [*spaceprime* for R users](spaceprime-for-r-users.qmd) vignette that provides a brief introduction to the Python concepts necessary to use *spaceprime* through a practical walk-through of an example analysis. If you still want to use R, it is possible to use Python code in R using the *reticulate* package. For more information on how to use Python code in R, see the [reticulate documentation](https://rstudio.github.io/reticulate/).
:::

## Main features
Expand Down Expand Up @@ -102,7 +102,7 @@ Make sure to install the relevant analysis dependencies with `pip install spacep
## Quickstart

::: {.callout-tip}
This quickstart guide assumes you have a basic understanding of Python. If you are an R user, please refer to the [spaceprime for R Users vignette](https://connor-french.github.io/spaceprime) for an overview of *spaceprime* with the necessary Python concepts explained.
This quickstart guide assumes you have a basic understanding of Python. If you are an R user, please refer to the [spaceprime for R users vignette](spaceprime-for-r-users.qmd) for an overview of *spaceprime* with the necessary Python concepts explained.
:::

### 1. Download data
Expand Down Expand Up @@ -137,7 +137,7 @@ plt.tight_layout()
plt.show()
```

The GeoJSON file contains geographic localities of *P. distincta* in the Brazilian Atlantic Forest, along with metadata about each locality. Here are the localities plotted on top of the present-day habitat suitability model:
The GeoJSON file contains geographic sampling localities of *P. distincta* in the Brazilian Atlantic Forest, along with metadata about each locality. Each row is a single individual/observation. *spaceprime* counts the number of observations with coordinates that overlap with a raster cell/deme and samples the calculated number for simulations and summary statistics. Here are the localities plotted on top of the present-day habitat suitability model:

```{python}
#| label: localities
Expand All @@ -161,22 +161,28 @@ plt.show()

### 2. Read in packages and data

Now that we have our data, let's read in the packages and data we'll be using. The [GeoPandas](https://geopandas.org/en/stable/) and [Rasterio](https://rasterio.readthedocs.io/en/stable/index.html) packages are dependencies of *spaceprime*, so you shouldn't need to install them separately. They are needed for reading in locality data and habitat suitability rasters, respectively. Make sure to replace the `projections.tif` file path with the path to the file on your system:
Now that we have our data, let's read in the packages and data we'll be using. The [GeoPandas](https://geopandas.org/en/stable/) and [Rasterio](https://rasterio.readthedocs.io/en/stable/index.html) packages are dependencies of *spaceprime*, so you shouldn't need to install them separately. They are needed for reading in locality data and habitat suitability rasters, respectively.

```{python}
#| label: import-data
#| label: import-packages
#| eval: false
import spaceprime as sp
import geopandas as gpd
import rasterio
```

Make sure to replace the `projections.tif` file path with the path to the file on your system. The GeoJSON file is read in from the web, so you don't need to download it.

```{python}
#| label: read-data
#| eval: false
r = rasterio.open("projections.tif")
locs = gpd.read_file("https://raw.githubusercontent.com/connor-french/spaceprime/main/spaceprime/data/localities.geojson")
```


### 3. Set up the demographic model

Next, we'll convert the habitat suitability values into deme sizes, so each cell in the raster will represent a deme in our model. We'll use a linear transformation to convert the suitability values to deme sizes, where the suitability value is multiplied by a constant to get the deme size. The constant is the maximum local deme size, which we set to 1000. For more on transformations, see the [suitability to deme size transformation functions vignette](transformation-functions.qmd).
Expand Down Expand Up @@ -239,7 +245,7 @@ sp.plot_model(demo, r, 0)
```


### 4. Simulate genetic data
### 5. Simulate genetic data

Before simulating this demography, we need to create a sample dictionary that translates the empirical sampling localities to the model's deme indices and maps those to the number of samples to take from each deme. By default, this function sets the number of individuals to sample from each deme to the number of empirical localities in that deme. The `coords_to_sample_dict()` function also returns two other dictionaries that are not used in this example, so we'll ignore them.

Expand All @@ -250,7 +256,7 @@ Before simulating this demography, we need to create a sample dictionary that tr
sample_dict, _, _ = sp.coords_to_sample_dict(r, locs)
```

Now we get to simulate! The first task is to simulate the ancestry of the samples using the coalescent. All of the hard work is done through `msprime`'s `sim_ancestry()` function, for which `spaceprime` provides a convenience wrapper. This function returns a [tskit TreeSequence](https://tskit.dev/tskit/docs/stable/python-api.html#trees-and-tree-sequences), which "represents a sequence of correlated evolutionary trees along a genome" and is and incredibly powerful and compact data representation for population genomic analyses. The minimum number of arguments required for this function are the sample dictionary and the demographic model. The sequence length is also necessary for overlaying mutations in the next step. Notice the lack of mutations in the table. We'll set `record_provenance` to False to decrease the memory overhead of storing a bunch of metadata about the simulation.
Now we get to simulate! The first task is to simulate the ancestry of the samples using the coalescent. All of the hard work is done through `msprime`'s `sim_ancestry()` function, for which `spaceprime` provides a convenience wrapper. This function returns a [tskit TreeSequence](https://tskit.dev/tskit/docs/stable/python-api.html#trees-and-tree-sequences), which "represents a sequence of correlated evolutionary trees along a genome" and is an incredibly powerful and compact data representation for population genomic analyses. The minimum number of arguments required for this function are the sample dictionary and the demographic model. If you need to overlay mutations, you need to supply the sequence length. Notice the lack of mutations in the table. We'll set `record_provenance` to False to decrease the memory overhead of storing a bunch of metadata about the simulation.

This step may take a minute or so to run.

Expand All @@ -264,7 +270,7 @@ print(sim)
```


We'll take a peak at a single tree from the TreeSequence object to see what it looks like. The `draw_tree()` function is a convenience function that uses `msprime`'s `draw_text()` function to plot a single tree from the TreeSequence object. Here, I removed the node labels because there are tons of nodes that crowd the plot and we're only interested in the tree structure.
We'll take a peak at a single tree from the TreeSequence object to see what it looks like. The `draw_svg()` method plots trees from the TreeSequence object. Here, I selected a single tree and removed the node labels because there are tons of nodes that crowd the plot and we're only interested in the tree structure.

```{python}
#| label: draw-tree
Expand All @@ -282,7 +288,6 @@ Overlaying mutations after simulating ancestry isn't necessary for calculating g
#| label: mutations
#| eval: true
sim = sp.sim_mutations(sim, rate=1e-10, random_seed=490)
print(sim)
Expand All @@ -305,7 +310,7 @@ first_tree_mut.draw_svg(y_axis=True, size=(600, 400), node_labels=node_labels)
From here, you have a few options. You can:

- Use the `analysis` module to calculate genetic summary statistics on the TreeSequence object. For more on the `analysis` module, see the [analysis module documentation]().
- Save the TreeSequence to use later or analyze on a platform like [tskit](https://tskit.dev/tskit/docs/stable/introduction.html) with `sim.dump(file/path/to/write/to)`.
- Save the TreeSequence to use later or analyze on a platform like [tskit](https://tskit.dev/tskit/docs/stable/introduction.html) with `sim.dump(file/path/to/write/to.trees)`.
- Convert the TreeSequence with mutations to a genotype matrix for use in a program like [scikit-allel](https://scikit-allel.readthedocs.io/en/stable/) with `sim.genotype_matrix()`. For more information on this function, see the [tskit documentation](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.genotype_matrix).
- Export the TreeSequence with mutations to a VCF file using `sim.write_vcf`. For more information on how to use this function, see the [tskit documentation](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.write_vcf).

Expand Down
Loading

0 comments on commit 048274d

Please sign in to comment.