At Recursion, we build maps of biology and chemistry to explore uncharted areas of disease biology, unravel its complexity, and industrialize drug discovery. Just as a map helps to navigate the physical world, our maps are designed to help us understand as much as we can about the connectedness of human biology so we can navigate the path to new medicines more efficiently.
Our maps are built using image-based high-dimensional data generated in-house. We conduct up to 2.2 million experiments every week in our highly automated labs, where we use deep learning models to embed high dimensional representations of billions of images of human cells that have been manipulated by CRISPR/Cas9-mediated gene knockouts, compounds, or other reagents. This allows us to create representations that can be compared and contrasted to predict trillions of relationships across biology and chemistry — even without physically testing all of the possible combinations. Recursion's Maps and associated applications help navigate complex biology and chemistry by revealing relationships across genes and chemical compounds.
RxRx3 is a publicly available map of biology that represents a small subset – less than 1% – of Recursion’s total dataset. MolRec™️ is a simple demo example of such an application that can be built on this type of map.
Please use the following format to cite this dataset as a whole:
We used the RxRx3 dataset (Fay et al. (2023). RxRx3: Phenomics Map of Biology. bioRxiv 2023.02.07.527350), available from RxRx.ai.
For more information about RxRx3 please visit RxRx.ai/rxrx3
RxRx3 is part of a larger set of Recursion datasets that can be found at RxRx.ai and on GitHub. For questions about this dataset and others please email [email protected].
The RxRx3 release contains 17,063 genes, as well as 1,674 known chemical entities at 8 doses each. 16,328 of these genes are anonymized in the dataset, enabling people to explore and learn from this massive dataset while protecting Recursion’s business interests. Recursion may de-anonymize genes in this dataset in the future. If you'd like to understand more about how to get access to unblinded genes please email [email protected].
The metadata can be found in metadata.csv
and downloaded from here. The schema of the metadata is as follows:
Attribute | Description |
---|---|
well_id | Experiment Name - Plate - Well (compound-004_1_AA04 or gene-088_9_Z43) |
experiment_name | Experiment Name: Experiment number (compound-004 or gene-088) |
plate | Plate number in the experiment (1-48) |
address | Well location on the plate - "A01" to "AF48". |
gene | Unblinded or anonymized gene name, or a control |
treatment | Compound synonym or gene-name - guide-number (Narlaprevir or <gene_name>_guide_1) |
SMILES | Canonical SMILES or blank for non-compounds |
concentration | Compound concentration tested (in uM) |
perturbation_type | CRISPR or COMPOUND |
cell_type | HUVEC |
To help understand the metadata, we have included some samples that some some of the more complex parts of the format to allow parser testing and validation
well_id,experiment_name,plate,address,gene,treatment,SMILES,concentration,perturbation_type,cell_type
gene-079_8_H29,gene-079,8,H29,RPLP2,RPLP2_guide_4,,,CRISPR,HUVEC
gene-045_4_AD27,gene-045,4,AD27,RXRX3-43938,RXRX3-43938_guide_6,,,CRISPR,HUVEC
gene-060_9_P28,gene-060,9,P28,EMPTY_control,EMPTY_control,,,CRISPR,HUVEC
compound-001_19_D20,compound-001,19,D20,,Dequalinium,"CC1=[N+](CCCCCCCCCC[N+]2=C(C)C=C(N)C3=CC=CC=C23)C2=CC=CC=C2C(N)=C1 |c:1,13,21,29,31,35,t:16,19,23,27|",0.25,COMPOUND,HUVEC
compound-001_11_U08,compound-001,11,U08,,EMPTY_control,,,COMPOUND,HUVEC
compound-004_43_B08,compound-004,43,B08,,CRISPR_control,,,COMPOUND,HUVEC
The images are found in images/*
and can be downloaded from here (this is ~ 83 TB).
The image data are 2048x2048 16-bit png
files. These can be downloaded by experiment plate. We provide a tar file for each experiment and plate, with each image from the experiment plate in the tar file. The image file names, such as AA02_s1_w3.png
, can be read as:
Well location on plate (column AA, row 2) Site (1) Channel (3)
All six channels (w1
- w6
) make up an single image of a given site
. Note there is one site only for every well address.
Physical resolution: 0.65 micron/pixel.
The deep learning embeddings are provided as embeddings.tar
and can be downloaded from here (this is ~ 1.82 GB).
Embeddings path is similar to the images path structure. Ex: gene-001/Plate1/embeddings.parquet
Each row in the parquet file has a well_id
as described in the metadata schema. The remaining 128 columns are the embedding for that respective well
- January 2023: initial release
This work is licensed under Recursion Non-Commercial End User License Agreement