Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gene regulatory network inference (with prior knowledge) #900

Open
stkmrc opened this issue Jun 2, 2024 · 10 comments
Open

Gene regulatory network inference (with prior knowledge) #900

stkmrc opened this issue Jun 2, 2024 · 10 comments
Labels
task Add a new task

Comments

@stkmrc
Copy link

stkmrc commented Jun 2, 2024

Task motivation

Gene Regulatory Network (GRN) inference is pivotal in systems biology, offering profound insights into the complex mechanisms that govern gene expression and cellular behavior. These insights are crucial for advancing our understanding of biological processes and have significant implications in medical research, particularly in developing targeted therapies and understanding disease mechanisms.

Computational Challenges

Despite its importance, GRN inference from single-cell RNA-Seq data is challenged by the high dimensionality of the data, inherent data noise, sparsity of the data, sparsity of the networks to be inferred, the lack of known negative edges in the GRN (positive unlabeled setting) and the ambiguity of possible causal explanations for the data. Available computational approaches often struggle with these issues, leading to inaccurate or overfitted models.

Research Gap

Current methods range from statistical correlations to advanced machine learning, each with limitations in terms of accuracy, data requirements, and interpretability. Multiple benchmarking studies exist, differing in the choices of evaluation, such as the way of negative sampling, metrics used and the choice of synthetic vs experimental data. What is missing is a more standardized way of benchmarking using biologically meaningful metrics.

Task description

The task focuses on the inference of GRNs from scRNA-Seq data. It is divided into two subtasks based on the availability of prior knowledge:

  1. GRN Inference without prior knowledge: Inferring GRN solely from scRNA-Seq data.
  2. GRN Inference with prior knowledge: Inferring GRN from scRNA-Seq data using an additional prior knowledge graph (a subset of edges from the ground truth GRN).

Input Data

  • For Subtask 1: Normalized and preprocessed scRNA-Seq data
  • For Subtask 2: In addition to the scRNA-Seq data, a subset of given edges of the GRN as prior knowledge

Expected Output

The output for both subtasks is a predicted GRN, represented as a graph where nodes are genes and edges indicate regulatory interactions. The quality of the predicted networks can be evaluated in two main ways:

  1. Binary Classification: Each potential interaction (edge) is classified as either present or absent (like this)
  2. Topological Evaluation: The overall structure and properties of the predicted network are assessed (like this)

Proposed ground-truth in datasets

  1. Synthetic, Curated and Experimental datasets from (BEELINE)
  2. Experimental datasets from (this paper)

Initial set of methods to implement

  1. MLPs
  2. Graph Neural Network based diffusion models (GCN / GAT)

Proposed control methods

  1. Pearson / Spearman correlation
  2. Random predictor

Proposed Metrics

Binary classification:

  1. Link-equality metrics (AUROC / AUPRC)
  2. Node-equality metrics (Mean Average Precision)
  3. Precision@Top k

Topological evaluation:

  1. Information Exchange (Average Shortest Path Length, Global and Local Efficiency)
  2. Hub Topology (Assortativity, Clustering Coefficient, Centralization)
  3. Hub Identification (PageRank, Betweenness, Radiality, Centrality)
@stkmrc stkmrc added the task Add a new task label Jun 2, 2024
@rcannood
Copy link
Member

Hi @stkmrc !

Thanks for creating this issue! I heard from @janursa that he's also involved in benchmarking gene regulatory network inference methods, but with a different angle on a few things -- mainly concerning the what the ground truth information and metrics are. However, the methods will probably be quite similar.

@janursa would you be willing to discuss what your proposal concerning how to benchmark GRN inference methods for single-cell applications is?

Would be great if we could combine our efforts to get the best of both benchmarking experimental designs.

@janursa
Copy link

janursa commented Jun 21, 2024

@stkmrc thanks for creating this issue and @rcannood thanks for involving me.
Yes, i would be happy to share our approach and merge the ideas and efforts. I already contacted Marco to have a talk. I created a slack group and added you. We can discuss there and summarize the outcomes here

@LuckyMD
Copy link
Collaborator

LuckyMD commented Jun 24, 2024

Hi @stkmrc,
A few comments also from my side:

Despite its importance, GRN inference from single-cell RNA-Seq data is challenged by the high dimensionality of the data, inherent data noise, sparsity of the data, sparsity of the networks to be inferred, the lack of known negative edges in the GRN (positive unlabeled setting) and the ambiguity of possible causal explanations for the data. Available computational approaches often struggle with these issues, leading to inaccurate or overfitted models.

Wouldn't ground truth generally be an issue as well and not only known negative edges? Even in the examples you suggest, I imagine there are quite a few caveats on the ground truth network structure, no?

Expected Output
The output for both subtasks is a predicted GRN, represented as a graph where nodes are genes and edges indicate regulatory interactions.

Do you propose to output weighted edges or just showing direction? If weights are used, what should this signify?

General comments:

  • Control methods: you may benefit from a positive control being the underlying ground-truth network you compare against. This would help to calibrate your metrics as well (in case they are not 1 for the ground truth... e.g., shortest path metrics (see later point)
  • Generalization of GRN structures: why do you assume that some GRNs exhibit certain properties that should be applicable to other GRNs? We assume that any GRN or other biological network we currently have data is a subset of an underlying true network. However, properties of that underlying true network do not have to be true for the subset (e.g., subnets of scale free networks are not scale-free: https://www.pnas.org/doi/10.1073/pnas.0501179102)
  • Shortest path metrics: Metrics should be set up to be optimized by the ground truth and a toy example. However, it seems to me that shortest path metrics are optimized by fully connected networks and not the assumed underlying sparse network. Are these really then good metrics?
  • Metrics that don't rely on (exact) ground truth: Have you considered metrics that focus on downstream use cases of GRNs, such as predicting responses to perturbations or TF-target prediction? That way you wouldn't rely on a specific ground truth and would be able to show utility of the GRNs to users. This may also alleviate the issue about indirect vs direct interactions (as this may not matter for signal transduction and thus usability for prediction). These metrics would be really important to communicate the value of your benchmark to biologists as well IMO.

@LuckyMD
Copy link
Collaborator

LuckyMD commented Jun 24, 2024

Also, @janursa,

What do you think about keeping the discussions on this on github, so we have documentation for future community involvement?

@stkmrc
Copy link
Author

stkmrc commented Jun 24, 2024

@LuckyMD thanks for your comments! Check my answers below:

Wouldn't ground truth generally be an issue as well and not only known negative edges? Even in the examples you suggest, I imagine there are quite a few caveats on the ground truth network structure, no?

Certainly, though it's more a "Computational Challenge" of the evaluation than of the GRN inference task itself

Do you propose to output weighted edges or just showing direction? If weights are used, what should this signify?

That's a good point - for the AUC metrics we would need weighted edges for the outputs, but not for the ground truth (since it's a binary classification task). There's also the option to evaluate against weighted ground truth, or even signed (activation/repression) edges to add more detail - but since most available algorithms already perform poorly on the "easier" task of binary classification I wouldn't add more complexity in the first version

control methods/ shortest path

The topological metrics used in the STREAMLINE paper for this are computed as the difference between the predicted and ground truth network, so to compute this we need the ground truth control values. These metrics are then optimal (=0) when e.g. the average shortest path in the predicted network matches the value of the ground truth network (and not optimized the smaller for example the shortest paths are). Of course, we could construct also a metric out of the difference that is in range [0-1], if required - but I like to look at the signed difference because it not only tells you how close the topology value is to the ground truth, but also if it's over- or underestimated in the prediction.

Generalization of GRN structures

There exist separate GRNs (e.g. a GRN per Species) that I wouldn't consider directly linked as part of the "one underlying true network". For these, the "scale-free" topology is often referred to. But I agree, smaller subnetworks can have other topologies, even if the underlying larger network is scale-free. That's why in our topological benchmarking paper we didn't only focus on scale-free graphs and the metrics are not evaluating for example "how scale-free" a predicted graph is, but instead evaluates how close the topology is to the topology of the ground truth in a more unbiased way (even if the ground-truth is for example more a small-world network).

Metrics that don't rely on (exact) ground truth

For the experimental datasets this would be anyway planned, since we don't have an exact ground truth available. So the McCalla datasets for example provide both TF perturbation and TF Chip-Seq based ground truth networks we can evaluate against. If you have ideas for other resources or how to construct something similar for the simulated datasets, I would be happy to include those as well!

@rcannood rcannood transferred this issue from openproblems-bio/openproblems-v2 Sep 8, 2024
@rcannood
Copy link
Member

rcannood commented Sep 8, 2024

Work on this task has been started at task_grn_inference

@ekernf01
Copy link

Hello everyone,

I recently announced the PEREGGRN benchmarks, which are very similar in spirit to this effort. @LuckyMD graciously reached out with an invitation to work together. Today I got a chance to read through a lot of the OpenProblems documentation -- and wow, it is beautiful to see all of this. Really clean and thorough layout, and a huge asset to the field. Here's a short recap of PEREGGRN, similarities/differences to the OpenProblems GRN inference task, and possible ways our work could be made to inter-operate. I don't think I will be able to follow through on all the possibilities that exist here, but OpenProblems looks like a DREAM-like force to shape whole subfields in the years to come, so I am certainly excited to lay out what PEREGGRN might be able to contribute.

The core of PEREGGRN is a genetic perturbation prediction task: split a Perturb-seq, training on some perturbations and testing on the rest. Some of the main reusable components that would probably be easy to cannibalize are:

  • A collection of uniformly formatted, previously published gene networks
  • A collection of uniformly formatted, QC'd perturb-seq and similar data (AnnData objects)
  • Code to fit various regression models using arbitrary networks to define the X's and Y's
  • Code to evaluate predicted versus observed log fold change in gene expression

The main differences compared to the OpenProblems GRN inference task:

  • Not all the datasets are scRNA-seq. Some are bulk, pseudo-bulk, or microarray.
  • PEREGGRN assumes genetic, not chemical, perturbations. This affects the data split. We reveal held-out expression not for all regulators but rather only for the perturbed gene. Only genes that are both perturbed and measured can go in the test set. It is a little more restrictive, and probably a harder task, which is a downside in such a difficult domain. But the benefit is that the causal arrows have to point the right way for the models to beat negative controls or non-mechanistic models.
  • The PEREGGRN benchmarks include methods like GEARS that are not based on network structures. Network structure is really almost incidental to PEREGGRN. It's conceptually important for sure, but the individual methods are only required to predict expression, and they may or may not use interpretable causal networks. Doing this exactly would probably require a new task rather than fitting within the OpenProblems GRN inference task.
  • We are working on additional experiments that each match a timeseries of differentiation or immune activation with genetic perturbations in the same experimental system, and sometimes with low-throughout CRISPR screens. This temporal aspect comes with many unpleasant complications, which is why I am currently procrastinating by reaching out to you XD.

I guess my assessment right now is that it would be tricky, partially redundant, and probably not worthwhile to completely re-implement PEREGGRN within OpenProblems. Whether or not this happens, end users would have a similar experience: making a docker image that reads an AnnData and runs their method. But, it would be easy to provide the network collection as positive controls or to provide the perturb-seq datasets for evaluation in additional contexts. What do you think?

Best regards from Baltimore, MD, USA.
Eric Kernfeld

@janursa
Copy link

janursa commented Oct 30, 2024

@ekernf01, thanks for reaching out and for the nice description. @LuckyMD also referred me to your work, so I had the chance to read your recent paper—great work! From what I gathered, your approach in PEREGGRN combines elements from both task_grn_inference and task_perturbation_prediction. It resembles task_grn_inference in that it leverages a prior GRN model to inform feature construction. However, a key difference is that the GRNs used in your model are not context-specific; they don’t directly relate to the expression data being predicted. In our work, we infer a context-specific GRN from multiomics data and use it to construct features for regression models that predict perturbation data from the same experiment. We show that this context specificity makes a significant difference, as features built with a non-context-specific model, like collectRI, perform as poorly as a random network. I’d be interested to test your GRNs on our evaluation to see how they perform.

Regarding OpenProblems integration, Malte will certainly have a better sense of this, but it seems like PEREGGRN would fit well within task_perturbation_prediction, as it offers additional datasets and metrics. For task_grn_inference, in addition to many things such as your experience, your datasets could be valuable; I’d still need to explore the perturbation sizes and whether we can integrate unpaired ATAC data from other sources. We’ve just finished the first phase of the benchmark using a single set of inference and perturbation data, and the next steps involve extending this to more datasets. Our perturbation is chemical, so I’m particularly interested to see how the model performs with KO/KD cases.

@ekernf01
Copy link

Some brief follow-through on this: I have written a first draft of a data exporter that returns AnnData objects in something close to the format used for the OpenProblems perturbation prediction task. It is here. Hopefully this is a useful start, and here is some of what remains to be done.

  • The exporter code will throw errors if I never generated PCA or UMAP on that dataset. Usually I did both but I don't remember always inspecting UMAPs. Ideally the code should generate a UMAP or give a warning instead of an error, so that's a work in progress.
  • Dosage is given in raw counts, not micromolar. Some gene dosages are not known.
  • The exporter code does not yet distinguish between knockouts and other perturbations. It should probably list knockouts as having a dosage of 0 even if some transcript is detected, because the extant transcripts won't be functional.
  • Multi-gene perturbations are given as comma-separated strings: "FOXN1,GCM2"
  • Dosage for multi-gene perturbations is given as comma-separated strings: "1,3"
  • For controls, split is "control" but "sm_name" could be anything. Often it gives details on the type of control, for instance "non-targeting gRNA 4". Dosage will be listed as np.nan for all controls.
  • Cell type info is not available from the current pereggrn zenodo archives. For the exporter code to work, you will need to go into the data collection and manually replace perturbations.csv .
  • Lots of required fields like plate number, row, col, well are not applicable.

@janursa
Copy link

janursa commented Nov 17, 2024

@ekernf01 thanks you very much for the efforts. Indeed, the current data format requirements is not generic enough to integrate new datasets. Will fix it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Add a new task
Projects
None yet
Development

No branches or pull requests

5 participants