-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gene regulatory network inference (with prior knowledge) #900
Comments
Hi @stkmrc ! Thanks for creating this issue! I heard from @janursa that he's also involved in benchmarking gene regulatory network inference methods, but with a different angle on a few things -- mainly concerning the what the ground truth information and metrics are. However, the methods will probably be quite similar. @janursa would you be willing to discuss what your proposal concerning how to benchmark GRN inference methods for single-cell applications is? Would be great if we could combine our efforts to get the best of both benchmarking experimental designs. |
Hi @stkmrc,
Wouldn't ground truth generally be an issue as well and not only known negative edges? Even in the examples you suggest, I imagine there are quite a few caveats on the ground truth network structure, no?
Do you propose to output weighted edges or just showing direction? If weights are used, what should this signify? General comments:
|
Also, @janursa, What do you think about keeping the discussions on this on github, so we have documentation for future community involvement? |
@LuckyMD thanks for your comments! Check my answers below:
Certainly, though it's more a "Computational Challenge" of the evaluation than of the GRN inference task itself
That's a good point - for the AUC metrics we would need weighted edges for the outputs, but not for the ground truth (since it's a binary classification task). There's also the option to evaluate against weighted ground truth, or even signed (activation/repression) edges to add more detail - but since most available algorithms already perform poorly on the "easier" task of binary classification I wouldn't add more complexity in the first version
The topological metrics used in the STREAMLINE paper for this are computed as the difference between the predicted and ground truth network, so to compute this we need the ground truth control values. These metrics are then optimal (=0) when e.g. the average shortest path in the predicted network matches the value of the ground truth network (and not optimized the smaller for example the shortest paths are). Of course, we could construct also a metric out of the difference that is in range [0-1], if required - but I like to look at the signed difference because it not only tells you how close the topology value is to the ground truth, but also if it's over- or underestimated in the prediction.
There exist separate GRNs (e.g. a GRN per Species) that I wouldn't consider directly linked as part of the "one underlying true network". For these, the "scale-free" topology is often referred to. But I agree, smaller subnetworks can have other topologies, even if the underlying larger network is scale-free. That's why in our topological benchmarking paper we didn't only focus on scale-free graphs and the metrics are not evaluating for example "how scale-free" a predicted graph is, but instead evaluates how close the topology is to the topology of the ground truth in a more unbiased way (even if the ground-truth is for example more a small-world network).
For the experimental datasets this would be anyway planned, since we don't have an exact ground truth available. So the McCalla datasets for example provide both TF perturbation and TF Chip-Seq based ground truth networks we can evaluate against. If you have ideas for other resources or how to construct something similar for the simulated datasets, I would be happy to include those as well! |
Work on this task has been started at task_grn_inference |
Hello everyone, I recently announced the PEREGGRN benchmarks, which are very similar in spirit to this effort. @LuckyMD graciously reached out with an invitation to work together. Today I got a chance to read through a lot of the OpenProblems documentation -- and wow, it is beautiful to see all of this. Really clean and thorough layout, and a huge asset to the field. Here's a short recap of PEREGGRN, similarities/differences to the OpenProblems GRN inference task, and possible ways our work could be made to inter-operate. I don't think I will be able to follow through on all the possibilities that exist here, but OpenProblems looks like a DREAM-like force to shape whole subfields in the years to come, so I am certainly excited to lay out what PEREGGRN might be able to contribute. The core of PEREGGRN is a genetic perturbation prediction task: split a Perturb-seq, training on some perturbations and testing on the rest. Some of the main reusable components that would probably be easy to cannibalize are:
The main differences compared to the OpenProblems GRN inference task:
I guess my assessment right now is that it would be tricky, partially redundant, and probably not worthwhile to completely re-implement PEREGGRN within OpenProblems. Whether or not this happens, end users would have a similar experience: making a docker image that reads an AnnData and runs their method. But, it would be easy to provide the network collection as positive controls or to provide the perturb-seq datasets for evaluation in additional contexts. What do you think? Best regards from Baltimore, MD, USA. |
@ekernf01, thanks for reaching out and for the nice description. @LuckyMD also referred me to your work, so I had the chance to read your recent paper—great work! From what I gathered, your approach in PEREGGRN combines elements from both task_grn_inference and task_perturbation_prediction. It resembles task_grn_inference in that it leverages a prior GRN model to inform feature construction. However, a key difference is that the GRNs used in your model are not context-specific; they don’t directly relate to the expression data being predicted. In our work, we infer a context-specific GRN from multiomics data and use it to construct features for regression models that predict perturbation data from the same experiment. We show that this context specificity makes a significant difference, as features built with a non-context-specific model, like collectRI, perform as poorly as a random network. I’d be interested to test your GRNs on our evaluation to see how they perform. Regarding OpenProblems integration, Malte will certainly have a better sense of this, but it seems like PEREGGRN would fit well within task_perturbation_prediction, as it offers additional datasets and metrics. For task_grn_inference, in addition to many things such as your experience, your datasets could be valuable; I’d still need to explore the perturbation sizes and whether we can integrate unpaired ATAC data from other sources. We’ve just finished the first phase of the benchmark using a single set of inference and perturbation data, and the next steps involve extending this to more datasets. Our perturbation is chemical, so I’m particularly interested to see how the model performs with KO/KD cases. |
Some brief follow-through on this: I have written a first draft of a data exporter that returns AnnData objects in something close to the format used for the OpenProblems perturbation prediction task. It is here. Hopefully this is a useful start, and here is some of what remains to be done.
|
@ekernf01 thanks you very much for the efforts. Indeed, the current data format requirements is not generic enough to integrate new datasets. Will fix it soon. |
Task motivation
Gene Regulatory Network (GRN) inference is pivotal in systems biology, offering profound insights into the complex mechanisms that govern gene expression and cellular behavior. These insights are crucial for advancing our understanding of biological processes and have significant implications in medical research, particularly in developing targeted therapies and understanding disease mechanisms.
Computational Challenges
Despite its importance, GRN inference from single-cell RNA-Seq data is challenged by the high dimensionality of the data, inherent data noise, sparsity of the data, sparsity of the networks to be inferred, the lack of known negative edges in the GRN (positive unlabeled setting) and the ambiguity of possible causal explanations for the data. Available computational approaches often struggle with these issues, leading to inaccurate or overfitted models.
Research Gap
Current methods range from statistical correlations to advanced machine learning, each with limitations in terms of accuracy, data requirements, and interpretability. Multiple benchmarking studies exist, differing in the choices of evaluation, such as the way of negative sampling, metrics used and the choice of synthetic vs experimental data. What is missing is a more standardized way of benchmarking using biologically meaningful metrics.
Task description
The task focuses on the inference of GRNs from scRNA-Seq data. It is divided into two subtasks based on the availability of prior knowledge:
Input Data
Expected Output
The output for both subtasks is a predicted GRN, represented as a graph where nodes are genes and edges indicate regulatory interactions. The quality of the predicted networks can be evaluated in two main ways:
Proposed ground-truth in datasets
Initial set of methods to implement
Proposed control methods
Proposed Metrics
Binary classification:
Topological evaluation:
The text was updated successfully, but these errors were encountered: