-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine-grained T cell typing #2
Comments
These are really interesting categories! To better understand what's going on, is there a notebook that constructs Also you got your word vectors from http://bio.nlplab.org, yes? Overall I think the hard part for us won't be mapping to Cell Ontology but rather determining which of these modifiers deserves to be included in an extension of Cell Ontology that we construct ourselves. I need to do some reading on how Cell Ontology wants to be extended. For mapping to Cell Ontology it seems like it may be useful to distinguish between modifiers that don't alter the underlying cell type (e.g. tissue, protocol) and those that do (e.g. expression markers). I also need to think more about the simplistic vector addition strategy used by NormCo. The geometry of embedding space is pretty tricky. It would be fun to try some dumb vector math though like "human CD8+ T cell" - "human" + "mouse" or something. Another thought: disease normalization is maybe not the best analogy for what we're trying to do. We have a pretty small existing ontology and a huge, indeterminate set of strings that we want to distill down to terms and then create a hierarchy among those terms. In disease normalization the existing ontology is enormous and you don't want to extend it you just want to map to a term that's closest to the string you have in hand. One last thought: do any ontology specification languages accommodate traits rather than the "is-a" hierarchical relation? Will look into it. Seems like a better fit for this particular domain. |
Perhaps a more useful analogy for our task: HiExpan: Task-Guided Taxonomy Construction by Hierarchical Tree Expansion (2018). I haven't read it yet but it sounds promising from the abstract and Jiawei Han is a leading thinker in this space. |
Another paper with some interesting related work: User-Centric Ontology Population (2018). |
Last thought for the night: this giant data frame of strings looks like a great use case for Vaex. Have you ever tried it out? It uses Apache Arrow under the hood, which I'm excited about. |
Touching on a few of those:
|
Sorry, I didn't mean to assert that we should not try to map mentions to CL types. We should definitely do that, and build synonym lists for the various CL types that already exist. At some point, however, we need to determine when we don't believe a mapping is possible. At that point, we can extend CL to include unmapped mentions that we believe correspond to a fine-grained T cell type. I read through the HiExpan paper last night and their approach looks reasonable and they have code available at https://github.com/mickeystroller/HiExpan. One note on the code: they make use of a bunch of their own projects, which is always a little suspicious. I suspect we could implement their strategy with more widely adopted tools if we really like their approach. Anyways, they take as input a collection of documents and a "seed taxonomy" (CL in our case), then they do:
|
One note on terminology: I prefer to use the word "term" when talking about a single node in an ontology. In the NormCo paper they say "concept", which I don't like for reasons best explained in the BFO book. Also in this literature the term "entity type" is used, e.g. https://github.com/shimaokasonse/NFGEC. So, in the context of this discussion, "term" == "concept" == "entity type", and I prefer "term". |
One last comment for the night: the distantly supervised training data extracted from BioASQ fits the format of the bag-of-sentences w/ bag-level labels model from the AI2 paper https://github.com/allenai/comb_dist_direct_relex. I recall us finding that formulation somewhat strange in the AI2 paper so it's amusing to see it in the NormCo paper in a slightly different context. |
Alright then, I'm imagining a process like this to try out HiExpan:
The biggest hole I can see in that plan is that there will be a ton of synonymous terms on the same level like "T-helper (Th)17" == "T helper 17" == "CD4+CD161+CD196+ T cells" since HiExpan doesn't make any attempt to resolve them. I could resolve them easily if the whole process starts only with the white list terms, but I don't see a way to make it work with the JNLPA terms without first doing the kind of term collation I was alluding to before. And by that I don't mean entity linking -- I'm thinking more like an unsupervised clustering -- is there a word for that in NLP? If nothing else, tokenizing those protein expression strings seems pretty critical for this domain since I can see now that neither ScispaCy nor the PMC word2vec tokenizer really do it all, and it'd have to be done if we wanted to match to CL using anything beyond the cell type string names/aliases. Perhaps that would be a good place to start as a standalone project compatible with any kind of spaCy pipeline? One last thought on tokenization: I did notice that the WordPiece tokenization used in sciBERT actually does a good job of chunking up those no-whitespace strings where CD4+CD8- becomes ['CD', '##4', '+', 'CD, '##8', '-'] rather than one single token. Do any thoughts come to mind as to how we could exploit that? |
It looks like in the notebook you're trying
One thought is that there are containment relationships implied by the combination of markers which could help place terms in the hierarchy. I also wonder if we can make uses of gene name synonym lists to canonicalize these marker lists? One synonym I see a lot: CD137 <--> 4-1BB. We can then use mappings back to the protein ontology and hierarchical structures that have been built on that (e.g. families of cytokine receptors) to have additional structure to use to compare to the embedding results.
I wonder if there is a way to do this pruning in a way that's not totally manual. We're also getting into "noun phrase internal" relation extraction territory a bit when we consider complex noun phrases like "IL17-producing CD4+ T cells". There's also a lot of structured metadata we could collect from these phrases, as you have outlined in your first comment on this thread. It may be that we are collecting attributes of an entity as well as inferring the entity type. One thing I'm struggling with for our problem: the distinction between entities and entity types. These papers rarely if ever discuss individual entities (i.e. cells); they're almost always talking about properties of a collection of cells distinguished by their entity type. I found the distinction between "universals" and "defined classes" in the BFO book to be useful in this context. I think an "entity type" corresponds to a "universal", while entity attributes can be used to name "defined classes". What we're trying to figure out is when a token or phrase distinguishes a novel universal versus when it's just an attribute that can be used to organize "defined classes".
That's a great observation. It does seem like their method needs a second form of "conflict resolution" which involves finding terms on the same level that are actually just synonyms; they can then remove all but one of the terms in that round. Does the CL provide synonym lists for their terms? Seems like they should. |
@hammer, I put together a notebook to start exploring how well embeddings might work to infer dimensions of T cell typing, beyond protein expression and general phenotypic qualifiers (exhausted, activated, antigen-specific, etc.).
To get a basic sense of that variety, I did what they did in NormCo using the summation of token embedding vectors for noun phrases from word2vec trained on PMC/PubMed. The embedding projection here gives some interesting clustering:
PMC/PubMed T Cell Embedding Projection
Zooming in on the part I mapped out a bit with the annotations shows fairly broad categorizations like:
My take after hovering over a bunch of those groups is that these seem to be common dimensions for the descriptions:
Do you think any of those make for useful characterizations we should keep in mind before trying to map the types to Cell Ontology or something like it?
The text was updated successfully, but these errors were encountered: