Pipelines with duplicate models #2920

DeNeutoy · 2018-11-11T22:50:22Z

DeNeutoy
Nov 11, 2018

Feature description

I would like to contribute a fix to make it possible to isolate the NER and Dependency Parser models using custom attribute namespaces. Currently, because they share a global namespace Doc.ents which determine a TransitionSystem state (in the NER case), duplicate models clobber each other when proceeding through the pipeline.

As an example, here is some code we had to write to have multiple NER models for different types of entities for biomedical text:

def ner_aggregate(doc):
    doc._._tmp_ents = doc._._tmp_ents + list(doc.ents)
    doc.ents = []
    return doc

def ner_finalize(doc):
    doc.ents = doc._._tmp_ents + list(doc.ents)
    return doc

def load(**overrides):
    """Use category_paths to specify UMLS ontology paths, category_searches to
        specify specific UMLS categories, or categories to specify filesystem
        paths or mapped category names.
        Additionally, a position can be specified as 'first',
        'last', or left absent for a default value of 'first'.
    """
    Doc.set_extension('_tmp_ents', default=[], force=True)

    Language.factories['ner_bc5cdr'] = Language.factories['ner']
    Language.factories['aggre_bc5cdr'] = lambda nlp, **cfg: ner_aggregate
    Language.factories['ner_chemdner'] = Language.factories['ner']
    Language.factories['aggre_chemdner'] = lambda nlp, **cfg: ner_aggregate
    Language.factories['ner_msh'] = Language.factories['ner']
    Language.factories['aggre_msh'] = lambda nlp, **cfg: ner_aggregate
    Language.factories['finalize_ner'] = lambda nlp, **cfg: ner_finalize

This hack is functional in the sense that at inference time, you get all the entities and the different models don't stomp on each other. However, I'm a bit scared of what would happen if for example, someone tried to fine tune the pipeline on some new data. It also means the NER models have to be trained independently.

Proposed Solution

Add an optional parameter to these models which the set_annotations method of nn_parser. The current functionality of the NER and Parser models would be completely unchanged.

https://github.com/explosion/spaCy/blob/master/spacy/syntax/nn_parser.pyx#L780

However, it looks like the transition systems for NER and Parsing are intimately tied to the StateC struct - I wasn't sure how this related to the underlying state of the Doc?

https://github.com/explosion/spaCy/blob/master/spacy/syntax/_state.pxd

So basically this issue is just a request for you to summarise how much work this would be, if it's possible at all, and the things which need to be completed such that this is possible.

honnibal · 2018-11-12T00:29:19Z

honnibal
Nov 12, 2018
Maintainer

Hmm. You want to have overlapping or nested entities, right?

At the end of the day, the TokenC struct only has one ent_type and one ent_iob field. So there's no way to store overlapping entities in the tokens. We could change that of course --- it wouldn't be a huge amount of work to add another pointer to a 0-terminated list.

What if, instead of your solution, you made a new pipeline component that acted as a container, holding models for the different NER components? In pseudo-code it would look like this:

def multi_ner(ner_models):
    def predict_entities(doc):
        entities = []
        for model in ner_models:
            entities.extend(model(doc))
            doc.ents = []
        doc._.entities = entities
        return doc
    return predict_entities

If you want more than just a quick fix, you could also make this much better by creating a much more custom subclass. You'd want to base this on the version up on develop, as it's refactored much more nicely: https://github.com/explosion/spaCy/blob/develop/spacy/syntax/nn_parser.pyx

The refactored version splits the model makes the pre-computation in the implementation much clearer. When we get the whole batch of documents, we first run the CNN to get the token vectors, and then we pre-compute all the features for the hidden layers. Once the pre-computation is done, we only need to sum the state features.

You could make your model really efficient by predicting all of the feature values at once. Then you would only have to sum and softmax the features during the parsing loops. Have a look at the precompute_hiddens function here: https://github.com/explosion/spaCy/blob/develop/spacy/syntax/_parser_model.pyx#L301

I found the pre-computation a bit of a mind-warp when I was working through it, so feel free to ask questions if it's not immediately obvious!

Making a subclass that does this shouldn't be too bad, and it'll run really fast --- you'll be able to add more entity schemes without changing the runtime much, as you're reusing most of the computation. You'll probably need to cut and paste some of the code and make small modifications, which is fine imo.

It'd be cool to have a version of the parser which did this for parser+NER, although possibly more trouble than it's worth. Not everyone would want to use this as it'd be hard to do post-training without hitting catastrophic forgetting problems. But a lot of the time people don't need to post-train, and this would make things faster.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipelines with duplicate models #2920

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Pipelines with duplicate models #2920

DeNeutoy Nov 11, 2018

Feature description

Proposed Solution

Replies: 1 comment

honnibal Nov 12, 2018 Maintainer

DeNeutoy
Nov 11, 2018

honnibal
Nov 12, 2018
Maintainer