Pipelines with duplicate models #2920
Replies: 1 comment
-
Hmm. You want to have overlapping or nested entities, right? At the end of the day, the What if, instead of your solution, you made a new pipeline component that acted as a container, holding models for the different NER components? In pseudo-code it would look like this: def multi_ner(ner_models):
def predict_entities(doc):
entities = []
for model in ner_models:
entities.extend(model(doc))
doc.ents = []
doc._.entities = entities
return doc
return predict_entities If you want more than just a quick fix, you could also make this much better by creating a much more custom subclass. You'd want to base this on the version up on develop, as it's refactored much more nicely: https://github.com/explosion/spaCy/blob/develop/spacy/syntax/nn_parser.pyx The refactored version splits the model makes the pre-computation in the implementation much clearer. When we get the whole batch of documents, we first run the CNN to get the token vectors, and then we pre-compute all the features for the hidden layers. Once the pre-computation is done, we only need to sum the state features. You could make your model really efficient by predicting all of the feature values at once. Then you would only have to sum and softmax the features during the parsing loops. Have a look at the I found the pre-computation a bit of a mind-warp when I was working through it, so feel free to ask questions if it's not immediately obvious! Making a subclass that does this shouldn't be too bad, and it'll run really fast --- you'll be able to add more entity schemes without changing the runtime much, as you're reusing most of the computation. You'll probably need to cut and paste some of the code and make small modifications, which is fine imo. It'd be cool to have a version of the parser which did this for parser+NER, although possibly more trouble than it's worth. Not everyone would want to use this as it'd be hard to do post-training without hitting catastrophic forgetting problems. But a lot of the time people don't need to post-train, and this would make things faster. |
Beta Was this translation helpful? Give feedback.
-
Feature description
I would like to contribute a fix to make it possible to isolate the NER and Dependency Parser models using custom attribute namespaces. Currently, because they share a global namespace
Doc.ents
which determine aTransitionSystem
state (in the NER case), duplicate models clobber each other when proceeding through the pipeline.As an example, here is some code we had to write to have multiple NER models for different types of entities for biomedical text:
This hack is functional in the sense that at inference time, you get all the entities and the different models don't stomp on each other. However, I'm a bit scared of what would happen if for example, someone tried to fine tune the pipeline on some new data. It also means the NER models have to be trained independently.
Proposed Solution
Add an optional parameter to these models which the
set_annotations
method ofnn_parser
. The current functionality of the NER and Parser models would be completely unchanged.https://github.com/explosion/spaCy/blob/master/spacy/syntax/nn_parser.pyx#L780
However, it looks like the transition systems for NER and Parsing are intimately tied to the
StateC
struct - I wasn't sure how this related to the underlying state of theDoc
?https://github.com/explosion/spaCy/blob/master/spacy/syntax/_state.pxd
So basically this issue is just a request for you to summarise how much work this would be, if it's possible at all, and the things which need to be completed such that this is possible.
Beta Was this translation helpful? Give feedback.
All reactions