Memory usage when using Doc extensions #13566
Unanswered
makp
asked this question in
Help: Coding & Implementations
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Issue
Before preprocessing my data with spaCy, I typically have my data stored in a Pandas Series. Since I'd like to preserve the index for each document before serializing my Docs, I decided to use the extension attribute. However, I noted a dramatic increase in the memory usage until my system runs out of memory. I'm not sure what I might be doing wrong.
Here is how I added the extension after initializing the Language class and adding the extension with
Doc.set_extension("idx", default=None)
. I runnlp.pipe
on my text and add the extensionidx
to each Doc:And when saving my data as a DocBin, I create the DocBin with
store_user_data=True
in order to save my extension:Question: Am I implementing the extension feature incorrectly? Any thoughts of how I might proceed? Any suggestions are more than welcome!
Further details
Beta Was this translation helpful? Give feedback.
All reactions