Pipeline Speed for 'et' 15GB = 55 Days #979
-
Greetings! CASE I am building an Estonian keyword extractor. I am using the Stanza pipeline processes tokenization and lemmatization for the language 'et' with the help of GPU. But to build the keyword extractor I need to lemmatize the entire Estonian Corpus which is 15gbs of text. So far the lemmatization for 58 mb of texts lasts 1 hour. I tried adding /n/n to each sentence to speed up the process but learned at the end that having chunks of files with sentences on a single row is somewhat faster. But 15 GB of data is estimated to 40-55 days for me, which seems really long. Is there any options to speed up the process ? I tried CSV with sentences having /n/n and now single row parquet files seem to be a bit faster DATA Mõeldes sellest kirjutas ta raamatu Kättemaks Kristofer Marloo surm , mille eest sai . aastal preemia. Raamat Kaardi looja räägib Walter Raleigh ekspeditsioonist Lõuna Ameerikasse. (single row, but has at least 10000 sentences.) CODE The current for loop is reading in from chunks and processing it through the pipeline. the following script is from anaconda jupyter notebook
I have a single RTX3070 8gb , 16gb RAM , 3700x Ryzen. Pytorch installed. GPU = TRUE |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 8 replies
-
https://stanfordnlp.github.io/stanza/pipeline.html#processing-multiple-documents
Can you use this to batch 100 at a time or so? It should work much faster
if your GPU can do parallel operations.
…On Tue, Mar 15, 2022 at 9:24 AM hp355837 ***@***.***> wrote:
Greetings!
*CASE*
I am building an Estonian keyword extractor. I am using the Stanza
pipeline processes tokenization and lemmatization for the language 'et'
with the help of GPU.
But to build the keyword extractor I need to lemmatize the entire Estonian
Corpus which is 15gbs of text. So far the lemmatization for 58 mb of texts
lasts 1 hour.
I tried adding /n/n to each sentence to speed up the process but learned
at the end that having chunks of files with sentences on a single row is
somewhat faster. But 15 GB of data is estimated to 40-55 days for me, which
seems really long. Is there any options to speed up the process ?
I tried CSV with sentences having /n/n and now single row parquet files
seem to be a bit faster
*DATA*
the data which I am processing layout in csv file :
....CSV
Eestis on seda aastatel ja näidanud Eesti Televisioon.
Saksa ohvitserid soovivad pärast sõja lõppu kunstiteose maha müüa.
Ainult et linnakese rahu häirib Herr Flick gestaapost, kes on šedöövrist
haisu ninna saanud.
Ka tema himustab pilti, et see endale jätta, võltsing aga Hitlerile saata.
Ta loodab seeläbi ametikõrgendusele, pensionipõlves aga maali rahaks teha
ja rikkaks saada.
Olukorda komplitseerib tõik, et maalist on teadlik vastupanuliikumine ja
sellest liigub ringi koopiaid.
Hertogenboschi vald on . järgu haldusüksus Hollandi lõunaosas Põhja
Brabandi provintsis.
Valla keskuseks on s Hertogenboschi linn.
...Parquet
Mõeldes sellest kirjutas ta raamatu Kättemaks Kristofer Marloo surm ,
mille eest sai . aastal preemia. Raamat Kaardi looja räägib Walter Raleigh
ekspeditsioonist Lõuna Ameerikasse. (single row, but has at least 10000
sentences.)
*CODE*
The current for loop is reading in from chunks and processing it through
the pipeline.
Question is, am I doing something wrong that my process is running so
slowly? Is there ways to speed up the process to lemmatize the data? 58MB
1.3 million sentences = 1 hour. 15 GB is between 44-55 days. This seems too
long.
the following script is from anaconda jupyter notebook
stanza.download('et',processors='tokenize,lemma')
nlp = stanza.Pipeline('et', processors='tokenize,lemma', use_gpu=True)
with open('Data/data'+str(counter)+'.parquet', 'r', encoding='utf-8') as
f: for line in f.readlines():
line = nlp(line)
for sent in line.sentences:
for word in sent.words:
if ((word.lemma!="")&(word.lemma!=" ")): word.lemma=re.sub('\W', '',
word.lemma)
if word.lemma in lemmaDictionary: lemmaDictionary[word.lemma] += 1
else: lemmaDictionary[word.lemma] = 1
I have a single RTX3070 8gb , 16gb RAM , 3700x Ryzen. Pytorch installed.
GPU = TRUE
—
Reply to this email directly, view it on GitHub
<#979>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWNPA6HYMC67LLBZBULVAC2SHANCNFSM5QZGW2YQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
break your file into lines of 100 (in the script, don't make millions of files out of your 15G of text) every 100 lines, call this:
|
Beta Was this translation helpful? Give feedback.
-
I don't quite follow the use of
On a medium sized file, this is 3x faster than reading the document one line at a time. So not perfect, but faster, at least. you can rewrite the iterator to avoid the while True if you like |
Beta Was this translation helpful? Give feedback.
-
Alright, I think the problem here is that the processing of the documents
has to be done immediately after the out_docs=nlp(in_docs) statement. You
could try to save all of the results, but since you mentioned several GB of
data, it seems unlikely you will be able to fit it all in memory.
What is it that makes you say it is doing more than tokenize and lemma? It
should print out a list of processors which are used... is it printing out
something unexpected?
…On Wed, Mar 16, 2022 at 12:22 AM hp355837 ***@***.***> wrote:
Thank you @AngledLuffa <https://github.com/AngledLuffa> for the
suggestion!
Actually theres no other reason for islice as I thought it is going to
read 100 rows at a time from the file and put it in the document. I am
still learning how to use python and not sure whether I build the right
thing.
This is the current version
stanza.download('et',processors='tokenize,lemma')
nlp = stanza.Pipeline('et', processors='tokenize,lemma', use_gpu=True)
lemmatizedData = open('CorpusLemmas.csv','w',encoding='utf-8')
lemmaDictionary = dict()
def stanzaDocument():
documents=[]
documentcount=0
with open('Data/data1.parquet', 'r', encoding='utf-8') as f:
while True:
batch = islice(f, 100)
in_docs = [stanza.Document([], text=d) for d in batch]
if len(in_docs) == 0:
break
out_docs = nlp(in_docs)
return out_docs
def documentBasedLemmatizer():
documents=stanzaDocument()
for sent in tqdm.tqdm(documents[ ???? ].sentences):
for word in sent.words:
if ((word.lemma!="")&(word.lemma!=" ")):
word.lemma=re.sub('\W', '', word.lemma)
if word.lemma in lemmaDictionary:
lemmaDictionary[word.lemma] += 1
else:
lemmaDictionary[word.lemma] = 1
lemmaOutput(lemmaDictionary)
documentBasedLemmatizer()
I added islice thinking that it will read out the files by 100 and putting
them into the document.
I don't understand how the stanza.Document works. As how should I iterate
through the document->sentences->words->lemmas to write them down into my
csv at the end. Also it seems the current document output is doing other
processors other than giving out lemmas.
What should be the for loop for this case to extract all the lemmas ? And
what should be used in the stanzaDocument() then if the file currently is
one single dataset1 with 1.3 rows?
—
Reply to this email directly, view it on GitHub
<#979 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWOG4EN7VJESQPOOTN3VAGD3XANCNFSM5QZGW2YQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
The id, text, start/end char are artifacts of the tokenizer maybe something like
|
Beta Was this translation helpful? Give feedback.
-
I was provided my error recently. I was calling out the nlp pipeline per sentence instead of a chunk. the new script runs much faster 15gb = 7.5 days . 1.6 GB = 16 hours. Reading f.read instead of f.readline improved the situation. I would say 15gb of data is more appropriate. Furthermore my current data is read with sentences that are separated with \n\n at the end which improved the situation to 37000 it/s which is a massive improvement over 300 it/s speed for 1000 rows of sentences.
|
Beta Was this translation helpful? Give feedback.
I was provided my error recently. I was calling out the nlp pipeline per sentence instead of a chunk.
So it was a for loop that went through single sentence one by one instead of the whole chunk.
the new script runs much faster 15gb = 7.5 days . 1.6 GB = 16 hours.
Reading f.read instead of f.readline improved the situation. I would say 15gb of data is more appropriate.
Furthermore my current data is read with sentences that are separated with \n\n at the end which improved the situation to 37000 it/s which is a massive improvement over 300 it/s speed for 1000 rows of sentences.