Pipeline Speed for 'et' 15GB = 55 Days #979

hermanpetrov · 2022-03-15T16:23:50Z

hermanpetrov
Mar 15, 2022

Greetings!

CASE

I am building an Estonian keyword extractor. I am using the Stanza pipeline processes tokenization and lemmatization for the language 'et' with the help of GPU.

But to build the keyword extractor I need to lemmatize the entire Estonian Corpus which is 15gbs of text. So far the lemmatization for 58 mb of texts lasts 1 hour.

I tried adding /n/n to each sentence to speed up the process but learned at the end that having chunks of files with sentences on a single row is somewhat faster. But 15 GB of data is estimated to 40-55 days for me, which seems really long. Is there any options to speed up the process ?

I tried CSV with sentences having /n/n and now single row parquet files seem to be a bit faster

DATA
the data which I am processing layout in csv file :
....CSV
Eestis on seda aastatel ja näidanud Eesti Televisioon.
Saksa ohvitserid soovivad pärast sõja lõppu kunstiteose maha müüa.
Ainult et linnakese rahu häirib Herr Flick gestaapost, kes on šedöövrist haisu ninna saanud.
Ka tema himustab pilti, et see endale jätta, võltsing aga Hitlerile saata.
Ta loodab seeläbi ametikõrgendusele, pensionipõlves aga maali rahaks teha ja rikkaks saada.
Olukorda komplitseerib tõik, et maalist on teadlik vastupanuliikumine ja sellest liigub ringi koopiaid.
Hertogenboschi vald on . järgu haldusüksus Hollandi lõunaosas Põhja Brabandi provintsis.
Valla keskuseks on s Hertogenboschi linn.
...Parquet

Mõeldes sellest kirjutas ta raamatu Kättemaks Kristofer Marloo surm , mille eest sai . aastal preemia. Raamat Kaardi looja räägib Walter Raleigh ekspeditsioonist Lõuna Ameerikasse. (single row, but has at least 10000 sentences.)

CODE

The current for loop is reading in from chunks and processing it through the pipeline.
Question is, am I doing something wrong that my process is running so slowly? Is there ways to speed up the process to lemmatize the data? 58MB 1.3 million sentences = 1 hour. 15 GB is between 44-55 days. This seems too long.

the following script is from anaconda jupyter notebook

stanza.download('et',processors='tokenize,lemma')
nlp = stanza.Pipeline('et', processors='tokenize,lemma', use_gpu=True)

with open('Data/data'+str(counter)+'.parquet', 'r', encoding='utf-8') as f: for line in f.readlines():
line = nlp(line)
for sent in line.sentences:
for word in sent.words:
if ((word.lemma!="")&(word.lemma!=" ")): word.lemma=re.sub('\W', '', word.lemma)
if word.lemma in lemmaDictionary: lemmaDictionary[word.lemma] += 1
else: lemmaDictionary[word.lemma] = 1

I have a single RTX3070 8gb , 16gb RAM , 3700x Ryzen. Pytorch installed. GPU = TRUE

Answered by hermanpetrov

Mar 17, 2022

I was provided my error recently. I was calling out the nlp pipeline per sentence instead of a chunk.
So it was a for loop that went through single sentence one by one instead of the whole chunk.

the new script runs much faster 15gb = 7.5 days . 1.6 GB = 16 hours.

Reading f.read instead of f.readline improved the situation. I would say 15gb of data is more appropriate.

Furthermore my current data is read with sentences that are separated with \n\n at the end which improved the situation to 37000 it/s which is a massive improvement over 300 it/s speed for 1000 rows of sentences.

with open('Data/data'+str(counter)+'.csv', 'r', encoding='utf-8') as f:
      inputText = f.read().rstrip()
    …

View full answer

AngledLuffa · 2022-03-15T16:44:29Z

AngledLuffa
Mar 15, 2022
Maintainer

https://stanfordnlp.github.io/stanza/pipeline.html#processing-multiple-documents Can you use this to batch 100 at a time or so? It should work much faster if your GPU can do parallel operations.

…

On Tue, Mar 15, 2022 at 9:24 AM hp355837 ***@***.***> wrote: Greetings! *CASE* I am building an Estonian keyword extractor. I am using the Stanza pipeline processes tokenization and lemmatization for the language 'et' with the help of GPU. But to build the keyword extractor I need to lemmatize the entire Estonian Corpus which is 15gbs of text. So far the lemmatization for 58 mb of texts lasts 1 hour. I tried adding /n/n to each sentence to speed up the process but learned at the end that having chunks of files with sentences on a single row is somewhat faster. But 15 GB of data is estimated to 40-55 days for me, which seems really long. Is there any options to speed up the process ? I tried CSV with sentences having /n/n and now single row parquet files seem to be a bit faster *DATA* the data which I am processing layout in csv file : ....CSV Eestis on seda aastatel ja näidanud Eesti Televisioon. Saksa ohvitserid soovivad pärast sõja lõppu kunstiteose maha müüa. Ainult et linnakese rahu häirib Herr Flick gestaapost, kes on šedöövrist haisu ninna saanud. Ka tema himustab pilti, et see endale jätta, võltsing aga Hitlerile saata. Ta loodab seeläbi ametikõrgendusele, pensionipõlves aga maali rahaks teha ja rikkaks saada. Olukorda komplitseerib tõik, et maalist on teadlik vastupanuliikumine ja sellest liigub ringi koopiaid. Hertogenboschi vald on . järgu haldusüksus Hollandi lõunaosas Põhja Brabandi provintsis. Valla keskuseks on s Hertogenboschi linn. ...Parquet Mõeldes sellest kirjutas ta raamatu Kättemaks Kristofer Marloo surm , mille eest sai . aastal preemia. Raamat Kaardi looja räägib Walter Raleigh ekspeditsioonist Lõuna Ameerikasse. (single row, but has at least 10000 sentences.) *CODE* The current for loop is reading in from chunks and processing it through the pipeline. Question is, am I doing something wrong that my process is running so slowly? Is there ways to speed up the process to lemmatize the data? 58MB 1.3 million sentences = 1 hour. 15 GB is between 44-55 days. This seems too long. the following script is from anaconda jupyter notebook stanza.download('et',processors='tokenize,lemma') nlp = stanza.Pipeline('et', processors='tokenize,lemma', use_gpu=True) with open('Data/data'+str(counter)+'.parquet', 'r', encoding='utf-8') as f: for line in f.readlines(): line = nlp(line) for sent in line.sentences: for word in sent.words: if ((word.lemma!="")&(word.lemma!=" ")): word.lemma=re.sub('\W', '', word.lemma) if word.lemma in lemmaDictionary: lemmaDictionary[word.lemma] += 1 else: lemmaDictionary[word.lemma] = 1 I have a single RTX3070 8gb , 16gb RAM , 3700x Ryzen. Pytorch installed. GPU = TRUE — Reply to this email directly, view it on GitHub <#979>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWNPA6HYMC67LLBZBULVAC2SHANCNFSM5QZGW2YQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

1 reply

hermanpetrov Mar 15, 2022
Author

I looked into the Processing Multiple Documents
as much as I understood
documents = ["This is a test document.", "I wrote another document for fun."] # Documents that we are going to process
this part where the This is a test document, supposed to go my Single row sentences? I am confused how the script should look like for me with my 45 files

AngledLuffa · 2022-03-15T20:24:00Z

AngledLuffa
Mar 15, 2022
Maintainer

break your file into lines of 100 (in the script, don't make millions of files out of your 15G of text)

every 100 lines, call this:

in_docs = [stanza.Document([], text=d) for d in batch]
out_docs = nlp(in_docs)

out_docs will now have 100 documents, each of which you can analyze, then drop the previous out_docs, build a new batch, and repeat

2 replies

hermanpetrov Mar 16, 2022
Author

for some reason I tested it with stanza.Document option and the process became even more slower. Doesn't speed up for me.

hermanpetrov Mar 16, 2022
Author

lines_gen = islice(f, 100)
for line in tqdm.tqdm(lines_gen):
documents.append(line)
in_docs = [stanza.Document([], text=d) for d in documents]
out_docs = nlp(in_docs)

  this took me already to load at least a minute if we divide into 100 lines. which would mean for 1.3 mil it would take even longer. 15 G would take even longer then too

AngledLuffa · 2022-03-16T02:29:46Z

AngledLuffa
Mar 16, 2022
Maintainer

I don't quite follow the use of islice here. Perhaps some important context is missing above? I think you want something like this:

with open('Data/data'+str(counter)+'.parquet', 'r', encoding='utf-8') as f:
  while True:
    batch = islice(f, 100)
    in_docs = [stanza.Document([], text=d) for d in batch]
    if len(in_docs) == 0:
      break
    out_docs = nlp(in_docs)

On a medium sized file, this is 3x faster than reading the document one line at a time. So not perfect, but faster, at least.

you can rewrite the iterator to avoid the while True if you like

1 reply

hermanpetrov Mar 16, 2022
Author

Thank you @AngledLuffa for the suggestion!

Actually theres no other reason for islice as I thought it is going to read 100 rows at a time from the file and put it in the document. I am still learning how to use python and not sure whether I build the right thing.

This is the current version

stanza.download('et',processors='tokenize,lemma')
nlp = stanza.Pipeline('et', processors='tokenize,lemma', use_gpu=True)

lemmatizedData = open('CorpusLemmas.csv','w',encoding='utf-8')
lemmaDictionary = dict()

def stanzaDocument():
    documents=[]
    documentcount=0
    with open('Data/data1.parquet', 'r', encoding='utf-8') as f:
        while True:
            batch = islice(f, 100)
            in_docs = [stanza.Document([], text=d) for d in batch]
            if len(in_docs) == 0:
              break
            out_docs = nlp(in_docs)
    return out_docs
    




def documentBasedLemmatizer():
    documents=stanzaDocument()
    for sent in tqdm.tqdm(documents[  ????  ].sentences):
            for word in sent.words:
                if ((word.lemma!="")&(word.lemma!=" ")):
                    word.lemma=re.sub('\W', '', word.lemma)
                    if word.lemma in lemmaDictionary:                
                        lemmaDictionary[word.lemma] += 1
                    else:
                        lemmaDictionary[word.lemma] = 1
    lemmaOutput(lemmaDictionary)
documentBasedLemmatizer()

I added islice thinking that it will read out the files by 100 and putting them into the document.

I don't understand how the stanza.Document works. As how should I iterate through the document->sentences->words->lemmas to write them down into my csv at the end. Also it seems the current document output is doing other processors other than giving out lemmas.

What should be the for loop for this case to extract all the lemmas ? And what should be used in the stanzaDocument() then if the file currently is one single dataset1 with 1.3 rows?

AngledLuffa · 2022-03-16T07:40:36Z

AngledLuffa
Mar 16, 2022
Maintainer

Alright, I think the problem here is that the processing of the documents has to be done immediately after the out_docs=nlp(in_docs) statement. You could try to save all of the results, but since you mentioned several GB of data, it seems unlikely you will be able to fit it all in memory. What is it that makes you say it is doing more than tokenize and lemma? It should print out a list of processors which are used... is it printing out something unexpected?

…

On Wed, Mar 16, 2022 at 12:22 AM hp355837 ***@***.***> wrote: Thank you @AngledLuffa <https://github.com/AngledLuffa> for the suggestion! Actually theres no other reason for islice as I thought it is going to read 100 rows at a time from the file and put it in the document. I am still learning how to use python and not sure whether I build the right thing. This is the current version stanza.download('et',processors='tokenize,lemma') nlp = stanza.Pipeline('et', processors='tokenize,lemma', use_gpu=True) lemmatizedData = open('CorpusLemmas.csv','w',encoding='utf-8') lemmaDictionary = dict() def stanzaDocument(): documents=[] documentcount=0 with open('Data/data1.parquet', 'r', encoding='utf-8') as f: while True: batch = islice(f, 100) in_docs = [stanza.Document([], text=d) for d in batch] if len(in_docs) == 0: break out_docs = nlp(in_docs) return out_docs def documentBasedLemmatizer(): documents=stanzaDocument() for sent in tqdm.tqdm(documents[ ???? ].sentences): for word in sent.words: if ((word.lemma!="")&(word.lemma!=" ")): word.lemma=re.sub('\W', '', word.lemma) if word.lemma in lemmaDictionary: lemmaDictionary[word.lemma] += 1 else: lemmaDictionary[word.lemma] = 1 lemmaOutput(lemmaDictionary) documentBasedLemmatizer() I added islice thinking that it will read out the files by 100 and putting them into the document. I don't understand how the stanza.Document works. As how should I iterate through the document->sentences->words->lemmas to write them down into my csv at the end. Also it seems the current document output is doing other processors other than giving out lemmas. What should be the for loop for this case to extract all the lemmas ? And what should be used in the stanzaDocument() then if the file currently is one single dataset1 with 1.3 rows? — Reply to this email directly, view it on GitHub <#979 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWOG4EN7VJESQPOOTN3VAGD3XANCNFSM5QZGW2YQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

2 replies

hermanpetrov Mar 16, 2022
Author

"id": 4,
"text": "MT",
"lemma": "MT",
"start_char": 16,
"end_char": 18

is per sent, shouldn't it just be lemma?

hermanpetrov Mar 16, 2022
Author

processing of the documents has to be done immediately after the out_docs=nlp(in_docs) statement.

but what is the correct for loop to extract every sentence existing lemma?
theoretically we could split the files into smaller 1Gb chunks.

My current loop is going through single sentence. I am not certain how get each document sentence lemma extracted for the loop

AngledLuffa · 2022-03-17T07:19:07Z

AngledLuffa
Mar 17, 2022
Maintainer

The id, text, start/end char are artifacts of the tokenizer

maybe something like

for doc in out_docs:
  for sentence in doc.sentences:
    for word in sentence.words:
      print(word.lemma) # or keep it in a counter like what you seem to want, or whatever

2 replies

hermanpetrov Mar 17, 2022
Author

I found the issue. The problem is that i read f.readline instead of the f.read

hermanpetrov Mar 17, 2022
Author

Thank you a lot for the provided help! @AngledLuffa

hermanpetrov · 2022-03-17T08:47:44Z

hermanpetrov
Mar 17, 2022
Author

I was provided my error recently. I was calling out the nlp pipeline per sentence instead of a chunk.
So it was a for loop that went through single sentence one by one instead of the whole chunk.

the new script runs much faster 15gb = 7.5 days . 1.6 GB = 16 hours.

Reading f.read instead of f.readline improved the situation. I would say 15gb of data is more appropriate.

Furthermore my current data is read with sentences that are separated with \n\n at the end which improved the situation to 37000 it/s which is a massive improvement over 300 it/s speed for 1000 rows of sentences.

with open('Data/data'+str(counter)+'.csv', 'r', encoding='utf-8') as f:
      inputText = f.read().rstrip()
      doc = nlp(inputText)
      for sent in doc.sentences:
          for word in sent.words:
              if (word.lemma!=""):
                  word.lemma=re.sub('\W', '', word.lemma)
              if word.lemma in lemmaDictionary:                
                  lemmaDictionary[word.lemma] += 1
              else:
                  lemmaDictionary[word.lemma] = 1

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline Speed for 'et' 15GB = 55 Days #979

{{title}}

Replies: 6 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Pipeline Speed for 'et' 15GB = 55 Days #979

hermanpetrov Mar 15, 2022

Replies: 6 comments · 8 replies

AngledLuffa Mar 15, 2022 Maintainer

hermanpetrov Mar 15, 2022 Author

AngledLuffa Mar 15, 2022 Maintainer

hermanpetrov Mar 16, 2022 Author

hermanpetrov Mar 16, 2022 Author

AngledLuffa Mar 16, 2022 Maintainer

hermanpetrov Mar 16, 2022 Author

AngledLuffa Mar 16, 2022 Maintainer

hermanpetrov Mar 16, 2022 Author

hermanpetrov Mar 16, 2022 Author

AngledLuffa Mar 17, 2022 Maintainer

hermanpetrov Mar 17, 2022 Author

hermanpetrov Mar 17, 2022 Author

hermanpetrov Mar 17, 2022 Author

hermanpetrov
Mar 15, 2022

Replies: 6 comments 8 replies

AngledLuffa
Mar 15, 2022
Maintainer

hermanpetrov Mar 15, 2022
Author

AngledLuffa
Mar 15, 2022
Maintainer

hermanpetrov Mar 16, 2022
Author

hermanpetrov Mar 16, 2022
Author

AngledLuffa
Mar 16, 2022
Maintainer

hermanpetrov Mar 16, 2022
Author

AngledLuffa
Mar 16, 2022
Maintainer

hermanpetrov Mar 16, 2022
Author

hermanpetrov Mar 16, 2022
Author

AngledLuffa
Mar 17, 2022
Maintainer

hermanpetrov Mar 17, 2022
Author

hermanpetrov Mar 17, 2022
Author

hermanpetrov
Mar 17, 2022
Author