Goethe

Bringing word2vec to the German language.

Getting started with the Leizpig Corpus

The Leizpig Corpora Collection contains sentences from articles and wikipedia for each year from 1995 to 2015.

Download Leipzig Corpus

To download all news and wikipedia corpora for all years run:

>>> import goethe.utils.leipzig_corpora_downloader
>>> download_corpora_news()
>>> download_corpora_wiki()

Import Leizpig Corpus

The Leizpig Corpora Collection is a quick way to start training models for the German language. You can load a corpus and iterate its sentences with the following code:

from goethe.corpora import LeipzigCorpus

sentences = LeipzigCorpus('path/containing/corpora')

Assuming that you have a file structure like this:

path/containing/corpora/
    deu_news_2015_3M/
        deu_news_2015_3M-sentences.txt
        ...
    deu_wikipedia_2014_3M/
        deu_wikipedia_2014_3M-sentences.txt
        ...
    ...

Model building

You can train models using gensim:

import gensim

sentences = LeipzigCorpus('path/containing/corpora')
model = gensim.models.Word2Vec(sentences)

Evaluation

A trained model can be queried for semantic similarity. We can say e.g., "Obama to USA is what Putin is to X" and ask our model to return a word that matches X:

>>> model.most_similar(['Putin', 'USA'], ['Obama'], topn=3)
[('Russland', 0.7132166028022766),
 ('USA,', 0.7057479619979858),
 ('China', 0.6795132160186768)]

To test our model on multiple such queries you can use the model_accuracy function:

>>> from goethe.evaluation import model_accuracy
>>> model_accuracy(model, 'evaluation/bestmatch-questions.txt', topn=5)
[('Land-Währung', 0.5238095238095238),
 ('Hauptstad-Land', 0.47619047619047616),
 ('Land-Kontinent', 0.34615384615384615),
 ('Land-Sprache', 0.15384615384615385),
 ('Politik', 0.0),
 ('Technik', 0.6666666666666666),
 ('Geschlecht', 0.5220588235294118)]

The resulting list contains a tuple for each section with its name and accuracy. The accuracy here is the percentage of 4-tuples in which the topn words returned by most_similar contained the right word.

Name		Name	Last commit message	Last commit date
Latest commit History 220 Commits
data		data
evaluation		evaluation
goethe		goethe
models		models
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Goethe

Getting started with the Leizpig Corpus

Download Leipzig Corpus

Import Leizpig Corpus

Model building

Evaluation

About

Releases

Packages

Contributors 2

Languages

License

HPI-DeepLearning/wort2vek

Folders and files

Latest commit

History

Repository files navigation

Goethe

Getting started with the Leizpig Corpus

Download Leipzig Corpus

Import Leizpig Corpus

Model building

Evaluation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages