-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pygrunn 14 article #34
base: pelican
Are you sure you want to change the base?
Changes from 1 commit
cb0253e
2da68e1
6393cd1
fd9af3d
b0a5bc1
c8c26cf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,129 @@ | ||
Pygrunn 2014 | ||
============ | ||
|
||
:date: 2014-05-10 15:56 | ||
:tags: conference, talk, nlp, pygrunn | ||
:category: life | ||
:author: Dmitrijs Milajevs | ||
:template: article_cover | ||
:cover: 016-pygrunn14.jpg | ||
|
||
`Pygrunn <http://www.pygrunn.org/>`_ is an awesome conference for Python | ||
developers and friends, which takes place in | ||
`Groningen <http://en.wikipedia.org/wiki/Groningen>`_. | ||
|
||
As usually, the conference was perfectly organized. This is one of the most | ||
stylish conferences I've ever attended. It constantly grows, and next year the | ||
conference moves to a bigger venue, so keep the beginning of May 2015 free and | ||
attend the event. | ||
|
||
Another positive trend is the growing proportion of science related talks. One | ||
of the topics of the conference became (scientific) code quality and | ||
collaboration between professional developers and scientists. | ||
|
||
Check awesome summaries of talks by | ||
`Reinout van Rees <http://reinout.vanrees.org/weblog/tags/pygrunn.html>`_ | ||
and | ||
`Maurits van Rees <http://maurits.vanrees.org/weblog/topics/pygrunn>`_. Get the | ||
`#pygrunn <https://twitter.com/search?q=%23PyGrunn>`_ tweets and follow | ||
`@pygrunn <https://twitter.com/PyGrunn>`_. | ||
|
||
|
||
Computational linguistics 101 | ||
----------------------------- | ||
|
||
`My presentation`__ started as a demonstration of the modern pythonic scientific | ||
tools (my subjective classification): | ||
|
||
__ http://nbviewer.ipython.org/urls/bitbucket.org/dimazest/phd-buildout/raw/tip/notebooks/pygrunn14.ipynb | ||
|
||
1. Data structures | ||
* `numpy <http://www.numpy.org/>`_ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Being picky, but: |
||
* `scipy <http://www.scipy.org/scipylib/index.html>`_ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. SciPy |
||
* `pandas <http://pandas.pydata.org/>`_ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pandas |
||
2. Algorithms | ||
* `scikit-learn <http://scikit-learn.org/>`_ | ||
* `nltk <http://www.nltk.org/>`_, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. NLTK |
||
* `textblob <http://textblob.readthedocs.org>`_ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TextBlob |
||
* `gensim <http://radimrehurek.com/gensim/>`_ | ||
3. Reporting | ||
* `ipython <http://ipython.org/>`__ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IPython |
||
* `matplotlib <http://matplotlib.org/), [seaborn](http://www.stanford.edu/~mwaskom/software/seaborn/>`__ | ||
|
||
However, I find the technical talks with a lot of code rather boring, so I | ||
decided to show how these libraries are used to solve simple CL tasks. | ||
|
||
First, I `covered`__ `Zipf's law <http://en.wikipedia.org/wiki/Zipf%27s_law>`_ | ||
and showed that it holds for an English text. As a homework, I asked whether the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. First, I covered Zipf's law, which states that the frequency of any word in a corpus of texts is inversely proportional to its rank in the frequency table. With help of pandas??? I showed that it holds for an English text. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good point |
||
same behavior is observed in other languages and what the differences are. | ||
|
||
__ http://nbviewer.ipython.org/urls/bitbucket.org/dimazest/phd-buildout/raw/tip/notebooks/pygrunn14.ipynb#english-word-frequencies | ||
|
||
I could not resist and presented my `research area`__ :) by extracting co- | ||
occurrence counts and projecting the word vectors to 2 dimensions. I managed to | ||
get a plot where ``girl`` is close to ``boy`` but far to ``manager``. | ||
|
||
__ http://nbviewer.ipython.org/urls/bitbucket.org/dimazest/phd-buildout/raw/tip/notebooks/pygrunn14.ipynb#distributional-semantics | ||
|
||
Sprint | ||
------ | ||
|
||
`@_spyreto_ <https://twitter.com/_spyreto_>`_ and | ||
`Sjoerd de Haan <https://www.linkedin.com/profile/view?id=22830170>`_ liked the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @spyreto and Sjoerd de Haan liked the idea of counting word frequencies among various languages and see how they compare in relation to Zipf's law. |
||
idea of counting word frequencies among various languages and see the behavior | ||
of the slope. | ||
|
||
Initially, we wanted to take EU directives and compare the official EU languages, | ||
however, the website was down, and we were kindly redirected to | ||
`this page <http://sorry.ec.europa.eu/>`_ every time we wanted to get a legal | ||
document. | ||
|
||
Luckily, we found an already prepared `word frequencies for a many languages | ||
<http://invokeit.wordpress.com/frequency-word-lists/>`_ and reused them. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here we can expand a bit. We wrote a simple function to plot the frequency of the words against the rank of the words in the frequency table. One thing we can compare amonst languages is how well this plot follows a straight line. Also the slope of the line contains interesting information. It tells what kind of power law we are dealing with exactly. Here I can explain slope of line = exponential in power law There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, you can use latex syntax here |
||
The task was to | ||
|
||
* plot the ranked frequency distribution on the log-log scale | ||
* estimate the slope :math:`\alpha`, the ratio of the frequencies between the | ||
neighboring words in the rank. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You wanted to make the blog post human readable, perhaps you could explain this problem better |
||
|
||
`We tried`__ English, Dutch, Russian, Latvian, Spanish and Italian. All languages | ||
obey Zipf's law, at least visually. Here is a plot for English (see `the notebook`__ | ||
for other plots): | ||
|
||
__ http://nbviewer.ipython.org/urls/bitbucket.org/dimazest/phd-buildout/raw/tip/notebooks/Word%20frequencies.ipynb | ||
__ http://nbviewer.ipython.org/urls/bitbucket.org/dimazest/phd-buildout/raw/tip/notebooks/Word%20frequencies.ipynb | ||
|
||
.. image:: {filename}/static/images/016-en_zipf.png | ||
:align: center | ||
:alt: English word frequency counts on the log-log scale. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The image is not found There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. github doesn't know how to find it, but our blog engine does :) |
||
:target: {filename}/static/images/016-en_zipf.png | ||
|
||
The blue line is the provided frequencies, the green is a regression line. | ||
Theory [Li1992]_ says that the slope coefficient should be close to -1. As the | ||
table shows, the values deviate from -1 quite drastically (-1.7 for Spanish). | ||
Also, the `slope estimate`__ for English from the `British National Corpus`__ is | ||
-1.18. | ||
|
||
__ http://nbviewer.ipython.org/urls/bitbucket.org/dimazest/phd-buildout/raw/tip/notebooks/pygrunn14.ipynb#estimating-the-slope | ||
__ http://www.natcorp.ox.ac.uk/ | ||
|
||
========= ========= =========== | ||
Language Slope Intercept | ||
========= ========= =========== | ||
uk -1.044263 11.212273 | ||
nl -1.566664 19.635268 | ||
ru -1.395736 17.781756 | ||
lv -1.055992 11.541761 | ||
es -1.707326 22.161790 | ||
it -1.601567 20.000540 | ||
========= ========= =========== | ||
|
||
Conclusion | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe you could add a few closing words on Pygrunn, maybe something relating to linguistics as well. Or if you don't want to add anything you could just change Conclusion to References & keep the link below. |
||
---------- | ||
|
||
.. [Li1992] Li, Wentian. | ||
`Random texts exhibit Zipf's-law-like word frequency distribution.`__ | ||
Information Theory, IEEE Transactions on 38.6 (1992): 1842-1845. | ||
|
||
__ http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.164.8422&rep=rep1&type=pdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the notebook is not loading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it works now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The article is improving!
Still the notebook is not loading. It is not found on the server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strange, it works for me, maybe there are some problems on the server. I'll give a link to the original file and to the rendered version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use word frequencies available here http://wacky.sslmit.unibo.it/doku.php?id=frequency_lists