Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pygrunn 14 article #34

Open
wants to merge 6 commits into
base: pelican
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 129 additions & 0 deletions content/articles/016-pygrunn14.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
Pygrunn 2014
============

:date: 2014-05-10 15:56
:tags: conference, talk, nlp, pygrunn
:category: life
:author: Dmitrijs Milajevs
:template: article_cover
:cover: 016-pygrunn14.jpg

`Pygrunn <http://www.pygrunn.org/>`_ is an awesome conference for Python
developers and friends, which takes place in
`Groningen <http://en.wikipedia.org/wiki/Groningen>`_.

As usually, the conference was perfectly organized. This is one of the most
stylish conferences I've ever attended. It constantly grows, and next year the
conference moves to a bigger venue, so keep the beginning of May 2015 free and
attend the event.

Another positive trend is the growing proportion of science related talks. One
of the topics of the conference became (scientific) code quality and
collaboration between professional developers and scientists.

Check awesome summaries of talks by
`Reinout van Rees <http://reinout.vanrees.org/weblog/tags/pygrunn.html>`_
and
`Maurits van Rees <http://maurits.vanrees.org/weblog/topics/pygrunn>`_. Get the
`#pygrunn <https://twitter.com/search?q=%23PyGrunn>`_ tweets and follow
`@pygrunn <https://twitter.com/PyGrunn>`_.


Computational linguistics 101
-----------------------------

`My presentation`__ started as a demonstration of the modern pythonic scientific
tools (my subjective classification):

__ http://nbviewer.ipython.org/urls/bitbucket.org/dimazest/phd-buildout/raw/tip/notebooks/pygrunn14.ipynb

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the notebook is not loading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it works now.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The article is improving!
Still the notebook is not loading. It is not found on the server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strange, it works for me, maybe there are some problems on the server. I'll give a link to the original file and to the rendered version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use word frequencies available here http://wacky.sslmit.unibo.it/doku.php?id=frequency_lists


1. Data structures
* `numpy <http://www.numpy.org/>`_

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being picky, but: s/numpy/NumPy

* `scipy <http://www.scipy.org/scipylib/index.html>`_

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SciPy

* `pandas <http://pandas.pydata.org/>`_

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas

2. Algorithms
* `scikit-learn <http://scikit-learn.org/>`_
* `nltk <http://www.nltk.org/>`_,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NLTK

* `textblob <http://textblob.readthedocs.org>`_

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TextBlob

* `gensim <http://radimrehurek.com/gensim/>`_
3. Reporting
* `ipython <http://ipython.org/>`__

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPython

* `matplotlib <http://matplotlib.org/), [seaborn](http://www.stanford.edu/~mwaskom/software/seaborn/>`__

However, I find the technical talks with a lot of code rather boring, so I
decided to show how these libraries are used to solve simple CL tasks.

First, I `covered`__ `Zipf's law <http://en.wikipedia.org/wiki/Zipf%27s_law>`_
and showed that it holds for an English text. As a homework, I asked whether the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I covered Zipf's law, which states that the frequency of any word in a corpus of texts is inversely proportional to its rank in the frequency table. With help of pandas??? I showed that it holds for an English text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

same behavior is observed in other languages and what the differences are.

__ http://nbviewer.ipython.org/urls/bitbucket.org/dimazest/phd-buildout/raw/tip/notebooks/pygrunn14.ipynb#english-word-frequencies

I could not resist and presented my `research area`__ :) by extracting co-
occurrence counts and projecting the word vectors to 2 dimensions. I managed to
get a plot where ``girl`` is close to ``boy`` but far to ``manager``.

__ http://nbviewer.ipython.org/urls/bitbucket.org/dimazest/phd-buildout/raw/tip/notebooks/pygrunn14.ipynb#distributional-semantics

Sprint
------

`@_spyreto_ <https://twitter.com/_spyreto_>`_ and
`Sjoerd de Haan <https://www.linkedin.com/profile/view?id=22830170>`_ liked the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spyreto and Sjoerd de Haan liked the idea of counting word frequencies among various languages and see how they compare in relation to Zipf's law.

idea of counting word frequencies among various languages and see the behavior
of the slope.

Initially, we wanted to take EU directives and compare the official EU languages,
however, the website was down, and we were kindly redirected to
`this page <http://sorry.ec.europa.eu/>`_ every time we wanted to get a legal
document.

Luckily, we found an already prepared `word frequencies for a many languages
<http://invokeit.wordpress.com/frequency-word-lists/>`_ and reused them.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we can expand a bit.

We wrote a simple function to plot the frequency of the words against the rank of the words in the frequency table.
[Example table with some words, frequency and rank for English and another language]
For these plots we used a log-log scale. A straight line on a log-log plot implies that the quantities on the two axis are related trough a power law. Thus, if our data would fit straight line perfectly, that would mean that the frequency of a word occurring is exactly proportional to a power of the rank of that word in the frequency table. This is the content of Zipf's law, but ofcource such laws are never exact.
For English however it works quite well. [Graph]

One thing we can compare amonst languages is how well this plot follows a straight line. Also the slope of the line contains interesting information. It tells what kind of power law we are dealing with exactly.
A power law has the shape
[Can I use latex formula?]

Here I can explain slope of line = exponential in power law
etc
etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you can use latex syntax here

The task was to

* plot the ranked frequency distribution on the log-log scale
* estimate the slope :math:`\alpha`, the ratio of the frequencies between the
neighboring words in the rank.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You wanted to make the blog post human readable, perhaps you could explain this problem better :neckbeard:


`We tried`__ English, Dutch, Russian, Latvian, Spanish and Italian. All languages
obey Zipf's law, at least visually. Here is a plot for English (see `the notebook`__
for other plots):

__ http://nbviewer.ipython.org/urls/bitbucket.org/dimazest/phd-buildout/raw/tip/notebooks/Word%20frequencies.ipynb
__ http://nbviewer.ipython.org/urls/bitbucket.org/dimazest/phd-buildout/raw/tip/notebooks/Word%20frequencies.ipynb

.. image:: {filename}/static/images/016-en_zipf.png
:align: center
:alt: English word frequency counts on the log-log scale.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image is not found

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

github doesn't know how to find it, but our blog engine does :)

:target: {filename}/static/images/016-en_zipf.png

The blue line is the provided frequencies, the green is a regression line.
Theory [Li1992]_ says that the slope coefficient should be close to -1. As the
table shows, the values deviate from -1 quite drastically (-1.7 for Spanish).
Also, the `slope estimate`__ for English from the `British National Corpus`__ is
-1.18.

__ http://nbviewer.ipython.org/urls/bitbucket.org/dimazest/phd-buildout/raw/tip/notebooks/pygrunn14.ipynb#estimating-the-slope
__ http://www.natcorp.ox.ac.uk/

========= ========= ===========
Language Slope Intercept
========= ========= ===========
uk -1.044263 11.212273
nl -1.566664 19.635268
ru -1.395736 17.781756
lv -1.055992 11.541761
es -1.707326 22.161790
it -1.601567 20.000540
========= ========= ===========

Conclusion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could add a few closing words on Pygrunn, maybe something relating to linguistics as well. Or if you don't want to add anything you could just change Conclusion to References & keep the link below.

----------

.. [Li1992] Li, Wentian.
`Random texts exhibit Zipf's-law-like word frequency distribution.`__
Information Theory, IEEE Transactions on 38.6 (1992): 1842-1845.

__ http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.164.8422&rep=rep1&type=pdf
Binary file added content/static/article_covers/016-pygrunn14.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added content/static/images/016-en_zipf.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.