index.html

<!DOCTYPE html>
<html lang="en">
<head>
        <meta charset="utf-8" />
        <title>:: RaPrism ::</title>
        <link rel="stylesheet" href="./theme/css/main.css" />
        <link href="https://raprism.github.io/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title=":: RaPrism :: Atom Feed" />

        <!--[if IE]>
            <script src="http://html5shiv.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
</head>

<body id="index" class="home">
        <header id="banner" class="body">
                <h1><a href="./">:: RaPrism :: </a></h1>
                <nav><ul>
                    <li><a href="./category/publishing.html">publishing</a></li>
                    <li><a href="./category/python.html">python</a></li>
                </ul>
                </nav>
        </header><!-- /#banner -->

            <aside id="featured" class="body">
                <article>
                    <h1 class="entry-title"><a href="./PyDataBerlin2015.html">PyData Berlin 2015</a></h1>
<footer class="post-info">
        <span>Thu 18 June 2015</span>
<span>| tags: <a href="./tag/python.html">python</a><a href="./tag/data.html">data</a><a href="./tag/berlin.html">berlin</a><a href="./tag/pydata.html">pydata</a></span>
</footer><!-- /.post-info --><p>After a visit of <a href="http://pydata.org/berlin2015">PyData Berlin 2015</a>
event, which took place 29.-30. May at location
<a href="http://www.betahaus.com">Betahaus</a>, I wrote down some links as a
reminder, and decided then to put them here alongside with some
thoughts on given topics.</p>
<p>Actually there were talks as given on meeting site - 'related to the
use of Python in data management and analysis' - from variety of
fields, but with emphasis on application of machine learning.</p>
<p>In 2014 PyData Berlin event was a <a href="https://www.youtube.com/watch?v=d9Qm3PPoYNQ">talk of Travis
Oliphant</a> , during which
also 'PyData: the First 20 Years' were summarized on a slide. Actually
this story began with basic matrix calculation packages that were
precursor of current standard <a href="http://www.numpy.org">NumPy</a> and
extensions like <a href="https://www.scipy.org">SciPy</a> and
<a href="http://matplotlib.org">matplotlib</a>. With
<a href="http://pandas.pydata.org">pandas</a> and several machine learning
packages, e.g. <a href="http://scikit-learn.org">scikit-learn</a>, the PyData
ecosystem rivals <a href="http://www.r-project.org">R</a> and interacts with
'big' Java-driven 'big data' solutions centered on
<a href="https://projects.apache.org/indexes/category.html#big-data">apache.org</a> -
as described
e.g. <a href="http://www.blue-yonder.com/blog-e/2014/11/12/environment-choose-data-science">here</a>.</p>
<p>Topics presented and discussed in Berlin this year were commented in
'official'
<a href="https://twitter.com/hashtag/PyDataBerlin?src=hash">#PyDataBerlin</a>
twitter feeds, and <a href="https://www.youtube.com/user/PyDataTV">videos</a> are
on-line.</p>
<p>The keynote from <a href="http://matthewrocklin.com">Matthew
Rocklin</a> from <a href="http://continuum.io">Continuum
Analytics</a> was about
<a href="http://dask.pydata.org/">Dask</a>, which seems to be a quite elegant
approach to get 'instantly' multi-core performance for a large subset
of NumPy functionality by parallel processing. This topic leads
usually to hints about limitations for multi-threaded data management
because of <a href="https://wiki.python.org/moin/GlobalInterpreterLock">GIL</a>,
and consequently one of Matthew's message was to get rid of this in
main parts of PyData module ecosystem (<a href="http://docs.cython.org/src/userguide/external_C_code.html#nogil">with
nogil</a>
statement in cython).</p>
<p><a href="http://haenel.co">Valentin Haenel</a> talked about Blosc and related
higher-level packages (see links on his homepage). Especially the
'columnar data container' <a href="https://github.com/Blosc/bcolz">bcolz</a> could
be considered as an light-weight alternative for HDF5 file format (actually
<a href="http://www.pytables.org">PyTables</a> offers also Blosc-based
compression filter).</p>
<p><a href="https://github.com/c-abird">Claas Abert</a>'s talk about numerical
treatment of PDE with Numpy gave impresison about one 'classic' use
case for PyData packages, i.e. physical simulations by
finite-difference and finite-elements methods of <a href="http://micromagnetics.org">micromagnetic
problems</a>. The finite-elements package is
based on <a href="http://fenicsproject.org/">FEnICS</a>. Number crunching
involves also optimization of Python code, and an overview on
possibilities was given. Thanks for getting a hint about new JIT
compiler: <a href="https://github.com/cosmo-ethz/hope">HOPE</a>!</p>
<p>Mobile app marketing was one example of 'new' 'Big Data' and their
challenges. Nakul Selvaraj from <a href="http://www.trademob.com">Trademob</a>
explained demands for real-time monitoring, and his colleague Tobias
Kuhn gave some insights about statistical methods with specific
algorithms like <a href="https://github.com/trademob/t-digest">t-digest</a> used
e.g.  for tracking of anomalies, or how to decompose trends on top of
seasonal variations.</p>
<p>There were 2 talks from <a href="https://pivotal.io">Pivotal</a> folks. <a href="http://cloudfoundry.org">Cloud
Foundry</a> was mentioned as their
<a href="https://en.wikipedia.org/wiki/Platform_as_a_service">PaaS</a> product,
and it's interesting to see what open source is <a href="https://pivotal.io/open-source">used
@Pivotal</a>. Here smart GPS tracking of
cars was given as one example of
<a href="https://en.wikipedia.org/wiki/Internet_of_Things">IoT</a>. In this study
presented by Ronert Obst <a href="http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm">Random
Forests</a>
were used to learn and classify possible driven routes. Although
<a href="https://github.com/apache/spark">Spark</a> usage was mentioned here and
in other talks, it was also hinted to
<a href="https://github.com/apache/flink">Flink</a> as alternative with different
memory model and ability to process data
<a href="http://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/languagebinding/api/java/python/streaming/PythonStreamer.html">streams</a>.</p>
<p>On Friday the presentation session ended with 'Get Together', but one
should not forget to mention the (en)lightning talks just before. I
pick two: Matthew Rocklin gave short example to show what the benefit
of <a href="http://pandas.pydata.org/pandas-docs/dev/categorical.html#categorical">pandas' dtype
category</a>
is. And the presentation from <a href="http://tdj.si">Tadej Štajner</a> about
<a href="http://tdj.si/automl_pydataberlin.pdf">automatic machine learning</a>
with some <a href="https://github.com/tadejs/autokit">code</a>.  So how different
will be future PyData events, if one takes one statement on
<a href="http://automl.org">AutoML</a> site too serious: 'taking the human expert
out of the loop'?</p>
<p>It was not only a presentation about how <a href="https://www.ascribe.io">ascribe</a>
uses Python modules, like
<a href="https://github.com/ascribe/transactions">transactions</a> as 'Bitcoin
for Humans', but in his keynote Trent McConaghy also gave an
interesting view on how blockchains and a suitable protocol
(<a href="https://github.com/ascribe/spool">Spool</a>) under the hood can help
e.g. artists to keep in control of intellectual property of digital
objects.</p>
<p>The keynote of Felix Wick from <a href="http://www.blue-yonder.com">Blue
Yonder</a> contained really lots of
information not only about technical stuff a data scientist might or
should know, when doing e.g <a href="http://www.gartner.com/it-glossary/predictive-analytics">predictive
analytics</a>. So
in his/her life it might be not bad to know a bit about hype cycles -
but do not care too much. And you should really have a look to the
<a href="https://www.youtube.com/watch?v=Fo0Ne2pYWW4">video</a>, because it
appears to be almost a lecture, which gives a nice introduction on
'data science' in our Big Data world (be it a peak or not ...).</p>
<p>Peadar Coyle very nicely presented an interesting example how <a href="http://nbviewer.ipython.org/format/slides/github/springcoil/Probabilistic_Programming_and_Rugby/blob/master/Bayesian_Rugby.ipynb#/">sports
analytics</a> -
in this case <a href="http://www.rbs6nations.com">6 Nations</a> playing Rugby -
can be done by applying bayesian statistical models with
<a href="https://github.com/pymc-devs/pymc">PyMC</a>. Someone in the audience
mentioned that it could be already worth switching to the successor
<a href="https://github.com/pymc-devs/pymc3">PyMC 3</a> to perform Markow chain
Monte Carlo fitting. Actually
<a href="http://nbviewer.ipython.org/github/aloctavodia/Doing_bayesian_data_analysis/blob/master/IPython/Kruschkes_Doing_Bayesian_Data_Analysis_in_PyMC3.ipynb">this</a>
notebook seems to be a good tutorial for learning or migration
purposes.</p>
<p>The second presentation of Blue Yonder folks was an
introduction to
<a href="https://spark.apache.org/docs/latest/api/python/index.html">PySpark</a>
DataFrame objects, which are created to make use of - for instance -
columnar data stored on <a href="http://hadoop.apache.org/">Hadoop</a> HDFS
clusters in <a href="http://parquet.apache.org">Parquet</a> format. It was
concluded that such an API for distributed file access is too costly
in terms of efficiency compared to NumPy arrays or Pandas dataframes,
when data fits on one machine.</p>
<p><a href="http://albahnsen.com">Alejandro C. Bahnsen</a> presented his work
<a href="https://github.com/albahnsen/CostSensitiveClassification">CostCla</a>,
which examplifies usage of scikit-learn classifications on financial
topics like <a href="http://nbviewer.ipython.org/github/albahnsen/CostSensitiveClassification/blob/master/doc/tutorials/tutorial_edcs_credit_scoring.ipynb">credit
scoring</a>.</p>
<p>Brian Carter (IBM Software Group) gave an overview on web text mining
processes. In this field the <a href="http://www.nltk.org">NLTK</a> module seems
to be the standard for language processing with Python.</p>
<p>When dealing with similarity of words like Miguel Fernando Cabrera (TrustYou) did with hotel reviews, <a href="https://github.com/danielfrg/word2vec">word2vec</a> has the metrices needed.</p>
<p>I didn't attend the tutorial sessions, so it will help to have a look
at respective videos, if one wants to learn and understand better
e.g. how interactive graphics e.g. within <a href="http://ipython.org/notebook.html">IPython
notebooks</a> can be created with
<a href="http://bokeh.pydata.org">Bokeh</a>, how
<a href="https://docs.docker.com">Docker</a> could be seen in between tools like
virtualenv and 'normal' virtual machines, and what one could use
instead of
<a href="https://ipython.org/ipython-doc/dev/interactive/magics.html?highlight=timeit#magic-timeit">%timeit</a>
for code profiling.</p>
<p>Someone referred to the
<a href="https://amplab.cs.berkeley.edu/projects/velox">velox</a> model server,
and I noted also <a href="https://github.com/spotify/luigi">Luigi</a> for
pipeline creation of batch jobs - if you remember the right context,
talk or discussion, those were given then feel free to message me
<a href="https://twitter.com/RaPrism">@RaPrism</a>.</p>
<!-- Local Variables: -->

<!-- mode: rst -->

<!-- End: -->                </article>
            </aside><!-- /#featured -->
                <section id="content" class="body">
                    <h1>Other articles</h1>
                    <ol id="posts-list" class="hfeed">

            <li><article class="hentry">
                <header>
                    <h1><a href="./test-post.html" rel="bookmark"
                           title="Permalink to Trapattoni '98">Trapattoni '98</a></h1>
                </header>

                <div class="entry-content">
<footer class="post-info">
        <span>Sun 17 May 2015</span>
<span>| tags: <a href="./tag/pelican.html">pelican</a><a href="./tag/publishing.html">publishing</a></span>
</footer><!-- /.post-info -->                <p>This is a test page.</p>
                <a class="readmore" href="./test-post.html">read more</a>
                </div><!-- /.entry-content -->
            </article></li>
            </ol><!-- /#posts-list -->
<p class="paginator">
    Page 1 / 1
</p>
            </section><!-- /#content -->
        <section id="extras" class="body">
                <div class="blogroll">
                        <h2>blogroll</h2>
                        <ul>
                            <li><a href="http://planetpython.org/">Planet Python</a></li>
                        </ul>
                </div><!-- /.blogroll -->
                <div class="social">
                        <h2>social</h2>
                        <ul>
                            <li><a href="https://raprism.github.io/feeds/all.atom.xml" type="application/atom+xml" rel="alternate">atom feed</a></li>

                            <li><a href="https://github.com/prismv">GitHub #source</a></li>
                            <li><a href="https://github.com/RaPrism">GitHub #publish</a></li>
                            <li><a href="https://twitter.com/RaPrism">Twitter</a></li>
                        </ul>
                </div><!-- /.social -->
        </section><!-- /#extras -->

        <footer id="contentinfo" class="body">
                <p>Powered by <a href="http://getpelican.com/">Pelican</a>. Theme <a href="https://github.com/blueicefield/pelican-blueidea/">blueidea</a>, inspired by the default theme.</p>
        </footer><!-- /#contentinfo -->

</body>
</html>