-
Notifications
You must be signed in to change notification settings - Fork 0
/
ieee_big_data_16.html
103 lines (92 loc) · 6.66 KB
/
ieee_big_data_16.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="generator" content="Pelican" />
<title>Insights from IEEE Big Data 16</title>
<link rel="stylesheet" href="/theme/css/main.css" />
<meta name="description" content="I have attended the IEEE Big Data 16 conference in Washington DC. I thank my company for sponsoring the trip. The conference included a special..." />
</head>
<body id="index" class="home">
<header id="banner" class="body">
<h1><a href="/">Marco Santoni</a></h1>
<nav><ul>
<li><a href="/pages/about.html">about</a></li>
<li><a href="/pages/bookshelf.html">bookshelf</a></li>
<li class="active"><a href="/category/posts.html">posts</a></li>
</ul></nav>
</header><!-- /#banner -->
<section id="content" class="body">
<article>
<header>
<h1 class="entry-title">
<a href="/ieee_big_data_16.html" rel="bookmark"
title="Permalink to Insights from IEEE Big Data 16">Insights from IEEE Big Data 16</a></h1>
</header>
<div class="entry-content">
<footer class="post-info">
<abbr class="published" title="2016-12-26T16:22:00+01:00">
Published: Mon 26 December 2016
</abbr>
<address class="vcard author">
By <a class="url fn" href="/author/marco-santoni.html">Marco Santoni</a>
</address>
<p>In <a href="/category/posts.html">posts</a>.</p>
</footer><!-- /.post-info --> <p>I have attended the IEEE Big Data 16 conference in Washington DC. I thank my company for sponsoring the trip. The conference included a <a href="http://cci.drexel.edu/bigdata/bigdata2016/SpecialSymposium.html">special symposium</a> dedicated to manufacturing. The symposium hosted some participants of the <a href="https://www.kaggle.com/c/bosch-production-line-performance">Bosch Production Line Performance</a> competition from Kaggle.</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">2016 IEEE International Conference on Big Data kicked off today in Washington, DC. Share highlights w/ hashtag <a href="https://twitter.com/hashtag/IEEEBigData16?src=hash">#IEEEBigData16</a> & we’ll RT!</p>— IEEE Big Data (@ieeebigdata) <a href="https://twitter.com/ieeebigdata/status/805799488128425984">December 5, 2016</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>I'll list here a few notes I took during the conference.</p>
<ul>
<li><strong>Streaming Processing.</strong> I heard about the most popular architectures nowadays, and I highly recommend reading the blog posts by the authors of such architectures:<ul>
<li><a href="http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html">Lambda architecture</a></li>
<li><a href="https://www.oreilly.com/ideas/questioning-the-lambda-architecture">Kappa architecture</a></li>
</ul>
</li>
<li><strong>K-Spectral Centroid.</strong> The K-Spectral Centroid algorithm clusters time series by their shape, and finds the most representative shape (the cluster centroid) for each cluster.</li>
<li><strong>K-D Tree partition:</strong> an algorithm for space partitioning.</li>
<li><strong>Database Decay.</strong> Interesting keynote by Michael Stonebraker. Shortly, large applications often share a centralized database used by different groups of a company. The DBA point of view:<ul>
<li>High Risk. When changing a DB schema, I need to find applications all around in the company and update them accordingly (do I have budget for that?).</li>
<li>Low Risk. No change in schema, I do a workaround in data.</li>
<li>Claim. DBA want to lower the risk. --> no change in schema --> ER diagram diverges from reality --> database decay.</li>
<li>At some point, a total rewrite is the only way forward.</li>
<li>If you work in analytics getting data from operational DB, you realize data is getting more and more dirty.</li>
</ul>
</li>
<li><strong>PMML Scoring Engine.</strong> Max Ferguson introduced what a Predictive Model Markup Language (PMML) is. Basically, if you train a model and want to share it in a different application, PMML is a standard that defines how models should be stored as an XML.</li>
<li><strong>Uncertainty in RFs.</strong> Random Forests can express uncertainty. One just needs to look at distribution of predictions among the decision trees of the model.</li>
<li><strong>Bosch.</strong> Rumi Ghosh introduced the data science team at Bosch.<ul>
<li>Insight from production plants: plant managers prefer interpretable models (logistic regression or decision tree) over black box models.</li>
<li>Research directions:</li>
<li>Root cause analysis (via Bayesian inference)</li>
<li>Class imbalance</li>
</ul>
</li>
<li><strong>3 Approaches in Kaggle Competition.</strong> <a href="https://www.kaggle.com/bpavlyshenko">Bohdan Pavlyshenko</a> gave a talk on the three approaches he explored during the Kaggle competition about failure detection:<ul>
<li>Pure machine learning approach. 2-Levels of model ensembling, a pure black-box.</li>
<li>Generalized Linear Model with Lasso regularization. Informative about feature impact.</li>
<li>Bayesian model in BUGS. It enables to obtain the estimate of the probability distribution for each coefficient.</li>
</ul>
</li>
<li><strong>FTLR.</strong> Follow the regularized leader: a feature engineering method used to convert all categorical feature into one numerical feature.</li>
<li><strong>CRF.</strong> Conditional Random Fields is a class of predictive models used when the dataset is represented as a graph. Each node is a sample with a vector X and a target variable y.</li>
</ul>
</div><!-- /.entry-content -->
</article>
</section>
<section id="extras" class="body">
<div class="social">
<h2>social</h2>
<ul>
<li><a href="https://linkedin.com/in/msantoni">linkedin</a></li>
<li><a href="https://twitter.com/mrsantoni">twitter</a></li>
</ul>
</div><!-- /.social -->
</section><!-- /#extras -->
<footer id="contentinfo" class="body">
<address id="about" class="vcard body">
Proudly powered by <a href="https://getpelican.com/">Pelican</a>, which takes great advantage of <a href="https://www.python.org/">Python</a>.
</address><!-- /#about -->
<p>The theme is by <a href="https://www.smashingmagazine.com/2009/08/designing-a-html-5-layout-from-scratch/">Smashing Magazine</a>, thanks!</p>
</footer><!-- /#contentinfo -->
</body>
</html>