index.html

<!DOCTYPE html>
<html prefix="            og: http://ogp.me/ns# article: http://ogp.me/ns/article#     " vocab="http://ogp.me/ns" lang="en">
<head>
<meta charset="utf-8">
<meta name="description" content="Data-related analyses, tools, and news.">
<meta name="viewport" content="width=device-width">
<title>Tanya Schlusser</title>
<link href="assets/css/custom.css" rel="stylesheet" type="text/css">
<meta name="theme-color" content="#5670d4">
<meta name="generator" content="Nikola (getnikola.com)">
<link rel="alternate" type="application/rss+xml" title="RSS" href="rss.xml">
<link rel="canonical" href="https://tanyaschlusser.github.io/">
<link rel="apple-touch-icon" href="apple-touch-icon.png" sizes="180x180">
<link rel="icon" href="favicon-32x32.png" sizes="32x32">
<link rel="icon" href="favicon-16x16.png" sizes="16x16">
<link rel="icon" href="favicon.ico" sizes="48x48 32x32 16x16">
<link rel="manifest" href="site.webmanifest">
<link rel="mask-icon" href="safari-pinned-tab.svg" color="#1f91c2">
<meta name="msapplication-TileColor" content="#00aba9">
<meta name="theme-color" content="#cceeff">
<!-- favicons generated using http://realfavicongenerator.net/ --><!--[if lt IE 9]><script src="assets/js/html5shiv-printshiv.min.js"></script><![endif]--><link rel="prefetch" href="posts/whats-so-great-about-knime/" type="text/html">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.10.0-beta/katex.min.css" integrity="sha256-sI/DdD47R/Sa54XZDNFjRWlS+Dv8MC5xfkqQLRh0Jes=" crossorigin="anonymous">
</head>
<body>
    <a href="#content" class="sr-only sr-only-focusable">Skip to main content</a>
         
    <header id="header" class="hidden-print"><nav id="menu"><a href="https://tanyaschlusser.github.io/" title="Tanya Schlusser" rel="home">

        <svg viewbox="0 0 120 20" xmlns="http://www.w3.org/2000/svg"><desc>Tanya Schlusser</desc><text id="blog-title" y="15" transform="scale(1, 1.1)">Tanya Schlusser</text></svg></a>

    <ul>
<li><a href="projects/">Projects</a></li>
                <li><a href="slides/">Slides</a></li>
                <li><a href="archive.html">Archive</a></li>
    
    
    </ul></nav></header><main id="content"><div class="postindex">
    <article class="h-entry post-text"><header><h1 class="p-name entry-title"><a href="posts/whats-so-great-about-knime/" class="u-url">What's so great about Knime?</a></h1>
        <span class="metadata">
            <time datetime="2019-01-08T00:00:42-05:00">08 Jan 2019</time><span>Filed under <a class="tag p-category" href="categories/cat_tools/" rel="category">Tools</a>.</span>
        </span>
    </header><div class="p-summary entry-summary">
    <div>
<p>Last March, (for the fifth time, according to
<a href="https://www.forestgt.com.au/latest-news/2018/3/2/knime-2018-gartner-magic-quadrant-market-leader">Forest Grove Technology</a>),
the <a href="https://www.knime.com/">Knime Analytics platform</a> was named a Gartner Magic Quadrant
leader. This year's other leaders are <a href="https://www.alteryx.com/">Alteryx</a>,
<a href="https://www.sas.com/">SAS</a>, <a href="https://rapidminer.com/">RapidMinder</a>,
and <a href="https://www.h2o.ai/">H2Oai</a>.
The best thing I learned from the announcement? Knime is open source,
and free for individual users—I can afford to look at it!</p>
<p>Knime (silent "k"; rhymes with "dime") provides a graphical user interface
to chain together blocks that represent steps in a data science workflow.
(So they're like Pentaho or Informatica but for machine learning.
 Or <a href="http://www.ni.com/en-us/shop/labview.html">LabView</a> if you have an
engineering background.)</p>
<p>It has dozens of built-in data access and transformation functions,
statistical inference and machine learning algorithms,
<a href="https://sourceforge.net/projects/pmml/">PMML</a>,
and custom <a href="https://www.knime.com/blog/blending-knime-and-python">Python</a>,
<a href="https://www.knime.com/nodeguide/scripting/java/example-of-java-snippet">Java</a>,
<a href="https://www.knime.com/nodeguide/scripting/r/example-of-r-snippet">R</a>,
<a href="https://informationentropy.wordpress.com/2016/03/24/programming-nodes-in-knime-with-scala-part-i/">Scala</a>,
a <a href="https://www.knime.com/nodeguide">zillion other nodes</a>,
or other community plugins (since it's open source, anyone can
<a href="https://www.knime.com/developer/example/extension-wizard">make a plugin</a>.)
Even better, Knime imposes structure and modularity
on a data science workflow by requiring code fit into specified building blocks.</p>
<p>This post implements the Bayesian NFL model from
<a href="https://tanyaschlusser.github.io/posts/bayesian-updating-and-the-nfl/">last month</a>
in Knime.
It adds the upstream and downstream workflows to pull new data each week and
write the model output to a spreadsheet: enough for a first look at this tool.</p>
<p class="more"><a href="posts/whats-so-great-about-knime/">Read more</a></p>
</div>    <hr>
</div>
    </article><article class="h-entry post-text"><header><h1 class="p-name entry-title"><a href="posts/bayesian-updating-and-the-nfl/" class="u-url">Bayesian updating and the NFL</a></h1>
        <span class="metadata">
            <time datetime="2018-09-09T00:00:42-05:00">09 Sep 2018</time><span>Filed under <a class="tag p-category" href="categories/cat_methods/" rel="category">Methods</a>.</span>
        </span>
    </header><div class="p-summary entry-summary">
    <div tabindex="-1" id="notebook" class="border-box-sizing">
    <div class="container" id="notebook-container">

<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>It's football season again, hooray! Every year for my friends' football pool I try out a different algorithm. Invariably, my picks are around 60% accurate. Not terrible, but according to NFL Pickwatch (<a href="https://web.archive.org/web/20180811195907/http://nflpickwatch.com/">archive</a>, <a href="https://nflpickwatch.com/">current season</a>), the best pickers get to 68 or 69%. So, an amazing performance—my upper bound—is just under 70%, and the lower bound for a competitive model—the FiveThirtyEight baseline—is 60%.</p>
<p>I've been modeling NFL outcomes for a couple of years, and running linear (predicting point spread) and logistic (predicting win probability) regressions given various team and player data. My best year so far incorporated the Vegas spread into the model, and my biggest disaster so far was an aggressive lasso model on every player in every offensive line, with team defenses lumped as a group. Attempting to track <a href="https://www.pro-football-reference.com/players/injuries.htm">injuries</a>, suspensions, and other changes to the starting lineup was not sustainable for the amount of time I wanted to spend.</p>
<p>Enter Nate Silver's awesome <span class="vocabulary" title="Arpad Elo was a Hungarian-American physics professor who invented the system to rank chess players. Silver adapted it for Football, baseball, and most of the other sports on FiveThirtyEight."><a href="https://fivethirtyeight.com/features/how-our-2017-nfl-predictions-work/">NFL Elo rankings</a></span>, the aspirational target for this year. What's impressive is that he gets something like 60% accuracy out of literally no information but home field advantage and past scores. I particularly love that it updates weekly to incorporate the new information—this immediately says "Bayesian" and in fact is a lot how people using their intuition are making their picks anyway. A system like his—but with a more straightforward Bayesian model—is the goal of this post.</p>
<p class="more"><a href="posts/bayesian-updating-and-the-nfl/">Read more</a></p>
</div>
</div>
</div>
</div>
</div>    <hr>
</div>
    </article><article class="h-entry post-text"><header><h1 class="p-name entry-title"><a href="posts/property-tax-cook-county/" class="u-url">Modeling property tax assessment in Cook County, IL</a></h1>
        <span class="metadata">
            <time datetime="2018-08-19T00:00:42-05:00">19 Aug 2018</time><span>Filed under <a class="tag p-category" href="categories/cat_data/" rel="category">Data</a>.</span>
        </span>
    </header><div class="p-summary entry-summary">
    <div tabindex="-1" id="notebook" class="border-box-sizing">
    <div class="container" id="notebook-container">

<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The year my Mom moved in down the street from us, my husband tried to get some local property tax appeal company to reduce her assessment. They refused, saying they thought there wasn't a case.</p>
<p>The next year, she got a postcard from that same company: they would appeal her case and split the savings with her 50/50. Who wants to give up 50% of their tax savings? Plus, I was miffed from the prior year. I decided to try and appeal myself. Success!</p>
<p><span class="vocabulary" title="A tool for automated testing of web applications, written in Java. I use the Python bindings."><a href="https://selenium-python.readthedocs.io/">Selenium</a></span> via Python bindings was used to pull the data from the web, and <a href="https://www.statsmodels.org">statsmodels</a>, with an interface that resembles R, was used to make the model.
</p>
<p class="more"><a href="posts/property-tax-cook-county/">Read more</a></p>
</div>
</div>
</div>
</div>
</div>    <hr>
</div>
    </article><article class="h-entry post-text"><header><h1 class="p-name entry-title"><a href="posts/mcmc-and-the-ising-model/" class="u-url">MCMC and the Ising Model</a></h1>
        <span class="metadata">
            <time datetime="2018-07-29T00:00:42-05:00">29 Jul 2018</time><span>Filed under <a class="tag p-category" href="categories/cat_methods/" rel="category">Methods</a>.</span>
        </span>
    </header><div class="p-summary entry-summary">
    <div tabindex="-1" id="notebook" class="border-box-sizing">
    <div class="container" id="notebook-container">

<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><span class="vocabulary" title="A sequence of statistical outcomes in which each step is statistically independent from all of the prior steps.">Markov-Chain</span>
<span class="vocabulary" title="A computer simulation technique using pseudo-random numbers to simulate random events.">Monte Carlo</span> (MCMC) methods are a category of numerical technique used in Bayesian statistics. They numerically estimate the distribution of a variable (the <span class="vocabulary" title="The prior times the likelihood,  normalized, is the posterior distribution: the probability distribution of the target variable after incorporating the observed data.">posterior</span>) given two other distributions: the <span class="vocabulary" title="A distribution that represents existing knowledge of a system. Often people choose a uniform (flat) distribution; or else something that is the known conjugate prior of a desired posterior distribution.">prior</span> and the <span class="vocabulary" title="A special name for the probability mass (or density) function when you fix the random variable (e.g. `x`) and integrate over the parameters (e.g. `mu` and `theta`). It's renamed 'likelihood' just to make that swap explicit when talking about it. The integral over the parameters may not equal one so you have to normalize.">likelihood function</span>, and are useful when direct integration of the likelihood function is not tractable.</p>
<p>I am new to Bayesian statistics, but became interested in the approach partly from exposure to the <a href="posts/mcmc-and-the-ising-model/">PyMC3 library</a>, and partly from FiveThirtyEight's promoting it in a <a href="https://fivethirtyeight.com/features/statisticians-found-one-thing-they-can-agree-on-its-time-to-stop-misusing-p-values/">commentary</a> soon after the time of the p-hacking scandals a few years back (<a href="https://www.ncbi.nlm.nih.gov/pubmed/22006061">Simmons et. al.</a> coin 'p-hacking' in 2011, and <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4359000">Head et. al.</a> quantify the scale of the issue in 2014).</p>
<p>Until the 1980's, it was not realistic to use Bayesian techniques except when analytic solutions were possible. (Here's Wikipedia's <a href="https://en.wikipedia.org/wiki/Conjugate_prior#Table_of_conjugate_distributions">list of analytic options</a>. They're still useful.) MCMC opens up more options.</p>
<p>The Python library <a href="https://docs.pymc.io/">pymc3</a> provides a suite of modern Bayesian tools: both MCMC algorithms and variational inference. One of its core contributors, Thomas Wiecki, wrote a blog post entitled <a href="https://twiecki.github.io/blog/2015/11/10/mcmc-sampling/">MCMC sampling for dummies</a>, which was the inspiration for this post. It was enthusiastically received, and cited by people I follow as the best available explanation of MCMC. To my dismay, I didn't understand it; probably because he comes from a stats background and I come from engineering. This post is for people like me.</p>
<p class="more"><a href="posts/mcmc-and-the-ising-model/">Read more</a></p>
</div>
</div>
</div>
</div>
</div>    <hr>
</div>
    </article>
</div>

<aside class="bio">
    Thank you for visiting! <span class="red">❤ </span> I'm a developer based in
    <a href="https://chipy.org" title="My home Python user group!">Chicago</a>.
    I'm not working right now because my Mom is sick, but am happy to connect.
    You can find me on
    <a href="https://www.linkedin.com/in/tanyatickel" title="Say you read my blog so I know you're not a robot.">LinkedIn</a> or
    <a href="https://github.com/tanyaschlusser" title="Where this site is hosted...">GitHub</a>.
</aside><script src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.10.0-beta/katex.min.js" integrity="sha256-mxaM9VWtRj1wBtn50/EDUUe4m3t39ExE+xEPyrxVB8I=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.10.0-beta/contrib/auto-render.min.js" integrity="sha256-9uFJqVHnc71lPswxPcpJP49zqhdqp7DFqX68yHs358I=" crossorigin="anonymous"></script><script>
                renderMathInElement(document.body,
                    {
                        
delimiters: [
    {left: "$$", right: "$$", display: true},
    {left: "\\[", right: "\\]", display: true},
    {left: "$", right: "$", display: false},
    {left: "\\(", right: "\\)", display: false}
]

                    }
                );
            </script></main><footer id="footer"><p class="light-sans">© Tanya Schlusser · Subscribe via <a href="rss.xml">RSS</a> · Powered by <a href="https://getnikola.com" rel="nofollow">Nikola</a> using a <a href="https://github.com/tanyaschlusser/tanyaschlusser.github.io/tree/src">custom theme</a> </p>
            
        </footer><!-- Global site tag (gtag.js) - Google Analytics --><script async src="https://www.googletagmanager.com/gtag/js?id=UA-130209649-2"></script><script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'UA-130209649-2');
</script>
</body>
</html>