Skip to content

Latest commit

 

History

History

2014-06

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

My First Data Science Project

What are your motivations to do a Data Science Project?

  • Improved job satisfaction

  • Become more efficient at –

    • find better questions to ask data
    • determining the ROI on business actions
    • Improve accuracy on current processes
    • Respond quickly to market changes
    • Find correlations between events.
    • Knowing "about" data science and "knowing" datascience.
    • Improve technical chops.
    • Curiosity.
    • Tell the difference between cats and dogs

How to identify a problem that can be solved by a Data Science project?

  • Has a lot of diverse data.
  • Problems that were left unsolved/solved unsatisfactorily for the lack of computing power/tools.
  • Problem has to lead to a solution or a course of action/decision/product.
  • Identifying data variables that have influence on decision making.
  • Talk to people. Investigate/query the people "Ask the right questions to the right people".
  • Start from the data and identify interesting questions from that. . "what questions can we answer, given this data".
  • Convert qualitative values to quantitative values. (eg: whiskey reviews to star ratings"
  • Tell a Compelling Story from data
  • Validate the "truth" of existing BI/reporting systems.

A Data Science project may have one of more this:

  1. Integrate data sources that were ignored before
  2. Will use one or more statistical methods.

What is not a Data Science Project?

  1. If there is no quantifiable aspect to the data.

  2. There is not enough data to have enough confidence in your results

  3. A (set of) SQL query alone.

  4. There is no statistics beyond aggregation. (BI)

  5. Does not question the credibility and generalizability of the results. Concerns solely with reporting existing data.

  6. Uses ONLY prepackaged, single-purpose package.


What are the traits of the people involved your Data Science Project?

  • your therapist

  • Experts with domain knowledge

  • Willingness to share (just enough)

  • bottom-up experience in the field (+ve trait)

  • ability to communicate

  • identify value

  • explaining the meaning of results (eg: spurious results)

  • "T-shaped people" (from Valve employee handbook).

  • Data Engineer - (pull data, data cleaning, ETL)

  • Project Sponsor

  • Strong will

  • political capital

  • knows how to set expectations

  • good prioritization ability

  • upselling

  • championing the cause/getting the buy-in

  • Data Artist

    • Visualization,
    • Narration (eg: Beautiful Evidence by Tufte),
    • Nate Silver
    • Upshot
    • Mike Bostok (D3 creator)
  • Statistician/Machine Learning expert (Math+Stats background)

  • Sanity -- reasonable generalizations, soundness of the models

  • actual models

  • ability to indentify the useful of the data

  • methods -- A/B testing etc.,

  • Programmer traits

    • GUI
    • Gluing systems together - scripting ability
    • Reporting
    • Productionizing
      • covering edge cases
      • efficiency
      • scalability
      • documentation
      • product life cycle/ maintainance
      • testing
      • version management
  • Data science product engineer(!)

What will you need to start a data science project?

What does your toolkit look like?

Version Control systems:

  • git
  • mercurial

Delivery

  • Data visualization toolkit
  • Reporting tools

Model discovery and generation

  • Exploratory data analysis

ETL tools

Data storage/management

Glue programming

Dev/System environment setup

  • VMs

Infrastructure

  • Cloud based services
  • High end hardware (GPU)
  • Network (pipeline)

Programming languages

  • Libraries/toolkits

Project Management tools

  • Trello
  • Jira

Documentation

Testing and Continuous integration

What will you have to show at the end of your project?

  • Delivery
  • dynamic documents -- R/Shiny, Javascript

Related topics to discuss in future meetups:

  • Trends – why DS?

  • Applications in various industries

  • Using Python vs R? why not together?

  • What is a good data scientist?

  • What is the minimum level of competence required in:

    • Mathematics
    • Statistics
    • Programming
    • Visualizations
    • Project management – setting expectations
    • Product lifecycle
  • Resources

    • Classes

    • Datasets

    • Tools

    • Books

      • Practical Data Science with R
    • Competitions