What are your motivations to do a Data Science Project?
-
Improved job satisfaction
-
Become more efficient at –
- find better questions to ask data
- determining the ROI on business actions
- Improve accuracy on current processes
- Respond quickly to market changes
- Find correlations between events.
- Knowing "about" data science and "knowing" datascience.
- Improve technical chops.
- Curiosity.
- Tell the difference between cats and dogs
How to identify a problem that can be solved by a Data Science project?
- Has a lot of diverse data.
- Problems that were left unsolved/solved unsatisfactorily for the lack of computing power/tools.
- Problem has to lead to a solution or a course of action/decision/product.
- Identifying data variables that have influence on decision making.
- Talk to people. Investigate/query the people "Ask the right questions to the right people".
- Start from the data and identify interesting questions from that. . "what questions can we answer, given this data".
- Convert qualitative values to quantitative values. (eg: whiskey reviews to star ratings"
- Tell a Compelling Story from data
- Validate the "truth" of existing BI/reporting systems.
A Data Science project may have one of more this:
- Integrate data sources that were ignored before
- Will use one or more statistical methods.
What is not a Data Science Project?
-
If there is no quantifiable aspect to the data.
-
There is not enough data to have enough confidence in your results
-
A (set of) SQL query alone.
-
There is no statistics beyond aggregation. (BI)
-
Does not question the credibility and generalizability of the results. Concerns solely with reporting existing data.
-
Uses ONLY prepackaged, single-purpose package.
What are the traits of the people involved your Data Science Project?
-
your therapist
-
Experts with domain knowledge
-
Willingness to share (just enough)
-
bottom-up experience in the field (+ve trait)
-
ability to communicate
-
identify value
-
explaining the meaning of results (eg: spurious results)
-
"T-shaped people" (from Valve employee handbook).
-
Data Engineer - (pull data, data cleaning, ETL)
-
Project Sponsor
-
Strong will
-
political capital
-
knows how to set expectations
-
good prioritization ability
-
upselling
-
championing the cause/getting the buy-in
-
Data Artist
- Visualization,
- Narration (eg: Beautiful Evidence by Tufte),
- Nate Silver
- Upshot
- Mike Bostok (D3 creator)
-
Statistician/Machine Learning expert (Math+Stats background)
-
Sanity -- reasonable generalizations, soundness of the models
-
actual models
-
ability to indentify the useful of the data
-
methods -- A/B testing etc.,
-
Programmer traits
- GUI
- Gluing systems together - scripting ability
- Reporting
- Productionizing
- covering edge cases
- efficiency
- scalability
- documentation
- product life cycle/ maintainance
- testing
- version management
-
Data science product engineer(!)
What will you need to start a data science project?
What does your toolkit look like?
Version Control systems:
- git
- mercurial
Delivery
- Data visualization toolkit
- Reporting tools
Model discovery and generation
- Exploratory data analysis
ETL tools
Data storage/management
Glue programming
Dev/System environment setup
- VMs
Infrastructure
- Cloud based services
- High end hardware (GPU)
- Network (pipeline)
Programming languages
- Libraries/toolkits
Project Management tools
- Trello
- Jira
Documentation
Testing and Continuous integration
What will you have to show at the end of your project?
- Delivery
- dynamic documents -- R/Shiny, Javascript
Related topics to discuss in future meetups:
-
Trends – why DS?
-
Applications in various industries
-
Using Python vs R? why not together?
-
What is a good data scientist?
-
What is the minimum level of competence required in:
- Mathematics
- Statistics
- Programming
- Visualizations
- Project management – setting expectations
- Product lifecycle
-
Resources
-
Classes
-
Datasets
-
Tools
-
Books
- Practical Data Science with R
-
Competitions
-