Hi there! This guide is for you if:
- You know Python or you're learning it 🐍
- You're new to Machine Learning
- You care about the ethics of ML
- You learn by doing
If that's you, join me in getting a bit ahead of yourself, and see if you want to learn more about the field. (For alternatives, jump to the end of the guide or check out Nam Vu's guide, Machine Learning for Software Engineers.)
Get your feet wet and boost your confidence.
- Python. Python 3 is the best option.
- Jupyter Notebook. (Formerly known as IPython Notebook.)
- Some scientific computing packages:
- numpy
- pandas
- scikit-learn
- matplotlib
You can install Python 3 and all of these packages in a few clicks with the Anaconda Python distribution. Anaconda is popular in Data Science and Machine Learning communities. (Use whichever tool you want.)
Some options you can use from your browser:
- Binder is Jupyter Notebook's official choice to try JupyterLab
- Deepnote allows for real-time collaboration
- Google Colab provides "free" GPUs
For other options, see:
- markusschanta/awesome-jupyter, "Hosted Notebook Solutions"
- ml-tooling/best-of-jupyter, "Notebook Environments"
Learn how to use Jupyter Notebook (5-10 minutes). (You can learn by screencast instead.)
Now, follow along with this brief exercise: An introduction to machine learning with scikit-learn. Do it in ipython
or Jupyter Notebook. It'll really boost your confidence.
You just classified some hand-written digits using scikit-learn. Neat huh?
Let's learn a bit more about Machine Learning, and a couple of common ideas and concerns. Read "A Visual Introduction to Machine Learning, Part 1" by Stephanie Yee and Tony Chu.
It won't take long. It's a beautiful introduction ... Try not to drool too much!
OK. Let's dive deeper.
Read "A Few Useful Things to Know about Machine Learning" by Prof. Pedro Domingos. It's densely packed with valuable information, but not opaque.
Take your time with this one. Take notes. Don't worry if you don't understand it all yet.
The whole paper is packed with value, but I want to call out two points:
- Data alone is not enough. This is where science meets art in machine-learning. Quoting Domingos: "... the need for knowledge in learning should not be surprising. Machine learning is not magic; it can’t get something from nothing. What it does is get more from less. Programming, like all engineering, is a lot of work: we have to build everything from scratch. Learning is more like farming, which lets nature do most of the work. Farmers combine seeds with nutrients to grow crops. Learners combine knowledge with data to grow programs."
- More data beats a cleverer algorithm. Listen up, programmers. We like cool tools. Resist the temptation to reinvent the wheel, or to over-engineer solutions. Your starting point is to Do the Simplest Thing that Could Possibly Work. Quoting Domingos: "Suppose you’ve constructed the best set of features you can, but the classifiers you’re getting are still not accurate enough. What can you do now? There are two main choices: design a better learning algorithm, or gather more data. [...] As a rule of thumb, a dumb algorithm with lots and lots of data beats a clever one with modest amounts of it. (After all, machine learning is all about letting data do the heavy lifting.)"
When you work on a real Machine Learning problem, you should focus your efforts on your domain knowledge and data before optimizing your choice of algorithms. Prefer to Do Simple Things until you have to increase complexity. You should not rush into neural networks because you think they're cool. To improve your model, get more data. Then use your knowledge of the problem to explore and process the data. You should only optimize the choice of algorithms after you have gathered enough data, and you've processed it well.
(Graphic inspired by slide 28 from Alex Pinto's talk, "Secure Because Math: A Deep-Dive on ML-Based Monitoring")
- What is the difference between Data Analytics, Data Analysis, Data Mining, Data Science, Machine Learning, and Big Data?
- Another handy term: "Data Engineering."
- "MLOps" overlaps with Data Eng, and there's an introductory MLOps section later in this guide.
Totally optional: some podcast episodes of note
First, download an interview with Prof. Domingos on the _Data Skeptic_podcast (2018). Prof. Domingos wrote the paper we read earlier. You might also start reading his book, The Master Algorithm by Prof. Pedro Domingos, a clear and accessible overview of machine learning. (It's available as an audiobook too.)
Next, subscribe to more machine learning and data science podcasts! These are great, low-effort resources that you can casually learn more from. To learn effectively, listen over time, with plenty of headspace. By the way, don't speed up technical podcasts, that can hinder your comprehension.
Subscribe to Talking Machines.
I suggest this listening order:
- Download the "Starting Simple" episode, and listen to that soon. It supports what we read from Domingos. Ryan Adams talks about starting simple, as we discussed above. Adams also stresses the importance of feature engineering. Feature engineering is an exercise of the "knowledge" Domingos writes about. In a later episode, they share many concrete tips for feature engineering.
- Then, over time, you can listen to the entire podcast series (start from the beginning).
Want to subscribe to more podcasts? Here's a good listicle of suggestions, and another.
OK! Take a break, come back refreshed.
Next, pick one or two of these Jupyter Notebooks and play along.
- Dr. Randal Olson's Example Machine Learning notebook: "let's pretend we're working for a startup that just got funded to create a smartphone app that automatically identifies species of flowers from pictures taken on the smartphone. We've been tasked by our head of data science to create a demo machine learning model that takes four measurements from the flowers (sepal length, sepal width, petal length, and petal width) and identifies the species based on those measurements alone."
- Various notebooks by topic:
- trekhleb/machine-learning-experiments - "This is a collection of interactive machine-learning experiments. Each experiment consists of 🏋️ Jupyter/Colab notebook (to see how a model was trained) and 🎨 demo page"
- trekhleb/homemade-machine-learning
- Notebooks in a series:
- ageron/handson-ml2 - "Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2."
Find more great Jupyter Notebooks when you're ready:
- Jupyter's official Gallery of Interesting Jupyter Notebooks: Statistics, Machine Learning and Data Science (permalink)
Now you should be hooked, and hungry to learn more. Pick one of the courses below and start on your way.
Prof. Andrew Ng's Machine Learning is a popular and esteemed free online course. I've seen it recommended often. And emphatically.
You might like to have a pet project to play with, on the side. When you are ready for that, you could explore one of these Awesome Public Datasets.
Also, you should grab an in-depth textbook to use as a reference. The two recommendations I saw repeatedly were Understanding Machine Learning and Elements of Statistical Learning. You only need to use one of the two options as your main reference; here's some context/comparison to help you pick which one is right for you. You can download each book free as PDFs at those links - so grab them!
- Busy schedule? Read Ray Li's review of Prof. Andrew Ng's course for some helpful tips.
- Review some of the "Learning How to Learn" videos. This is just about how to study in general. In the course, they advocate the learn-by-doing approach, as we're doing here. You'll get various other tips that are easy to apply, but go a long way to make your time investment more effective.
- Review tips from Nam Vu's guide to learning ML as a software engineer.
Here are some other free online courses I've seen recommended. (Machine Learning, Data Science, and related topics.)
- Data science courses as Jupyter Notebooks:
- Prof. Pedro Domingos's introductory video series. Domingos wrote the paper "A Few Useful Things to Know About Machine Learning", recommended earlier in this guide.
- Kevin Markham's video series, Intro to Machine Learning with scikit-learn, starts with what we've already covered, then continues on at a comfortable place.
- UC Berkeley's Data 8: The Foundations of Data Science course and the textbook Computational and Inferential Thinking teaches critical concepts in Data Science.
- Prof. Mark A. Girolami's Machine Learning Module (GitHub Mirror). Good for people with a strong mathematics background.
- Coursera's Data Science Specialization
- Advanced Statistical Computing (Vanderbilt BIOS8366). Interactive.
- Harvard CS109: Data Science
- An epic Quora thread: How can I become a data scientist?
- There are more alternatives linked at the bottom of this guide
Start with the support forums and chats related to the course(s) you're taking.
Check out datascience.stackexchange.com and stats.stackexchange.com – such as the tag, machine-learning. There are some subreddits, like /r/LearningMachineLearning and /r/MachineLearning.
Don't forget about meetups. Also, nowadays there are many active and helpful online communities around the ML ecosystem. Look for chat invitations on project pages and so on.
You'll want to get more familiar with Pandas.
- Essential: Things in Pandas I Wish I'd Had Known Earlier (as a Jupyter Notebook)
- Essential: 10 Minutes to Pandas
- Another helpful tutorial: Real World Data Cleanup with Python and Pandas
- Video series from Data School, about Pandas. "Reference guide to 30 common pandas tasks (plus 6 hours of supporting video)."
- Useful Pandas Snippets
- Here are some docs I found especially helpful as I continued learning:
- Bookmarks for later when you need to scale
dask
: A Pandas-like interface, but for larger-than-memory data and "under the hood" parallelism. Very interesting, but only needed when you're getting advanced.- See also: the MLOps section later in this guide.
Some good cheat sheets I've come across. (Please submit a Pull Request to add other useful cheat sheets.)
- scikit-learn algorithm cheat sheet
- Stanford CS 229 cheat sheets, available on the web and as PDFs
"Machine learning systems automatically learn programs from data." Pedro Domingos, in "A Few Useful Things to Know about Machine Learning." The programs you generate will require maintenance. Like any way of creating programs faster, you can rack up technical debt.
Here is the abstract of Machine Learning: The High-Interest Credit Card of Technical Debt:
Machine learning offers a fantastically powerful toolkit for building complex systems quickly. This paper argues that it is dangerous to think of these quick wins as coming for free. Using the framework of technical debt, we note that it is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning. The goal of this paper is highlight several machine learning specific risk factors and design patterns to be avoided or refactored where possible. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, changes in the external world, and a variety of system-level anti-patterns.
If you're following this guide, you should read that paper. You can also listen to a podcast episode interviewing one of the authors of this paper.
- Awesome Production Machine Learning, "a curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning." It includes a section about privacy-preserving ML, by the way!
- "Rules of Machine Learning: Best Practices for [Reliable] ML Engineering," by Martin Zinkevich, regarding ML engineering practices. There's an accompanying video.
- The High Cost of Maintaining Machine Learning Systems
- 11 Clever Methods of Overfitting and How to Avoid Them
- "So, you want to build an ethical algorithm?" An interactive tool to prompt discussions (source)
So you are dabbling with Machine Learning. You've got Hacking Skills. Maybe you've got some "knowledge" in Domingos' sense (some "Substantive Expertise" or "Domain Knowledge"). This diagram is modified slightly from Drew Conway's "Data Science Venn Diagram." It isn't a perfect fit for our purposes here, but it might get the point across:
Please don't sell yourself as a Machine Learning expert while you're still in the Danger Zone. Don't build bad products or publish junk science. Please do bookmark the Institute for Ethical AI & Machine Learning's Responsible Machine Learning Principles. This guide can't tell you how you'll know you've "made it" into Machine Learning competence ... let alone expertise. It's hard to evaluate proficiency without schools or other institutions. So, practice!
You need practice. On Hacker News, user olympus commented to say you could use competitions to practice and evaluate yourself. Kaggle and ChaLearn are hubs for Machine Learning competitions. (You can find more competitions here or here.)
You also need understanding. You should review what Kaggle competition winners say about their solutions, for example, the "No Free Hunch" blog. These might be over your head at first but once you're starting to understand and appreciate these, you know you're getting somewhere.
Competitions and challenges are just one way to practice. You shouldn't limit yourself, though - and you should also understand that Machine Learning isn't all about Kaggle competitions.
Here's a complementary way to practice: do practice studies.
- Ask a question. Start exploring some data. The "most important thing in data science is the question" (Dr. Jeff T. Leek). So start with a question. Then, find real data. Analyze it. Then ...
- Communicate results. When you think you have a novel finding, ask for review.
- Fix issues. Learn. Share what you learn.
And repeat. Re-phrasing this, it fits with the scientific method: formulate a question (or problem statement), create a hypothesis, gather data, analyze the data, and communicate results. (Here's a video about the scientific method in data science.)
How can you come up with interesting questions? Here's one way. Every Sunday, browse datasets and write down some questions. Also, sign up for Data is Plural, a newsletter of interesting datasets; look at these, datasets, and write down questions. Stay curious. When a question inspires you, start a study.
This advice, to do practice studies and learn from peer review, is based on a conversation with Dr. Randal S. Olson. Here's more advice from Olson, quoted with permission:
I think the best advice is to tell people to always present their methods clearly and to avoid over-interpreting their results. Part of being an expert is knowing that there's rarely a clear answer, especially when you're working with real data.
As you repeat this process, your practice studies will become more scientific, interesting, and focused. The most important part of this process is peer review.
Here are some communities where you can reach out for informal peer review:
- /r/LearnMachineLearning
- /r/DataIsBeautiful
- /r/DataScience
- /r/MachineLearning
- Cross-Validated: stats.stackexchange.com
Post to any of those, and ask for feedback. You'll get feedback. You'll learn a ton. As experts review your work you will learn a lot about the field. You'll also be practicing a crucial skill: accepting critical feedback.
Production, Deployment, MLOps
If you are learning about MLOps but find it overwhelming, these resources might help you get your bearings:
- MLOps Stack Template by Henrik Skogström
- Lessons on ML Platforms from Netflix, DoorDash, Spotify, and more by Ernest Chan in Towards Data Science
- MLOps Stack Canvas at ml-ops.org
Recommended awesomelists to save/star/watch:
- EthicalML/awesome-artificial-intelligence-guidelines
- EthicalML/awesome-production-machine-learning
- visenger/awesome-ml-model-governance
- visenger/awesome-MLOps
In early editions of this guide, there was no specific "Deep Learning" section. There are experts in the field who warn against jumping too far ahead.
Maybe this is a way to check your progress: ask yourself, does Deep Learning seem like magic? If so, take that as a sign that you aren't ready to work with it professionally, and let the fascination motivate you to learn more. I have read some argue you can learn Deep Learning in isolation; I have read others recommend it's best to master traditional Machine Learning first. Why not start with traditional Machine Learning, and develop your reasoning and intuition there? You'll only have an easier time learning Deep Learning after that. After all of it, you'll able to tackle all sorts of interesting problems.
In any case, when you're ready to dive into Deep Learning, here are some helpful resources.
- Dive into Deep Learning - An interactive book about deep learning
- "Interactive deep learning book with code, math, and discussions"
- "Implemented with NumPy/MXNet, PyTorch, and TensorFlow"
- "Adopted at 200 universities from 50 countries"
- labmlai/annotated_deep_learning_paper_implementations - "Deep learning papers implemented, with side-by-side notes" - "We are actively maintaining this repo and adding new implementations almost weekly."
- "Have Fun With [Deep] Learning" by David Humphrey. This is an excellent way to "get ahead of yourself" and hack-first. Then you will feel excited to move onto...
- Prof. Andrew Ng's courses on Deep Learning! There five courses, as part of the Deep Learning Specialization on Coursera. These courses are part of his new venture, deeplearning.ai
- Deep Learning, a free book published MIT Press. By Ian Goodfellow, Yoshua Bengio and Aaron Courville
- Quora: "What are the best ways to pick up Deep Learning skills as an engineer?" — answered by Greg Brockman (Co-Founder & CTO at OpenAI, previously CTO at Stripe)
- Distill.pub publishes explorable explanations that are really great.
- Creative Applications of Deep Learning with Tensorflow
- replicate.ai "makes it easy to share a running machine learning model" for the sake of reproducible research.
Machine Learning can be powerful, but it is not magic.
Whenever you apply Machine Learning to solve a problem, you are going to be working in some specific problem domain. To get good results, you or your team will need "substantive expertise" AKA "domain knowledge." Learn what you can, for yourself... But you should also collaborate. You'll have better results if you collaborate with domain experts. (What's a domain expert? See this useful subjective blurb old the ol' c2 wiki or the Wikipedia entry.)
I couldn't say it better:
Machine learning won’t figure out what problems to solve. If you aren’t aligned with a human need, you’re just going to build a very powerful system to address a very small—or perhaps nonexistent—problem.
That quote is from "The UX of AI" by Josh Lovejoy. In other words, You Are Not The User. Suggested reading: Martin Zinkevich's "Rules of ML Engineering", Rule #23: "You are not a typical end user"
Here are some useful links regarding Big Data and ML.
- 10 things statistics taught us about big data analysis (and some more food for thought: "What Statisticians think about Data Scientists")
- "Talking Machines" #12: Interviews Prof. Andrew Ng (from his course, which has its own module on big data); this episode covers some problems relevant to high-dimensional data
- "Talking Machines" #15: "Really Really Big Data and Machine Learning in Business"
- 0xnr/awesome-bigdata
See also: the MLOps section!
If you are working with data-intensive applications at all, I'll recommend this book:
- Designing Data-Intensive Applications by Martin Kleppman. (You can start reading it online, free, via Safari Books.) It's not specific to Machine Learning, but you can bridge that gap yourself.
Here are some additional Data Science resources:
- Python Data Science Handbook, as Jupyter Notebooks
- Accessible data science book, no coding experience required: Data Smart by John Foreman
- Data Science Workflow: Overview and Challenges (read the article and also the comment by Joseph McCarthy)
From the "Bayesian Machine Learning" overview on Metacademy:
... Bayesian ideas have had a big impact in machine learning in the past 20 years or so because of the flexibility they provide in building structured models of real world phenomena. Algorithmic advances and increasing computational resources have made it possible to fit rich, highly structured models which were previously considered intractable.
Here are some awesome resources for learning Bayesian methods.
- The free book Probabilistic Programming and Bayesian Methods for Hackers. Made with a "computation/understanding-first, mathematics-second point of view." Uses PyMC. It's available in print too!
- Like learning by playing? Me too. Try 19 Questions, "a machine learning game which asks you questions and guesses an object you are thinking about," and explains which Bayesian statistics techniques it's using!
- Time Series Forecasting with Bayesian Modeling by Michael Grogan, a 5-project series - paid but the first project is free.
- Bayesian Modelling in Python. Uses PyMC as well.
- Bookmark awesome-machine-learning, a curated list of awesome Machine Learning libraries and software.
- Bookmark Pythonidae, a curated list of awesome libraries and software in the Python language - with a section on Machine Learning.
- For Machine-Learning libraries that might not be on PyPI, GitHub, etc., there's MLOSS (Machine Learning Open Source Software). Seems to feature many academic libraries.
- Julia: Julia.jl, a curated list of awesome libraries and software in the Julia language - with a section on Machine Learning.
Here are some other guides to learning Machine Learning. They can be alternatives or supplements to this guide.
- Example Machine Learning notebook, exercise, and guide by Dr. Randal S. Olson. Mentioned in Notebooks section as well, but it has a similar goal to this guide (introduce you, and show you where to go next). Rich "Further Reading" section.
- Courses by cloud vendors (might be specific to their tools/platforms)
- Machine Learning Crash Course from Google with TensorFlow APIs.
- Amazon AWS Amazon have open up their internal training to the public and also offer certification.
- Machine Learning for Developers is good for people who are more familiar with Java or Scala than Python.
- ageron/handson-ml2 aka Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow by Aurélien Geron
- rasbt/python-machine-learning-book-3rd-edition aka Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2 by Sebastian Raschka and Vahid Mirjalili
- Machine Learning for Software Engineers, by Nam Vu. In their words, it's a "top-down and results-first approach designed for software engineers." Definitely bookmark and use it, as well - it can answer lots of questions and connect you with great resources.