diff --git a/deep-learning-intro-for-hep/00-intro.md b/deep-learning-intro-for-hep/00-intro.md index 3f3d97f..493d920 100644 --- a/deep-learning-intro-for-hep/00-intro.md +++ b/deep-learning-intro-for-hep/00-intro.md @@ -13,9 +13,9 @@ The course materials include some inline problems, intended for active learning ## Software for the course -This course uses [Scikit-Learn](https://scikit-learn.org/) and [PyTorch](https://pytorch.org/) for examples and problem sets. [TensorFlow](https://www.tensorflow.org/) is also a popular machine learning library, but its functionality is mostly the same as PyTorch, and I didn't want to hide the concepts behind incidental differences in software interfaces. (I _did_ include Scikit-Learn because its interface is much simpler than PyTorch. When I want to emphasize issues that surround fitting in general, I'll use Scikit-Learn because the fit itself is just two lines of code, and when I want to emphasize the details of the machine learning model, I'll use PyTorch, which expands the fit into tens of lines of code and allows for more control of this part.) +This course uses [Scikit-Learn](https://scikit-learn.org/) and [PyTorch](https://pytorch.org/) for examples and problem sets. [TensorFlow](https://www.tensorflow.org/) is also a popular machine learning library, but its functionality is mostly the same as PyTorch, and I didn't want to hide the concepts behind incidental differences in software interfaces. (I _did_ include Scikit-Learn because its interface is much simpler than PyTorch. When I want to emphasize issues that surround fitting in general, I'll use Scikit-Learn because the fit itself is just two lines of code, and when I want to emphasize the details of the machine learning model, I'll use PyTorch, which expands the fit into tens of lines of code and allows for more control.) -I didn't take the choice of PyTorch over TensorFlow lightly (since I'm a newcomer to both). I verified that PyTorch is about as popular as TensorFlow among CMS physicists using the plot below (derived using the methodology in [this GitHub repo](https://github.com/jpivarski-talks/2023-05-09-chep23-analysis-of-physicists) and [this talk](https://indico.jlab.org/event/459/contributions/11547/)). Other choices, such as [JAX](https://jax.readthedocs.io/), would be a mistake because a reader of this tutorial would not be prepared to collaborate with machine learning as it is currently practiced in particle physics. +I didn't take the choice of PyTorch over TensorFlow lightly. I verified that PyTorch is about as popular as TensorFlow among CMS physicists using the plot below (derived using the methodology in [this GitHub repo](https://github.com/jpivarski-talks/2023-05-09-chep23-analysis-of-physicists) and [this talk](https://indico.jlab.org/event/459/contributions/11547/)). Other choices, such as [JAX](https://jax.readthedocs.io/), would be a mistake because a reader of this tutorial would not be prepared to collaborate with machine learning as it is currently practiced in particle physics. ![](img/github-ml-package-cmsswseed.svg){. width="100%"} @@ -37,7 +37,7 @@ and navigate to the `notebooks` directory in the left side-bar. The pages are fo ## To run everything on your own computer -Make sure that you have the following packages installed with [conda](https://scikit-hep.org/user/installing-conda), pip, uv, pixi, etc. +Make sure that you have the following packages installed with [conda](https://scikit-hep.org/user/installing-conda) (or pip, uv, pixi, etc.): ```{include} ../environment.yml :literal: true diff --git a/deep-learning-intro-for-hep/01-overview.md b/deep-learning-intro-for-hep/01-overview.md index 81c0698..0ed7c74 100644 --- a/deep-learning-intro-for-hep/01-overview.md +++ b/deep-learning-intro-for-hep/01-overview.md @@ -37,11 +37,9 @@ All of these terms describe the following general procedure: 2. vary those parameters ("train the model") until the algorithm returns expected results on a labeled dataset ("supervised learning") or until it finds patterns according to some desired metric ("unsupervised learning"); 3. either apply the trained model to new data, to describe the new data in the same terms as the training dataset ("predictive"), or use the model to generate new data that is plausibly similar to the training dataset ("generative"; AI and ML only). -Apart from the word "huge," this procedure also describes curve-fitting, a ubiquitous analysis technique that most experimental physicists use on a daily basis. Consider a dataset with two observables (called "features" in ML), $x$ and $y$, and suppose that they have an approximate, but not exact, linear relationship. There is [an exact algorithm](https://en.wikipedia.org/wiki/Linear_regression#Formulation) to compute the best fit of $y$ as a function of $x$, and this linear fit is a model with two parameters: the slope and intercept of the line. If $x$ and $y$ have a non-linear relationship expressed by $N$ parameters, a non-deterministic optimizer like [MINUIT](https://en.wikipedia.org/wiki/MINUIT) can be used to search for the best fit. +Apart from the word "huge," this procedure also describes curve-fitting, a ubiquitous analysis technique that most experimental physicists use on a daily basis. ML fits differ from curve-fitting in the number of parameters used and their interpretation—or rather, their lack of interpretation. In curve fitting, the values of the parameters and their uncertainties are regarded as the final product, often quoted in a table as the result of the data analysis. In ML, the parameters are too numerous to present this way and wouldn't be useful if they were, since the calculation of predicted values from these parameters is complex. Instead, the ML model is used as a machine, trained on "features" $x$ and "targets" $y$, to predict new $y$ for new $x$ values (inference) or to randomly generate new $x$, $y$ pairs with the same distribution as the training set (generation). In fact, most ML models don't even have a unique minimum in parameter space—different combinations of parameters would result in the same predictions. -ML fits differ from curve-fitting in the number of parameters used and their interpretation—or rather, their lack of interpretation. In curve fitting, the values of the parameters and their uncertainties are regarded as the final product, often quoted in a table as the result of the data analysis. In ML, the parameters are too numerous to present this way and wouldn't be useful if they were, since the calculation of predicted values from these parameters is complex. Instead, the ML model is used as a machine to predict $y$ for new $x$ values (prediction) or to randomly generate new $x$, $y$ pairs with the same distribution as the training set (generation). In fact, most ML models don't even have a unique minimum in parameter space—different combinations of parameters would result in the same predictions. - -Today, the most accurate and versatile class of ML models are "deep" Neural Networks (NN), where "deep" means having a large number of internal layers. I will describe this type of model in much more detail, since this course will focus exclusively on them. However, it's worth pointing out that NNs are just one type of ML model; others include: +Today, the most accurate and versatile class of ML models are "deep" Neural Networks (NN), where "deep" means having a large number of internal layers. I will describe this type of model in much more detail, since this course will focus on them. However, it's worth pointing out that NNs are just one type of ML model; others include: * [Naive Bayes classifiers](https://en.wikipedia.org/wiki/Naive_Bayes_classifier), * [k-Nearest Neighbors (kNN)](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm), @@ -67,6 +65,6 @@ This course focuses on NNs for several reasons. At the end of the course, I want students to 1. understand neural networks at a deep level, to know _why_ they work and therefore _when_ to apply them, and -2. be sufficiently familiar with the tools and techniques of model-building to be able to start writing code (in PyTorch). The two hour exercise asks students to do exactly this, for a relevant particle physics problem. +2. be sufficiently familiar with the tools and techniques of model-building to be able to start writing code (in PyTorch). The [Main Project](20-main-project.md) asks students to do exactly this, for a relevant particle physics problem. In particular, I want to dispel the notion that ML is a black box or a dark art. Like the scientific method itself, ML development requires tinkering and experimentation, but it is possible to debug and it is possible to make informed decisions about it. diff --git a/deep-learning-intro-for-hep/02-history.md b/deep-learning-intro-for-hep/02-history.md index b453ef8..891130b 100644 --- a/deep-learning-intro-for-hep/02-history.md +++ b/deep-learning-intro-for-hep/02-history.md @@ -1,6 +1,6 @@ # History of HEP and ML -The purpose of this section is to show that particle physics, specifically High Energy Physics (HEP), has always needed ML. It is not a fad: ML (in a usable form) would have brought major benefits to HEP at any stage in its history. It's being applied to many problems in HEP now because it's just becoming _possible_ now. +The purpose of this section is to show that particle physics, specifically High Energy Physics (HEP), has always needed ML. It's not a fad: ML (in a usable form) would have brought major benefits to HEP at any stage in its history. It's being applied to many problems in HEP now because it's just becoming _possible_ now. ![](img/hep-plus-ml.jpg){. width="40%"} @@ -30,7 +30,7 @@ Below, I extended the plot to the present day: the number of events per second h These event rates have been too fast for humans since the 1970's, when human scanners were replaced by heuristic track-finding routines, usually by hand-written algorithms that iterate through all combinations within plausible windows (which are now a bottleneck in high track densities). -Although many computing tasks in particle physics are suitable for hand-written algorithms, the field also has and has always had tasks that are a natural fit for ML and artificial intelligence, to such an extent that human intelligence was enlisted to solve them. While ML would have been beneficial to HEP from the very beginning of the field, algorithms and computational resources have only recently made it possible. +Although many computing tasks in particle physics are suitable for hand-written algorithms, the field also has and has always had tasks that are a natural fit for artificial intelligence, to such an extent that human intelligence was enlisted to solve them. While ML would have been beneficial to HEP from the very beginning of the field, algorithms and computational resources have only recently made it possible. ## Symbolic AI and connectionist AI @@ -45,9 +45,9 @@ Symbolic AI consists of hand-written algorithms, which today wouldn't be called Symbolic AI is called "symbolic" because the starting point is a system of abstract symbols and rules—like programming in general. An associated philosophical idea is that this is the starting point for intelligence itself: human and artificial intelligence consists in manipulating propositions like an algebra. Mathematician-philosophers like George Boole, Gottlob Frege, and Bertrand Russell formulated these as "the laws of thought". -Connectionist AI makes a weaker claim about what happens in an intelligent system (natural or artificial): only that the system's inputs and outputs are correlated appropriately, and an intricate network of connections can implement that correlation. As we'll see, neural networks are an effective way to implement it, and they were (loosely) inspired by the biology of human brains. The idea that we can only talk about the inputs and outputs of human systems, without proposing symbols as entities in the mind, was a popular trend in psychology called Behaviorism in the first half of the 20th century. Today, Cognitive Psychologists can measure the scaling time and other properties of algorithms in human minds, so Behaviorism is out of favor. But it's ironic that large language models like ChatGPT are an implementation of what Behaviorists proposed as a model of human intelligence a century ago. +Connectionist AI makes a weaker claim about what happens in an intelligent system (natural or artificial): only that the system's inputs and outputs are correlated appropriately, and an intricate network of connections can implement that correlation. As we'll see, neural networks are an effective way to implement it, and they were (loosely) inspired by the biology of human brains. The idea that we can only talk about the inputs and outputs of human systems, without proposing symbols as entities in the mind, was a popular trend in psychology called Behaviorism in the first half of the 20th century. Today, Cognitive Psychologists can measure the scaling time and other properties of algorithms in human minds, so Behaviorism is out of favor. But it's ironic that large language models like ChatGPT are an implementation of what Behaviorists proposed as a model of human intelligence a century ago! -Although connectionist systems like neural networks don't start with propositions and symbols, something like these structures may form among the connections as the most effective way to produce the desired outputs, similar to emergent behavior in dynamical systems. Practitioners of explainable AI (xAI) try to find patterns like these in trained neural networks—far from treating a trained model as a black box, they treat it as a natural system to study! +Although connectionist systems like neural networks don't start with propositions and symbols, something like these structures may form among the connections as the most effective way to produce the desired outputs, similar to emergent behavior in dynamical systems. Practitioners of explainable AI (xAI) try to find patterns like these in trained neural networks—far from treating a trained model as a black box, they treat it as a natural system to study. ## AI's summers and winters @@ -82,13 +82,13 @@ More importantly for us, AI was adopted in experimental particle physics, starti ![](img/chep-acat-2024-ml.svg){. width="90%"} -Clearly, the physicists' interests are following the developments in computer science, including the latest "winter." From personal experience, I know that Boosted Decision Trees (BDTs) were popular in the gap between the 2000 and 2015, but they rarely appear in CHEP conference talks. (Perhaps they were less exciting?) +Clearly, the physicists' interests are following the developments in computer science, starting before the most recent winter. From personal experience, I know that Boosted Decision Trees (BDTs) were popular in the gap between the 2000 and 2015, but they rarely appear in CHEP conference talks. (Perhaps they were less exciting?) -The 2024 Nobel Prize in Physics has drawn attention to connections between the theoretical basis of neural networks and statistical mechanics—particularly John Hopfield's memory networks (which have the same mathematical structure as Ising spin-glass systems in physics) and Geoffrey Hinton's application of simulated annealing to training Hopfield and restricted Boltzmann networks (gradually reducing random noise in training to find a more global optimum). However, this connection is distinct from the _application_ of neural networks to solve experimental physics problems—which seemed promising in the 1990's and is yielding significant results today. +The 2024 Nobel Prize in Physics has drawn attention to connections between the theoretical basis of neural networks and statistical mechanics—particularly John Hopfield's memory networks (which have the same mathematical structure as Ising spin-glass systems in physics) and Geoffrey Hinton's application of simulated annealing to training Hopfield and restricted Boltzmann networks (gradually reducing random noise in training to find a more global optimum). However, this connection is distinct from the _application_ of neural networks to solve experimental physics problems—which is what I'll be focusing on. ## Conclusion -HEP has always needed ML. Since the beginning of the HEP as we know it, high energy physicists have invested heavily in computing, but their problems could not be solved without human intelligence in the workflow, which doesn't scale to large numbers of collision events. Today, we're finding that many of the hand-written algorithms from the decades in which AI was not ready are less efficient and less capable than connectionist AI solutions, especially deep learning. +HEP has always needed ML. Since the beginning of HEP as we know it, high energy physicists have invested heavily in computing, but their problems could not be solved without human intelligence in the workflow, which doesn't scale to large numbers of collision events. Today, we're finding that many of the hand-written algorithms from the decades in which AI was not ready are less efficient and less capable than connectionist AI solutions, especially deep learning. Meanwhile, the prospect of connectionist AI has been unclear until very recently. Interest and funding vacillated throughout its history (including a brief dip in 2020‒2022 ([ref](https://doi.org/10.1109/AIMLA59606.2024.10531477)), before ChatGPT) as hype alternated with pessimism. Given this history, one could find examples to justify either extreme: it has been heavily oversold and undersold. diff --git a/deep-learning-intro-for-hep/03-basic-fitting.md b/deep-learning-intro-for-hep/03-basic-fitting.md index 2601a2e..2deb70f 100644 --- a/deep-learning-intro-for-hep/03-basic-fitting.md +++ b/deep-learning-intro-for-hep/03-basic-fitting.md @@ -17,7 +17,7 @@ kernelspec: +++ -In this section, we'll start doing some computations, so get a Python interface (terminal, file, or notebook) handy! I'll start with basic fitting, which I assume you're familiar with, and show how neural networks are a generalization of linear fits. +In this section, we'll start doing some computations, so get a Python interface (terminal, file, or notebook) handy! (See the [front page](00-intro) for instructions.) I'll start with basic fitting, which I assume you're familiar with, and show how neural networks are a generalization of linear fits. +++ @@ -77,7 +77,7 @@ a, b which should be close to $a = 2$ and $b = 3$. Differences from the true values depend on the details of the noise. Regenerate the data points several times and you should see $a$ jump around $2$ and $b$ jump around $3$. -We can also confirm that the line goes through the points: +We can also visually confirm that the line goes through the points: ```{code-cell} ipython3 fig, ax = plt.subplots() @@ -88,9 +88,9 @@ ax.plot([-5, 5], [a*-5 + b, a*5 + b], color="tab:blue") plt.show() ``` -One thing that you should keep in mind is that we're treating the $x$ dimension and the $y$ dimension differently: $\chi^2$ is minimizing differences in predicted $y$ ($a x_i + b$) and measured $y$ ($y_i$): only the _vertical_ differences between points and the line matter. In an experiment, you'd use $x$ to denote the variables you can control, such as the voltage you apply to a circuit, and $y$ is the measured response of the system, like a current in that circuit. In ML terminology, $x$ is a "feature" and $y$ is a "prediction," and this whole fitting process is called "regression." +One thing that you should keep in mind is that we're treating the $x$ dimension and the $y$ dimension differently: $\chi^2$ is minimizing differences in predicted $y$ (that is, $a x_i + b$) and measured $y$ (that is, $y_i$): only the _vertical_ differences between points and the line matter. In an experiment, you'd use $x$ to denote the variables you can control, such as the voltage you apply to a circuit, and $y$ is the measured response of the system, like a current in that circuit. In ML terminology, $x$ is a "feature" and $y$ is a "prediction," and this whole fitting process is called "regression." -Now suppose you control two features. The $x$ values are now 2-dimensional vectors and you need a 2-dimensional $a$ parameter to write a linear relationship: +Now suppose you control two features. The $x$ values become 2-dimensional vectors and you need a 2-dimensional $a$ parameter to write a linear relationship: $$\left(\begin{array}{c c} a^1 & a^2 \\ @@ -219,26 +219,26 @@ best_fit.intercept_ Before we leave linear fitting, I want to point out that getting the indexes right is hard, and that both the mathematical notation and the array syntax hide this difficulty. -* The first of the two fits with Scikit-Learn takes features (`X`) with shape `(100, 1)` and targets (`y`) with shape `(100,)`. -* The second takes features (`X`) with shape `(100000, 2)` and targets (`Y`) with shape `(100000, 3)`. +* The first of the two fits above takes features (`X`) with shape `(100, 1)` and targets (`y`) with shape `(100,)`. +* The second fit takes features (`X`) with shape `(100000, 2)` and targets (`Y`) with shape `(100000, 3)`. -Scikit-Learn's `fit` function is operating in two modes: the first takes a rank-1 array (the `shape` tuple has length 1) as a set of scalar targets and the second takes a rank-2 array as a set of vector targets. In both cases, Scikit-Learn requires the features to be rank-2, even if its second dimension just has length 1. (Why isn't it just as strict about the targets? I don't know.) +Scikit-Learn's `fit` function is operating in two modes: the first takes a rank-1 array (the `y.shape` tuple has length 1) as a sequence of scalar targets and the second takes a rank-2 array (the `Y.shape` tuple has length 2) as a sequence of vector targets. In both cases, Scikit-Learn requires the features (`X.shape`) to be rank-2, even if its second dimension just has length 1. (Why isn't it just as strict about the targets? I don't know...) -The mathematical notation is just as tricky: in the fully general case, we want to fit n-dimensional feature vectors $\vec{x}_i$ to m-dimensional target vectors $\vec{y}_i$, and we're looking for a best-fit matrix $\hat{A}$ (or a best-fit matrix $\hat{a}$ _and_ vector $\vec{b}$, depending on whether we use the "fake dimension" trick or not). I'm using the word "vector" (with an arrow over the variable) to mean rank-1 and "matrix" (with a hat over the variable) to mean rank-2. Each pair of $\vec{x}_i$ and $\vec{y}_i$ vectors should be close to the +The mathematical notation is just as tricky: in the fully general case, we want to fit $n$-dimensional feature vectors $\vec{x}_i$ to $m$-dimensional target vectors $\vec{y}_i$, and we're looking for a best-fit matrix $\hat{A}$ (or a best-fit matrix $\hat{a}$ _and_ vector $\vec{b}$, depending on whether we use the "fake dimension" trick or not). I'm using the word "vector" (with an arrow over the variable) to mean rank-1 and "matrix" (with a hat over the variable) to mean rank-2. Each pair of $\vec{x}_i$ and $\vec{y}_i$ vectors should be close to the $$\hat{A} \cdot \vec{x}_i = \vec{y}_i$$ relationship and we minimize $\chi^2$ for the whole dataset, summing over all $i$. -This $i$ can be thought of as _another_ dimension, which is why we have a matrix $\hat{X}$ and a matrix $\hat{Y}$ (but still only a matrix $\hat{A}$: the model parameters are not as numerous as the number of data points in the dataset). +This $i$ can be thought of as _another_ dimension, which is why we have a matrix $\hat{X}$ and a matrix $\hat{Y}$ (but still only a matrix $\hat{A}$: the model parameters don't scale with the number of data points in the dataset). In machine learning applications like computer vision, the individual data points are images, which we'd like to think of as having two dimensions. Thus, we can get into higher and higher ranks, and that's why we usually talk about "tensors." It will be worth paying special attention to which dimensions mean what. The notation gets complicated because it's hard to decide where to put all those indexes. In the above, I've tried to consistently put the data-points-in-dataset index $i$ as a subscript and the features/targets-are-vectors index as a superscript, but if we have more of them, then we just have to list them somehow. -A concept that you should _not_ carry over from physics is the idea that tensors are defined by how they transform under spatial rotations—like the inertia tensor, the stress tensor, or tensors in general relativity. These "tensors" are just rectilinear arrays of numbers. +A concept that you should _not_ carry over from physics is the idea that tensors are defined by how they transform under spatial rotations—like the inertia tensor, the stress tensor, or tensors in general relativity. The "tensors" of ML are just rectilinear arrays of numbers. +++ -## Non-linear fitting +## Non-linear "ansatz" fitting +++ @@ -299,7 +299,7 @@ Instead, we use our theoretical knowledge of the shape of the functional form, o $$\chi^2 = \sum_i \left[f(x) - y\right]^2$$ -In HEP, our favorite search algorithm is implemented by the Minuit library. +In HEP, our favorite search algorithm is implemented by the [Minuit](https://en.wikipedia.org/wiki/MINUIT) library. ```{code-cell} ipython3 from iminuit import Minuit @@ -399,7 +399,7 @@ The fit might converge to the wrong value or it might fail to converge entirely. +++ -If you _do_ know enough to write a (correct) functional form and seed the fit with good starting values, then ansatz fitting is the best way to completely understand a system. Not only is the fitted function an accurate predictor of new values, but the parameters derived from the fit tell you about the underlying reality by filling in numerical values that were missing from the theory. In the above example, we could have used $\mu$ and $t_f$ to derive the force of air resistance on the tossed object—we'd learn something new. In general, all of physics is one big ansatz fit: we hypothesize general relativity and the Standard Model, then perform fits to measurements and learn the values of the constant of universal gravitation, the masses of quarks, leptons, and bosons, the strengths of interactions between them, etc. I didn't show it in the examples above, but fitting procedures can also provide uncertainties on each parameter, their correlations, and likelihoods that the ansatz is correct. +If you _do_ know enough to write a (correct) functional form and seed the fit with good starting values, then ansatz fitting is the best way to completely understand a system. Not only is the fitted function an accurate predictor of new values, but the parameters derived from the fit tell you about the underlying reality by filling in numerical values that were missing from the theory. In the above example, we could have used $\mu$ and $t_f$ to derive the force of air resistance on the tossed object—that is, we'd learn something new. In general, all of physics is one big ansatz fit: we hypothesize the functional form of general relativity and the Standard Model, then perform fits to measurements and learn the values of the constant of universal gravitation, the masses of quarks, leptons, and bosons, the strengths of interactions between them, etc. I didn't show it in the examples above, but fitting procedures can also provide uncertainties on each parameter, their correlations, and likelihoods that the ansatz is correct. _However_, most scientific problems beyond physics don't have this much prior information. This is especially true in sciences that study the behavior of human beings. What is the underlying theory for a kid preferring chocolate ice cream over vanilla? What are the variables, and what's the functional form? Even if you think that human behavior is determined by underlying chemistry and physics, it would be horrendously complex. @@ -417,8 +417,8 @@ Here's an example: the [Boston Housing Prices](https://www.kaggle.com/datasets/v * full-value property-tax rate per \$10,000 * pupil-teacher ratio by town * $1000(b - 0.63)^2$ where $b$ is the proportion of Black residents -* % lower status by population +* \% lower status by population -All of these seem like they would have an effect on housing prices, but it's almost impossible to guess which would be more important. Problems like these are usually solved by a generic linear fit of many variables. Unimportant features would have a best-fit slope near zero, and if our goal is to find out which features are most important, we can force unimportant features toward zero with "regularization" (to be discussed in a later section). The idea of ML as "throw everything into a big fit" is close to what you have to do if you have no ansatz, and neural networks are a natural generalization of high-dimensional linear fitting. +All of these seem like they would have an effect on housing prices, but it's almost impossible to guess which would be more important. Problems like these are usually solved by a generic linear fit of many variables. Unimportant features would have a best-fit slope near zero, and if our goal is to find out which features are most important, we can force unimportant features toward zero with "regularization" (to be discussed in [a later section](16-regularization)). The idea of ML as "throw everything into a big fit" is close to what has traditionally been done in these fields with high-dimensional linear fits. In the next section, we'll try to fit arbitrary non-linear _curves_ without knowing an ansatz. diff --git a/deep-learning-intro-for-hep/04-universal-approximators.md b/deep-learning-intro-for-hep/04-universal-approximators.md index c9e22ab..afcfb06 100644 --- a/deep-learning-intro-for-hep/04-universal-approximators.md +++ b/deep-learning-intro-for-hep/04-universal-approximators.md @@ -59,7 +59,7 @@ ax.legend(loc="lower right") plt.show() ``` -I don't think I need to demonstrate that a linear fit would be terrible. +I don't need to demonstrate that a linear fit would be terrible. We can get a good fit from a theory-driven ansatz, but as I showed in the previous section, it's very sensitive to the initial guess that we give the fitter. @@ -186,7 +186,7 @@ b_n &&=&& \frac{2}{P} \int_P f(x) \sin\left(2\pi\frac{n}{P}x\right) \, dx \\ \end{align} $$ -NumPy has a function for computing integrals using the trapezoidal rule, which I'll use below to fit a Fourier series to the function. +NumPy has Fast Fourier Transform (FFT) algorithms built-in, but they're hard to apply to arbitrary length (non-power-of-2) datasets. NumPy also has a function for computing integrals using the trapezoidal rule, so I'll use that below to fit a Fourier series to the function. ```{code-cell} ipython3 NUMBER_OF_COS_TERMS = 7 @@ -230,7 +230,7 @@ Both the 15-term Taylor series and the 15-term Fourier series are not good fits +++ -The classic methods of universal function approximation—Taylor series, Fourier series, and others—have one thing in common: they all approximate the function with a fixed set of basis functions $\psi_i$ for $i \in [0, N)$. +The classic methods of universal function approximation—Taylor series, Fourier series, and the like—have one thing in common: they all approximate the function with a fixed set of basis functions $\psi_i$ for $i \in [0, N)$. $$f(x) = \sum_i^N c_i \, \psi_i(x)$$ @@ -240,7 +240,7 @@ Suppose, instead, that we had a set of functions that could also change shape: $$f(x) = \sum_i^N c_i \, \psi(x; \alpha_i, \beta_i)$$ -These are functions of $x$, parameterized by $\alpha_i$ and $\beta_i$. Here's an example set that we can use: [sigmoid functions](https://en.wikipedia.org/wiki/Sigmoid_function) with an adjustable center $\alpha$ and width $\beta$: +These are functions of $x$, parameterized by $\alpha_i$ and $\beta_i$. Here's a useful example of a set of functions: [sigmoid functions](https://en.wikipedia.org/wiki/Sigmoid_function) with an adjustable center $\alpha$ and width $\beta$: $$\psi(x; \alpha, \beta) = \frac{1}{1 + \exp\left((x - \alpha)/\beta\right)}$$ @@ -265,7 +265,7 @@ ax.legend(loc="lower left", bbox_to_anchor=(0.05, 0.1)) plt.show() ``` -Fitting with these adaptive sigmoids requires an iterative search, rather than computing the parameters with an exact formula. These basis functions are not orthogonal to each other (unlike Fourier components), and they're not even related through a linear transformation (unlike Taylor components). +Fitting with these adaptive sigmoids requires an iterative search, like Minuit, rather than computing the parameters with an exact formula. These basis functions are not orthogonal to each other (unlike Fourier components), and they're not even related to an orthogonal basis through a linear transformation (unlike Taylor components). In fact, this is a harder-than-usual problem for Minuit because the search space has many local minima. To get around this, let's run it 15 times and take the best result (minimum of minima). @@ -323,7 +323,7 @@ assert np.sum((model_y - curve_y)**2) < 10 It's a beautiful fit (usually)! -Since you used 5 sigmoids with 3 parameters each (scaling coefficient $c_i$, center $\alpha_i$, and width $\beta_i$), this is 15 parameters, and the result is much better than it is with 15 Taylor components or 15 Fourier components. +Since we used 5 sigmoids with 3 parameters each (scaling coefficient $c_i$, center $\alpha_i$, and width $\beta_i$), it's a total of 15 parameters, and the result is much better than it is with 15 Taylor components or 15 Fourier components. Moreover, it generalizes reasonably well: @@ -383,11 +383,11 @@ the full fit function is $$y = \sum_i^n c_i \, f\left(x'_i\right)$$ -We took a 1-dimensional $x$, linear transformed it into an n-dimensional $\vec{x}'$, applied a non-linear function $f$, and then linear-transformed that into a 1-dimensional $y$. Let's draw it (for $n = 5$) like this: +We took a 1-dimensional $x$, linear transformed it into an $n$-dimensional $\vec{x}'$, applied a non-linear function $f$, and then linear-transformed that into a 1-dimensional $y$. Let's draw it (for $n = 5$) like this: ![](img/artificial-neural-network-layers-3.svg){. width="100%"} -If you've seen diagrams of neural networks before, this should look familiar! The input is on the left is a vertical column of boxes—only one in this case because our input is 1-dimensional—and the linear transformation is represented by arrows to the next vertical column of boxes, our 5-dimensional $\vec{x}'$. The sigmoid $f$ is not shown in diagram, and the next set of arrows represent another linear transformation to the outputs, $y$, which is also 1-dimensional so only one box. +If you've seen diagrams of neural networks before, this should look familiar! The input is on the left as a vertical column of boxes—only one in this case because our input is 1-dimensional—and the linear transformation is represented by arrows to the next vertical column of boxes, our 5-dimensional $\vec{x}'$. The sigmoid $f$ is not shown in diagram, and the next set of arrows represents another linear transformation to the outputs, $y$, which is also 1-dimensional, so only one box. The first linear transform has slopes $1/\beta$ and intercepts $\alpha_i/\beta_i$ and the second linear transformation in our example has only slopes $c_i$, but we could have added another intercept $y_0$ if we wanted to, to let the vertical offset float. diff --git a/deep-learning-intro-for-hep/05-neural-networks.md b/deep-learning-intro-for-hep/05-neural-networks.md index bd9ae34..9358a75 100644 --- a/deep-learning-intro-for-hep/05-neural-networks.md +++ b/deep-learning-intro-for-hep/05-neural-networks.md @@ -60,7 +60,7 @@ f\left[a_{4,1}x_1 + a_{4,2}x_2 + \ldots + a_{4,10}x_{10} + b_4\right] = y_4 \\ f\left[a_{5,1}x_1 + a_{5,2}x_2 + \ldots + a_{5,10}x_{10} + b_5\right] = y_5 \\ \end{array}$$ -I personally don't know whether real neurons are deterministic—can be modeled as a strict function of their inputs—or that it's the same function $f$ for all dendrites ($y_i$), but early formulations like [McCulloch & Pitts (1943)](https://doi.org/10.1007/BF02478259) used a sharp, binary step function: +I personally don't know whether real, biological neurons are deterministic—whether they can be modeled as a strict function of their inputs—or if they share the same activation function $f$ for all dendrites ($y_i$), but early formulations like [McCulloch & Pitts (1943)](https://doi.org/10.1007/BF02478259) used a sharp, binary step function: $$f(x) = \left\{\begin{array}{c l} 0 & \mbox{if } x < 0 \\ 1 & \mbox{if } x \ge 0 \\ @@ -88,9 +88,9 @@ x_2 \\ x_{10} \end{array}\right) + b\right] = y$$ -The 1-dimensional binary output is good for classification. A binary classification model is trained with a set of $N$ data points, $\vec{x}_i$ with $i \in [0, N)$, and corresponding binary targets, $y_i \in \{A, B\}$, to produce a predictor-machine that associates every point in the space $\vec{x}$ with a probability that it is $A$ or $B$, $P_A(\vec{x})$ and $P_B(\vec{x})$. Naturally, $P_A(\vec{x}) + P_B(\vec{x}) = 1$, so knowing $P_A(\vec{x})$ is equivalent to knowing $P_B(\vec{x})$. +The 1-dimensional binary output is good for classification. A binary classification model is trained with a set of $N$ data points, $\vec{x}_i$ with $i \in [0, N)$, and corresponding binary targets, $y_i \in \{A, B\}$, to produce a predictor-machine that associates every point in the space $\vec{x}$ with a probability that it is $A$ or $B$, $P_A(\vec{x})$ and $P_B(\vec{x})$. Since they're probabilities, $P_A(\vec{x}) + P_B(\vec{x}) = 1$, so knowing $P_A(\vec{x})$ is equivalent to knowing $P_B(\vec{x})$. -Here's a sample problem to learn: which regions of the plane are orange and which are blue? +Here's a sample problem to learn: which regions of the plane are covered with orange dots and which are covered with blue dots? ```{code-cell} ipython3 blob1 = np.random.normal(0, 1, (1000, 2)) + np.array([[0, 3]]) @@ -126,7 +126,7 @@ ax.set_ylim(-4, 7) plt.show() ``` -Since a perceptron doesn't have a hidden layer, it's not even considered a neural network by Scikit-Learn. A model consisting of a linear transformation passed into a sigmoid function is called logistic regression (sigmoid is sometimes called "logistic"). +Since a perceptron doesn't have a hidden layer, it's not even considered a neural network by Scikit-Learn. A model consisting of a linear transformation passed into a sigmoid function is called logistic regression (the sigmoid function is also called a "logistic"). ```{code-cell} ipython3 from sklearn.linear_model import LogisticRegression @@ -244,9 +244,9 @@ The solution starts with the observation that, in a brain, the output of one neu ![](img/nerve-cells-sem-steve-gschmeissner.jpg){. width="100%"} -The full case of a general graph is hard to think about: what happens if the output of one neuron connects, perhaps through a series of other neurons, back to one of its own inputs? Since each $x_i$ component is single-valued, a cycle has to be handled in a time-dependent way. A value of $x_i = 0$ might, though some connections, force $x_i \to 1$, but only _at a later time_. +The full case of a general graph is hard to think about: what happens if the output of one neuron connects, perhaps through a series of other neurons, back to one of its own inputs? Since each $x_i$ component is single-valued, a cycle has to be handled in a time-dependent way. A value of $x_i = 0$ might, though some connections, force $x_i \to 1$, but only _at a later time_. (Signals in the brain take time to propagate, too.) -There have been a few different approaches. +There have been a few different ways to address cycles: 1. Require the graph to not have cycles. This is what McCulloch & Pitts did in their original formulation, since they were trying to build neuron diagrams that make logical propositions, like AND, OR, and NOT in digital circuits. These have clear inputs and outputs and should be time-independent. 2. Update the graph in discrete time-steps. If $x_i = 0$ implies, through some connections, that $x_i$ will be $1$, it is updated in a later time-step. @@ -255,11 +255,11 @@ The layers that we now use in most neural networks are a special case of #1. Eve ![](img/boltzmann-machine.svg){. width="100%"} -This is equivalent to the system of adaptive basis functions that we developed in the previous section: +The Restricted Boltzmann machine is equivalent to the system of adaptive basis functions that we developed in the previous section—just move the nodes into tidy columns, maintaining their connections: ![](img/artificial-neural-network-layers-4.svg){. width="100%"} -Now let's use it to classify orange and blue points, the problem described in the previous subsection. +Now let's use this neural network to classify the orange and blue points that a single perceptron couldn't fit: ```{code-cell} ipython3 from sklearn.neural_network import MLPRegressor @@ -307,7 +307,7 @@ The basis functions didn't have to be sigmoid-shaped. Here are a few common opti | ![](img/Activation_prelu.svg){. width="100%"} | leaky ReLU | $\displaystyle f(x) = \left\{\begin{array}{c l}\alpha x & \mbox{if } x < 0 \\ x & \mbox{if } x \ge 0\end{array}\right.$ | | ![](img/Activation_swish.svg){. width="100%"} | sigmoid linear unit or swish | $\displaystyle f(x) = \frac{x}{1 + e^{-x}}$ | -Let's use the ReLU shape instead. This is perhaps the most common activation function, because of its simplicity (and its derivatives don't asymptotically approach zero). +Let's use the ReLU shape instead. This is the most widely used activation function in ML, because of its simplicity (and because its derivatives don't asymptotically approach zero). ```{code-cell} ipython3 best_fit = MLPRegressor( @@ -337,13 +337,13 @@ The boundaries still separate the orange and blue points, but now they're made o With enough components in the hidden layer, we can approximate any shape. After all, each component is one adaptive basis function, and our favorite activation functions (the table above) have one wiggle each. Adding a component to the hidden layer adds a wiggle, which the fitter can use to wrap around the training points. -However, if the fit function has too many wiggles to fit around the outliers in the training data, it will overfit the data (to be discussed in an upcoming section). We want a model that _generalizes_ the training data, not one that _memorizes_ it. +However, if the fit function has too many wiggles to fit around the outliers in the training data, it will overfit the data (to be discussed in an [upcoming section](15-under-overfitting.md)). We want a model that _generalizes_ the training data, not one that _memorizes_ it. An effective way to do that is to add more hidden layers, like the diagram below: ![](img/artificial-neural-network-layers-5.svg){. width="100%"} -$\vec{x}^{L1}$ (layer 1) is the input and $\vec{x}^{L2}$ (layer 2) is the first hidden layer. Then the output of that is passed through another linear transformation and activation function, as many times as we want. Each layer is a function composition. (Be careful when specifying how many layers a neural network has. We're interested in how many linear transformations the network has, so we'd count the above as 4, not 5.) +$\vec{x}^{L1}$ (layer 1) is the input and $\vec{x}^{L2}$ (layer 2) is the first hidden layer. Then the output of that is passed through another linear transformation and activation function, as many times as we want. Each layer is a function composition. (Be careful when specifying how many layers a neural network has. We're interested in how many linear transformations the network has, so we'd count the above as 4 layers, not 5 layers.) This technique is called "deep learning" (especially when many, many layers are used) and it is largely responsible for the resurgence of interest in neural networks since 2015. Below is a plot of Google search volume for the words "neural network" and "deep learning": @@ -370,7 +370,7 @@ But what really got things started was that deep neural networks started winning The general adage is that "one layer memorizes, many layers generalize." -Each layer in a neural network is a function composition: the arbitrary curve that a set of adaptive basis functions learn is in a space that has already been transformed by a previous set of adaptive basis functions. Each layer warps space to make the next layer's problem simpler. +Each layer in a neural network is a function composition: the arbitrary curve that a set of adaptive basis functions learns is in a space that has already been transformed by a previous set of adaptive basis functions. Each layer warps space to make the next layer's problem simpler. Roy Keyes has a [fantastic demo](https://gist.github.com/jpivarski/f99371614ecaa48ace90a6025d430247) (too much detail for this section) that classifies three categories of points that are arranged like spiral arms of a galaxy, shown below on the left. The first layer of the neural network transforms the $x$-$y$ space of the original problem into the mesh shown in grey on the right. In the transformed space, the three spiral arms are now linearly separable. It is the starting point for the second layer. @@ -382,4 +382,4 @@ For another illustration ([ref](https://arxiv.org/abs/1402.1869)), consider the One hidden layer with many components has a lot of adjustable handles to fit a curve. Multiple hidden layers search for symmetries and try to use them to describe the data in a more generalizable way. -These arguments are heuristic: you could certainly have a problem with no internal symmetries. However, the fact that deep learning has been so successful might mean that most problems do. +These arguments are heuristic: you could certainly have a problem with no internal symmetries. However, the fact that deep learning has been so successful might mean that most problems do have internal symmetries. diff --git a/deep-learning-intro-for-hep/07-regression.md b/deep-learning-intro-for-hep/07-regression.md index ba9b715..2032832 100644 --- a/deep-learning-intro-for-hep/07-regression.md +++ b/deep-learning-intro-for-hep/07-regression.md @@ -17,7 +17,7 @@ kernelspec: +++ -This and the next section introduce PyTorch so that we can use it for the remainder of the course. Whereas Scikit-Learn gives you a function for just about [every type of machine learning model](https://scikit-learn.org/stable/machine_learning_map.html), PyTorch gives you the pieces and expects you to build it yourself. (The [JAX](https://jax.readthedocs.io/) library is even more extreme in providing only the fundamental pieces. PyTorch's level of abstraction is between JAX and Scikit-Learn.) +This and the [next section](09-classification.md) introduce PyTorch so that we can use it for the remainder of the course. Whereas Scikit-Learn gives you a function for just about [every type of machine learning model](https://scikit-learn.org/stable/machine_learning_map.html), PyTorch gives you the pieces and expects you to build it yourself. (The [JAX](https://jax.readthedocs.io/) library is even more extreme in providing only the fundamental pieces. PyTorch's level of abstraction is between JAX and Scikit-Learn.) I'll use the two types of problems we've seen so far—regression and classification—to show Scikit-Learn and PyTorch side-by-side. First, though, let's get a dataset that will provide us with realistic regression and classification problems. @@ -31,7 +31,7 @@ import matplotlib.pyplot as plt +++ -This is my new favorite dataset: basic measurements on 3 species of penguins. You can get the data as a CSV file from the [original source](https://www.kaggle.com/code/parulpandey/penguin-dataset-the-new-iris) or from this project's GitHub: [deep-learning-intro-for-hep/data/penguins.csv](https://github.com/hsf-training/deep-learning-intro-for-hep/blob/main/deep-learning-intro-for-hep/data/penguins.csv). +This is my new favorite dataset: basic measurements on 3 species of penguins. You can get the data as a CSV file from the [original source](https://www.kaggle.com/code/parulpandey/penguin-dataset-the-new-iris) or from GitHub: [deep-learning-intro-for-hep/data/penguins.csv](https://github.com/hsf-training/deep-learning-intro-for-hep/blob/main/deep-learning-intro-for-hep/data/penguins.csv). If you're using Codespaces or cloned the repository, it's already in the `data` directory. ![](img/culmen_depth.png){. width="50%"} @@ -93,7 +93,7 @@ plot_regression_problem(ax) plt.show() ``` -Next, let's add a layer of ReLU functions using Scikit-Learn's [MLPRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html). The reason we set `alpha=0` is because its regularization is not off by default, and we haven't talked about regularization yet. The `solver="lbfgs"` picks a more robust optimization method for this low-dimension problem. +Next, let's add a layer of ReLU functions using Scikit-Learn's [MLPRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html). ("MLP" stands for "Multi Layer Perceptron.") The reason we set `alpha=0` is because its regularization is not off by default, and we haven't talked about [regularization](16-regularization.md) yet. The `solver="lbfgs"` picks a more robust optimization method for this low-dimension problem. ```{code-cell} ipython3 from sklearn.neural_network import MLPRegressor @@ -136,7 +136,7 @@ A model has parameters that the optimizer will vary in the fit. When you create list(model.parameters()) ``` -We can't pass NumPy arrays directly into PyTorch—they have to be converted into PyTorch's own array type (which can reside on CPU or GPU), called `Tensor`. +We can't pass NumPy arrays directly into PyTorch—they have to be converted into PyTorch's own array type (which can reside on CPU or GPU), [torch.Tensor](https://pytorch.org/docs/stable/tensors.html). PyTorch's functions are very sensitive to the exact data types of these tensors: the difference between integers and floating-point can make PyTorch run a different algorithm! For floating-point numbers, PyTorch prefers 32-bit. @@ -148,7 +148,7 @@ tensor_targets = torch.tensor(regression_targets[:, np.newaxis], dtype=torch.flo Now we need to say _how_ we're going to train the model. * What will the loss function be? For a regression problem, it would usually be $\chi^2$, or mean squared error: [nn.MSELoss](https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html) -* Which optimizer should we choose? (This is the equivalent of `solver="lbfgs"` in Scikit-Learn.) We'll talk more about these later, and the right choice will usually be [nn.Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html#torch.optim.Adam), but not for this linear problem. For now, we'll use [nn.Rprop](https://pytorch.org/docs/stable/generated/torch.optim.Rprop.html#torch.optim.Rprop). +* Which optimizer should we choose? (This is the equivalent of `solver="lbfgs"` in Scikit-Learn.) We'll talk more about in an [upcoming section](11-minimizers.md), and the right choice will usually be [nn.Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html#torch.optim.Adam), but not for this linear problem. For now, we'll use [nn.Rprop](https://pytorch.org/docs/stable/generated/torch.optim.Rprop.html#torch.optim.Rprop). The optimizer has access to the model's parameters, and it will modify them in-place. @@ -158,7 +158,7 @@ loss_function = nn.MSELoss() optimizer = optim.Rprop(model.parameters()) ``` -To actually train the model, you have to write your own loop! It's more verbose, but you get to control what happens and debug it. +To actually train the model, you have to write your own loop! The code you'll write is more verbose, but you get to control what happens and debug it. One step in optimization is called an "epoch." In Scikit-Learn, we set `max_iter=1000` to get 1000 epochs. In PyTorch, we write, @@ -178,11 +178,11 @@ for epoch in range(1000): optimizer.step() ``` -The `optimizer.zero_grad()`, `loss.backward()`, and `optimizer.step()` calls change the state of the optimizer and the model parameters, but you can think of them just as the beginning and end of an optimization step. +The [optim.Optimizer.zero_grad](https://pytorch.org/docs/stable/generated/torch.optim.Optimizer.zero_grad.html), [torch.Tensor.backward](https://pytorch.org/docs/stable/generated/torch.Tensor.backward.html), and [optim.Optimizer.step](https://pytorch.org/docs/stable/generated/torch.optim.Optimizer.step.html) calls change the state of the optimizer and the model parameters, but you can think of them just as the beginning and end of an optimization step. -There are other state-changing functions, like `model.train()` (to tell it we're going to start training) and `model.eval()` (to tell it we're going to start using it for inference), but we won't be using any of the features that depend on the variables that these set. +There are other state-changing functions, like [nn.Module.train](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.train) (to tell it we're going to start training) and [nn.Module.eval](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.eval) (to tell it we're going to start using it for inference), but these only matter for a few techniques, such as dropout regularization, [discussed later](16-regularization.md). -Now, to draw a plot with this model, we'll have to turn the NumPy `x` positions into a `Tensor`, run it through the model, and then convert the model's output back into a NumPy array. The output has derivatives as well as values, so those will need to be detached. +Now, to draw a plot with this model, we'll have to turn the NumPy `x` positions into a [torch.Tensor](https://pytorch.org/docs/stable/tensors.html), run it through the model, and then convert the model's output back into a NumPy array. The output has derivatives as well as values, so those will need to be detached. * NumPy `x` to Torch: `torch.tensor(x, dtype=torch.float32)` (or other dtype) * Torch `y` to NumPy: `y.detach().numpy()` @@ -279,7 +279,7 @@ list(model.parameters()) Initially, the model parameters are all random numbers between $-1$ and $1$. After fitting, _some_ of the parameters are in the few-hundred range. -Now look at the $x$ and $y$ ranges on the plot: flipper lengths are hundreds of millimeters and body masses are thousands of grams. The optimizer had to gradually step values of order 1 up to values of order 100‒1000, and it took small steps to avoid jumping over the solution. In the end, the optimizer found a reasonably good fit by scaling just a few parameters up and effectively performed a purely linear fit. +Now look at the $x$ and $y$ ranges on the plot: flipper lengths are hundreds of millimeters and body masses are thousands of grams. The optimizer had to gradually increase parameters of order 1 up to order 100‒1000, and it takes small steps to avoid jumping over the solution. In the end, the optimizer found a reasonably good fit by scaling just a few parameters up and effectively performed a purely linear fit. We should have scaled the inputs and outputs so that the values the fitter sees are _all_ of order 1. This is something that PyTorch _assumes_ you will do. @@ -374,8 +374,8 @@ This time, we see the effect of the ReLU steps because the data and the model pa I think this illustrates an important point about working with neural networks: you cannot treat them as black boxes—you have to understand the internal parts to figure out why it is or isn't fitting the way you want it to. Nothing told us that the ReLU parameters were effectively being ignored because the data were at the wrong scale. We had to step through the pieces to find that out. -Hand-written code, called "craftsmanship" in the [Overview](01-overview.md), is generally designed to be more compartmentalized than this. If you're coming from a programming background, this is something to look out for! Andrej Karpathy's excellent [recipe for training neural networks](https://karpathy.github.io/2019/04/25/recipe/) starts with the warning that neural network training is a "[leaky abstraction](https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/)," which is to say, you have to understand its inner workings to use it effectively—more so than other software products. +Hand-written code, called "craftsmanship" in the [Overview](01-overview.md), is generally designed to be more compartmentalized than this. If you're coming from a programming background, look out for these interdependencies! Andrej Karpathy's excellent [recipe for training neural networks](https://karpathy.github.io/2019/04/25/recipe/) starts with the warning that neural network training is a "[leaky abstraction](https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/)," which is to say, you have to understand its inner workings to use it effectively—more so than with hand-written software. That may be why PyTorch is so popular: it forces you to look at the individual pieces, rather than maintaining the illusion that pressing a `fit` button will give you what you want. -Next, we'll see how to use it for classification problems. +After an exercise, we'll see how to use it for [classification problems](09-classification.md). diff --git a/deep-learning-intro-for-hep/08-exercise-2.md b/deep-learning-intro-for-hep/08-exercise-2.md index d1eafaf..ff22467 100644 --- a/deep-learning-intro-for-hep/08-exercise-2.md +++ b/deep-learning-intro-for-hep/08-exercise-2.md @@ -43,11 +43,11 @@ I [previously mentioned](03-basic-fitting.md) the Boston House Prices dataset. I * TAX: full-value property-tax rate per \$10,000 * PTRATIO: pupil-teacher ratio by town * B: $1000(b - 0.63)^2$ where $b$ is the proportion of Black residents -* LSTAT: % lower status by population +* LSTAT: \% lower status by population as well as MEDV, the median prices of owner-occupied homes. Your job is to predict the prices, given all of the other data as features. You will do this with both a linear fit and a neural network with 5 hidden sigmoid components. -You can get the dataset from [the original source](https://www.kaggle.com/datasets/vikrishnan/boston-house-prices) or from this project's GitHub: [deep-learning-intro-for-hep/data/boston-house-prices.csv](https://github.com/hsf-training/deep-learning-intro-for-hep/blob/main/deep-learning-intro-for-hep/data/boston-house-prices.csv). +You can get the dataset from [the original source](https://www.kaggle.com/datasets/vikrishnan/boston-house-prices) or from this project's GitHub: [deep-learning-intro-for-hep/data/boston-house-prices.csv](https://github.com/hsf-training/deep-learning-intro-for-hep/blob/main/deep-learning-intro-for-hep/data/boston-house-prices.csv). If you're using Codespaces or cloned the repository, it's already in the `data` directory. ```{code-cell} ipython3 housing_df = pd.read_csv( @@ -127,6 +127,8 @@ ax.legend(loc="upper right", bbox_to_anchor=(1.2, 1), framealpha=1) plt.show() ``` +And it does: the residuals are narrower after the fit. + ## Neural network solution in Scikit-Learn +++ @@ -156,7 +158,7 @@ def unscale_predictions(predictions): return (predictions * regression_targets.std()) + regression_targets.mean() ``` -In this neural network, let's use 5 sigmoids ("logistic" functions) in the hidden layer. The `alpha=0` turns off regularization, which we haven't covered yet. +In this neural network, let's use 5 sigmoids ("logistic" functions) in the hidden layer. The `alpha=0` turns off regularization, which we [haven't covered yet](16-regularization.md). ```{code-cell} ipython3 best_fit_nn = MLPRegressor( @@ -184,7 +186,7 @@ plt.show() The neural network is a further improvement, beyond the linear model. This shouldn't be a surprise, since we've given the fitter more knobs to turn: instead of just a linear transformation, we have a linear transformation followed by 5 adaptive basis functions (the sigmoids). -Note that this is not a proper procedure for modeling yet: you can get an _arbitrarily_ good fit by adding more and more sigmoids. Try replacing the `(5,)` with `(100,)` to see what happens to the neural network residuals. If we have enough adaptive basis functions, we can center one on every input data point, and our "quality check" is looking at the model's prediction of those same input data points. We'll talk more about this in the sections on overfitting and splitting data into training, validation, and test datasets. +Note that this is not a proper procedure for modeling yet: you can get an _arbitrarily_ good fit by adding more and more sigmoids. Try replacing the `(5,)` with `(100,)` to see what happens to the neural network residuals. If we have enough adaptive basis functions, we can center one on every input data point, and our "quality check" is looking at the model's prediction of those same input data points. We'll talk more about this in the sections on [overfitting](15-under-overfitting.md) and splitting data into [training, validation, and test datasets](18-hyperparameters.md). +++ diff --git a/deep-learning-intro-for-hep/09-classification.md b/deep-learning-intro-for-hep/09-classification.md index 8d28395..a0e7646 100644 --- a/deep-learning-intro-for-hep/09-classification.md +++ b/deep-learning-intro-for-hep/09-classification.md @@ -57,7 +57,7 @@ Python's `TypeError` won't tell you if you're inappropriately multiplying or div Categorical problems involve at least one categorical variable. For instance, given a penguin's bill length and bill depth, what's its species? We might ask a categorical model to predict the most likely category or we might ask it to tell us the probabilities of each category. -We can't pass string-valued variables into a fitter, so we need to convert the strings into numbers. Since these categories are nominal (not ordinal), equality/inequality is the only meaningful operation, so the numbers should only indicate which strings are the same as each other and which are different. +We can't pass string-valued variables into a fitter, so we need to convert the strings into numbers. Since these categories are nominal (not ordinal), equality/inequality is the only meaningful operation. The numbers should only indicate which strings are the same as each other and which are different. Pandas has a function, [pd.factorize](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.factorize.html), to turn unique categories into unique integers and an index to get the original strings back. (You can also use Pandas's [categorical dtype](https://pandas.pydata.org/docs/user_guide/categorical.html).) @@ -80,9 +80,9 @@ categorical_1hot_df = pd.get_dummies(penguins_df.dropna()[["bill_length_mm", "bi categorical_1hot_df ``` -This is called [one-hot encoding](https://en.wikipedia.org/wiki/One-hot) and it's generally more useful than integer encoding, though it takes more memory (especially if you have a lot of distinct categories). +This is called [one-hot encoding](https://en.wikipedia.org/wiki/One-hot) and, because it doesn't even imply an ordering or equal-sized spacings between categories, it's conceptually better. However, it takes more memory, especially if you have a lot of distinct categories. -For instance, suppose that the categorical variable is the feature and we're trying to predict something numerical: +To illustrate the conceptual limitations of integer encoding, suppose that the categorical variable is the feature and we're trying to predict something numerical: ```{code-cell} ipython3 fig, ax = plt.subplots() @@ -96,7 +96,7 @@ plt.show() If you were to fit a straight line through $x = 0$ and $x = 1$, it would have _some_ meaning: the intersections would be the average bill lengths of Adelie and Gentoo penguins, respectively. But if the fit also includes $x = 2$, it would be meaningless, since it would be using the order of Adelie, Gentoo, and Chinstrap, as well as the equal spacing between them, as relevant for determining the $y$ predictions. -On the other hand, the one-hot encoding is difficult to visualize, but any fits through this high-dimensional space are meaningful. +One-hot encoding is difficult to visualize, but any fits through this high-dimensional space are meaningful. ```{code-cell} ipython3 fig = plt.figure(figsize=(8, 8)) @@ -152,26 +152,24 @@ plot_categorical_problem(ax) plt.show() ``` -The model will be numerical, a function from bill length and depth to a 3-dimensional probability space. Probabilities have two hard constraints: +Scikit-Learn and PyTorch models must involve numerical inputs and outputs, so the model will be a function from bill length and depth to a 3-dimensional probability space. Probabilities must: -* they are all strictly bounded between $0$ and $1$ -* all the probabilities in a set of possibilities need to add up to $1$. +* be greater than or equal to $0$ and +* add up to $1$. -If we define $P_A$, $P_G$, and $P_C$ for the probability that a penguin is Adelie, Gentoo, or Chinstrap, respectively, then $P_A + P_G + P_C = 1$ and all are non-negative. +If we define $P_A$, $P_G$, and $P_C$ for the probability that a penguin is Adelie, Gentoo, or Chinstrap, respectively, then $0 \le P_A, P_G, P_C \le 1$ and $P_A + P_G + P_C = 1$. -One way to ensure the first constraint is to let a model predict values between $-\infty$ and $\infty$, then pass them through a sigmoid function: +One way to ensure the first constraint is to let a model predict values $y$ between $-\infty$ and $\infty$, then pass them through a sigmoid function: -$$p(x) = \frac{1}{1 + \exp(x)}$$ +$$p(y) = \frac{1}{1 + \exp(y)}$$ -If we only had 2 categories, $P_1$ and $P_2$, this would be sufficient: we'd model the probability of $P_1$ only by applying a sigmoid as the last step in our model. $P_2$ can be inferred from $P_1$. +If we only had 2 categories, $P_1$ and $P_2$, we could define $P_1 = p(y)$ for an unconstrained $y$ and $P_2 = 1 - P_1$. But what about 3 categories, as in the penguin problem? -But what if we have 3 categories, as in the penguin problem? +The sigmoid function has a multidimensional generalization called [softmax](https://en.wikipedia.org/wiki/Softmax_function). Given an $n$-dimensional vector $\vec{y}$ with components $y_1, y_2, \ldots, y_n$, -The sigmoid function has a multidimensional generalization called [softmax](https://en.wikipedia.org/wiki/Softmax_function). Given an $n$-dimensional vector $\vec{x}$ with components $x_1, x_2, \ldots, x_n$, +$$P(\vec{y})_i = \frac{\exp(y_i)}{\displaystyle \sum_j^n \exp(y_j)}$$ -$$P(\vec{x})_i = \frac{\exp(x_i)}{\displaystyle \sum_j^n \exp(x_j)}$$ - -For any $x_i$ between $-\infty$ and $\infty$, all $0 \le P_i \le 1$ and $\sum_i P_i = 1$. Thus, we can pass the output of any $n$-dimensional vector through a softmax to get probabilities. +For any $y_i$ between $-\infty$ and $\infty$, all $0 \le P_i \le 1$ and $\sum_i P_i = 1$. Thus, we can pass the output of any $n$-dimensional vector through a softmax to get probabilities. +++ @@ -228,7 +226,7 @@ from sklearn.neural_network import MLPClassifier As in the previous section, we need to scale the features to be order 1. The bill lengths and depths are 30‒60 mm and 13‒22 mm, respectively, which would be hard for the optimizer to find in small steps, starting from numbers between $-1$ and $1$. -But... do we need to scale the prediction targets? Why not? +But... do we need to scale the prediction targets? Why or why not? ```{code-cell} ipython3 def scale_features(x): @@ -237,7 +235,7 @@ def scale_features(x): categorical_features_scaled = scale_features(categorical_features) ``` -Below, `alpha=0` because we haven't discussed regularization yet. +Below, `alpha=0` because we haven't discussed [regularization](16-regularization.md) yet. ```{code-cell} ipython3 best_fit = MLPClassifier( @@ -254,7 +252,7 @@ plot_categorical_problem(ax) plt.show() ``` -The 50% threshold lines can now be piecewise linear, because of the ReLU adaptive basis functions. (If we had chosen sigmoid/logistic, they'd be smooth curves.) These thresholds even be shrink-wrapping around individual training points, especially those that are far from the bulk of the distributions, which is less constrained. +The 50% threshold lines can now be piecewise linear, because of the ReLU adaptive basis functions. (If we had chosen sigmoid/logistic, they'd be smooth curves.) These thresholds may even be shrink-wrapping around individual training points, especially those that are far from the bulk of the distributions, which is less constrained. +++ @@ -398,7 +396,7 @@ nn.CrossEntropyLoss()(predictions, targets_as_labels) (We get the same answer because these `targets_as_labels` correspond to the `targets_as_probabilities`.) -As another PyTorch technicality, notice that most of these functions create functions (or, as another way of saying it, they're class instances with a `__call__` method, so they can be called like functions). `nn.CrossEntropyLoss` is not a function of predictions and targets; it returns a function of predictions and targets: +As another PyTorch technicality, notice that most of these functions create functions (or, as another way of saying it, they're classes with a `__call__` method, so that class instances can be called like functions). `nn.CrossEntropyLoss` is not a function of predictions and targets; it _returns_ a function of predictions and targets: ```{code-cell} ipython3 nn.CrossEntropyLoss() diff --git a/deep-learning-intro-for-hep/11-minimizers.md b/deep-learning-intro-for-hep/11-minimizers.md index f17c794..239e4f6 100644 --- a/deep-learning-intro-for-hep/11-minimizers.md +++ b/deep-learning-intro-for-hep/11-minimizers.md @@ -87,7 +87,7 @@ The adaptive basis functions that we use in neural networks, however, are not so * $L$ takes $n$ floating-point numbers as input, which can be thought of as an $n$-dimensional space of real numbers. This number of parameters $n$ may be hundreds or thousands (or more). * $L$ returns $1$ floating-point number as output. $1$-dimensional real numbers are strictly ordered, which is what makes it possible to say that some value of loss is better than another. (In practice, this means that if we want to optimize two quantities, such as signal strength and background rejection in a HEP analysis, we have to put them both on the same scale: how much signal can we afford to lose to reject enough background? If we're running a bank, how much credit card fraud are we willing to ignore to avoid annoying card-holders with too many alerts? All loss problems have to become $1$-dimensional.) * $L$ is continuous and differentiable at every point (even if the definition of the derivative seems artificial in some cases). -* $L$ may be very noisy and will have many local minima. Unless the neural network architecture is trivial, $L$ _will_ have non-unique global minima because of symmetries. +* $L$ may be very noisy and will have many local minima. Unless the neural network architecture is trivial, $L$ will at least have non-unique global minima because of symmetries. * $L$ is bounded below. Its output doesn't, for instance, limit to $-\infty$ as some combination of its input parameters limit to $\infty$ or some divergent point. At least the global minima _exist_. +++ @@ -103,7 +103,7 @@ All of the minimization algorithms we'll look at have the following basic approa 3. Use the derivative to take a step in the direction that decreases $L$, but don't jump all the way to the estimated minimum. Take a small step whose size is controlled by the "learning rate." 4. Either repeat step 2 or do something to avoid getting stuck in a local minimum or narrow valley. (There's a great deal of variety in how minimization algorithms choose to approach the apparent minimum.) -Minuit follows this approach, though it also (numerically) calculates a second derivative, $\nabla^2 L(\vec{p})$, which it uses to describe the shape of the minimum. This is because Minuit is intended for ansatz fits—the steepness of the loss function at the minimum tells us the uncertainty in the fitted parameters. Neural network training doesn't need it. +Minuit follows this approach, though it also (numerically) calculates a second derivative, $\nabla^2 L(\vec{p})$, which it uses to describe the shape of the minimum. This is because Minuit is intended for ansatz fits—the second derivative of the loss function at the minimum tells us the uncertainty in its fitted parameters. Neural network training doesn't need it. ```{code-cell} ipython3 from iminuit import Minuit @@ -278,7 +278,7 @@ Momentum is the reason why Adam (and variants) are robust minimizers: they dynam -But we can see it by trying to minimize the following function, which has a long, narrow valley that tends to trap the minimization process (like golf). +But we can see it for ourselves by trying to minimize the following function, which has a long, narrow valley that tends to trap the minimization process (like sand traps in golf). ```{code-cell} ipython3 def function_to_minimize(x, y): @@ -336,7 +336,7 @@ plt.show() Keep in mind that what we're doing is unusual: it's very unlikely that you'll ever minimize 2-dimensional functions, and certainly not by hand like this. The reason we're doing this is to understand what these options mean, to have an idea of what to tune when a neural network isn't improving during training. -Most likely, the functions you'll be minimizing will have hundreds or thousands of dimensions, but they'll have complicated structure like the 2-dimensional examples above. This is a 2-dimensional cross-section taken from a real machine learning application ([ref](https://doi.org/10.48550/arXiv.1712.09913)): +Most likely, the functions you'll be minimizing will have hundreds or thousands of dimensions, but they'll have complicated structure like the 2-dimensional examples above. Below is a 2-dimensional cross-section taken from a real machine learning application ([ref](https://doi.org/10.48550/arXiv.1712.09913)): ![](img/loss-visualization-noshort.png){. width="65%"} @@ -348,9 +348,9 @@ Once you find a best fit for all sigmoids $\psi(x; \alpha_i, \beta_i)$ in $$y = \sum_i^n c_i \psi(x; \alpha_i, \beta_i) = \sum_i^n c_i \, \frac{1}{1 + \exp\left((x - \alpha_i)/\beta_i\right)}$$ -swap any sigmoid $i$ for sigmoid $j$ (i.e. swap $c_i \leftrightarrow c_j$, $\alpha_i \leftrightarrow \alpha_j$, and $\beta_i \leftrightarrow \beta_j$) and the result will be identical. However, we have just replaced the values of 3 components of parameter $\vec{p}$ with 3 other components: this is a new point $\vec{p}'$. If one is a minimum, then the other must also be a minimum with exactly the same value. +swap any sigmoid $i$ for sigmoid $j$ (that is, swap $c_i \leftrightarrow c_j$, $\alpha_i \leftrightarrow \alpha_j$, and $\beta_i \leftrightarrow \beta_j$) and the result will be identical. However, we have just replaced the values of 3 components of parameter $\vec{p}$ with 3 other components: this is a new point $\vec{p}'$. If one is a minimum, then the other must also be a minimum with exactly the same value. -This underscores the distinction between ansatz-fitting and neural network training: when fitting a function to an ansatz, we want to find the unique minimum because the location and shape of the minimum tells us the value and uncertainty in some theoretical parameter—something about the real world, such as the strength of gravity or the charge of an electron. In neural network training, the minimum _cannot_ be unique, at least because of the symmetries in swapping adaptive basis functions, but also because the landscape is very complicated with many local minima. _This_ is why neural networks are evaluated on their ability to predict, not on the values of their internal parameters. +This underscores the distinction between ansatz-fitting and neural network training: when fitting a function to an ansatz, we want to find the unique minimum because the location and shape of the minimum tells us the value and uncertainty in some theoretical parameter—something about the real world, such as the strength of gravity or the charge of an electron. In neural network training, the fitted $\vec{p}$ _cannot_ be unique, at least because of the symmetries in swapping adaptive basis functions, but also because the landscape is very bumpy with many traps. _This_ is why neural networks are judged only by their ability to predict, not on the values of their internal parameters. +++ @@ -404,7 +404,7 @@ for fan_out in [10, 20, 50, 100, 200, 500, 1000]: print_values(model) ``` -This is usually sufficient, but the optimal distributions depend on both $n_{\mbox{in}}$ and $n_{\mbox{out}}$. For smooth activation functions like sigmoids and inverse tangent, Xavier initialization is best and for ReLU and its variants, Hu initialization is best. This can be relevant if you build an architecture in which the size of one layer is very different from the size of the next. +This default is usually sufficient, but the optimal distributions depend on both $n_{\mbox{in}}$ and $n_{\mbox{out}}$. For smooth activation functions like sigmoids and inverse tangent, Xavier initialization is best and for ReLU and its variants, Hu initialization is best. This may be relevant if you build an architecture in which the size of one layer is very different from the size of the next. The [nn.init](https://pytorch.org/docs/main/nn.init.html) module provides initialization functions, so you can set initial parameters appropriately: @@ -425,9 +425,9 @@ for fan_out in [10, 20, 50, 100, 200, 500, 1000]: +++ -You might have noticed the `requires_grad` option in all of the PyTorch `Tensor` objects. By default, PyTorch computes derivatives for every computation it performs and most optimizers use these derivatives. The `loss_function.backward()` step in every model-training `for` loop that you write computes derivatives from the output of the neural network (where it's constrained by the target you're trying to optimize for) through to the inputs. +You might have noticed the `requires_grad` option in all of the [torch.Tensor](https://pytorch.org/docs/stable/tensors.html) objects. By default, PyTorch computes derivatives for every computation it performs and most optimizers use these derivatives. The `loss_function.backward()` step in every model-training `for` loop that you write computes derivatives from the output of the neural network (where it's constrained by the target you're trying to optimize for) through to the inputs. -Here's an example of using a PyTorch `Tensor` to compute a function's derivative. +Here's an example of using a [torch.Tensor](https://pytorch.org/docs/stable/tensors.html) to compute a function's derivative. ```{code-cell} ipython3 x = torch.linspace(-np.pi, np.pi, 1000, requires_grad=True) @@ -452,15 +452,15 @@ ax.legend(loc="upper left") plt.show() ``` -Unlike Minuit, this derivative is not computed by finite differences ($df/dx \approx (f(x + \Delta x) - f(x))/\Delta x$ for some small $\Delta x$), and it is not a fully symbolic derivative from a computer algebra system (like Mathematica, Maple, or SymPy). [Autograd](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html) is a technique that passes arrays of derivatives through the same calculation that the main arrays have gone through, applying the chain rule at every step. Thus, it's not approximate, it doesn't depend on a choice of scale ($\Delta x$), and the functions don't have to be closed-form formulae. +Unlike Minuit, this derivative is not computed by finite differences ($df/dx \approx (f(x + \Delta x) - f(x))/\Delta x$ for some small $\Delta x$), and it is not a fully symbolic derivative from a computer algebra system (like Mathematica, Maple, or SymPy). [Autograd](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html) is a technique that passes arrays of derivatives through the same calculation that the main arrays have gone through, swapping functions like sine for their derivatives, applying the chain rule at every step. Thus, it's not approximate, it doesn't depend on a choice of scale ($\Delta x$), and the functions don't have to be closed-form formulae. -This is the reason why you have to [Tensor.detach](https://pytorch.org/docs/stable/generated/torch.Tensor.detach.html) before converting a `Tensor` to a NumPy array—the removal of information about the derivative shouldn't be implicit, or it could lead to wrong calculations. +This is the reason why you have to [Tensor.detach](https://pytorch.org/docs/stable/generated/torch.Tensor.detach.html) before converting a [torch.Tensor](https://pytorch.org/docs/stable/tensors.html) to a NumPy array—the removal of information about the derivative shouldn't be implicit, or it could lead to wrong calculations. But this is also relevant because you might be wondering how we can use ReLU functions in loss functions that are supposed to be differentiable. Formally, $$\mbox{ReLU}(x) = \left\{\begin{array}{c l}0 & \mbox{if } x < 0 \\ x & \mbox{if } x \ge 0\end{array}\right.$$ -doesn't have a true derivative at $x = 0$. (Limiting from the left, the derivative is $0$; limiting from the right, the derivative is $1$.) However, as an artifact of the autograd technique non-differentiable points get "fake" derivatives (that derive from details of their implementation). +doesn't have a true derivative at $x = 0$. (Limiting from the left, the derivative is $0$; limiting from the right, the derivative is $1$.) However, as an artifact of the autograd technique non-differentiable points get "fake" derivatives that derive from details of their implementation. ```{code-cell} ipython3 x = torch.arange(1000, dtype=torch.float32, requires_grad=True) * 0.004 - 2 @@ -538,7 +538,7 @@ It could have been defined to be $0$ or $1$, but $0$ is more popular. * Use the [optim.Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) minimizer. * Pay close attention to the learning rate (`lr`). Try different, exponentially spaced steps (like `0.001`, `0.01`, `0.1`) if the optimization doesn't seem to be doing anything or if the output is erratic. -* Maybe even play with the momentum (`betas`), but the primary learning rate is more important. +* Maybe even play with the momentum (`betas`), but the learning rate is more important. * The final, fitted parameters are not meaningful. Unlike ansatz fitting, solutions are even guaranteed to not be unique. The parameters you get from one loss-minimization will likely be different from the parameters you get from another. * The variance of the initial random parameter values matters! If your network architecture has roughly the same number of vector components in each layer, you can use PyTorch's default. If not, initialize the parameter values using the appropriate function from [nn.init](https://pytorch.org/docs/main/nn.init.html), which depends on your choice of activation function. * Most of the function minimizers need to be given derivatives. Despite appearances, ReLU is differentiable in the sense that it needs to be. diff --git a/deep-learning-intro-for-hep/12-mini-batches.md b/deep-learning-intro-for-hep/12-mini-batches.md index 3cac14b..a2ed160 100644 --- a/deep-learning-intro-for-hep/12-mini-batches.md +++ b/deep-learning-intro-for-hep/12-mini-batches.md @@ -33,9 +33,7 @@ while last_loss is None or loss - last_loss > EPSILON: optimizer.step() ``` -But usually, we just pick a large enough number of iterations for the specific application. (Minuit and Scikit-Learn have to work for _all_ applications; hence, they have to be more clever about the stopping condition.) These iterations are called "epochs." - -(I haven't been able to figure out why they're called that.) +But usually, we just pick a large enough number of iterations for the specific application. (Minuit and Scikit-Learn have to work for _all_ applications; hence, they have to be more clever about the stopping condition.) These iterations are called "epochs." I haven't been able to find out why they're called that. +++ @@ -119,7 +117,7 @@ targets = torch.tensor(boston_prices_df["MEDV"]).float()[:, np.newaxis] In this and the next code block, we'll train a 5-hidden layer model, first without mini-batches (all data in one big batch) and then with mini-batches. In both cases, we'll set the [torch.manual_seed](https://pytorch.org/docs/stable/generated/torch.manual_seed.html) to make sure that both optimization processes start in the same state, since random luck can also contribute to the time needed for convergence. -Also, let's track the value of the loss function and number of function calls (`model`, `loss_function`, and `optimizer.step()`) as a function of epoch. +Also, let's track the value of the loss function and number of function calls (`model`, `loss_function`, and `optimizer.step`) as a function of epoch. ```{code-cell} ipython3 torch.manual_seed(12345) @@ -267,7 +265,7 @@ If the whole dataset does not fit in the processor's working memory, then some k are much more expensive than almost any conceivable model. Thus, you'll generally do better to pass the batches of data that you get from the source into the training step to update the optimization state as much as possible with the data you have, before undertaking the time-consuming step of accumulating the next batch. -Generally, a good mini-batch size for minimizing training time is just small enough for the working memory to fit within your processor's memory size, such as the GPU global memory. If you're tuning other hyperparameters, such as learning rate, the numbers of layers and hidden layer sizes, and regularization, re-tune the mini-batch size _last_, since it depends on all of the rest. +Generally, a good mini-batch size for minimizing training time is just small enough for the working memory to fit within your processor's memory size, such as the GPU global memory. If you're tuning fitter options, such as learning rate, or the numbers of layers and hidden layer sizes and [regularization](16-regularization.md), re-tune the mini-batch size _last_, since it depends on all of the rest and they don't depend on it. +++ @@ -277,7 +275,7 @@ Generally, a good mini-batch size for minimizing training time is just small eno There's another reason to train in small batches: it helps to avoid getting stuck in local minima and end up with a model that generalizes well. -In problem domains like stock market prediction, models need to be updated with every new data point: this is called [online machine learning](https://en.wikipedia.org/wiki/Online_machine_learning). It's an extreme form of mini-batching in which the `NUMBER_IN_BATCH` is 1. Each new data point pulls the fitter toward a different minimum, but roughly in the same direction, and this noise prevents the model from memorizing specific features of a constant dataset. Due to the success that online training processes have had over fixed-dataset processes, we'd expect to find some balance between very noisy online training and smooth-but-biased full-dataset training. We'll discuss techniques to _measure_ overfitting in later sections, but reducing the batch size is one way to reduce overfitting. +In problem domains like stock market prediction, models need to be updated with every new data point: this is called [online machine learning](https://en.wikipedia.org/wiki/Online_machine_learning). It's an extreme form of mini-batching in which the `NUMBER_IN_BATCH` is 1. Each new data point pulls the fitter toward a different minimum, but roughly in the same direction, and this noise prevents the model from memorizing specific features of a constant dataset. Due to the success that online training processes have had over fixed-dataset processes, we'd expect to find some balance between very noisy online training and smooth-but-biased full-dataset training. We'll discuss techniques to _measure_ [overfitting](15-under-overfitting.md) in [later sections](18-hyperparameters.md), but reducing the batch size is one way to reduce overfitting. Let's see the "roughness" of the loss function as a function of batch size for our sample problem. We'll use the `model` that has already been optimized above, so we're near one of its minima. diff --git a/deep-learning-intro-for-hep/14-kernel-trick.md b/deep-learning-intro-for-hep/14-kernel-trick.md index 66c4e14..21ef67d 100644 --- a/deep-learning-intro-for-hep/14-kernel-trick.md +++ b/deep-learning-intro-for-hep/14-kernel-trick.md @@ -274,7 +274,7 @@ But we _can't_ do that with the adaptive basis functions. +++ -This process of choosing relevant combinations of input variables and adding them to a model as features is sometimes called "feature engineering." If you know what features are relevant, as in the circle problem, then computing them explicitly as inputs will improve the fit and its generalization. If you're not sure, including them in the mix with the basic features can only help the model find a better optimum (possibly by overfitting; see the next section). The laundry list of features in the Boston House Prices dataset is an example of feature engineering. +This process of choosing relevant combinations of input variables and adding them to a model as features is sometimes called "feature engineering." If you know what features are relevant, as in the circle problem, then computing them explicitly as inputs will improve the fit and its generalization. If you're not sure, including them in the mix with the basic features can only help the model find a better optimum (possibly by overfitting; see the [next section](15-under-overfitting.md)). The laundry list of features in the Boston House Prices dataset is an example of feature engineering. Sometimes, feature engineering is presented as a bad thing, since you, as data analyst, are required to be more knowledgeable about the details of the problem. The ML algorithm isn't "figuring it out for you." That's a reasonable perspective if you're developing ML algorithms and you want them to apply to any problem, regardless of how deeply understood those problems are. It's certainly impressive when an ML algorithm rediscovers features we knew were important or discovers new ones. However, if you're a data analyst, trying to solve one particular problem, and you happen to know about some relevant features, it's in your interest to include them in the mix! diff --git a/deep-learning-intro-for-hep/15-under-overfitting.md b/deep-learning-intro-for-hep/15-under-overfitting.md index c96ea94..1d6b1a0 100644 --- a/deep-learning-intro-for-hep/15-under-overfitting.md +++ b/deep-learning-intro-for-hep/15-under-overfitting.md @@ -59,7 +59,7 @@ plt.show() The first, a 0th degree polynomial, is just an average. An average is the simplest model that has any relationship to the training data, but it's usually too simple: it doesn't characterize any functional relationships between features and targets and its predictions are sometimes far from the targets. This model _underfits_ the data. -The second, a 2nd degree polynomial, is just right for this data because it was generated from a 2nd degree polynomial ($1 + 4x - 5x^2$). +The second, a 2nd degree polynomial, is just right for this dataset because it was generated to have an underlying quadratic relationship, $1 + 4x - 5x^2$. The third, a 9th degree polynomial, has more detail than the dataset itself. It makes false claims about what data would do between the points and beyond the domain. It _overfits_ the data. @@ -185,7 +185,7 @@ plt.show() The optimizer used all of the parameters it had available to shrink-wrap the model around the training data. ReLU segments are carefully positioned to draw the decision boundary around every data point, sometimes even making islands to correctly categorize outlier penguins. This model is very overfitted. -The problem is that this model does not generalize. If we had more penguins to measure (sadly, we don't), they probably wouldn't line up with this decision boundary. There might be more orange (Gentoo) outliers in the region dominated by green (Chinstrap), but not likely in the same position as the outlier in this dataset, any islands this model might have drawn around the training-set outliers would be useless for categorizing new data. +The problem is that this model does not generalize. If we had more penguins to measure (sadly, we don't), they probably wouldn't line up with this decision boundary. There might be more orange (Gentoo) outliers in the region dominated by green (Chinstrap), but not likely in the same position as the outlier in this dataset, and any islands this model might have drawn around the training-set outliers would be useless for categorizing new data. +++ @@ -193,7 +193,7 @@ The problem is that this model does not generalize. If we had more penguins to m +++ -First, you need a way to even know whether you're under or overfitting. In the regression example, overlaying the fits on more data drawn from the same distribution revealed that the 0th and 9th degree polynomials are inaccurate/biased in some regions of $x$. Judging a model by data that were not used to fit it is a powerful technique. In general, you'll want to split your data into a subsample for training and another subsample reserved for validation and tests (which we'll cover in a later section). +First, you need a way to even know whether you're under or overfitting. In the regression example, overlaying the fits on more data drawn from the same distribution revealed that the 0th and 9th degree polynomials are inaccurate/biased in some regions of $x$. Judging a model by data that were not used to fit it is a powerful technique. In general, you'll want to split your data into a subsample for training and another subsample reserved for validation and tests (which we'll cover in a [later section](18-hyperparameters.md)). If you know that you're underfitting a model, you can always add more parameters. That's the benefit of neural networks over ansatz fits: if an ansatz is underfitting the data, then there are correlations in the data that you don't know about. You need to fully understand them and put them in your fit function before you can make accurate predictions. But if a neural network is underfitting, just add more layers and/or more vector components per layer. diff --git a/deep-learning-intro-for-hep/16-regularization.md b/deep-learning-intro-for-hep/16-regularization.md index ab48e22..15eae30 100644 --- a/deep-learning-intro-for-hep/16-regularization.md +++ b/deep-learning-intro-for-hep/16-regularization.md @@ -118,7 +118,7 @@ In the second plot, a very weak $\lambda_{L2} = 10^{-14}$ is almost—but not qu To get visible effects, I had to set $\lambda_{L1}$ and $\lambda_{L2}$ to very different orders of magnitude, and the Lasso (L1) regularization needed more iterations (`max_iter`) to converge. -(One more thing that I should point out: if you want to use this technique in a real application, use [Legendre polynomials](https://en.wikipedia.org/wiki/Legendre_polynomials) as the terms in the linear fit, rather than $x^0$, $x^1$, $x^2$, etc., since [Legendre polynomials](https://en.wikipedia.org/wiki/Legendre_polynomials) are orthonormal with uniform weight on the interval $(-1, 1)$. Also, scale $(-1, 1)$ to the domain of your actual data. Notice that in this fit, the highly constrained polynomials are more constrained by the point at $x = 1$ than any other point!) +(One more thing that I should point out: if you want to use this technique in a real application, use [Legendre polynomials](https://en.wikipedia.org/wiki/Legendre_polynomials) as the terms in the linear fit, rather than $x^0$, $x^1$, $x^2$, etc., since [Legendre polynomials](https://en.wikipedia.org/wiki/Legendre_polynomials) are orthonormal with uniform weight on the interval $(-1, 1)$. Also, scale $(-1, 1)$ to the domain of your actual data. Notice that in this fit, the highly constrained polynomials are more constrained by the point at $x = 1$ than any other point—that's because of the non-uniform weight on this interval!) +++ @@ -235,7 +235,7 @@ Remember that the 13 features of this dataset quantify a variety of things, all * TAX: full-value property-tax rate per \$10,000 * PTRATIO: pupil-teacher ratio by town * B: $1000(b - 0.63)^2$ where $b$ is the proportion of Black residents -* LSTAT: % lower status by population +* LSTAT: \% lower status by population This is a "throw everything into the model and see what's relevant" kind of analysis. If the dataset is large enough, unimportant features should have a slope that is statistically consistent with zero, but this dataset only has 506 rows (towns near Boston). Rather than a null hypothesis analysis, let's find out which features are the most disposable by fitting it with Lasso (L1) regularization, to see which features it zeros out first. @@ -482,7 +482,7 @@ Unlike L1 and L2 regularization, the parameter values are not much smaller than +++ -In an upcoming section, we'll cover hyperparameter optimization and validation, in which we'll divide a dataset into a subset for training and another (one or two) for tests. The value of the loss function on the training dataset can get arbitrarily small as the model overfits, but the value of the loss function on data not used in training is a good indication of how well the model generalizes. +In an [upcoming section](18-hyperparameters.md), we'll cover hyperparameter optimization and validation, in which we'll divide a dataset into a subset for training and another (one or two) for tests. The value of the loss function on the training dataset can get arbitrarily small as the model overfits, but the value of the loss function on data not used in training is a good indication of how well the model generalizes. Let's see what the training and test loss look like for an overfitted model. Below, we're using [Datasets](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) and [DataLoaders](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for convenience, because they work with PyTorch's [random_split](https://pytorch.org/docs/stable/data.html#torch.utils.data.random_split) function. diff --git a/deep-learning-intro-for-hep/18-hyperparameters.md b/deep-learning-intro-for-hep/18-hyperparameters.md index bac7816..5fe1bd2 100644 --- a/deep-learning-intro-for-hep/18-hyperparameters.md +++ b/deep-learning-intro-for-hep/18-hyperparameters.md @@ -57,9 +57,9 @@ as well as other changeable aspects of the training procedure, * distribution of initial parameter values in each layer, * number of epochs and mini-batch size. -It would be confusing to call these choices "parameters," so we call them "**hyperparameters**" ("hyper" means "over or above"). The problem of finding the best model is split between you, the human, choosing hyperparameters and the optimizer choosing parameters. Following the farming analogy from the [Overview](01-overview.md), the hyperparameters are choices that the farmer gets to make—how much water, how much sun, etc. The parameters are the low-level aspects of how a plant grows, where its leaves branch, how its veins and roots organize themselves to survive. Generally, there are a lot more parameters than hyperparameters. +It would be confusing to call these choices "parameters," so we call them "**hyperparameters**" ("hyper" means "over or above"). The problem of finding the best model is a collaboration between you, the human, choosing hyperparameters and the optimizer choosing parameters. Following the farming analogy from the [Overview](01-overview.md), the hyperparameters are choices that the farmer gets to make—how much water, how much sun, etc. The parameters are the low-level aspects of how a plant grows, where its leaves branch, how its veins and roots organize themselves to survive. Generally, there are a lot more parameters than hyperparameters. -If there are a lot of hyperparameters to tune, we might want to tune them algorithmically—maybe with a grid search, randomly, or with Bayesian optimization. Technically, I suppose they become parameters, or we get a three-level hierarchy: parameters, hyperparameters, and hyperhyperparameters! Practitioners might not use consistent terminology ("using ML to tune hyperparameters" is a contradiction in terms), but just don't get confused about who is optimizing which: algorithm 1, algorithm 2, or human. Even if some hyperparameters are being tuned by an algorithm, some of them must be chosen by hand. For instance, you choose a type of ML algorithm, maybe a neural network, maybe something else, and non-numerical choices about the network topology are generally hand-chosen. If a grid search, random search, or Bayesian optimization is choosing the rest, you do have to set the grid spacing for the grid search, the number of trials and measure of the random search, or various options in the Bayesian search. Or, a software package that you use chooses for you. +If there are a lot of hyperparameters to tune, we might want to tune them algorithmically—maybe with a grid search, randomly, or with Bayesian optimization. Technically, I suppose they then become parameters, or we get a three-level hierarchy: parameters, hyperparameters, and hyperhyperparameters! Practitioners might not use consistent terminology ("using ML to tune hyperparameters" is a contradiction in terms), but just don't get confused about who is optimizing which: algorithm 1, algorithm 2, or human. Even if some hyperparameters are being tuned by an algorithm, some of them must be chosen by hand. For instance, you choose a type of ML algorithm, maybe a neural network, maybe something else, and non-numerical choices about the network topology are generally hand-chosen. If a grid search, random search, or Bayesian optimization is choosing the rest, you do have to set the grid spacing for the grid search, the number of trials and measure of the random search, or various options in the Bayesian search. Or, a software package that you use chooses for you. +++ @@ -69,12 +69,12 @@ If there are a lot of hyperparameters to tune, we might want to tune them algori In the section on [Regularization](16-regularization.md), we split a dataset into two samples and computed the loss function on each. -* **Training:** loss computed from the training dataset is used to change the parameters of the model. Training loss can get arbitrarily small as the model is adjusted to fit the training data points exactly (if it has enough parameters to be so flexible). +* **Training:** loss computed from the training dataset is used to change the parameters of the model. Thus, the loss computed in training can get arbitrarily small as the model is adjusted to fit the training data points exactly (if it has enough parameters to be so flexible). * **Test:** loss computed from the test dataset acts as an independent measure of the model quality. A model generalizes well if it is a good fit (has minimal loss) on both the training data and data drawn from the same distribution: the test dataset. Suppose that I set up an ML model with some hand-chosen hyperparameters, optimize it for the training dataset, and then I don't like how it performs on the test dataset, so I adjust the hyperparameters and run again. And again. After many hyperparameter adjustments, I find a set that optimizes both the training and the test datasets. Is the test dataset an independent measure of the model quality? -It's not a fair test because my hyperparameter optimization is not a fundamentally different thing from the automated parameter optimization. When I adjust hyperparameters, look at how the loss changes, and use that information to either revert the hyperparameters or make another change, I am acting as a minimization algorithm—just a slow, low-dimensional one. +It's not a fair test because my hyperparameter optimization is the same kind of thing as the automated parameter optimization. When I adjust hyperparameters, look at how the loss changes, and use that information to either revert the hyperparameters or make another change, I am acting as a minimization algorithm—just a slow, low-dimensional one. Since we do need to optimize (some of) the hyperparameters, we need a third data subsample: @@ -142,7 +142,7 @@ For completeness, I should mention an alternative to allocating a validation dat ![](img/grid_search_cross_validation.png){. width="75%"} -After isolating a test sample (maybe 20%), you +After isolating a test sample, you 1. subdivide the remaining sample into $k$ subsamples, 2. for each $i \in [0, k)$, combine all data except for subsample $i$ into a training dataset $T_i$ and use subsample $i$ as a validation dataset $V_i$, diff --git a/deep-learning-intro-for-hep/19-goodness-of-fit.md b/deep-learning-intro-for-hep/19-goodness-of-fit.md index a0a0cbb..a9c0b09 100644 --- a/deep-learning-intro-for-hep/19-goodness-of-fit.md +++ b/deep-learning-intro-for-hep/19-goodness-of-fit.md @@ -77,11 +77,11 @@ Story time: > > The next day, the same thing happened. > -> The day after that, there really was a wolf and the shepherd cried, "Wolf!" but the villagers refused to be fooled again and didn't help. The wolf ate all the sheep. +> The day after that, there actually was a wolf and the shepherd cried, "Wolf!" but the villagers refused to be fooled again and didn't help. The wolf ate all the sheep. Each day, there are two possible truth states: -* there really is a wolf +* there actually is a wolf * there is not wolf and two possible claims by a model that puports to represent reality (the shepherd boy's utterances): @@ -277,9 +277,9 @@ The F₁ score ranges from 0 (worst) to 1 (best), and there is a [family of Fᵦ +++ -Returning to the Boy Who Cried Wolf, the 4 outcomes are not equally good or bad. Public scorn is not as bad as all the sheep being eaten, so if the shepherd boy has any doubt about whether a wolf-like animal is really a wolf, he should err on the side of calling it a wolf. +Returning to the Boy Who Cried Wolf, the 4 outcomes are not equally good or bad. Public scorn is not as bad as all the sheep being eaten, so if the shepherd boy has any doubt about whether a wolf-like animal is actually a wolf, he should err on the side of calling it a wolf. -Suppose that we really want to catch as many blue (positive) points as possible, and we're willing to accept some contamination of orange (negative). Instead of cutting the model at its 50% threshold, perhaps we should cut it at its 90% threshold: +Suppose that we want to catch as many blue (positive) points as possible, and we're willing to accept some contamination of orange (negative). Instead of cutting the model at its 50% threshold, perhaps we should cut it at its 90% threshold: ```{code-cell} ipython3 fig, ax = plt.subplots(figsize=(5, 5)) @@ -365,7 +365,8 @@ Some properties of the ROC curve: * The better a model is, the closer it gets to $(0, 1)$ (zero false positive rate and perfect true positive rate). * If the model is completely uninformative—randomly guessing—then the ROC curve is a horizontal line from $(0, 0)$ to $(1, 1)$. (For finite datasets, it's a noisy horizontal line: it only approaches a perfect diagonal when evaluated on very large datasets.) * If you plot the ROC curve with higher resolution than the size of the dataset it is evaluated on, as above, it consists of horizontal and vertical segments as the threshold curve scans across individual points (of the two categories). -* Be sure to plot the ROC curve using the validation dataset while tuning hyperparameters and only use the test dataset after the model is fixed. + +Be sure to plot the ROC curve using the validation dataset while tuning hyperparameters and only use the test dataset after the model is fixed. For completeness, let's see the ROC curves for a perfect model (always correct) and a model that randomly guesses. diff --git a/deep-learning-intro-for-hep/22-beyond-supervised.md b/deep-learning-intro-for-hep/22-beyond-supervised.md index 612e960..526f4e5 100644 --- a/deep-learning-intro-for-hep/22-beyond-supervised.md +++ b/deep-learning-intro-for-hep/22-beyond-supervised.md @@ -14,3 +14,5 @@ kernelspec: --- # Beyond supervised regression & classification + +_To be completed soon!_ diff --git a/deep-learning-intro-for-hep/23-autoencoders.md b/deep-learning-intro-for-hep/23-autoencoders.md index 3b42eb3..581e634 100644 --- a/deep-learning-intro-for-hep/23-autoencoders.md +++ b/deep-learning-intro-for-hep/23-autoencoders.md @@ -14,3 +14,5 @@ kernelspec: --- # Autoencoders + +_To be completed soon!_ diff --git a/deep-learning-intro-for-hep/24-convolutional.md b/deep-learning-intro-for-hep/24-convolutional.md index 87ee6e8..1e28b5d 100644 --- a/deep-learning-intro-for-hep/24-convolutional.md +++ b/deep-learning-intro-for-hep/24-convolutional.md @@ -14,3 +14,5 @@ kernelspec: --- # Convolutional Neural Networks (CNNs) + +_To be completed soon!_ diff --git a/deep-learning-intro-for-hep/25-deepsets-and-graphs.md b/deep-learning-intro-for-hep/25-deepsets-and-graphs.md index 95b7f6f..6c25a74 100644 --- a/deep-learning-intro-for-hep/25-deepsets-and-graphs.md +++ b/deep-learning-intro-for-hep/25-deepsets-and-graphs.md @@ -14,3 +14,5 @@ kernelspec: --- # DeepSets and Graph Neural Networks (GNNs) + +_To be completed soon!_ diff --git a/deep-learning-intro-for-hep/img/artificial-neural-network-layers-4.svg b/deep-learning-intro-for-hep/img/artificial-neural-network-layers-4.svg index 6c31b4c..d41b177 100644 --- a/deep-learning-intro-for-hep/img/artificial-neural-network-layers-4.svg +++ b/deep-learning-intro-for-hep/img/artificial-neural-network-layers-4.svg @@ -25,9 +25,9 @@ showgrid="false" inkscape:object-nodes="false" inkscape:snap-midpoints="false" - inkscape:zoom="0.73573123" - inkscape:cx="1071.0433" - inkscape:cy="255.5281" + inkscape:zoom="1.0404811" + inkscape:cx="1090.8416" + inkscape:cy="182.60784" inkscape:window-width="2554" inkscape:window-height="1365" inkscape:window-x="1920" @@ -2111,26 +2111,26 @@ width="26.68256" height="26.68256" x="80.085602" - y="421.52692" /> + y="405.65195" /> xx2 + d="m 106.76816,418.99319 75.37553,79.37488" + id="path26261" + sodipodi:nodetypes="cc" /> + d="m 106.76816,418.99319 75.37553,47.62494" + id="path26831" + sodipodi:nodetypes="cc" /> + d="m 106.76816,418.99319 75.37553,15.875" + id="path27001" + sodipodi:nodetypes="cc" /> + d="m 106.76816,418.99319 75.37553,-15.87494" + id="path27171" + sodipodi:nodetypes="cc" /> + d="m 106.76816,418.99319 75.37553,-47.62488" + id="path27374" + sodipodi:nodetypes="cc" /> + y="373.90201" /> xx1 + y="437.40189" /> xx3 + y="421.52692" /> yy2 + y="389.77698" /> yy1 + + x4 + + + + + + + + y3 + + + + + + diff --git a/make-notebooks.sh b/make-notebooks.sh index c284801..09dc9ba 100755 --- a/make-notebooks.sh +++ b/make-notebooks.sh @@ -5,5 +5,6 @@ cd deep-learning-intro-for-hep for x in *.md; do y=`echo $x | sed s/.md/.ipynb/`; jupytext --to notebook $x -o ../notebooks/$y --update-metadata '{"kernelspec": {"display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3"}}'; done cp -a img ../notebooks/img +cp -a data ../notebooks/data cd ..