forked from aws-samples/aws-mlu-explain
-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
679 lines (651 loc) · 33.6 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<title>Double Descent 2</title>
<meta name="description" content="Double Descent" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<meta
name="description"
content="MLU-Explain: An Explanation of Double Descent."
/>
<meta name="viewport" content="width=device-width, initial-scale=1" />
<meta
property="og:image"
content="https://mlu-explain.github.io/assets/ogimages/ogimage-double-descent2.png"
/>
<meta property="og:title" content="Double Descent 2" />
<meta
property="og:description"
content="An explanation of the Double Descent phenomena in modern machine learning."
/>
<meta property="og:image:width" content="1000" />
<meta property="og:image:height" content="630" />
<link rel="icon" href="./assets/mlu_robot.png" />
<link rel="stylesheet" href="css/katex.min.css" />
<link rel="stylesheet" href="css/styles.scss" />
<!-- Global site tag (gtag.js) - Google Analytics -->
<script
async
src="https://www.googletagmanager.com/gtag/js?id=G-1FYW57GW3G"
></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag() {
dataLayer.push(arguments);
}
gtag("js", new Date());
gtag("config", "G-1FYW57GW3G");
</script>
</head>
<body>
<main>
<div id="intro-icon">
<a href="https://mlu-explain.github.io"
><svg
width="50"
height="50"
viewBox="0 0 234 216"
fill="none"
xmlns="https://www.w3.org/2000/svg"
>
<g id="mlu_robot 1" clip-path="url(#clip0)">
<g>
<path
id="Vector"
d="M90.6641 83.1836C96.8828 83.1836 101.941 78.1289 101.941 71.8906V71.8242C101.941 65.5898 96.8945 60.5312 90.6641 60.5312C84.4453 60.5312 79.3828 65.5898 79.3828 71.8242V71.8906C79.3828 78.1289 84.4336 83.1836 90.6641 83.1836Z"
fill="black"
></path>
<path
id="Vector_2"
d="M143.305 83.1836C149.523 83.1836 154.586 78.1289 154.586 71.8906V71.8242C154.586 65.5898 149.535 60.5312 143.305 60.5312C137.09 60.5312 132.027 65.5898 132.027 71.8242V71.8906C132.027 78.1289 137.078 83.1836 143.305 83.1836Z"
fill="black"
></path>
<path
id="Vector_3"
d="M163.586 159.402H173.609V122.641H163.586V159.402Z"
fill="black"
></path>
<path
id="Vector_4"
d="M60.3594 159.402H70.3867V122.641H60.3594V159.402Z"
fill="black"
></path>
<g id="Group">
<path
id="Vector_5"
d="M182.16 30.0781H51.8047V10.0234H182.16V30.0781ZM182.16 103.609H51.8047V40.1055H182.16V103.609ZM144.559 168.789H89.4062V113.641H144.559V168.789ZM0 0V10.0234H15.8789V46.7891H25.9023V10.0234H41.7812V113.641H79.3867V178.816H96.9297V215.578H106.957V178.816H127.016V215.578H137.039V178.816H154.586V113.641H192.188V10.0234H233.969V0"
fill="black"
></path>
</g>
</g>
</g>
<defs>
<clipPath id="clip0">
<rect width="233.97" height="215.58" fill="white"></rect>
</clipPath>
</defs>
</svg>
<h2 class="logo">MLU-expl<span id="ai">AI</span>n</h2>
</a>
</div>
<section id="intro">
<h1 id="intro-hed">Double Descent</h1>
<h1 class="intro-sub">Part 2: A Mathematical Explanation</h1>
<h3 id="intro__date">
Brent Werness & <a href="https://twitter.com/jdwlbr">Jared Wilber</a>,
December 2021
</h3>
<p id="top-note">
Note - this is part 2 of a two article series on
<span class="bold">Double Descent</span>. Part 1 is available
<a href="https://mlu-explain.github.io/double-descent/">here</a>.
</p>
<h2 class="subtitle">A Sketch of the Mathematics</h2>
<p class="body-text">
In our
<a href="https://mlu-explain.github.io/double-descent/">
previous discussion of the double descent phenomenon</a
>, we have made use of a piecewise linear model that we fit to our
data. This may seem a somewhat strange choice, but we made it for a
simple reason: you can rigorously prove that it will show double
descent for essentially any dataset! This does not seem to be in the
literature anywhere, so we've taken it upon ourselves to at least
sketch the argument to provide some mathematical justification for why
double descent occurs.
</p>
<p class="body-text">
This article is made of three parts. First, we provide a very short
discussion of the behavior of the model with a small number of
segments—the classical regime. Second, we will explain why the
model must perform poorly at the interpolation threshold. Third, we
will identify what happens in the limit of infinitely many linear
pieces.
</p>
<p class="body-text">
Throughout this discussion, we will see that the behavior boils down
to <span class="bold">two core principles:</span>
</p>
<p class="block-text">
<span class="bold">1.</span> At the interpolation threshold, there can
often be a
<i>single</i>
choice of model that works, and there is no reason to believe that
said model will be good.
<br /><br />
<span class="bold">2.</span> In the limit of infinitely large models,
there will be a vast number of interpolating models, and we can pick
the best amongst them.
</p>
<p class="body-text">
Throughout this discussion, we will use a simple example dataset to
demonstrate our points. This is the collection of points:
</p>
<span class="katex-eq" id="katex-points"></span>
<p class="body-text">
This example dataset has no noise, and is simply a quadratic function
sampled at 6 equally spaced points. We will get to know this dataset
well.
</p>
<div id="chart1" class="chart"></div>
<h2 class="subtitle">Our Piecewise Linear Model</h2>
<p class="body-text">
Let's be precise about exactly what model we are working with. We work
entirely in one dimension, so our input data is a vector
<span class="katex-eq" id="katex-input"></span>, and our target is a
vector <span class="katex-eq" id="katex-target"></span>. Our model
will attempt to fit a piecewise linear function to this dataset, and
the way we'll do that is to pick
<span class="katex-eq" id="katex-knot"></span>
<i>knot points</i> where our linear function will be allowed to bend.
This will be represented by use of <i>radial basis functions</i>, in
particular we will define:
</p>
<span class="katex-eq" id="katex-radial"></span>
<p class="body-text">and say that our model will be written as</p>
<span class="katex-eq" id="katex-model"></span>
<p class="body-text">
where <span class="katex-eq" id="katex-alpha"></span>,
<span class="katex-eq" id="katex-beta"></span>, and
<span class="katex-eq" id="katex-gamma"></span> are learnable
parameters. It can be readily checked (since the absolute value
function is linear aside from a singularity at the origin) that this
is a parametrization of piecewise linear functions with breakpoints
only allowed at the <span class="katex-eq" id="katex-t1"></span>
points. We will consider the knot points
<span class="katex-eq" id="katex-t2"></span> to be fixed, and picked
as independent and identically distributed random points from a
continuous distribution with density
<span class="katex-eq" id="katex-p1"></span>. Typically we take
<span class="katex-eq" id="katex-p2"></span> to be uniform on the
range of <span class="katex-eq" id="katex-x"></span>, although we
discuss the problem in generality.
</p>
<p class="body-text">
We will fit this model by ordinary least squares regression on our
learnable parameters. As is standard, we will seek the minimum norm
solution when the system is under-determined (when we have more
parameters than data points). To simplify the analysis, we will assume
we are only looking to minimize
<span class="katex-eq" id="katex-gamma2"></span> (excluding
<span class="katex-eq" id="katex-alpha2"></span> and
<span class="katex-eq" id="katex-beta2"></span> from the
minimization).
</p>
<h2 class="subtitle">Below The Interpolation Threshold</h2>
<p class="body-text">
While not the focus of the document, let's take a moment to consider
what happens when we are well below the interpolation threshold.
Without any knot points
<span class="katex-eq" id="katex-k0"></span> this is ordinary least
squares regression.
</p>
<div id="chart2" class="chart"></div>
<p class="body-text">
As our dataset is not linear, this fit is not particularly great, and
we will be able to improve our fit by adding a point.
</p>
<div id="chart3" class="chart"></div>
<p class="body-text">
How much this improves performance depends heavily on the dataset, the
ground truth, and choice of random point. Either numerical testing or
an exact computation shows that on average the performance is improved
versus the linear solution with regards to
<span class="katex-eq" id="katex-q"></span>-th power of any
<span class="katex-eq" id="katex-lq"></span>-norm with
<span class="katex-eq" id="katex-q1"></span>. This includes the MSE
and the MAE metrics.
</p>
<p class="body-text">
So far, this perfectly matches traditional statistical intuition. Our
simple linear fit is poor since the ground truth function is
non-linear. Expanding our model to allow for non-linearity should
improve the fit, as long as there is sufficient data. In this case,
out six data points are enough to provide a better model when you
allow for a single bend at a random location.
</p>
<p class="body-text"></p>
<h2 class="subtitle">At The Interpolation Threshold</h2>
<p class="body-text">
We now turn our attention to the behavior <i>at</i> the interpolation
threshold. Our goal here will be to illustrate why this will perform
poorly by showing that, for this model, the average error (averaged
over our random knot points) is infinite!
</p>
<p class="body-text">
Let's visualize an example of a model at the interpolation threshold.
</p>
<div id="chart4" class="chart"></div>
<p class="body-text">
In this case, we see that we have four knot points placed between the
five right-most data points, and then no knot point between the two
left-most ones. The main takeaway we want you to have is that, for
this choice of points, there is no freedom in how the interpolating
model is selected once the knot points are fixed. There is a unique
line which passes through the left two points, and if we want to
exactly interpolate the data we must start with that line.
</p>
<p class="body-text">
This line intersects the vertical blue line denoting the knot point at
<span class="katex-eq" id="katex-point1"></span> at a point which is
uniquely determined by the line on the left, and then it must bend to
pass through the data point at
<span class="katex-eq" id="katex-point2"></span>. From there it hits
the knot point at <span class="katex-eq" id="katex-point3"></span>, at
which point again it must bend to hit the point at
<span class="katex-eq" id="katex-point4"></span>, and so on. There is
never any choice since it needs to be an interpolating function, so
the entire solution is fixed.
</p>
<p class="body-text">
This is one of the main lessons of what happens at the interpolation
threshold—there is <i>rigidity</i> in what model you fit.
Indeed, it is not uncommon that there is exactly <i>one</i> such
choice (and a count of variable and constraints tells us that this
should frequently happen when we have two fewer knot points than data
points, although this is not guaranteed). If there is only one choice,
we have no control over whether that model is good or not, and we
should expect a bad fit.
</p>
<p class="body-text">
To see an example that is a bad fit, let's move the left-most knot
point towards the right.
</p>
<div id="animation-chart" class="chart"></div>
<p class="body-text">
This performance is extremely bad—jumping up to almost a value
of
<span class="katex-eq" id="katex-point5"></span> when predicting a
function whose maximum value is
<span class="katex-eq" id="katex-point6"></span> in this region.
Indeed, we can shift again and make it even more pronounced, jumping
all the way to almost
<span class="katex-eq" id="katex-point7"></span>.
</p>
<div id="chart5" class="chart"></div>
<p class="body-text">
This phenomena can be made so pronounced that our <i>average</i> error
can be made infinite for all
<span class="katex-eq" id="katex-lp"></span>-norms.
</p>
<p class="body-text">
First, let's focus on what this picture is showing. We are throwing
down four random knot points, three of which are to the right of
<span class="katex-eq" id="katex-point8"></span>
(for concreteness, say the first three), and one (say the last) occurs
just to the left of <span class="katex-eq" id="katex-point9"></span>.
To be explicit, let's make that left-most point fall in the interval
of <span class="katex-eq" id="katex-interval"></span>. This whole
event occurs with probability
</p>
<span class="katex-eq" id="katex-intervalp"></span>
<p class="body-text">
We will throw out all constants as they are immaterial for the
argument being made, and concentrate only on how this depends on
<span class="katex-eq" id="katex-epsilon"></span>.
</p>
<p class="body-text">
When this occurs, the left-most segment will hit the left-most knot
point somewhere below the point
<span class="katex-eq" id="katex-point10"></span> (since the dataset
comes from a quadratic function that is curving up). To be concrete
let's say that we ensure this by picking
<span class="katex-eq" id="katex-epsilon2"></span> sufficiently small
that it is no larger than
<span class="katex-eq" id="katex-point11"></span>. Then, at that knot
point, the next linear segment will need to go from a point no better
than <span class="katex-eq" id="katex-better"></span>, which means
that it needs to have a slope of at least
<span class="katex-eq" id="katex-slope"></span> to pass through the
point <span class="katex-eq" id="katex-point12"></span>.
</p>
<p class="body-text">
With this enormous slope, we have our key component. It means that our
function has started oscillating wildly, and we simply need to show it
is wild enough to make the average error infinite under any
<span class="katex-eq" id="katex-lp2"></span>
norm. Notice that we have no other knot points in the region from
<span class="katex-eq" id="katex-interval2"></span>. On that region,
our interpolating function takes on a value no smaller than
<span class="katex-eq" id="katex-smaller"></span>, where
<span class="katex-eq" id="katex-slope2"></span> is the slope, and
thus the value and the error are both of order
<span class="katex-eq" id="katex-error"></span> on that interval of
width <span class="katex-eq" id="katex-width"></span>.
</p>
<p class="body-text">
Now we finish the argument. By considering the sequence of
<span class="katex-eq" id="katex-seq"></span> we can create a sequence
of disjoint events which we can use to lower-bound our expected error.
In particular, letting
<span class="katex-eq" id="katex-fhat"></span> be this piecewise
linear interpolating function, and
<span class="katex-eq" id="katex-f"></span> be the ground-truth
quadratic, we see that we may estimate (discarding constants)
</p>
<span class="katex-eq" id="katex-estimate"></span>
<p class="body-text">
where the norm is understood as the integral over the range
<span class="katex-eq" id="katex-interval3"></span> (in essence the
infinite data limit). Thus the average error, as measured by the
<span class="katex-eq" id="katex-q2"></span>-norm (for
<span class="katex-eq" id="katex-qge1"></span>), is infinite. Indeed,
for many reasonable metrics beyond the
<span class="katex-eq" id="katex-q3"></span>-norm, this error should
be infinite.
</p>
<p class="body-text">
If you follow through this argument in generality, you get the
following fact:
</p>
<p class="block-text">
<span class="bold">Proposition 1.</span> For any dataset
<span class="katex-eq" id="katex-prop1"></span> with at
<span class="katex-eq" id="katex-prop2"></span> points, not all
co-linear, any bounded ground truth function
<span class="katex-eq" id="katex-prop3"></span>, and any
<span class="katex-eq" id="katex-prop4"></span> , then the best fit
model of this type,
<span class="katex-eq" id="katex-prop5"></span> with
<span class="katex-eq" id="katex-prop6"></span> knot points has
<span class="katex-eq-prop" id="katex-prop7"></span>
</p>
<p class="body-text">
The moral of this story is that these models are so rigid, that bad
behavior is guaranteed independent of the dataset or ground truth.
Simply the fact that you have a unique interpolating function that can
only bend at random points forces the fit to be poor on average when
you are at the interpolation threshold.
</p>
<h2 class="subtitle">To Infinity! (But Not Beyond)</h2>
<p class="body-text">
We now understand why the model performs poorly at the interpolation
threshold. In essence, there is only one model that can be used, and
that model has no reason to be a reasonable interpolating function at
all! This provides the first hint as to why larger models might be
better: they provide many choices about which interpolating model to
pick, and perhaps we can structure the model in such a way that the
interpolation of the prediction between points is much better behaved.
</p>
<p class="body-text">
To see why this is so, suppose we have a nice function (at least twice
continuously differentiable) and we select a large number of random
knot points
<span class="katex-eq" id="katex-inf1"></span>from our density
<span class="katex-eq" id="katex-inf2"></span>. Recall that the way we
are selecting the model is to minimize
<span class="katex-eq" id="katex-inf3"></span>, which is the norm of
the coefficients we put in front of our basis functions. Our goal is
to provide an interpretation of this norm in terms of our smooth
function <span class="katex-eq" id="katex-inf4"></span>.
</p>
<div id="delta-chart" class="chart"></div>
<p class="body-text">
In the above figure we see three consecutive knot points:
<span class="katex-eq" id="katex-inf5"></span>. We will let
<span class="katex-eq" id="katex-inf6"></span> and
<span class="katex-eq" id="katex-inf7"></span> represent the slopes of
the two segments adjacent to
<span class="katex-eq" id="katex-inf8"></span>.
<span class="katex-eq" id="katex-inf9"></span> denotes the distance
between the midpoints of those two segments. We now collect three
facts:
</p>
<div class="list">
<p class="list-text">
<span class="bold">1.</span> Since
<span class="katex-eq" id="katex-inf10"></span>and
<span class="katex-eq" id="katex-inf11"></span>are two consecutive
intervals in our piecewise linear function, we can express the
difference in the slope in terms of the coefficients of the model.
In particular, note that
<span class="katex-eq" id="katex-inf12"></span>,
<span class="katex-eq" id="katex-inf13"></span>, and
<span class="katex-eq" id="katex-inf14"></span> for all
<span class="katex-eq" id="katex-inf15"></span> are all linear
across both intervals. This means that they all contribute the same
constant expression to the slope on both sides, thus the difference
in slopes <span class="katex-eq" id="katex-inf16"></span> is simply
the difference in slopes between the two sides of
<span class="katex-eq" id="katex-inf17"></span>. This difference is
<span class="katex-eq" id="katex-inf18"></span>. Thus:
</p>
<span class="katex-eq-list" id="katex-inf19"></span>
<p class="list-text">
<span class="bold">2.</span> Since the points on our piecewise
linear interpolation are assumed to be close together, and all lie
on our smooth function
<span class="katex-eq" id="katex-inf20"></span>, we can say that
both <span class="katex-eq" id="katex-inf21"></span> and
<span class="katex-eq" id="katex-inf22"></span> are approximately
equal to the derivative of
<span class="katex-eq" id="katex-inf23"></span> at the midpoints of
the intervals. Thus, their difference quotient is approximately the
second derivative at
<span class="katex-eq" id="katex-inf24"></span>:
</p>
<span class="katex-eq-list" id="katex-inf25"></span>
<p class="list-text">
<span class="bold">3.</span> Finally, let us collect two
approximations for the probability that a randomly sampled point
from <span class="katex-eq" id="katex-inf26"></span> lands in the
interval connecting the two midpoints. On the one hand, the
definition of density tells us this probability is obtained by
integrating <span class="katex-eq" id="katex-inf27"></span> over the
interval, which (because the interval is small) is approximately
equal to the rectangular approximation
<span class="katex-eq" id="katex-inf28"></span>. On the other hand,
we sampled <span class="katex-eq" id="katex-inf29"></span> points
from <span class="katex-eq" id="katex-inf30"></span>, and one landed
in that interval, so we may also approximate it by
<span class="katex-eq" id="katex-inf31"></span>. Thus:
</p>
<span class="katex-eq-list" id="katex-inf32"></span>
</div>
<p class="body-text"></p>
<p class="body-text">
Note: this third approximation is too rough to fully rigorously derive
what follows. Instead, you need to consider an intermediate scale made
by collecting together, for instance,
<span class="katex-eq" id="katex-inf33"></span> many points. Since
<span class="katex-eq" id="katex-inf34"></span> and
<span class="katex-eq" id="katex-inf35"></span>, this lets us both
assume the second derivative is approximately constant in that region
and that the length of the interval has random fluctuations much
smaller than the length itself. Given this is not a formal paper, we
do not follow this any further.
</p>
<p class="body-text">
We can try to understand what our norm minimization means in terms of
the curve. By our second bullet point, we may write:
</p>
<span class="katex-eq" id="katex-inf36"></span>
<p class="body-text">
Applying the third bullet point to one of the
<span class="katex-eq" id="katex-inf37"></span> yields:
</p>
<span class="katex-eq" id="katex-inf38"></span>
<p class="body-text">
The last equation is a discrete approximation to the integral of
<span class="katex-eq" id="katex-inf39"></span>, so we can conclude:
</p>
<span class="katex-eq" id="katex-inf40"></span>
<p class="body-text">
We typically consider the case where
<span class="katex-eq" id="katex-inf41"></span> is constant on an
interval, and zero outside it, so if we assume that
<span class="katex-eq" id="katex-inf42"></span> outside our interval
(it is linear outside the interval), then we may conclude that we are
trying to minimize
</p>
<span class="katex-eq" id="katex-inf43"></span>
<p class="body-text">
This is the key observation. By minimizing the
<span class="katex-eq" id="katex-inf44"></span>-norm of our
parameters, we are actually minimizing the integral of the square of
the second derivative of our smooth interpolating function. Since
second derivatives tell us the change in derivative, this can be
interpreted as trying to find an interpolating function whose
derivative changes as little as possible, or equivalently one which is
as close to linear as possible.
</p>
<p class="body-text">
This idea is so fundamental, it has been studied extensively before
and goes by the name of the <i>natural cubic spline</i> (see for
example these notes
<a class="reference-link" href="#reference1">[1]</a>
or Chapter 11 of A&G
<a class="reference-link" href="#reference2">[2]</a>). This is a well
studied class of interpolating functions.
</p>
<p class="body-text">
This is only one portion of a full proof. Some additional work, such
as showing that any convergent subsequence of discrete minimizers must
converge to a function that is twice differentiable, is needed.
However, once done, this produces the following theorem.
</p>
<p class="block-text">
<span class="bold">Theorem.</span> Take a sequence of our
approximating functions
<span class="katex-eq" id="katex-inf45"></span> associated to the
first <span class="katex-eq" id="katex-inf46"></span> from an infinite
sequence of independent and identically distributed knot points drawn
uniformly from an interval containing your dataset. Then the sequence
<span class="katex-eq" id="katex-inf47"></span> almost surely
converges uniformly on that interval to the natural cubic spline.
</p>
<p class="body-text">
We picture the cubic spline and an interpolating function with 1000
pts below. Notice that, as predicted by the theory above, these
pictures are indistinguishable.
</p>
<div id="chart6" class="chart"></div>
<h2 class="subtitle">Summary</h2>
<p class="body-text">Let's recall what we've seen.</p>
<p class="body-text">
In the first section, we briefly discussed the classical regime to see
that indeed adding a single point does better than the simple linear
fit. This tells us that, for this problem, there is a benefit in
adding non-linearity.
</p>
<p class="body-text">
In the second section, we continue to add points until we reach the
exact same number of parameters as we have points. This is the point
at which we expect interpolation to hold. In this case, we show that
no matter the choice of data or target function, the average error
<i>must</i> be infinite (as measured by any
<span class="katex-eq" id="katex-inf48"></span>-norm) simply owing to
the fact that there is often only a single interpolating function,
which with reasonably high probability is wildly behaved.
</p>
<p class="body-text">
Finally, in the third section, we took the limit to an infinite number
of points and saw that our condition of taking the minimum norm
solution corresponds to taking the smoothest interpolating function as
measured by the integral of the square of the second derivative.
<span class="bold"
>This is exactly the energy minimized by the
<i>natural cubic spline</i> interpolation.</span
>
</p>
<p class="block-text">
In this case, we've reduced double descent to a completely
unremarkable fact: that for a quadratic function, the cubic spline
interpolating the points is a better approximation than any piecewise
linear function with four or fewer segments! The fact that the model
performs poorly at the interpolation threshold of five segments is an
unavoidable consequence of the rigidity of the model forcing a single
choice of interpolating function over which we have essentially no
control.
</p>
<p class="body-text">
This exemplifies one of the reasons why people believe that double
descent occurs: that the interpolating functions you find using very
large models can be better behaved <i>between</i> the points you are
interpolating, and thus can match the ground truth better if you
structure your large model in a way to build in an inductive bias
towards the correct type of solution.
</p>
<p class="body-text">
In many ways, one of the most powerful aspects of modern deep models
is that the architecture can be tuned to be specific to the type of
problem at hand (convolutions for images, recurrent for text, and now
transformers for either). This tuning changes the types of functions
that can be found, and thus changes the way that these models
interpolate between the data points. It is a reasonable conjecture to
believe that this tuning contributes to the observation of double
descent commonly seen today.
</p>
<hr />
<p class="body-text">
<br /><br />
Thanks for reading! To learn more about machine learning, check out
our
<a href="https://aws.amazon.com/machine-learning/mlu/"
>self-paced courses</a
>, our
<a href="https://www.youtube.com/channel/UC12LqyqTQYbXatYS9AA7Nuw"
>YouTube videos</a
>, and the
<a href="https://d2l.ai/">Dive into Deep Learning</a> textbook. If you
have any comments, ideas, etc related to MLU-Explain articles, feel
free to <a href="https://twitter.com/jdwlbr"> reach out directly</a>.
The code is available
<a href="https://github.com/aws-samples/aws-mlu-explain">here</a>.
</p>
<br /><br />
<h2 id="references" class="subtitle">Resources + Open Source:</h2>
<br />
<ul>
<li id="reference1">
<a
href="https://www.cs.ubc.ca/~jf/courses/303.S2020/Handouts/spline_energy.pdf"
>[1] CPSC 303: Energy in Cubic Splines, Power Series as
Algorithms, and the Initial Value Problem(Joel Friedman)</a
><br />
(Joel Friedman, 2019)
</li>
<li id="reference2">
<a
href="https://epubs.siam.org/doi/book/10.1137/9780898719987?mobileUi=0&)"
>[2] A First Course in Numerical Methods (Uri M. Asher, Chen
Greif)</a
>
<br />(Uri M. Asher, Chen Greif, 2019).
</li>
<li>
<a href="https://d3js.org/">D3.js</a> (Mike Bostock, Philippe
Rivière)
</li>
<li>
<a href="https://katex.org/">KaTeX</a> (Emily Eisenberg, Sophie
Alpert)
</li>
</ul>
</section>
</main>
<script src="js/index.js"></script>
<script src="js/katexCalls.js"></script>
</body>
</html>