forked from Omniak87B/w207FinalProject
-
Notifications
You must be signed in to change notification settings - Fork 0
/
notebook.tex
2161 lines (1733 loc) · 131 KB
/
notebook.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% Default to the notebook output style
% Inherit from the specified cell style.
\documentclass[11pt]{article}
\usepackage[T1]{fontenc}
% Nicer default font (+ math font) than Computer Modern for most use cases
\usepackage{mathpazo}
% Basic figure setup, for now with no caption control since it's done
% automatically by Pandoc (which extracts ![](path) syntax from Markdown).
\usepackage{graphicx}
% We will generate all images so they have a width \maxwidth. This means
% that they will get their normal width if they fit onto the page, but
% are scaled down if they would overflow the margins.
\makeatletter
\def\maxwidth{\ifdim\Gin@nat@width>\linewidth\linewidth
\else\Gin@nat@width\fi}
\makeatother
\let\Oldincludegraphics\includegraphics
% Set max figure width to be 80% of text width, for now hardcoded.
\renewcommand{\includegraphics}[1]{\Oldincludegraphics[width=.8\maxwidth]{#1}}
% Ensure that by default, figures have no caption (until we provide a
% proper Figure object with a Caption API and a way to capture that
% in the conversion process - todo).
\usepackage{caption}
\DeclareCaptionLabelFormat{nolabel}{}
\captionsetup{labelformat=nolabel}
\usepackage{adjustbox} % Used to constrain images to a maximum size
\usepackage{xcolor} % Allow colors to be defined
\usepackage{enumerate} % Needed for markdown enumerations to work
\usepackage{geometry} % Used to adjust the document margins
\usepackage{amsmath} % Equations
\usepackage{amssymb} % Equations
\usepackage{textcomp} % defines textquotesingle
% Hack from http://tex.stackexchange.com/a/47451/13684:
\AtBeginDocument{%
\def\PYZsq{\textquotesingle}% Upright quotes in Pygmentized code
}
\usepackage{upquote} % Upright quotes for verbatim code
\usepackage{eurosym} % defines \euro
\usepackage[mathletters]{ucs} % Extended unicode (utf-8) support
\usepackage[utf8x]{inputenc} % Allow utf-8 characters in the tex document
\usepackage{fancyvrb} % verbatim replacement that allows latex
\usepackage{grffile} % extends the file name processing of package graphics
% to support a larger range
% The hyperref package gives us a pdf with properly built
% internal navigation ('pdf bookmarks' for the table of contents,
% internal cross-reference links, web links for URLs, etc.)
\usepackage{hyperref}
\usepackage{longtable} % longtable support required by pandoc >1.10
\usepackage{booktabs} % table support for pandoc > 1.12.2
\usepackage[inline]{enumitem} % IRkernel/repr support (it uses the enumerate* environment)
\usepackage[normalem]{ulem} % ulem is needed to support strikethroughs (\sout)
% normalem makes italics be italics, not underlines
% Colors for the hyperref package
\definecolor{urlcolor}{rgb}{0,.145,.698}
\definecolor{linkcolor}{rgb}{.71,0.21,0.01}
\definecolor{citecolor}{rgb}{.12,.54,.11}
% ANSI colors
\definecolor{ansi-black}{HTML}{3E424D}
\definecolor{ansi-black-intense}{HTML}{282C36}
\definecolor{ansi-red}{HTML}{E75C58}
\definecolor{ansi-red-intense}{HTML}{B22B31}
\definecolor{ansi-green}{HTML}{00A250}
\definecolor{ansi-green-intense}{HTML}{007427}
\definecolor{ansi-yellow}{HTML}{DDB62B}
\definecolor{ansi-yellow-intense}{HTML}{B27D12}
\definecolor{ansi-blue}{HTML}{208FFB}
\definecolor{ansi-blue-intense}{HTML}{0065CA}
\definecolor{ansi-magenta}{HTML}{D160C4}
\definecolor{ansi-magenta-intense}{HTML}{A03196}
\definecolor{ansi-cyan}{HTML}{60C6C8}
\definecolor{ansi-cyan-intense}{HTML}{258F8F}
\definecolor{ansi-white}{HTML}{C5C1B4}
\definecolor{ansi-white-intense}{HTML}{A1A6B2}
% commands and environments needed by pandoc snippets
% extracted from the output of `pandoc -s`
\providecommand{\tightlist}{%
\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}
\DefineVerbatimEnvironment{Highlighting}{Verbatim}{commandchars=\\\{\}}
% Add ',fontsize=\small' for more characters per line
\newenvironment{Shaded}{}{}
\newcommand{\KeywordTok}[1]{\textcolor[rgb]{0.00,0.44,0.13}{\textbf{{#1}}}}
\newcommand{\DataTypeTok}[1]{\textcolor[rgb]{0.56,0.13,0.00}{{#1}}}
\newcommand{\DecValTok}[1]{\textcolor[rgb]{0.25,0.63,0.44}{{#1}}}
\newcommand{\BaseNTok}[1]{\textcolor[rgb]{0.25,0.63,0.44}{{#1}}}
\newcommand{\FloatTok}[1]{\textcolor[rgb]{0.25,0.63,0.44}{{#1}}}
\newcommand{\CharTok}[1]{\textcolor[rgb]{0.25,0.44,0.63}{{#1}}}
\newcommand{\StringTok}[1]{\textcolor[rgb]{0.25,0.44,0.63}{{#1}}}
\newcommand{\CommentTok}[1]{\textcolor[rgb]{0.38,0.63,0.69}{\textit{{#1}}}}
\newcommand{\OtherTok}[1]{\textcolor[rgb]{0.00,0.44,0.13}{{#1}}}
\newcommand{\AlertTok}[1]{\textcolor[rgb]{1.00,0.00,0.00}{\textbf{{#1}}}}
\newcommand{\FunctionTok}[1]{\textcolor[rgb]{0.02,0.16,0.49}{{#1}}}
\newcommand{\RegionMarkerTok}[1]{{#1}}
\newcommand{\ErrorTok}[1]{\textcolor[rgb]{1.00,0.00,0.00}{\textbf{{#1}}}}
\newcommand{\NormalTok}[1]{{#1}}
% Additional commands for more recent versions of Pandoc
\newcommand{\ConstantTok}[1]{\textcolor[rgb]{0.53,0.00,0.00}{{#1}}}
\newcommand{\SpecialCharTok}[1]{\textcolor[rgb]{0.25,0.44,0.63}{{#1}}}
\newcommand{\VerbatimStringTok}[1]{\textcolor[rgb]{0.25,0.44,0.63}{{#1}}}
\newcommand{\SpecialStringTok}[1]{\textcolor[rgb]{0.73,0.40,0.53}{{#1}}}
\newcommand{\ImportTok}[1]{{#1}}
\newcommand{\DocumentationTok}[1]{\textcolor[rgb]{0.73,0.13,0.13}{\textit{{#1}}}}
\newcommand{\AnnotationTok}[1]{\textcolor[rgb]{0.38,0.63,0.69}{\textbf{\textit{{#1}}}}}
\newcommand{\CommentVarTok}[1]{\textcolor[rgb]{0.38,0.63,0.69}{\textbf{\textit{{#1}}}}}
\newcommand{\VariableTok}[1]{\textcolor[rgb]{0.10,0.09,0.49}{{#1}}}
\newcommand{\ControlFlowTok}[1]{\textcolor[rgb]{0.00,0.44,0.13}{\textbf{{#1}}}}
\newcommand{\OperatorTok}[1]{\textcolor[rgb]{0.40,0.40,0.40}{{#1}}}
\newcommand{\BuiltInTok}[1]{{#1}}
\newcommand{\ExtensionTok}[1]{{#1}}
\newcommand{\PreprocessorTok}[1]{\textcolor[rgb]{0.74,0.48,0.00}{{#1}}}
\newcommand{\AttributeTok}[1]{\textcolor[rgb]{0.49,0.56,0.16}{{#1}}}
\newcommand{\InformationTok}[1]{\textcolor[rgb]{0.38,0.63,0.69}{\textbf{\textit{{#1}}}}}
\newcommand{\WarningTok}[1]{\textcolor[rgb]{0.38,0.63,0.69}{\textbf{\textit{{#1}}}}}
% Define a nice break command that doesn't care if a line doesn't already
% exist.
\def\br{\hspace*{\fill} \\* }
% Math Jax compatability definitions
\def\gt{>}
\def\lt{<}
% Document parameters
\title{w207\_Final\_Jake\_Tim\_Pierce\_Debasish}
% Pygments definitions
\makeatletter
\def\PY@reset{\let\PY@it=\relax \let\PY@bf=\relax%
\let\PY@ul=\relax \let\PY@tc=\relax%
\let\PY@bc=\relax \let\PY@ff=\relax}
\def\PY@tok#1{\csname PY@tok@#1\endcsname}
\def\PY@toks#1+{\ifx\relax#1\empty\else%
\PY@tok{#1}\expandafter\PY@toks\fi}
\def\PY@do#1{\PY@bc{\PY@tc{\PY@ul{%
\PY@it{\PY@bf{\PY@ff{#1}}}}}}}
\def\PY#1#2{\PY@reset\PY@toks#1+\relax+\PY@do{#2}}
\expandafter\def\csname PY@tok@w\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.73,0.73,0.73}{##1}}}
\expandafter\def\csname PY@tok@c\endcsname{\let\PY@it=\textit\def\PY@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
\expandafter\def\csname PY@tok@cp\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.74,0.48,0.00}{##1}}}
\expandafter\def\csname PY@tok@k\endcsname{\let\PY@bf=\textbf\def\PY@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
\expandafter\def\csname PY@tok@kp\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
\expandafter\def\csname PY@tok@kt\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.69,0.00,0.25}{##1}}}
\expandafter\def\csname PY@tok@o\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
\expandafter\def\csname PY@tok@ow\endcsname{\let\PY@bf=\textbf\def\PY@tc##1{\textcolor[rgb]{0.67,0.13,1.00}{##1}}}
\expandafter\def\csname PY@tok@nb\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
\expandafter\def\csname PY@tok@nf\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.00,0.00,1.00}{##1}}}
\expandafter\def\csname PY@tok@nc\endcsname{\let\PY@bf=\textbf\def\PY@tc##1{\textcolor[rgb]{0.00,0.00,1.00}{##1}}}
\expandafter\def\csname PY@tok@nn\endcsname{\let\PY@bf=\textbf\def\PY@tc##1{\textcolor[rgb]{0.00,0.00,1.00}{##1}}}
\expandafter\def\csname PY@tok@ne\endcsname{\let\PY@bf=\textbf\def\PY@tc##1{\textcolor[rgb]{0.82,0.25,0.23}{##1}}}
\expandafter\def\csname PY@tok@nv\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
\expandafter\def\csname PY@tok@no\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.53,0.00,0.00}{##1}}}
\expandafter\def\csname PY@tok@nl\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.63,0.63,0.00}{##1}}}
\expandafter\def\csname PY@tok@ni\endcsname{\let\PY@bf=\textbf\def\PY@tc##1{\textcolor[rgb]{0.60,0.60,0.60}{##1}}}
\expandafter\def\csname PY@tok@na\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.49,0.56,0.16}{##1}}}
\expandafter\def\csname PY@tok@nt\endcsname{\let\PY@bf=\textbf\def\PY@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
\expandafter\def\csname PY@tok@nd\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.67,0.13,1.00}{##1}}}
\expandafter\def\csname PY@tok@s\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
\expandafter\def\csname PY@tok@sd\endcsname{\let\PY@it=\textit\def\PY@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
\expandafter\def\csname PY@tok@si\endcsname{\let\PY@bf=\textbf\def\PY@tc##1{\textcolor[rgb]{0.73,0.40,0.53}{##1}}}
\expandafter\def\csname PY@tok@se\endcsname{\let\PY@bf=\textbf\def\PY@tc##1{\textcolor[rgb]{0.73,0.40,0.13}{##1}}}
\expandafter\def\csname PY@tok@sr\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.73,0.40,0.53}{##1}}}
\expandafter\def\csname PY@tok@ss\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
\expandafter\def\csname PY@tok@sx\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
\expandafter\def\csname PY@tok@m\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
\expandafter\def\csname PY@tok@gh\endcsname{\let\PY@bf=\textbf\def\PY@tc##1{\textcolor[rgb]{0.00,0.00,0.50}{##1}}}
\expandafter\def\csname PY@tok@gu\endcsname{\let\PY@bf=\textbf\def\PY@tc##1{\textcolor[rgb]{0.50,0.00,0.50}{##1}}}
\expandafter\def\csname PY@tok@gd\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.63,0.00,0.00}{##1}}}
\expandafter\def\csname PY@tok@gi\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.00,0.63,0.00}{##1}}}
\expandafter\def\csname PY@tok@gr\endcsname{\def\PY@tc##1{\textcolor[rgb]{1.00,0.00,0.00}{##1}}}
\expandafter\def\csname PY@tok@ge\endcsname{\let\PY@it=\textit}
\expandafter\def\csname PY@tok@gs\endcsname{\let\PY@bf=\textbf}
\expandafter\def\csname PY@tok@gp\endcsname{\let\PY@bf=\textbf\def\PY@tc##1{\textcolor[rgb]{0.00,0.00,0.50}{##1}}}
\expandafter\def\csname PY@tok@go\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.53,0.53,0.53}{##1}}}
\expandafter\def\csname PY@tok@gt\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.00,0.27,0.87}{##1}}}
\expandafter\def\csname PY@tok@err\endcsname{\def\PY@bc##1{\setlength{\fboxsep}{0pt}\fcolorbox[rgb]{1.00,0.00,0.00}{1,1,1}{\strut ##1}}}
\expandafter\def\csname PY@tok@kc\endcsname{\let\PY@bf=\textbf\def\PY@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
\expandafter\def\csname PY@tok@kd\endcsname{\let\PY@bf=\textbf\def\PY@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
\expandafter\def\csname PY@tok@kn\endcsname{\let\PY@bf=\textbf\def\PY@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
\expandafter\def\csname PY@tok@kr\endcsname{\let\PY@bf=\textbf\def\PY@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
\expandafter\def\csname PY@tok@bp\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
\expandafter\def\csname PY@tok@fm\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.00,0.00,1.00}{##1}}}
\expandafter\def\csname PY@tok@vc\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
\expandafter\def\csname PY@tok@vg\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
\expandafter\def\csname PY@tok@vi\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
\expandafter\def\csname PY@tok@vm\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
\expandafter\def\csname PY@tok@sa\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
\expandafter\def\csname PY@tok@sb\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
\expandafter\def\csname PY@tok@sc\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
\expandafter\def\csname PY@tok@dl\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
\expandafter\def\csname PY@tok@s2\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
\expandafter\def\csname PY@tok@sh\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
\expandafter\def\csname PY@tok@s1\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
\expandafter\def\csname PY@tok@mb\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
\expandafter\def\csname PY@tok@mf\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
\expandafter\def\csname PY@tok@mh\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
\expandafter\def\csname PY@tok@mi\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
\expandafter\def\csname PY@tok@il\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
\expandafter\def\csname PY@tok@mo\endcsname{\def\PY@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
\expandafter\def\csname PY@tok@ch\endcsname{\let\PY@it=\textit\def\PY@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
\expandafter\def\csname PY@tok@cm\endcsname{\let\PY@it=\textit\def\PY@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
\expandafter\def\csname PY@tok@cpf\endcsname{\let\PY@it=\textit\def\PY@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
\expandafter\def\csname PY@tok@c1\endcsname{\let\PY@it=\textit\def\PY@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
\expandafter\def\csname PY@tok@cs\endcsname{\let\PY@it=\textit\def\PY@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
\def\PYZbs{\char`\\}
\def\PYZus{\char`\_}
\def\PYZob{\char`\{}
\def\PYZcb{\char`\}}
\def\PYZca{\char`\^}
\def\PYZam{\char`\&}
\def\PYZlt{\char`\<}
\def\PYZgt{\char`\>}
\def\PYZsh{\char`\#}
\def\PYZpc{\char`\%}
\def\PYZdl{\char`\$}
\def\PYZhy{\char`\-}
\def\PYZsq{\char`\'}
\def\PYZdq{\char`\"}
\def\PYZti{\char`\~}
% for compatibility with earlier versions
\def\PYZat{@}
\def\PYZlb{[}
\def\PYZrb{]}
\makeatother
% Exact colors from NB
\definecolor{incolor}{rgb}{0.0, 0.0, 0.5}
\definecolor{outcolor}{rgb}{0.545, 0.0, 0.0}
% Prevent overflowing lines due to hard-to-break entities
\sloppy
% Setup hyperref package
\hypersetup{
breaklinks=true, % so long urls are correctly broken across lines
colorlinks=true,
urlcolor=urlcolor,
linkcolor=linkcolor,
citecolor=citecolor,
}
% Slightly bigger margins than the latex defaults
\geometry{verbose,tmargin=1in,bmargin=1in,lmargin=1in,rmargin=1in}
\begin{document}
\maketitle
\section{W207 Spring 2019 Final
Project}\label{w207-spring-2019-final-project}
\subsection{Kaggle Competition: Forest Cover
Prediction}\label{kaggle-competition-forest-cover-prediction}
\textbf{Pierce Coggins, Jake Mitchell, Debasish Mukhopadhyay, and Tim
Slade}
\section{Table of Contents/Section
Notes}\label{table-of-contentssection-notes}
\begin{itemize}
\tightlist
\item
Section \ref{introduction}
\item
In which we discuss the problem and why it matters
\item
Section \ref{housekeeping}
\begin{itemize}
\tightlist
\item
In which we deal with basic prep and setup issues
\end{itemize}
\item
Section \ref{aboutthedata}
\item
EDA, charts, data cleaning
\item
Section \ref{featureengineering}
\item
Describe a basic model that we will use to test the usefulness of new
features (LR or NB)
\item
Normalization
\item
Each added or removed feature
\item
Section \ref{models}
\item
Maybe choose 4 to test out? Don't want this section to get too
lengthy, and each model should be covered in some detail
\item
Section \ref{results}
\item
What went well, what went poorly
\item
Final comparison of models on test data
\item
Section \ref{conclusion}
\item
Section \ref{annexa}
\end{itemize}
\# Introduction
In this report, we will attempt to predict the forest cover type
(defined as the predominant type of tree cover) for a given area of land
in Colorado given only cartographic variables as inputs. This problem
and dataset were initially posted as a Kaggle competition in 2015. We
have chosen to tackle this problem as it allows for many different
machine learning techniques to be attempted and explored. The report
will go through the process of building a capable model from data
cleaning through final testing.
The problem of understanding what type of vegetation is present in a
difficult to access area is a surprisingly important one. In this
particular example the forests of Colorado are very diverse, and each
type of tree cover has its own benefits and dangers. For example, many
of the pine trees in Colorado are susceptible to the
\href{https://csfs.colostate.edu/forest-management/common-forest-insects-diseases/mountain-pine-beetle/}{mountain
pine beetle}, while the Spruce and Fir trees are relatively safe from
the beetles. Without directly going to every location in the mountains
of Colorado, it is very difficult to distinguish these types of trees as
they look very similar from the air. It is relatively easy to get
cartographic data for a large swath of the mountains, however, and if it
is possible to accurately predict the tree type from the cartographic
information alone then all of the Colorado forest could be mapped by
likely forest cover type. That information would be invaluable to
firefighters and forest service personnel to direct their efforts where
it will have the most impact.
If you would like to learn more about the problem or try for yourself,
all information and data can be found from the kaggle
competition:\href{https://www.kaggle.com/c/forest-cover-type-prediction}{Kaggle's
Forest Cover Type Prediction}.
\#\# Housekeeping
\subsubsection{Importing Libraries, Helper Functions, and Loading
Data}\label{importing-libraries-helper-functions-and-loading-data}
\begin{Verbatim}[commandchars=\\\{\}]
{\color{incolor}In [{\color{incolor}1}]:} \PY{o}{\PYZpc{}\PYZpc{}}\PY{k}{capture}
\PYZsh{} \PYZpc{}matplotlib inline
\PYZsh{} \PYZpc{}matplotlib notebook
\PYZpc{}matplotlib qt
\PYZsh{} General libraries
import pandas as pd
import numpy as np
import os
import copy
import warnings
import statsmodels.api as sm
from scipy import stats
import math
\PYZsh{} Plotting and printing libraries
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import matplotlib.patches as mpatches
from matplotlib.pyplot import figure, imshow, axis
from matplotlib.image import imread
import pprint
\PYZsh{} Model\PYZhy{}building libraries
from sklearn.model\PYZus{}selection import train\PYZus{}test\PYZus{}split, StratifiedKFold
from sklearn.preprocessing import normalize, MinMaxScaler, StandardScaler, RobustScaler, Normalizer, scale
\PYZsh{} SK\PYZhy{}learn libraries for learning
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear\PYZus{}model import LogisticRegression, LinearRegression
from sklearn.naive\PYZus{}bayes import BernoulliNB, GaussianNB, MultinomialNB
from sklearn.model\PYZus{}selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.decomposition import PCA
from xgboost import XGBClassifier
\PYZsh{} SK\PYZhy{}learn libraries for evaluation
from sklearn.metrics import confusion\PYZus{}matrix, classification\PYZus{}report
from sklearn import metrics
from sklearn.model\PYZus{}selection import cross\PYZus{}val\PYZus{}score
import warnings
warnings.filterwarnings(\PYZsq{}ignore\PYZsq{})
\PYZsh{} Run the helper functions notebook
\PYZpc{}run w207\PYZus{}final\PYZus{}helper\PYZus{}functions.ipynb
\end{Verbatim}
The forest cover types we aim to predict are bundled with the features
used to predict them. Our first step is therefore to separate them out,
lest we accidentally let our models peek at the outcomes. We also want
to split the dataset into \emph{train} and \emph{test} subsets; this
will give us insight into how well our chosen models and parameters will
perform against out-of-sample data.
The original dataset contained 15,120 observations. We will train our
models on 90\% of the data and hold out 10\% for testing. We thus expect
to have approximately 0.9 * 15,120 = 13,608 observations in our training
dataset.
\begin{Verbatim}[commandchars=\\\{\}]
{\color{incolor}In [{\color{incolor}2}]:} \PY{o}{\PYZpc{}\PYZpc{}}\PY{k}{capture} \PYZhy{}\PYZhy{}no\PYZhy{}stdout \PYZhy{}\PYZhy{}no\PYZhy{}display
full\PYZus{}data = pd.DataFrame.from\PYZus{}csv(\PYZsq{}./train.csv\PYZsq{})
full\PYZus{}data.shape
\PYZsh{} Separating out the labels
full\PYZus{}labels = full\PYZus{}data[\PYZsq{}Cover\PYZus{}Type\PYZsq{}]
full\PYZus{}features = full\PYZus{}data.drop(\PYZsq{}Cover\PYZus{}Type\PYZsq{}, axis=1)
\PYZsh{} Setting seed so we get consistent results from our splitting
np.random.seed(0)
X\PYZus{}train, X\PYZus{}test, y\PYZus{}train, y\PYZus{}test = train\PYZus{}test\PYZus{}split(full\PYZus{}features, full\PYZus{}labels, test\PYZus{}size=0.10)
\PYZsh{} Verifying our data shapes are as expected
print(f\PYZsq{}\PYZsq{}\PYZsq{}
\PYZob{}\PYZsq{}\PYZsq{}:\PYZca{}16\PYZcb{} | \PYZob{}\PYZsq{}Observations\PYZsq{}:\PYZca{}12\PYZcb{} | \PYZob{}\PYZsq{}Features\PYZsq{}:\PYZca{}10\PYZcb{} |
\PYZob{}\PYZsq{}\PYZhy{}\PYZsq{}*46\PYZcb{}
\PYZob{}\PYZsq{}Training dataset\PYZsq{}:\PYZca{}16\PYZcb{} | \PYZob{}X\PYZus{}train.shape[0]:\PYZca{}12\PYZcb{} | \PYZob{}X\PYZus{}train.shape[1]:\PYZca{}10\PYZcb{} |
\PYZob{}\PYZsq{}Training labels\PYZsq{}:\PYZca{}16\PYZcb{} | \PYZob{}y\PYZus{}train.shape[0]:\PYZca{}12\PYZcb{} | \PYZob{}\PYZsq{}\PYZhy{}\PYZhy{}\PYZsq{}:\PYZca{}10\PYZcb{} |
\PYZob{}\PYZsq{}Test dataset\PYZsq{}:\PYZca{}16\PYZcb{} | \PYZob{}X\PYZus{}test.shape[0]:\PYZca{}12\PYZcb{} | \PYZob{}X\PYZus{}test.shape[1]:\PYZca{}10\PYZcb{} |
\PYZob{}\PYZsq{}Test labels\PYZsq{}:\PYZca{}16\PYZcb{} | \PYZob{}y\PYZus{}test.shape[0]:\PYZca{}12\PYZcb{} | \PYZob{}\PYZsq{}\PYZhy{}\PYZhy{}\PYZsq{}:\PYZca{}10\PYZcb{} |
\PYZsq{}\PYZsq{}\PYZsq{})
\end{Verbatim}
\begin{Verbatim}[commandchars=\\\{\}]
| Observations | Features |
----------------------------------------------
Training dataset | 13608 | 54 |
Training labels | 13608 | -- |
Test dataset | 1512 | 54 |
Test labels | 1512 | -- |
\end{Verbatim}
\# About the Data
The data comes from several wilderness areas in northern Colorado,
specifically the Rawah Wilderness Area, Neota Wilderness Area, Comanche
Peak Wilderness Area, and the Cache la Poudre Wilderness Area. These are
all fairly remote areas of Colorado, which may be why they were chosen;
there is less human influence in these places to complicate the
prediction task.
The features in the dataset are all cartographic measures of a 30x30m
square plot of land. We have 10 simple features. The 11th and 12th -
\texttt{wilderness\_area} and \texttt{soil\_type} - are categorical
variables which are represented as 4 and 40 binary columns respectively
in our dataset. We therefore have a total of 10 + 4 + 40 = 54 features
to work with. The list below contains a short description of each
feature, including where relevant its range, median, and mean. (See
Section \ref{annexa} for the associated code and further discussion of
the exploratory data analysis).
\begin{itemize}
\item
\texttt{Elevation}: \emph{Elevation in meters}
\item
\textbf{Range}: 1863 to 3849 \textbar{} \textbf{Mean}: 2749.3
\textbar{} \textbf{Median}: 2752
\item
\texttt{Aspect}: \emph{Aspect in degrees azimuth. i.e., degrees
clockwise from a line pointed at true North. So North = 0\(^\circ\),
East = 90\(^\circ\), South = 180\(^\circ\), and West = 270\(^\circ\)}
\item
\textbf{Range}: 0 to 360 \textbar{} \textbf{Mean}: 156.7 \textbar{}
\textbf{Median}: 126.0
\item
\texttt{Slope}: \emph{Slope in degrees. 0\(^\circ\) would indicate a
flat plane; greater values represent steeper slopes.}
\item
\textbf{Range}: 0 to 52 \textbar{} \textbf{Mean}: 16.5 \textbar{}
\textbf{Median}: 15.0
\item
\texttt{Horizontal\_Distance\_To\_Hydrology}: \emph{Horizontal
distance to nearest surface water features. Units unspecified.}
\item
\textbf{Range}: 0 to 1343 \textbar{} \textbf{Mean}: 227.2 \textbar{}
\textbf{Median}: 180
\item
\texttt{Vertical\_Distance\_To\_Hydrology}: \emph{Vertical distance to
nearest surface water features. Units unspecified.}
\item
\textbf{Range}: -146 to 554 \textbar{} \textbf{Mean}: 51.1 \textbar{}
\textbf{Median}: 32.0
\item
\texttt{Horizontal\_Distance\_To\_Roadways}: \emph{Horizontal distance
to nearest roadway. Units unspecified.}
\item
\textbf{Range}: 0 to 6890 \textbar{} \textbf{Mean}: 1714.0 \textbar{}
\textbf{Median}: 1316
\item
\texttt{Hillshade\_9am}: \emph{(0 to 255 index) - Hillshade index at
9am, summer solstice}
\item
\textbf{Range}: 0 to 254 \textbar{} \textbf{Mean}: 212.7 \textbar{}
\textbf{Median}: 220
\item
\texttt{Hillshade\_Noon}: \emph{(0 to 255 index) - Hillshade index at
noon, summer solstice}
\item
\textbf{Range}: 99 to 254 \textbar{} \textbf{Mean}: 219.0 \textbar{}
\textbf{Median}: 223
\item
\texttt{Hillshade\_3pm}: \emph{(0 to 255 index) - Hillshade index at
3pm, summer solstice}
\item
\textbf{Range}: 0 to 248 \textbar{} \textbf{Mean}: 135.1 \textbar{}
\textbf{Median}: 138.0
\item
\texttt{Horizontal\_Distance\_To\_Fire\_Points}: \emph{Horizontal
distance to nearest wildfire ignition points. Units unspecified.}
\item
\textbf{Range}: 0 to 6993 \textbar{} \textbf{Mean}: 1511.2 \textbar{}
\textbf{Median}: 1256
\item
\texttt{Wilderness\_Area}: \emph{(4 binary columns, 0 = absence or 1 =
presence) - Wilderness area designation}
\item
\% of cases - \textbf{Area 1}: 24\% \textbar{}\textbar{} \textbf{Area
2}: 3\% \textbar{}\textbar{} \textbf{Area 3}: 42\%
\textbar{}\textbar{} \textbf{Area 4}: 31\%
\item
\texttt{Soil\_Type}: \emph{(40 binary columns, 0 = absence or 1 =
presence) - Soil type designation}
\item
The soil types descriptions can be found at the
\href{https://www.kaggle.com/c/forest-cover-type-prediction/data}{Kaggle
Competition Data Page}
\end{itemize}
\subsubsection{Initial Exploration of the
Challenge}\label{initial-exploration-of-the-challenge}
The label indicating our data's categorization is contained in
the\texttt{Cover\_Type} variable, and is split up into 7 different
designations. While the tree species discussed in the Colorado State
Forest Service's
\href{https://csfs.colostate.edu/colorado-trees/colorados-major-tree-species/}{\emph{Colorado's
Major Tree Species}} article do not map perfectly to these categories,
the article provides some insights that may prove useful in our
categorization exercise.
\paragraph{\texorpdfstring{{Category 1}:
'Spruce/Fir'}{Category 1: 'Spruce/Fir'}}\label{category-1-sprucefir}
\begin{itemize}
\tightlist
\item
Species that might fit into this category include the \textbf{Blue
Spruce} (which thrives at an altitude of 6700-11500 ft in sandy soils
near moisture), the \textbf{Engelmann Spruce} (8000-11000 ft, moist
north-facing slopes), the \textbf{Subalpine Fir} (8000-12000 ft, cold
high-elevation forests), and the \textbf{White Fir} (7900-10200 ft,
moist soils in valleys).
\end{itemize}
Blue Spruce
\textbar{}
Engelmann Spruce
\textbar{}
Subalpine Fir
\textbar{}
White Fir
\begin{itemize}
\tightlist
\item
\textbar{} - \textbar{} - \textbar{} - \textbar{} \textbar{}
\textbar{}
\end{itemize}
\paragraph{\texorpdfstring{{Category 2}: 'Lodgepole Pine' and {Category
3}: 'Ponderosa
Pine'}{Category 2: 'Lodgepole Pine' and Category 3: 'Ponderosa Pine'}}\label{category-2-lodgepole-pine-and-category-3-ponderosa-pine}
\begin{itemize}
\tightlist
\item
The \textbf{Lodgepole Pine} thrives in well-drained soils at high
elevations (6000-11000 ft).
\item
The \textbf{Ponderosa Pine} thrives in dry, nutrient-poor soils at
elevations of 6300-9500 ft. It is often found with Douglas Firs.
\end{itemize}
Lodgepole Pine
\textbar{}
Ponderosa Pine
\begin{longtable}[]{@{}ll@{}}
\toprule
&\tabularnewline
\bottomrule
\end{longtable}
\paragraph{\texorpdfstring{{Category 4}:
'Cottonwood/Willow'}{Category 4: 'Cottonwood/Willow'}}\label{category-4-cottonwoodwillow}
\begin{itemize}
\tightlist
\item
Species that might fit into this category include the \textbf{Plains
Cottonwood} (which thrives at altitudes of 3500-6500 ft near sources
of water), the \textbf{Narrowleaf Cottonwood} (5000-8000 ft, moist
soils along streams), and the \textbf{Peachleaf Willow} (3500-7500 ft,
near water sources).
\end{itemize}
Plains Cottonwood
\textbar{}
Narrowleaf Cottonwood
\textbar{}
Peachleaf Willow
\begin{longtable}[]{@{}lll@{}}
\toprule
& &\tabularnewline
\bottomrule
\end{longtable}
\paragraph{\texorpdfstring{{Category 5}: 'Aspen' and {Category 6}:
'Douglas
Fir'}{Category 5: 'Aspen' and Category 6: 'Douglas Fir'}}\label{category-5-aspen-and-category-6-douglas-fir}
\begin{itemize}
\tightlist
\item
The \textbf{Quaking Aspen} thrives at altitudes of 6500-11500 ft.
While it can be in many soil types, it is especially found on sandy
and gravelly slopes.
\item
The \textbf{Douglas Fir} thrives at altitudes of 6000-9500 ft in rocky
soils of moist northern slopes.
\end{itemize}
Quaking Aspen
\textbar{}
Douglas Fir
\begin{longtable}[]{@{}ll@{}}
\toprule
&\tabularnewline
\bottomrule
\end{longtable}
\paragraph{\texorpdfstring{{Category 7}:
'Krummholz'}{Category 7: 'Krummholz'}}\label{category-7-krummholz}
\begin{itemize}
\tightlist
\item
Interestingly, \emph{krummholz} is not a species of tree; it is a type
of tree formation (which can emerge among various tree species) that
results from consistent long-term exposure to strong, cold winds. Per
\href{https://en.wikipedia.org/wiki/Krummholz}{Wikipedia}, Subalpine
Fir and Engelmann Spruce are often associated with Krummholz
conditions (as is Lodgepole Pine, although that is more common in
British Columbia).
\end{itemize}
Krummholz Banner Tree
\textbar{}
Krummholz White Pine
\textbar{}
Krummholz Bristlecone
\begin{itemize}
\tightlist
\item
\textbar{}- \textbar{}- \textbar{} \textbar{} \textbar{} \textbar{}
\end{itemize}
\subsubsection{Where do we start?}\label{where-do-we-start}
The brief descriptions we've seen already suggest some avenues of
exploration: altitude ranges and access to water seem to be of primary
importance.
\paragraph{What can we learn from elevation
alone?}\label{what-can-we-learn-from-elevation-alone}
One place to begin would be to plot out the idealized elevation ranges
within which the various tree species thrive. There may be certain
elevations where certain tree species would be far more prevalent than
others. The graph below illustrates the ranges in which the species of
trees discussed the Colorado State Forest Service's
\href{https://csfs.colostate.edu/colorado-trees/colorados-major-tree-species/}{\emph{Colorado's
Major Tree Species}} thrive, per the article.
It appears that lower elevations would be strongly suggestive of the
\texttt{Cottonwood/Willow} \texttt{Cover\_Type}, while higher elevations
might be more suggestive of the \texttt{Spruce/Fir},
\texttt{Lodgepole\ Pine}, \texttt{Aspen}, and \texttt{Krummholz}
\texttt{Cover\_Type}s. The graph above is based upon idealized data from
outside sources, though, and our actual dataset might tell a different
story. The graphs below present the observed \emph{elevation} ranges and
quartiles by \texttt{Cover\_Type} in our data.
\textbar{}
Elevation Ranges
\textbar{}
Elevation Quartiles
\textbar{}-\textbar{}- \textbar{} \textbar{} \textbar{}
When looking at the ranges, our dataset appears to differ from the
idealized one in that the \texttt{Cottonwood/Willow}
\texttt{Cover\_Type} does not seem to occur at markedly lower
elevations. When looking at the quartiles, though, patterns emerge that
appear similar to what we would expect from the idealized presentation:
\texttt{Cottonwood/Willow} tends to cluster at lower elevations, with
the higher elevations dominated by \texttt{Spruce/Fir} and
\texttt{Krummholz} cover types.
The separations are surprisingly clean, suggesting that
\texttt{Elevation} will be a powerful feature in our models. It might be
especially powerful if we could develop a method to cluster the
altitudes into the interquartile ranges presented in the model above.
\paragraph{What if we bring water into the
picture?}\label{what-if-we-bring-water-into-the-picture}
The other feature that the article suggests might be highly salient is
moisture. How does the picture evolve if we add a measure of the
distance to water to the mix?
The graph below is a scatterplot of the Euclidean distance (derived from
the \texttt{Horizontal\_Distance\_To\_Hydrology} and
\texttt{Vertical\_Distance\_To\_Hydrology} features) and the
\texttt{Elevation}, with data points colored by the
\texttt{Cover\_Type}.
The distance to hydrology appears to be informative:
\texttt{Cover\_Type}s 3, 4, and 6 are essentially not found when the
distance to water exceeds 750. That said, it remains clear that
\texttt{Elevation} is the predominant distinguishing feature.
\paragraph{What if we consider exposure to sunlight and
wind?}\label{what-if-we-consider-exposure-to-sunlight-and-wind}
From a layperson's perspective, the amount of sunlight to which a given
plot of land is exposed would seem likely to influence the vegetation
which thrives there. In our dataset, the \texttt{Hillshade} variables
encode this information.
The plot below compares the 1st quartile, median, and 3rd quartiles for
each measure of \texttt{Hillshade} for each category of
\texttt{Cover\_Type}.
While the median \texttt{Hillshade} values appear to vary a little
across categories in the morning and afternoon, the interquartile range
largely overlaps across categories. The overall impression is that
\texttt{Hillshade} is unlikely to be determinative on its own.
Exposure to sunlight and wind would also be affected by the
\texttt{Aspect}, which is essentially the compass direction (0\(^\circ\)
is true North, 90\(^\circ\) is East, 180\(^\circ\) is South,
270\(^\circ\) is West) the plot is facing. While the exact nature of the
interaction between these features may not be clear \emph{a priori}, we
can attempt to collapse the effect into a single feature by taking the
first principal component of the \texttt{Hillshade\_9am} and
\texttt{Hillshade\_3pm} features with the \texttt{Aspect} feature.
The graph below plots this first principal component against
\texttt{Elevation}, as we already know \texttt{Elevation} is strongly
informative.
What patterns we see are weak at best. While the \texttt{Douglas\ Fir}
category appears to be more prevalent for greater and lesser values of
this first principal component, and the \texttt{Ponderosa\ Pine} appears
to be slightly more prevalent nearer to zero, it is clear that the
\texttt{Elevation} remains the dominant feature.
\paragraph{What about the 'Kitchen Sink'
approach?}\label{what-about-the-kitchen-sink-approach}
So far we've examined \texttt{Elevation}, \texttt{Hydrology},
\texttt{Aspect}, and \texttt{Hillshade} features on the basis of the
write-ups regarding the various tree species. But what if we just took a
look at all of our key features and how they relate to one another?
The graph below is a scatterplot matrix incorporating all of the raw
simple features in our data, as well as the
\texttt{Euclidean\_Distance\_To\_Hydrology} feature we composed from the
horizontal and vertical distances to hydrology.
While \texttt{Elevation} remains the feature that seems to provide the
cleanest separation between \texttt{Cover\_Type}s, two additional
features seem to perform pretty well at discriminating the
\texttt{Lodgepole\ Pine}s: \texttt{Horizontal\_Distance\_To\_Roadways}
and \texttt{Horizontal\_Distance\_To\_Fire\_Points}.
\subsubsection{Cleaning the Data}\label{cleaning-the-data}
While exploring the data (see Section \ref{annexa}), we noted that the
\texttt{Soil\_Type7} and \texttt{Soil\_Type15} variables are never true.
Because there is no variation in this feature, it contributes nothing to
any of our models.
\begin{Verbatim}[commandchars=\\\{\}]
{\color{incolor}In [{\color{incolor}3}]:} \PY{c+c1}{\PYZsh{} Removing uninformative features}
\PY{n}{full\PYZus{}features} \PY{o}{=} \PY{n}{full\PYZus{}features}\PY{o}{.}\PY{n}{drop}\PY{p}{(}\PY{p}{[}\PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type7}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type15}\PY{l+s+s1}{\PYZsq{}}\PY{p}{]}\PY{p}{,} \PY{n}{axis}\PY{o}{=}\PY{l+m+mi}{1}\PY{p}{)}
\end{Verbatim}
\# Feature Engineering
Successful machine learning projects often depend heavily on feature
engineering. The most important feature in a dataset may be a latent one
- that is, 'hidden' behind other features which serves as proxies for
it. In such a case, the latent feature needs to be explicitly extracted.
While we are exploring the potential of various synthetic/constructed
features, we will also try to remove original features which are proving
uninformative. Doing so will reduce the noise passed into our models. We
can keep the engineered and source datasets separate by creating a deep
copy of the data.
\subsubsection{Euclidean Distance to
Hydrology}\label{euclidean-distance-to-hydrology}
As we saw in the Section \ref{aboutthedata} section, the
\texttt{Cover\_Type}s can be visually broken up based on their distance
to hydrology, both horizontally and vertically. By combining the
features into a single feature, we can reduce the overall number of
features.
\subsubsection{Elevation of Hydrology}\label{elevation-of-hydrology}
Elevation and Hydrology are very important features when it comes to
predicting the \texttt{Cover\_Type} of an area. By subtracting the
vertical distance to hydrology from the elevation, we can find what the
elevation of the hydrology itself it. This may prove useful by providing
a feature that would be able to discern an alpine lake vs a valley
stream.
\subsubsection{Mean Distance to Feature}\label{mean-distance-to-feature}
As we saw in the Section \ref{aboutthedata} section, the distance
metrics group the data pretty well for classification. We can engineer a
new feature that incorporates the mean distance to hydrology, fire
points, and roadways - the latter two features providing a fair
approximation of an area's remoteness.
\subsubsection{Stony}\label{stony}
This data set features 40 different types of soils. When compared to the
7 possible labels, this number of soil types seems a bit extreme.
Different types of trees favor more rocky soils, and so combining all of
the stony soil types into a single feature will allow a model to more
easily pick up on that.
\subsubsection{Hillshade}\label{hillshade}
\begin{Verbatim}[commandchars=\\\{\}]
{\color{incolor}In [{\color{incolor}4}]:} \PY{n}{full\PYZus{}features}\PY{p}{[}\PY{p}{[}\PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Hillshade\PYZus{}9am}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Hillshade\PYZus{}3pm}\PY{l+s+s1}{\PYZsq{}}\PY{p}{]}\PY{p}{]}\PY{o}{.}\PY{n}{describe}\PY{p}{(}\PY{p}{)}
\end{Verbatim}
\begin{Verbatim}[commandchars=\\\{\}]
{\color{outcolor}Out[{\color{outcolor}4}]:} Hillshade\_9am Hillshade\_3pm
count 15120.000000 15120.000000
mean 212.704299 135.091997
std 30.561287 45.895189
min 0.000000 0.000000
25\% 196.000000 106.000000
50\% 220.000000 138.000000
75\% 235.000000 167.000000
max 254.000000 248.000000
\end{Verbatim}
\subparagraph{Key Data Assumptions
Made}\label{key-data-assumptions-made}
One thing to notice about the data is that the \texttt{Hillshade\_9am}
and \texttt{Hillshade\_3pm} features are missing several values. We
choose to replace these values with the median value for those features.
This will allow the areas with missing values to be more accurately
classified as they no longer have un-usable data.
\begin{Verbatim}[commandchars=\\\{\}]
{\color{incolor}In [{\color{incolor}5}]:} \PY{n}{engineered\PYZus{}features} \PY{o}{=} \PY{n}{pd}\PY{o}{.}\PY{n}{DataFrame}\PY{o}{.}\PY{n}{copy}\PY{p}{(}\PY{n}{full\PYZus{}features}\PY{p}{)}
\PY{n}{engineered\PYZus{}features}\PY{p}{[}\PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Euclidean\PYZus{}Distance\PYZus{}To\PYZus{}Hydrology}\PY{l+s+s1}{\PYZsq{}}\PY{p}{]} \PY{o}{=} \PY{n}{engineered\PYZus{}features}\PY{o}{.}\PY{n}{apply}\PY{p}{(}\PY{k}{lambda} \PY{n}{row}\PY{p}{:} \PY{n}{math}\PY{o}{.}\PY{n}{sqrt}\PY{p}{(}\PY{n}{row}\PY{o}{.}\PY{n}{Horizontal\PYZus{}Distance\PYZus{}To\PYZus{}Hydrology}\PY{o}{*}\PY{o}{*}\PY{l+m+mi}{2} \PY{o}{+} \PY{n}{row}\PY{o}{.}\PY{n}{Vertical\PYZus{}Distance\PYZus{}To\PYZus{}Hydrology}\PY{o}{*}\PY{o}{*}\PY{l+m+mi}{2}\PY{p}{)}\PY{p}{,} \PY{n}{axis}\PY{o}{=}\PY{l+m+mi}{1}\PY{p}{)}
\PY{n}{engineered\PYZus{}features}\PY{p}{[}\PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Elevation\PYZus{}Of\PYZus{}Hydrology}\PY{l+s+s1}{\PYZsq{}}\PY{p}{]} \PY{o}{=} \PY{n}{engineered\PYZus{}features}\PY{p}{[}\PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Elevation}\PY{l+s+s1}{\PYZsq{}}\PY{p}{]}\PY{o}{\PYZhy{}}\PY{n}{engineered\PYZus{}features}\PY{p}{[}\PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Vertical\PYZus{}Distance\PYZus{}To\PYZus{}Hydrology}\PY{l+s+s1}{\PYZsq{}}\PY{p}{]}
\PY{n}{engineered\PYZus{}features}\PY{p}{[}\PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Mean\PYZus{}Distance\PYZus{}To\PYZus{}Feature}\PY{l+s+s1}{\PYZsq{}}\PY{p}{]} \PY{o}{=} \PY{p}{(}\PY{n}{engineered\PYZus{}features}\PY{p}{[}\PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Horizontal\PYZus{}Distance\PYZus{}To\PYZus{}Hydrology}\PY{l+s+s1}{\PYZsq{}}\PY{p}{]}\PY{o}{+}\PY{n}{engineered\PYZus{}features}\PY{p}{[}\PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Horizontal\PYZus{}Distance\PYZus{}To\PYZus{}Roadways}\PY{l+s+s1}{\PYZsq{}}\PY{p}{]}\PY{o}{+}\PY{n}{engineered\PYZus{}features}\PY{p}{[}\PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Horizontal\PYZus{}Distance\PYZus{}To\PYZus{}Fire\PYZus{}Points}\PY{l+s+s1}{\PYZsq{}}\PY{p}{]}\PY{p}{)}\PY{o}{/}\PY{l+m+mi}{3}
\PY{n}{engineered\PYZus{}features}\PY{p}{[}\PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Stony}\PY{l+s+s1}{\PYZsq{}}\PY{p}{]} \PY{o}{=} \PY{n}{engineered\PYZus{}features}\PY{p}{[}\PY{p}{[}\PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type1}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type2}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type6}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type9}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type12}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type18}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type24}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type25}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type26}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type27}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type28}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type29}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type30}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type31}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type32}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type33}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type34}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type35}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type36}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type37}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type38}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type39}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Soil\PYZus{}Type40}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{p}{]}\PY{p}{]}\PY{o}{.}\PY{n}{any}\PY{p}{(}\PY{n}{axis}\PY{o}{=}\PY{l+m+mi}{1}\PY{p}{)}
\PY{n}{median\PYZus{}hillshade\PYZus{}9am} \PY{o}{=} \PY{n}{np}\PY{o}{.}\PY{n}{median}\PY{p}{(}\PY{n}{engineered\PYZus{}features}\PY{p}{[}\PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Hillshade\PYZus{}9am}\PY{l+s+s1}{\PYZsq{}}\PY{p}{]}\PY{p}{)}
\PY{n}{engineered\PYZus{}features}\PY{p}{[}\PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Hillshade\PYZus{}9am}\PY{l+s+s1}{\PYZsq{}}\PY{p}{]} \PY{o}{=} \PY{n}{engineered\PYZus{}features}\PY{o}{.}\PY{n}{apply}\PY{p}{(}\PY{k}{lambda} \PY{n}{row}\PY{p}{:} \PY{n}{median\PYZus{}hillshade\PYZus{}9am} \PY{k}{if} \PY{n}{row}\PY{o}{.}\PY{n}{Hillshade\PYZus{}9am} \PY{o}{==} \PY{l+m+mi}{0} \PY{k}{else} \PY{n}{row}\PY{o}{.}\PY{n}{Hillshade\PYZus{}9am}\PY{p}{,} \PY{n}{axis}\PY{o}{=}\PY{l+m+mi}{1}\PY{p}{)}
\PY{n}{median\PYZus{}hillshade\PYZus{}3pm} \PY{o}{=} \PY{n}{np}\PY{o}{.}\PY{n}{median}\PY{p}{(}\PY{n}{engineered\PYZus{}features}\PY{p}{[}\PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Hillshade\PYZus{}3pm}\PY{l+s+s1}{\PYZsq{}}\PY{p}{]}\PY{p}{)}
\PY{n}{engineered\PYZus{}features}\PY{p}{[}\PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Hillshade\PYZus{}3pm}\PY{l+s+s1}{\PYZsq{}}\PY{p}{]} \PY{o}{=} \PY{n}{engineered\PYZus{}features}\PY{o}{.}\PY{n}{apply}\PY{p}{(}\PY{k}{lambda} \PY{n}{row}\PY{p}{:} \PY{n}{median\PYZus{}hillshade\PYZus{}3pm} \PY{k}{if} \PY{n}{row}\PY{o}{.}\PY{n}{Hillshade\PYZus{}3pm} \PY{o}{==} \PY{l+m+mi}{0} \PY{k}{else} \PY{n}{row}\PY{o}{.}\PY{n}{Hillshade\PYZus{}3pm}\PY{p}{,} \PY{n}{axis}\PY{o}{=}\PY{l+m+mi}{1}\PY{p}{)}
\PY{n}{np}\PY{o}{.}\PY{n}{random}\PY{o}{.}\PY{n}{seed}\PY{p}{(}\PY{l+m+mi}{0}\PY{p}{)}
\PY{n}{e\PYZus{}X\PYZus{}train}\PY{p}{,} \PY{n}{e\PYZus{}X\PYZus{}test}\PY{p}{,} \PY{n}{e\PYZus{}y\PYZus{}train}\PY{p}{,} \PY{n}{e\PYZus{}y\PYZus{}test} \PY{o}{=} \PY{n}{train\PYZus{}test\PYZus{}split}\PY{p}{(}\PY{n}{engineered\PYZus{}features}\PY{p}{,} \PY{n}{full\PYZus{}labels}\PY{p}{,} \PY{n}{test\PYZus{}size}\PY{o}{=}\PY{l+m+mf}{0.10}\PY{p}{)}
\end{Verbatim}
\subsubsection{How to Test Feature
Changes}\label{how-to-test-feature-changes}
Without \emph{a priori} knowledge of how the interplay between soil
types, topography, hydrology, etc. affects forest cover, we need a way
to view the performance of new features. As such we will use a simple
Gaussian Naive Bayes model to do predictions, and quanitify the results
using cross-validation. We will be tracking performance across
precision, recall, and f1-score.
\paragraph{Naïve Bayes}\label{nauxefve-bayes}
One reasonable place to begin might be a Naïve Bayes classifier. While
it is unlikely that all of the features at our disposal are
\emph{strictly} independent, we may be able to relax the assumption of
independence enough to explore how a NB model performs.
We don't want a Bernoulli NB model: our features are not uniformly
binary-valued. We also don't want a Multinomial NB model: per the
documentation, it assumes integer feature counts. A Gaussian NB, on the
other hand, might work well. While it assumes that the likelihoods of
the features are Gaussian - and this is not necessarily strictly the
case - it may be worth trying.
\begin{Verbatim}[commandchars=\\\{\}]
{\color{incolor}In [{\color{incolor}6}]:} \PY{c+c1}{\PYZsh{} Testing on the base data}
\PY{n}{cross\PYZus{}validate\PYZus{}model}\PY{p}{(}\PY{n}{GaussianNB}\PY{p}{(}\PY{p}{)}\PY{p}{,} \PY{n}{X\PYZus{}train}\PY{p}{,} \PY{n}{y\PYZus{}train}\PY{p}{,} \PY{n}{name}\PY{o}{=}\PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Base Data GaussianNB}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{n}{verbose}\PY{o}{=}\PY{k+kc}{True}\PY{p}{)}
\PY{c+c1}{\PYZsh{} Testing on the engineered data}
\PY{n}{cross\PYZus{}validate\PYZus{}model}\PY{p}{(}\PY{n}{GaussianNB}\PY{p}{(}\PY{p}{)}\PY{p}{,} \PY{n}{e\PYZus{}X\PYZus{}train}\PY{p}{,} \PY{n}{e\PYZus{}y\PYZus{}train}\PY{p}{,} \PY{n}{name}\PY{o}{=}\PY{l+s+s1}{\PYZsq{}}\PY{l+s+s1}{Base Data GaussianNB}\PY{l+s+s1}{\PYZsq{}}\PY{p}{,} \PY{n}{verbose}\PY{o}{=}\PY{k+kc}{True}\PY{p}{)}
\end{Verbatim}
\begin{Verbatim}[commandchars=\\\{\}]