-
Notifications
You must be signed in to change notification settings - Fork 0
/
solution.tex
915 lines (743 loc) · 32.6 KB
/
solution.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
% Options for packages loaded elsewhere
\PassOptionsToPackage{unicode}{hyperref}
\PassOptionsToPackage{hyphens}{url}
%
\documentclass[
]{article}
\usepackage[colorlinks = true,
linkcolor = blue,
urlcolor = blue,
citecolor = blue,
anchorcolor = blue]{hyperref}
\usepackage{lmodern}
\usepackage{amssymb,amsmath}
\usepackage{ifxetex,ifluatex}
\usepackage{subcaption}
\usepackage{float}
\captionsetup{compatibility=false}
\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{textcomp} % provide euro and other symbols
\else % if luatex or xetex
\usepackage{unicode-math}
\defaultfontfeatures{Scale=MatchLowercase}
\defaultfontfeatures[\rmfamily]{Ligatures=TeX,Scale=1}
\fi
% Use upquote if available, for straight quotes in verbatim environments
\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
\IfFileExists{microtype.sty}{% use microtype if available
\usepackage[]{microtype}
\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts
}{}
\makeatletter
\@ifundefined{KOMAClassName}{% if non-KOMA class
\IfFileExists{parskip.sty}{%
\usepackage{parskip}
}{% else
\setlength{\parindent}{0pt}
\setlength{\parskip}{6pt plus 2pt minus 1pt}}
}{% if KOMA class
\KOMAoptions{parskip=half}}
\makeatother
\usepackage{xcolor}
\IfFileExists{xurl.sty}{\usepackage{xurl}}{} % add URL line breaks if available
\IfFileExists{bookmark.sty}{\usepackage{bookmark}}{\usepackage{hyperref}}
\hypersetup{
hidelinks,
pdfcreator={LaTeX via pandoc}}
\urlstyle{same} % disable monospaced font for URLs
\usepackage{longtable,booktabs}
% Correct order of tables after \paragraph or \subparagraph
\usepackage{etoolbox}
\makeatletter
\patchcmd\longtable{\par}{\if@noskipsec\mbox{}\fi\par}{}{}
\makeatother
% Allow footnotes in longtable head/foot
\IfFileExists{footnotehyper.sty}{\usepackage{footnotehyper}}{\usepackage{footnote}}
\makesavenoteenv{longtable}
\usepackage{graphicx}
\makeatletter
\def\maxwidth{\ifdim\Gin@nat@width>\linewidth\linewidth\else\Gin@nat@width\fi}
\def\maxheight{\ifdim\Gin@nat@height>\textheight\textheight\else\Gin@nat@height\fi}
\makeatother
% Scale images if necessary, so that they will not overflow the page
% margins by default, and it is still possible to overwrite the defaults
% using explicit options in \includegraphics[width, height, ...]{}
\setkeys{Gin}{width=\maxwidth,height=\maxheight,keepaspectratio}
% Set default figure placement to htbp
\makeatletter
\def\fps@figure{htbp}
\makeatother
\setlength{\emergencystretch}{3em} % prevent overfull lines
\providecommand{\tightlist}{%
\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}
\setcounter{secnumdepth}{-\maxdimen} % remove section numbering
\ifluatex
\usepackage{selnolig} % disable illegal ligatures
\fi
\author{}
\date{}
\newcommand{\hhref}[3][blue]{\href{#2}{\color{#1}{#3}}}%
\begin{document}
\begin{titlepage}
\begin{center}
\vspace*{1cm}
\textbf{Automatic splitting into scenes of gameplay videos}
\vspace{0.5cm}
\hhref{https://github.com/diegoami/DA_ML_Capstone}{https://github.com/diegoami/DA\_ML\_Capstone}
\vspace{1.5cm}
\textbf{Diego Amicabile}
\vfill
Final Capstone Project for the Machine Learning Engineering Nanodegree
\vspace{0.8cm}
\textbf{Udacityb}
\vspace{0.8cm}
2 November 2020
\end{center}
\end{titlepage}
\hypertarget{definition}{%
\section{DEFINITION}\label{definition}}
\hypertarget{project-overview}{%
\subsection{PROJECT OVERVIEW}\label{project-overview}}
During the last few years it has become more and more common to stream
on platforms such as Youtube and Twitch while playing video games, or to
upload recorded sessions. The volume of videos produced is overwhelming.
In many of the video games being streamed there are different types of
scenes. Both for content producers and consumers it would be useful to
be able to automatically split videos, to find out in what time
intervals different types of scenes run.
The game that I have chosen to analyze is \emph{Mount of Blade:
Warband}, of which I attempted several walkthroughs. In this game, you command a hero, and later on a warband, in their quest to power in the fictitious continent of Calradia, divided between six warring factions. On their way there you have to fight battles, sieges, tournaments, clear bandit handouts, fulfill quests in what is usually a very long game, spanning hundreds of hours.
You spend most of the time on a ``strategic map'', taking your warband to
any of the towns or villages, following or running away from other
warbands which can belong to friendly or rival factions, or looking for
quest objectives.
\begin{figure}[H]
\centering
\begin{subfigure}[b]{0.3\textwidth}
\includegraphics[width=\linewidth]{docimages/E_0041_00_09_28.jpg}
\caption{Close to a town}
\end{subfigure}
\begin{subfigure}[b]{0.3\textwidth}
\includegraphics[width=\linewidth]{docimages/E_0049_00_15_38.jpg}
\caption{Bird's view}
\end{subfigure}
\begin{subfigure}[b]{0.3\textwidth}
\includegraphics[width=\linewidth]{docimages/E_0063_00_17_06.jpg}
\caption{Your warband}
\end{subfigure}
\end{figure}
Other in-game screenshots can show the character's inventory, warband
composition, allow interaction with non-playing characters (NPCs), display status
messages\ldots{}
\begin{figure}[H]
\centering
\begin{subfigure}[b]{0.3\textwidth}
\includegraphics[width=\linewidth]{docimages/E_0056_00_31_30.jpg}
\caption{In a shop}
\end{subfigure}
\begin{subfigure}[b]{0.3\textwidth}
\includegraphics[width=\linewidth]{docimages/E_0056_00_58_08.jpg}
\caption{Manage your warband}
\end{subfigure}
\begin{subfigure}[b]{0.3\textwidth}
\includegraphics[width=\linewidth]{docimages/E_0059_00_26_26.jpg}
\caption{Talk to a NPC}
\end{subfigure}
\end{figure}
The hero can also take a walk in town, villages and castles, have training sessions with soldiers, or spar with them in arena.
\begin{figure}[H]
\centering
\begin{subfigure}[b]{0.3\textwidth}
\includegraphics[width=\linewidth]{docimages/E_0055_00_33_54.jpg}
\caption{In a noble's castle}
\end{subfigure}
\begin{subfigure}[b]{0.3\textwidth}
\includegraphics[width=\linewidth]{docimages/E_0064_00_52_16.jpg}
\caption{Training Session}
\end{subfigure}
\begin{subfigure}[b]{0.3\textwidth}
\includegraphics[width=\linewidth]{docimages/E_0076_00_42_48.jpg}
\caption{Challenge in the arena}
\end{subfigure}
\end{figure}
However, what we are interested in is locating the scenes when the
warband engages enemies and the game switches to a tactical view, such
as a battle in an open field or in a village\ldots{}
\begin{figure}[H]
\centering
\begin{subfigure}[b]{0.3\textwidth}
\includegraphics[width=\linewidth]{docimages/E_0060_00_44_20.jpg}
\caption{Raid on caravan}
\end{subfigure}
\begin{subfigure}[b]{0.3\textwidth}
\includegraphics[width=\linewidth]{docimages/E_0066_01_21_18.jpg}
\caption{Major battle}
\end{subfigure}
\begin{subfigure}[b]{0.3\textwidth}
\includegraphics[width=\linewidth]{docimages/E_0065_00_29_08.jpg}
\caption{Desert battle}
\end{subfigure}
\end{figure}
or when they siege a town or a castle...
\begin{figure}[H]
\centering
\begin{subfigure}[b]{0.3\textwidth}
\includegraphics[width=\linewidth]{docimages/E_0068_00_48_04.jpg}
\caption{Unuzdaq castle}
\end{subfigure}
\begin{subfigure}[b]{0.3\textwidth}
\includegraphics[width=\linewidth]{docimages/E_0068_00_20_08.jpg}
\caption{Dhirim}
\end{subfigure}
\begin{subfigure}[b]{0.3\textwidth}
\includegraphics[width=\linewidth]{docimages/E_0075_00_25_50.jpg}
\caption{Halmar}
\end{subfigure}
\end{figure}
or when they assault a bandit hideout.
\begin{figure}[H]
\centering
\begin{subfigure}[b]{0.3\textwidth}
\includegraphics[width=\linewidth]{docimages/E_0041_00_39_58.jpg}
\caption{Forest bandits}
\end{subfigure}
\begin{subfigure}[b]{0.3\textwidth}
\includegraphics[width=\linewidth]{docimages/E_0054_00_10_54.jpg}
\caption{Tundra bandits}
\end{subfigure}
\begin{subfigure}[b]{0.3\textwidth}
\includegraphics[width=\linewidth]{docimages/E_0042_00_05_08.jpg}
\caption{Steppe bandits}
\end{subfigure}
\end{figure}
The hero often takes part in tournaments, which are a very important
part of the gam, and which we want to locate in gameplay videos. Regrettably, they may look similar to situations like training or sparring in the arena, that are not very interesting.
\begin{figure}[H]
\centering
\begin{subfigure}[b]{0.3\textwidth}
\includegraphics[width=\linewidth]{docimages/E_0057_00_21_48.jpg}
\caption{Tulga tournament}
\end{subfigure}
\begin{subfigure}[b]{0.3\textwidth}
\includegraphics[width=\linewidth]{docimages/E_0061_00_25_56.jpg}
\caption{Tulga tournament}
\end{subfigure}
\begin{subfigure}[b]{0.3\textwidth}
\includegraphics[width=\linewidth]{docimages/E_0042_00_12_48.jpg}
\caption{Tulga tournament}
\end{subfigure}
\end{figure}
Quests and ambushes, are pretty infrequent and the screenshots may look
similar to those of more peaceful situations. For instance when the hero is rescuing a lord from prison, or fighting a bandit in a village, are not very different from scenes when he might be just taking a stroll in a town or village.
\begin{figure}[H]
\centering
\begin{subfigure}[b]{0.3\textwidth}
\includegraphics[width=\linewidth]{docimages/E_0054_00_40_24.jpg}
\caption{Lord Rescue}
\end{subfigure}
\begin{subfigure}[b]{0.3\textwidth}
\includegraphics[width=\linewidth]{docimages/E_0063_00_27_08.jpg}
\caption{Ambush in Town}
\end{subfigure}
\begin{subfigure}[b]{0.3\textwidth}
\includegraphics[width=\linewidth]{docimages/E_0067_00_52_32.jpg}
\caption{Ambush in village}
\end{subfigure}
\end{figure}
\newpage
\hypertarget{problem-statement}{%
\subsection{PROBLEM STATEMENT}\label{problem-statement}}
The goal is to create and deploy a model which is able to classify
images from the game \emph{Mount\&Blade: Warband} and return a category,
such as ``Battle'', ``Hideout'', ``Siege'', ``Tournament'' and
``Other'' for each frame.
The approach should be robust enough to allow for later expanding of categories, for which there are not enough samples at the moment.
It would be ideal to have categories for less frequent scenes such as "Prison escape", "Ambush", "Quest", but this will be out of scope and such scenes will be lumped together with the closest category).
An additional goal is to have a model which identifies contiguous scenes in a gameplay video of \emph{Mount\&Blade: Warband}, providing the beginning and the end of the scenes, and their most likely categories.
A necessary requirement for this project is to gather a dataset of screenshots from gameplay videos, as well as the category to which they belong.
\hypertarget{metrics}{%
\subsection{METRICS}\label{metrics}}
While training, validating and testing the image classifier model, I will evaluate it by means of accuracy and cross-entropy loss.
While overall accuracy is not a good metrics in an unbalanced dataset, cross-entropy loss is a good loss function for Classification Problems with several classes, as it minimizes the distance between the predicted and the actual probability.
I will also use scikit-learn functions to measure precision, recall, accuracy and F1, for each class and on average, as well as to calculate a total weighted and mean accuracy. A model will not be considered good enough if it cannot deliver a good precision and, even more importantly, recall on classes that have relatively few samples.
\hypertarget{analysis}{%
\section{ANALYSIS}\label{analysis}}
Most files referenced in these section are relative to \hhref{https://github.com/diegoami/DA_ML_Capstone/}{this Github repository}..
\hypertarget{data-exploration}{%
\subsection{DATA EXPLORATION}\label{data-exploration}}
\hypertarget{creating-a-dataset}{%
\subsubsection{CREATING A DATASET}\label{creating-a-dataset}}
To create a dataset I took some videos from a game walkthrough of mine,
the adventures of Wendy (\hhref{https://www.youtube.com/playlist?list=PLNP_nRm4k4jfNLo7FkjXewFH9Xe5Uc2Pa}{I}, \hhref{https://www.youtube.com/playlist?list=PLNP_nRm4k4jd-AJ0GwTPS1ld2YP8FdT4h}{II}).
I used the episodes from 41 to 68 to build the model, keeping episodes 69 to 77 to test it, and to simplify download I also created following playlists on youtube, containing just the episodes I will be using in this project:
\begin{itemize}
\tightlist
\item
\hhref{https://www.youtube.com/playlist?list=PLNP_nRm4k4jfVfQobYTRQAXV\_uOzt8Bov}{CNN-Wendy-I}
\item
\hhref{https://www.youtube.com/playlist?list=PLNP_nRm4k4jdEQ-OM31xNqeE64svvx-aT}{CNN-Wendy-II}
\item
\hhref{https://www.youtube.com/playlist?list=PLNP_nRm4k4jeoJ8H7mtTUbbOJ6_Rx_god}{CNN-Wendy-III}
\end{itemize}
To build a dataset, I identified scenes in these episodes and wrote them all down in the video description on youtube.
For instance, in episode 54, I have wrote down following scenes in the description, of the
category ``Hideout'', ``Battle'', ``Tournament'', ``Town'' (``Town'' is
eventually remapped to ``Battle'') - first word of the description line. All the other parts of the video are
categorized as ``Other''.
\begin{itemize}
\tightlist
\item
09:51-12:21 Hideout Tundra Bandits (Failed)
\item
18:47-19:44 Battle with Sea Raiders
\item
20:50-21:46 Battle with Sea Raiders
\item
22:54-23:42 Battle with Sea Raiders
\item
34:06-37:44 Tournament won in Tihr
\item
38:46-40:48 Town escape for Boyar Vlan
\end{itemize}
\hypertarget{companion-projects}{%
\subsubsection{COMPANION PROJECT}\label{companion-projects}}
To prepare the data set, I created \hhref{https://github.com/diegoami/DA_split_youtube_frames\_s3/}{a companion project} allowing to:
\begin{itemize}
\tightlist
\item
download the relevant videos from youtube, using the
\emph{youtube-dl} python library, in a 640x360 format
\item
extract at every two seconds a frame and save it in \emph{jpeg} file, using
the \emph{opencv} python library, resizing to the practical format
320x180
\item
download the text from the youtube descriptions, which describes the scenes, and save it along the
video as \emph{metadata}
\item
assign the images created in the second step to labels, according to the metadata produced at step 3
\item
save images into subdirectories named after the labels they ere assigned
\end{itemize}
The result of this process has been uploaded \hhref{https://da-youtube-ml.s3.eu-central-1.amazonaws.com/}{to a public bucket on Amazon S3}, from where you can browse and download all files produced in the above described stages.
\hypertarget{dataset-characteristics}{%
\subsubsection{DATASET CHARACTERISTICS}\label{dataset-characteristics}}
The dataset contains 51216 images, 320 x 180, in jpeg format,
distributed over the following classes. It can be downloaded and extracted from \hhref{https://da-youtube-ml.s3.eu-central-1.amazonaws.com/wendy-cnn/frames/wendy_cnn_frames_data_5.zip}{this zip file on S3}.
\begin{longtable}[]{@{}llll@{}}
\toprule
& Category & Amount & Percentage\tabularnewline
\midrule
\endhead
& Battle & 7198 & 14.0\%\tabularnewline
& Hideout & 1163 & 2.2\%\tabularnewline
& Other & 35425 & 67.4\%\tabularnewline
& Siege & 634 & 1.2\%\tabularnewline
& Tournament & 6796 & 13.3\%\tabularnewline
& TOTAL & 51216 &\tabularnewline
\bottomrule
\end{longtable}
You can browse them using the \emph{analysis.ipynb} notebooks. While
having a look at images, it seemed to me that a format of 320x180, in
grayscale, keeps most of the information necessary to categorize it.
\hypertarget{exploratory-visualization}{%
\subsubsection{EXPLORATORY
VISUALIZATION}\label{exploratory-visualization}}
Using PCA on a flattened version of the image matrixes, in format 80x45,
black and white, I produced visualizations of the distribution of the 2 main principal components of the full dataset.
\begin{figure}[H]
\centering
\includegraphics{visualizations/pca_sklearn_2d_80_45_L.png}
\caption{2 Main Principal Components}
\end{figure}
and also of the three main principal components, using the same color scheme.
\begin{figure}[H]
\centering
\includegraphics{visualizations/pca_sklearn_3d_80_45_L.png}
\caption{3 Main Principal Components}
\end{figure}
It can be seen that ``Other'' scenes are separated in several clusters.
``Battle'', ``Tournament'', ``Siege'' and ``Hideout'' images do group in
certain regions, but there is a lot of overlap between themselves and
with some of the ``Other'' images.
Later on, after we have create a VGG13 model, we can plot an alternative PCA representation of the
dataset, using the additional generated features that can be recovered from the last layer of the neural net.
\begin{figure}[H]
\centering
\includegraphics{visualizations/pca_v5n_2d_f.png}
\caption{VGG13-2 principal components}
\end{figure}
\begin{figure}[H]
\centering
\includegraphics{visualizations/pca_v5n_3d_f4.png}
\caption{VGG13-3 principal components}
\end{figure}
Here there are is much less overlap between regions where the different
classes are located. That shows how the features created in a neural network are
important for the categorization task.
The script used to produce these visualzations are \emph{pca\_vgg.py} and \emph{pca\_sklearn.py}
\hypertarget{algorithms-and-techniques}{%
\subsection{ALGORITHMS AND
TECHNIQUES}\label{algorithms-and-techniques}}
The general idea is to create first an image classifier to categorize the
images extracted from gameplay videos. Then to use the predictions generated by the image classifier as a base for splitting gameplay videos into scenes.
The main approach to build an Image Classifier will be Convolutional
Neural Network, since
\begin{itemize}
\tightlist
\item
this is a well-tried way to approach the problem
\item
I can use standard CNN topologies available in Pytorch, such as VGG,
picking one with the best tradeoff performance/complexity
\item
at the very least the model would give me more information on
what I have to be looking for
\end{itemize}
A simple to use and flexible topology I decided to use was VGG13, which is
powerful enough to give good results and small enough to fit on the GPU I used for training.
As the images extracted from game walkthroughs are not related to real
world images, using a pre-trained net and expanding it with transfer
learning did not seem a sensible approach. Instead, I opted for a full
train.
\hypertarget{benchmark}{%
\subsection{BENCHMARK}\label{benchmark}}
As 67.4 \% of the images belong to the category ``Other'', a model
should have an accuracy of at least 68\% for being considered better
than a model that always picks the Category ``Other''.
On top of that, I created a very simple model that uses flattened
matrixes of images as an input to standard scikit-learn
transformers and PCA.
Using greyscale images in 80x45 format, I got the following results
using Random Tree Forest and Stochastic Gradient Descent on 100 PCA-produced features, from a
validation set (which had not been used for training).
It is not surprising that these results do not look that bad at all, and
are actually good as separating ``Other'' images from images depicting
any kind of engagement / fight (Tournament, Battle, Siege, Hideout).
That is expected, as there are some GUI components that appear only in
these scenes.
\hypertarget{randomforestclassifier}{%
\paragraph{RandomForestClassifier}\label{randomforestclassifier}}
\begin{verbatim}
RandomForestClassifier(n_estimators=200)
Accuracy: 0.932
F1 Score: 0.926
precision recall f1-score support
0 0.83 0.91 0.87 2375
1 0.99 0.18 0.31 384
2 0.98 0.97 0.98 11691
3 1.00 0.34 0.51 209
4 0.81 0.92 0.86 2243
accuracy 0.93 16902
macro avg 0.92 0.67 0.70 16902
weighted avg 0.94 0.93 0.93 16902
\end{verbatim}
\hypertarget{sgdclassifier}{%
\paragraph{SGDClassifier}\label{sgdclassifier}}
\begin{verbatim}
SGDClassifier(alpha=0.001, max_iter=10000)
Accuracy: 0.841
F1 Score: 0.843
precision recall f1-score support
0 0.72 0.53 0.61 2375
1 0.53 0.37 0.44 384
2 0.97 0.93 0.95 11691
3 0.56 0.30 0.39 209
4 0.52 0.83 0.64 2243
accuracy 0.84 16902
macro avg 0.66 0.59 0.60 16902
weighted avg 0.86 0.84 0.84 16902
\end{verbatim}
\hypertarget{methodology}{%
\section{METHODOLOGY}\label{methodology}}
\hypertarget{data-processing}{%
\subsection{DATA PROCESSING}\label{data-processing}}
The image dataset has been created extracting frames from videos on
youtube, and putting them in subdirectories named like the category they
belong with.
These images have also been resized are in 320 x 180, RGB format, and
normalized.
While training the benchmark models, images are transformed into
grayscale and resized to 80 x 45, but when training the deep learning
models we use the original images without resizing them, 320x180, and in color.
This is because this is what the VGG models expect.
As these images originate from video games, any kind of data
augmentation such as mirrored or cropped images do not make sense. These artificially produced
images would not be produced by the videogame, as for instance many GUI components are
always on the same side of the screen.
\hypertarget{implementation}{%
\subsection{IMPLEMENTATION}\label{implementation}}
I created scripts and notebooks so that they would work both locally and
on Sagemaker. For more details see the file \url{README.md} in the \hhref{https://github.com/diegoami/DA_ML_Capstone}{Github repository}
\hypertarget{training-script}{%
\paragraph{TRAINING SCRIPT}\label{training-script}}
The training script \emph{train.py}:
\begin{itemize}
\tightlist
\item
uses an image loader from pytorch to create a generator scanning all
files in the data directory.
\item
uses a pytorchvision transformer to resize and normalize images
\item
splits the dataset in train, validation and test sets, using
stratification and shuffling
\item
loads a VGG13 neural network, modified so that the output layers produce
a category from our domain (5 categories in the final version)
\item
For each epoch, execute a training step and evaluation step, trying to
minimize the cross entropy loss in the validation set
\item
Save the model so that it can be further used by the following steps
\item
Evaluate the produced model on the test dataset, and print a
classification report and a confusion matrix.
\end{itemize}
\hypertarget{verification-script}{%
\paragraph{VERIFICATION SCRIPT}\label{verification-script}}
The verification script \emph{verify\_model.py} works only locally, as
it assumes the model and the dataset has been saved locally from the previous
step. This script
\begin{itemize}
\tightlist
\item
Loads the model created in the previous step
\item
Walks through all the images in the dataset, one by one, returning a list a predictions (by means of the predictor) and a list of truth values
\item
prints average accuracy, a classification report based on
discrepancies, a confusion matrix, and a list of images whose
predicted category does not coincide with their labels, so that the
dataset can be corrected.
\end{itemize}
\hypertarget{predictor}{%
\paragraph{PREDICTOR}\label{predictor}}
The file \emph{predict.py} contains the methods that are necessary to
deploy the model to an endpoint. It works both locally and on a
Sagemaker container and requires a previously trained model.
\begin{itemize}
\tightlist
\item
input\_fn: this is be the endpoint entry point, which converts a Numpy
matrix payload to a Pillow Image, applying the same preprocessing
steps as during training
\item
model\_fn: predicts a category using a previously trained model,
accepting an image from the domain space (a screenshot from
\emph{Mount\&Blade: Warband} in format 320 x 180) and transformed in
the same way as during training
\item
output\_fn: returns the model output as a numpy array: a list of log
probabilites for each class
\end{itemize}
\hypertarget{endpoint}{%
\paragraph{ENDPOINT}\label{endpoint}}
The file \emph{endpoint.py} contains a method \emph{evaluate} allowing
to call an endpoint on Sagemaker, so that you can collect predictions,
which can be used to show a classification report and a confusion
matrix. It requires locally saved data, but the model is accessed
through a published endpoint, unlike the \emph{verify\_model.py}
component which requires a saved model locally.
\emph{endpoint.py} works only in Sagemaker, when called from a Jupyter
Notebook. An examples can be found in
\emph{CNN\_Fourth\_iteration.ipynb}.
\hypertarget{refinement}{%
\subsection{REFINEMENT}\label{refinement}}
How the approach evolved can be seen in the Jupyter Notebooks: \emph{CNN\_First\_Iteration.ipynb}, \emph{CNN\_Second\_Iteration.ipynb}, \emph{CNN\_Third\_Iteration.ipynb} and \emph{CNN\_Fourth\_Iteration.ipynb} (the final one, that can be executed on Sagemaker)
At the beginning I was trying recognize 8 types of scenes. Apart from
the categories BATTLE, HIDEOUT, OTHER, SIEGE, TOURNAMENT I had also the
categories TOWN, TRAP and TRAINING. But as can be seen in
\url{CNN_First_Iteration.ipynb}, I had too few examples of these and
merged TRAINING into OTHER, while TOWN and TRAP became BATTLE.
Another important refinement was finding images in the dataset that had
been wrongly labeled. For this, I used the output from one of my
scripts, \emph{verify\_model.py}, which pointed to images whose
prediction didn't match with the true values. The only semnsible way to deal with
that was to correct metadata at the source and use the companion
project\\
\emph{DA\_split\_youtube\_frames\_S3} to regenerate frames in the
correct directory.
After a few tries, 320 x 180 proved also to be the optimal size.
\hypertarget{results}{%
\section{RESULTS}\label{results}}
\hypertarget{model-evaluation-and-validation}{%
\subsection{MODEL EVALUATION AND
VALIDATION}\label{model-evaluation-and-validation}}
\hypertarget{image-classification}{%
\paragraph{IMAGE CLASSIFICATION}\label{image-classification}}
The final model uses 51216 images ( 320 x 180 ) in color, VGG13, with 8
epochs. This is the confusion matrix and metrics on the full dataset.
Confusion matrix
\begin{longtable}[]{@{}llllll@{}}
\toprule
X & 0 & 1 & 2 & 3 & 4\tabularnewline
\midrule
\endhead
0 & 7126 & 13 & 49 & 3 & 7\tabularnewline
1 & 0 & 1159 & 1 & 2 & 1\tabularnewline
2 & 163 & 13 & 35154 & 5 & 90\tabularnewline
3 & 5 & 6 & 0 & 623 & 0\tabularnewline
3 & & 181 & 181 & 2 & 6609\tabularnewline
\bottomrule
\end{longtable}
Classification Report
\begin{longtable}[]{@{}llllll@{}}
\toprule
class name & class & precision & recall & f1-score &
support\tabularnewline
\midrule
\endhead
Battle & 0 & 0.98 & 0.99 & 0.98 & 7198\tabularnewline
Hideout & 1 & 0.97 & 1.00 & 0.98 & 1163\tabularnewline
Other & 2 & 0.99 & 0.99 & 0.99 & 35425\tabularnewline
Siege & 3 & 0.98 & 0.98 & 0.98 & 194\tabularnewline
Tournam & 4 & 0.99 & 0.97 & 0.98 & 6796\tabularnewline
& macro avg & 0.98 & 0.99 & 0.98 & 51216\tabularnewline
& weighted avg & 0.99 & 0.99 & 0.99 & 51216\tabularnewline
\bottomrule
\end{longtable}
When just executed on the test dataset, it returns an accuracy of 98,7\%
and a cross entropy of 0.0026 and following confusion matrix and
classification report:
Confusion Matrix
\begin{longtable}[]{@{}llllll@{}}
\toprule
X & 0 & 1 & 2 & 3 & 4\tabularnewline
\midrule
\endhead
0 & 995 & 3 & 7 & 1 & 2\tabularnewline
1 & 0 & 163 & 0 & 0 & 0\tabularnewline
2 & 26 & 3 & 4911 & 2 & 18\tabularnewline
3 & 3 & 1 & 0 & 85 & 0\tabularnewline
4 & 0 & 0 & 24 & 0 & 927\tabularnewline
\bottomrule
\end{longtable}
\begin{longtable}[]{@{}llllll@{}}
\toprule
class name & class & precision & recall & f1-score &
support\tabularnewline
\midrule
\endhead
Battle & 0 & 0.97 & 0.99 & 0.98 & 1008\tabularnewline
Hideout & 1 & 0.96 & 1.00 & 0.98 & 163\tabularnewline
Other & 2 & 0.99 & 0.99 & 0.99 & 4960\tabularnewline
Siege & 3 & 0.97 & 0.96 & 0.96 & 89\tabularnewline
Tournam & 4 & 0.98 & 0.97 & 0.98 & 951\tabularnewline
& macro avg & 0.97 & 0.98 & 0.98 & 7171\tabularnewline
& weighted avg & 0.99 & 0.99 & 0.99 & 7171\tabularnewline
\bottomrule
\end{longtable}
\hypertarget{interval-identification}{%
\paragraph{INTERVAL IDENTIFICATION}\label{interval-identification}}
However, this is not the only result I was striving for. I wanted to
create a tool not just to categorize images, but to split videos in
scenes. Now, this problem would be worth a project in itself, possibly
building a model on top of another model, or maybe considering RNN. At
the moment I think this would make the problem too complex, as I expect
this tool just to be able to help redact descriptions, and not create
them without human supervision.
To convert image classifications to scenes, I use an empirical
procedure which is very dependent on the model and the dataset. Using
probabilities calculated on each frame, I set up a ``string
visualization'' for each frame, containing a set of characters
representing the initials of the classes the frame is thought to
belong to. From these frame representations, scenes in videos are
extracted.
The algorithms are in the use \emph{interval} package.
Locally I use \emph{predict\_intervals\_walkdir},
On Sagemaker you can use use \emph{predict\_intervals\_endpoint}.
I applied this procedure to a few episodes whose images hadn't been used
for training, from E69 to E77, skipping E70 which is an outlier (Siege
Defence in Slow motion, bugged by the mod I was using). The
results, along with the correct metadata, can be found in the
\url{split_scenes} directory of the main repository.
Pretty much all battles, sieges, hideout and tournaments are recognized
correctly. An example of the output:
\begin{verbatim}
...
41:10 __BBBBBBBBBBBBBBB___
41:12 _BBBBBBBBBBBBBBBBBBB
41:14 _BBBBBBBBBBBBBBBBBBS
41:16 __BBBBBBBBBBBBBBBBBB
41:18 _BBBBBBBBBBBBBBBBBBB
41:20 __BBBBBBBBBBBBBBBBBB
41:22 __BBBBBBBBBBBBBBBBHS
41:24 _BBBBBBBBBBBBBBBBBBB
41:26 _BBBBBBBBBBBBBBBBBBB
41:28 __BBBBBBBBBBBBBBBB_T
41:30 _BBBBBBBBBBBBBBBBB_T
41:32 __BBBBBBBBBBBBBBB__T
41:34 ___BBBBBBBBBBBBBBBHH
41:36 __BBBBBBBBBBBBBBB__T
41:38 ___BBBBBBBBBBBBBBBBH
01:44-03:06 | Battle : 95%
19:50-21:54 | Battle : 89% , Other : 7%
36:50-38:04 | Battle : 94%
41:10-41:38 | Battle : 85% , Other : 10%
\end{verbatim}
There are some instance where the output is less than perfect, but very
helpful nonetheless. For instance:
\begin{itemize}
\tightlist
\item
In E72, E73, E74 the hero talks a walk around her estates, but the algorithm is confused and hints at some kind of battle, as it is an outdoor scene
\item
in E73, a siege is recognized correctly, but the program is confused
when the hero gets knocked out and the troops keep fighting without
her
\item
In E72 there are a couple of mountain battles, when the hero can fight
dismounted, which are therefore then partly classified as Hideouts
\item
In E72 some noise is generated because of a ``Training'' session,
which I usually put in the ``Other'' bucket, but are actually fight
scenes with some similarity to Tournaments.
\item
The Siege of Halmar in E75 gets about 45\% probability Siege, 40\%
probability Hideout
\item
In E76 some sparring in the Arena is confused with a Tournament
\end{itemize}
The idea in the long term is to gather more data and introduce additional categories.
\hypertarget{justification}{%
\subsection{JUSTIFICATION}\label{justification}}
In the final model, both F1 and Accuracy are in the 97-99\% range for
every category, unlike the simple model I used in the benchmark. It is
particularly important to be able to tell tournaments from battles, and
show good precision / recall on Siege / Hideout frames, which are
actually the information that we need to.
To see that, we can compare how the VGG13, Random Tree Forest and
Stochastic Tree Descent scores in regard to classifying each
category. .
\begin{longtable}[]{@{}llll@{}}
\toprule
F1 & VGG13 & RTF & SGD\tabularnewline
\midrule
\endhead
Battle & 0.98 & 0.87 & 0.61\tabularnewline
Hideout & 0.98 & 0.31 & 0.44\tabularnewline
Other & 0.99 & 0.98 & 0.95\tabularnewline
Siege & 0.96 & 0.51 & 0.39\tabularnewline
Tournament & 0.98 & 0.86 & 0.64\tabularnewline
\bottomrule
\end{longtable}
The difference in Recall is even greater:
\begin{longtable}[]{@{}llll@{}}
\toprule
Recall & VGG13 & RTF & SGD\tabularnewline
\midrule
\endhead
Battle & 0.99 & 0.91 & 0.53\tabularnewline
Hideout & 1.00 & 0.18 & 0.37\tabularnewline
Other & 0.99 & 0.97 & 0.93\tabularnewline
Siege & 0.96 & 0.34 & 0.30\tabularnewline
Tournament & 0.97 & 0.92 & 0.83\tabularnewline
\bottomrule
\end{longtable}
In future improvements I am planning to extract more categories, which
will also have few samples. This makes the difference in Recall even
more critical.
\end{document}