-
Notifications
You must be signed in to change notification settings - Fork 2
/
thesis.tex
1324 lines (1009 loc) · 67.1 KB
/
thesis.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[a4paper,11pt]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8x]{inputenc}
\usepackage[colorinlistoftodos, textwidth=2cm, shadow]{todonotes}
\usepackage{listings}
\usepackage{appendix}
\usepackage[colorlinks=true,linkcolor=blue]{hyperref}
\usepackage[section]{placeins}
\usepackage{color}
\usepackage{amsmath}
\definecolor{darkgreen}{rgb}{0,0.4,0}
\newcommand{\HRule}{\rule{\linewidth}{0.5mm}}
\lstset{
language=C
}
\topmargin 0pt
\advance \topmargin by -\headheight
\advance \topmargin by -\headsep
\textheight 8.9in
\oddsidemargin 0pt
\evensidemargin \oddsidemargin
\marginparwidth 0.5in
\textwidth 6.5in
\parindent 0in
\parskip 1.5ex
\let\subsectionautorefname\sectionautorefname
\let\subsubsectionautorefname\sectionautorefname
\begin{document}
%Polish title page
\begin{titlepage}
\begin{center}
\textsc{\LARGE \\ Uniwersytet Wrocławski }\\
Wydział Matematyki i Informatyki \\
Instytut Informatyki \\[1.5cm]
\textsc{\Large Praca Magisterska}\\[0.5cm]
\HRule \\[0.4cm]
{ \huge \bfseries \emph{dsinf}: Inferencja stuktur danych oparta na kodzie źródłowym \\[0.4cm] }
\HRule \\[1.5cm]
\begin{minipage}{0.4\textwidth}
\begin{flushleft} \large
\emph{Autor:}\\
Aleksander \textsc{Balicki}
\end{flushleft}
\end{minipage}
\begin{minipage}{0.4\textwidth}
\begin{flushright} \large
\emph{Promotor:} \\
Prof. Witold \textsc{Charatonik}
\end{flushright}
\end{minipage}
\vfill
{\large Wrzesień 2014}
\thispagestyle{empty}
\end{center}
\end{titlepage}
% Polish abstract
\renewcommand{\abstractname}{Streszczenie}
\begin{abstract}
Większość popularnych dzisiaj języków programowania posiada już zaimplementowane biblioteki z wieloma dostępnymi strukturami danych
do przechowywania danych.
Programiści po prostu muszą nauczyć się, kiedy należy używać danej struktury. W niektórych przypadkach ten wybór jest na tyle prosty,
że program mógłby automatycznie dobrać najlepszą strukturę danych. Praca ta opisuje \emph{dsinf} --- framework do dobierania najlepszej struktury danych pasującej do zadania na podstawie kodu źródłowego programu w języku C. \emph{dsinf} analizuje przypadki użycia struktur danych w całym programie i proponuje najszybszą.
Do zapewnienia jakości tej sugestii, używa danych profilera.
\end{abstract}
\thispagestyle{empty}
\pagebreak
% English title page
\begin{titlepage}
\begin{center}
\textsc{\LARGE \\ University of Wrocław} \\
Department of Matemathics and Computer Science \\
Computer Science Institute\\[1.5cm]
\textsc{\Large Master's Thesis}\\[0.5cm]
\HRule \\[0.4cm]
{ \huge \bfseries \emph{dsinf}: Source based data structure inference \\[0.4cm] }
\HRule \\[1.5cm]
\begin{minipage}{0.4\textwidth}
\begin{flushleft} \large
\emph{Author:}\\
Aleksander \textsc{Balicki}
\end{flushleft}
\end{minipage}
\begin{minipage}{0.4\textwidth}
\begin{flushright} \large
\emph{Supervisor:} \\
Prof. Witold \textsc{Charatonik}
\end{flushright}
\end{minipage}
\vfill
{\large September 2014}
\thispagestyle{empty}
\end{center}
\end{titlepage}
\vfill
\pagebreak
% English abstract
\renewcommand{\abstractname}{Abstract}
\begin{abstract}
When you need to store data in a data structure, most languages popular
today already have libraries with convenience data structures implemented.
Programmers just have to be taught when to use a particular data structure. Some of
the cases of choosing the right data structure look sufficiently easy, so a
program could do it automatically. This work describes the \emph{dsinf} project,
an analysis framework for inferring the best data structure matching the task,
based on the program's source code in C language. \emph{dsinf} analyses the
uses of data structures throughout the program and suggests the fastest one.
To further enhance the quality of the suggestion, it uses profiler data.
\end{abstract}
\thispagestyle{empty}
\pagebreak
\tableofcontents
\vfill
\section{Introduction} \label{sec:intro}
In computer science, a data structure is a particular way of storing and organizing data in a computer so that
it can be used efficiently\cite{Wids}.
The interface of a data structure is a set of operations that you can perform on a data structure and some
semantics of those operations.
The implementation of a data structure is a realization of some data structure interface in some programming
language, that is consistent with the semantics.
There are a lot of different known types of data structures:
\begin{itemize}
\item Map/associative array/dictionary is a data structure for storing $(key,\;value)$ pairs, it allows
insertion, deletion, modification and lookup,
\item multimap is the same as map, but with the possibility of storing multiple values for a one key,
\item list is a data structure for storing ordered sets of values,
\item set is a data structure for storing unordered sets of values,
\item multiset is the same as set, but with the possibility of storing multiple objects that are equal
under some equivalence relation,
\item queue is a data structure implementing operations for queuing and dequeuing elements,
\item deque is a queue, where you can add and remove elements from both ends as opposed to one,
\item stack is a data structure implementing operations for putting something on the top of the stack or
removing the top element from it,
\item priority queue is like a regular queue, but each element has a priority associated with it, which
is used for ordered dequeuing of the elements,
\item string is a data structure for storing text,
\item tree is a data structure, to hold some tree-like hierarchical structure,
\item graph is a data structure, to store elements and a relation on those elements,
\item others, which usually are a modification or a mix of previous ones.
\end{itemize}
A different implementation of those data structures makes it possible, that specific operations run time is
lower than the others. So there is no data structure perfect for all the tasks.
\subsection{What is dsinf?} \label{sub:intro}
dsinf is a source code analyzer. It works by analyzing specially prepared C code. That C code, for every
instance of a data structure, uses a type defined in the dsinf library. This type is called a data
structure wildcard. The type represents a data structure with the most general interface, i.e. a set union
of interfaces of all data structures available in the dsinf framework. It connects the data structures
and operations performed on them and then reasons about which of the available data structure
implementations would be best for the instances of data structure wildcards in your code, finally (if
possible) choosing the best matches and printing them out or linking them to your code to create a
working binary.
The above analysis now works after compiler C code preprocessing (using CPP or a preprocessor included in a C compiler), before the actual compilation. In the future it can be
pushed to the phase between compilation and linking as described in \autoref{sec:future}.
The analysis process uses asymptotic complexity for data structure comparison. The process can also be
modified by providing source code annotations. One can also use test data to tailor the data
structure as described in \autoref{sub:importance}.
The idea of the framework is to provide a data structure, that is always "good enough". Due to
unsolvability of the general case of the problem, the effect always will be worse than a hand tailored
data structure to your task, but can save up some time (both programming time and execution time) if
time is an important factor. To reiterate, using dsinf for the most critical bottleneck of your
application probably is not a good idea.
\subsection{Comparing to a database} \label{sub:database}
There is already a mature solution for storing data and querying it in an efficient way --- databases.
The main difference between dsinf and a database is that a database has to implement all the operations
for all the possible SQL queries, including some crosschecks, counting, summations, extremal elements
and with dsinf your data structure implementation does not have to be ready for all the possible SQL
queries that can be constructed, but only the ones you used in your source code, which enables it to
potentially gain some performance over a database, both in space and time complexity.
\subsection{Comparing to class clusters} \label{sub:classcluster}
A class cluster is an architecture that groups a number of private, concrete subclasses under a public,
abstract superclass. The grouping of classes in this way provides a simplified interface to the user,
who sees only the publicly visible architecture. Behind the scenes, though, the abstract class is
calling up the private subclass most suited for performing a particular task\cite{AppleCC}. One could
have an object representing e.g. an array, and have a lot of different private implementations for it, like
large, sparse arrays, or arrays with sizes known at compile time and optimize operations for those
cases. The downside of such method is that you can alternate between the data structure representations
that need a lot of time to transform into each other. dsinf knows exactly what operations can occur and
tailors the data structure to the operations that will happen.
\subsection{C API}
The framework detects specific C functions in the source code and bases the inference on them. Now we define
what those functions are. The first API version is in \autoref{fig:C-api-v1}.
\begin{figure}[h!]
\lstinputlisting{thesis-pics/ds1.h}
\caption{The simplified dsinf C API}
\label{fig:C-api-v1}
\end{figure}
There is a problem with the \autoref{fig:C-api-v1} API. If there is an
operation on a data structure that is asymptotically faster than the
speed of a lookup of an element, it is impossible to obtain in the
desired time with this API. For example, a balanced search tree
that guarantees $O(log n)$ lookups, with a linked list of all the
elements (like in \autoref{sec:gdsm:le}) to enhance the speed of the
successor and predecessor operations to $O(1)$. Defining a successor
function using this API convention, it would look like this:
\begin{lstlisting}
dstype successor_d(ds, dstype);
\end{lstlisting}
The problem with this function is that it does not take a pointer to the element in a data structure, so it has
to actually find the value given in the actual parameter of the function, and then compute the successor in
$O(1)$, but the search takes $O(log n)$, so the whole operation takes $O(log n)$, which is bad.
To fix this we declare a new type for a data structure element. The type encompasses a pointer to the place in
the data structure. The search can be started from this pointer, so there is no need to waste time on finding the element.
The second version of the API is in \autoref{fig:C-api-v2}. It is updated to use the \emph{dselem} struct. We use the pointer value instead of the simple value like before.
\begin{figure}[h!]
\lstinputlisting{thesis-pics/ds2.h}
\caption{The dsinf C API}
\label{fig:C-api-v2}
\end{figure}
The following code examples will sometimes use \autoref{fig:C-api-v1} API convention for simplicity.
\section{Data structure inference}
\subsection{Comparison of the complexities}
If we want to find the best possible data structure for a task, we have to define some kind of ordering
on the data structures, so we can judge which structures are better than others. Here we want to compare the
asymptotic complexities of operations on data structures. When trying to define such ordering, we
encounter a problem. We can not do it for the general case, because the complexity of an operation
can be described by arbitrary functions and a comparison of asymptotic complexities is complicated.
Asymptotic complexity of an operation is stored as a pair of type:
\begin{eqnarray}
AsymptoticalComplexity = Int \times Int,
\end{eqnarray}
where
\begin{eqnarray} \label{eqn:linlog}
(k, \; l) \; means \; O(n^k \log^l{ n}).
\end{eqnarray}
The reason to choose such a type is that it is easier to compare than the general case (we can do a
lexicographical comparison of the two numbers) and it distinguishes most of the data structure operation
complexities.
Sometimes we have to use some qualified complexities:
\begin{eqnarray}
ComplexityType = \{ Normal, \; Amortized, \; Amortized \;Expected, \; Expected \}
\end{eqnarray}
The overall complexity can be seen as a type:
\begin{eqnarray}
Complexity = AsymptoticalComplexity \times ComplexityType
\end{eqnarray}
Here we can also use a lexicographical comparison, but we have to say that
\begin{eqnarray}
Amortized > Normal,\\
Expected > Amortized,\\
Amortized \; Expected > Expected
\end{eqnarray}
and that $>$ is transitive.
The ordering is done in such way, because normal complexity of $O(f(n))$ guarantees that no operation
will exceed that time multiplied by a constant. Amortized is worse, because here, we can only guarantee
that $n$ consecutive operations will have the average time $O(f(n))$, but single operations can spike
heavily. Expected is worse than amortized, because here we can only have some probability of achieving
the mentioned complexity for an operation and there is no bound on the number of spikes, which were under
control in an amortized complexity.
We also always choose the smallest asymptotic-complexity-wise complexity. For example, we have a search
operation on a splay tree. It is $O(n)$, but $O(\log n)$ amortized, so it is represented as
$((0,1),Amortized)$.
\subsection{Choosing the best data structure} \label{sec:choose-ds}
We define a set \emph{DataStructureOperation}. We can further extend this set, but for now assume that
\begin{eqnarray}
\mathit{DataStructureOperation} = \{Insert, \; Update, \; Delete, \; FindMax,\; DeleteMax, \; \dots\}.
\end{eqnarray}
Each of the \emph{DataStructureOperation} elements symbolizes an operation you can accomplish on a data
structure.
The type
\begin{equation}\label{data-structure-type}
DataStructure = DataStructureOperation \rightarrow Complexity
\end{equation}
represents a data structure \emph{ds} and all of the operations implemented for it, with their complexities, as a
partial function \emph{f} from DataStructureOperation to Complexities. It represents some implementation of a
data structure, where existing arguments of \emph{f} are the operations, which are possible to accomplish on
\emph{ds}, and the value of the function represents asymptotic speed of the operation. For example
\autoref{tab:rbt} presents such representation for Red-Black Trees.
\begin{table}[h!]
\centering
\begin{tabular}{|l|l|}
\hline
\multicolumn{2}{|c|}{Red-Black Trees} \\
\hline
Operation & Complexity \\
\hline
Insert & $O(log \; n)$ \\
Update & $O(log \; n)$ \\
Delete & $O(log \; n)$ \\
Find Max & $O(log \; n)$/$O(1)$\\
Delete Max & $O(log \; n)$ \\
Search & $O(log \; n)$ \\
Successor & $O(log \; n)$/$O(1)$\\
\hline
\end{tabular}
\caption{A table showing the representation of Red-Black trees; complexity varies on
implementation, speed-up technique described in \autoref{sub:gdsm}}
\label{tab:rbt}
\end{table}
When trying to find the best suited data structure for a given program \emph{p}, we look for data structure
uses in \emph{p}. Let
\begin{equation} \label{dsu-type}
DSU(p) :: [P(DataStructureOperation)]
\end{equation}
be a set of groups of data structure operations. One group represents operations on one persistent
data structure identity in the source code of \emph{p}.
For every $ds \in DSU(p)$, we define a parametrized comparison operator for data structures $<_{ds}$
defined as:
\begin{center}
\begin{equation}
d_1 <_{ds} d_2
\end{equation}
$\Updownarrow$
\begin{equation} \label{data-structure-order}
|\{(o, c_1) \in d_1 | o \in ds \wedge o \texttt{ used in $p$} \wedge (o,c_2) \in d_2 \wedge c_1 < c_2 \}| < 0.5 *
|\{o \in ds | o \texttt{ used in $p$} \}|
\end{equation}
\end{center}
\autoref{data-structure-order} defines an order we use for the comparison of data
structures. Intuitively a data structure $d_1$ is better than a data stucture $d_2$ if
it implements "faster" at least half of types of operations used in $p$.
For a fixed $p$, we have a preorder on data structures induced by $<_{ds}$ and we can sort data structures
available to the framework using this order. The maximum element is the best data structure
implementation for the persistent data structure identity.
\subsection{Collecting the program data} \label{dsu-definition}
In \autoref{dsu-type} the type of the $DSU()$ operation was mentioned. This section shows, how $DSU()$ is
defined. We are not considering the problem of 2 different variables having the same name and/or shadowing each
other, to evade this we can just rename each variable to an UUID, and then remember the mapping for nicer
output.
Let $DSOPS(p) :: [(\mathit{VariableName}, \mathit{DataStructureOperation})]$ be the union of:
\begin{enumerate}
\item \label{it:global} $\{(g, o)\; |\; g \texttt{ is a global data structure variable in $p$ }
\wedge \\ o \texttt{ is an operation performed on the value of $g$, somewhere in $p$ } \}$
\item \label{it:auto} $\{(d, o)\; |\; d \texttt{ is a data structure declared somewhere } \\
\texttt{ in a body of a function $f$ in $p$ } \wedge \\ o \texttt{ is an operation
performed on the value of $d$, somewhere in $f$ } \}$
\item \label{it:param} $\{(p', o)\; |\; p' \texttt{ is a formal parameter of data structure type,}
\\ \texttt{of a function $f$ in $p$ } \wedge \\ o \texttt{ is an operation performed on the value of
$p'$, somewhere in $f$ } \}$
\end{enumerate}
An example of the first rule is in \autoref{fig:global-collection-rule}. In the example there are two global data structure variables declared ($global\_ds\_1$, $global\_ds\_2$) and in the bodies of functions we perform some operations on them. The rule returns list of pairs of data structures and operations as they were called.
An example of the second rule is in \autoref{fig:declared-collection-rule}. This rule is similiar to the first rule, but it works on variables with an auto storage specifier (local variables in C code) like the $declared\_ds$ variable in the example.
An example of the third rule is in \autoref{fig:parameter-collection-rule}. This rule works similiarly to the two above, only gathering results on variables declared as a function parameter of any function in the program, like the $parameter\_ds$ variable in the example.
\begin{figure}[h]
\lstinputlisting{thesis-pics/collecting_global.c}
\caption{Global variables collection rule}
\label{fig:global-collection-rule}
\end{figure}
\begin{figure}[h]
\lstinputlisting{thesis-pics/collecting_declared.c}
\caption{Declared variables collection rule}
\label{fig:declared-collection-rule}
\end{figure}
\begin{figure}[h]
\lstinputlisting{thesis-pics/collecting_function.c}
\caption{Function parameters collection rule}
\label{fig:parameter-collection-rule}
\end{figure}
\clearpage
We want to group the elements in $DSOPS(p)$ to detect the persistent identities \cite{Okasaki} of
data structures, meaning that in one group, there will be data structure operations conducted on one
data structure from its allocation, to deallocation or the end of the program, counting in passing the structure as a
parameter to another function or copying the pointer to the structure.
$DSU(p)$ is the list of groups created in this step. After this operation, each group has operations performed on the same persistent identity of a data structure. For every such group we find the best matching data structure as shown in
\autoref{sec:choose-ds}.
To group the operations:
\begin{itemize}
\item \textbf{Persistent Identities} - for every two pairs $(d_1, o_1)$ and $(d_2, o_2)$ created using
\autoref{it:global} or \autoref{it:auto}, we put them in the same group if $d_1 = d_2$. This rule for the example \autoref{fig:global-collection-rule} would group all the operations on $global\_ds\_1$ into one group, and all the operations on $global\_ds\_2$ into another.
\item \textbf{Function calls} - for every pair $(p, o_1)$ that was created by using \autoref{it:param}
on function $f$, which was called with the actual data structure parameter $d$ as the formal parameter
$p$, we put $o_1$ into the group of operations on $d$.
This rule for the example \autoref{fig:parameter-collection-rule} would group all the operations on $declared\_ds$ and $parameter\_ds$ into one group. A more complex example is described in \autoref{fig:function-call-grouping}.
\item \textbf{Copy propagation} - for every $(do, o_1)$ and $(dc, o_2)$, where the $dc$ variable was
obtained by copying the value of the variable $do$, we put $o_1$ and $o_2$ in the same group. In \autoref{fig:copy-grouping} this rule would group
$declared\_ds$ and $copied\_ds$ into one group.
\item \textbf{Uniqueness} - for every group, we delete repeating elements. We notice that in \autoref{fig:global-collection-rule} the method yields a list with $(global\_ds\_2, delete\_d)$ two times. This rule takes care of that and leaves only one.
\end{itemize}
\begin{figure}[h]
\lstinputlisting{thesis-pics/function-call-grouping.c}
\caption{Function call grouping rule}
\label{fig:function-call-grouping}
\end{figure}
\begin{figure}[h]
\lstinputlisting{thesis-pics/grouping_copy.c}
\caption{Copy propagation grouping rule}
\label{fig:copy-grouping}
\end{figure}
\clearpage
\pagebreak
\section{Extensions of the idea}
\subsection{Second extremal element}
If we want to find the maximal element in a heap, we just look it up in $O(1)$, that is what heaps are
for. If we want to find the minimal element we can modify the heap, for it to use an order, which would
allow us to lookup the minimal element in $O(1)$. What happens if we want to find the max and the min
element in the duration of one program? How to modify our framework to handle this kind of situations?
\begin{eqnarray*}
DataStructureOperation = \{\dots, \; FindFirstExtremalElement,\\
DeleteFirstExtremalElement,\\
FindSecondExtremalElement,\\
DeleteSecondExtremalElement\}.
\end{eqnarray*}
Now we can add two complexity costs to the data structure definition. We always try to map the first extremal
element to the kind of element that is used more. We can always reverse the order, so the cheaper one can be
used primarily, and the more expensive one in situations when we need both types of extremal elements.
\subsection{Importance of operations} \label{sub:importance}
Detecting the importance of a single data structure operation is an important problem, because a lot of
programs will fall into a class where compile-time data structure selection is undecidable. The
program in \autoref{fig:undecidable} shows that the problem of compile-time deciding on the best data structure
is impossible to solve.
\begin{figure}[!h]
\lstinputlisting{thesis-pics/undecidable.c}
\caption{An example of the problem of choosing the fastest data structure for a program being
undecidable, user interaction}
\label{fig:undecidable}
\end{figure}
In \autoref{fig:undecidable}, the best data structure is depending on the user input, which is not known at
compile time. If we analyze it as is, it will decide based on the information, that every operation is
used, so the final data structure will not be specific to the task, but just average at everything.
Code that is not totally dependent on the user input can also cause problems to analyze.
\begin{figure}[!h]
\lstinputlisting{thesis-pics/weighted-instructions.c}
\caption{An example of the problem of choosing the fastest data structure for a program being hard to
analyze, no user interaction}
\label{fig:hard-no-user}
\end{figure}
In \autoref{fig:hard-no-user} we see a very costly instruction used only once
($delete\_max\_d$), and a few instructions ($insert$, $search$) run in a loop for a few million times.
To anyone that knows program complexities it is obvious that this one heavy instruction does not affect
the execution time of the whole program, yet the framework at the current state treats those
instructions equally and will probably choose something like Red-Black Trees, to minimize the time of
the $delete\_max\_d$, instead of ignoring its cost and using a very fast Hashtable.
We can formally show, that the problem of choosing the fastest data structure is undecidable, by showing that
it is not easier than the halting problem. In \autoref{fig:undecidable-no-user} we run the $delete\_max\_d$ only
if a turing machine that we defined accepts an input, so we do not know if we should choose a data structure that
implements $delete\_max\_d$ fast, or just ignore it, because it will not ever be used because of the machine not
stopping the computation.
\begin{figure}
\lstinputlisting{thesis-pics/turing-machine.c}
\caption{An example formally showing the undecidability of the problem, no user interaction}
\label{fig:undecidable-no-user}
\end{figure}
\subsubsection{Code pragmas} \label{sec:pragmas}
A possible solution to this problem is to let the programmer add code pragmas to his source
code, so he decides how important an instruction is in relation to other instructions and then
the framework makes use of those values in choosing the data structure.
\begin{figure}[!h]
\lstinputlisting{thesis-pics/code-pragmas.c}
\caption{Source code with pragmas that state the importance of the operation}
\label{fig:code-pragmas}
\end{figure}
In the example in \autoref{fig:code-pragmas} the programmer adds weights to operations, assigning very low values
for statements that are used rarely or for debug purposes, and very high values for crucial
parts of the program.
We would have to extend the API like in \autoref{fig:code-pragmas-api}.
\begin{figure}[!h]
\lstinputlisting{thesis-pics/dsw.h}
\caption{dsinf API, using additional arguments for operation importance}
\label{fig:code-pragmas-api}
\end{figure}
This is not a perfect solution, because we still need the programmer to judge which operations
should have high weights, but it is nice when a programmer wants to use it for debug purposes or
otherwise tinker with it.
\subsubsection{Choosing the best data structure with weights} \label{sec:choose-weights}
We need to change the algorithm for choosing the best data structure, for it to handle weights
to the operations in the source code. We change the type of the $DSU()$ operation from
\autoref{dsu-type} to:
\begin{equation}
\mathit{DSU}_w(p) :: [(\mathit{VariableName}, \mathit{DataStructureOperation}, \mathit{Int})]
\end{equation}
This change introduces weights for data structure operations in program \emph{p}. We can use the
additional \emph{Int} field to store the specified weight of the operation. For operations that are used
multiple times, we sum the weights and the resulting sum is the value for that operation.
The $DSU()$ definition needs a change from \autoref{dsu-definition}:
\begin{itemize}
\item Every rule in $DSOPS(p)$ now adds a triple, where the additional argument is
weight obtained by using a method from \autoref{sec:pragmas} or
\autoref{sec:pgo}.
\item The last step of grouping, which removed repeated elements, now sums up the weights
of the same operation elements and substitutes it with a new element with the
sum as its weight.
\end{itemize}
The change the definition of \autoref{data-structure-order} is needed. We could still simply compare data
structure operations on their complexity, making that one point, and multiply that point by the weight of
the operation for that data structure identity. That is not a very precise solution, let us consider an
example in \autoref{fig:dsu-weight-bad-example}.
\begin{figure}[!h]
\lstinputlisting{thesis-pics/dsu-weight-bad-example.c}
\caption{Example for weighted data structure choice algorithm, using pragmas API; the weight is
counted per call site, not per function call, so the weight will not be $x^2$, but $x$ for $search\_d$ .}
\label{fig:dsu-weight-bad-example}
\end{figure}
Let us consider the class of programs, like in \autoref{fig:dsu-weight-bad-example}, but with different $n$
and $x$ values. Assume for simplicity that we ignore the insertions and the complexity functions of $search\_d$,
$delete\_max\_d$ and $delete\_min\_d$ are $log_2 n$, $log_2 n$, $log_2 n$ for Red Black Trees and $1$, $n$ and $n$
for Hashtables respectively --- we discard big-O constants. Then, the answer to the question which data structure
is better for this case can be stated as the inequality:
\begin{equation}
x + 2n > (x + 2)log_2(n)
\end{equation}
The left side is the cost of $x$ searches ($x * 1$) and deleting the maximal ($n$) and the minimal ($n$) element on Hashtables.
The right side is the cost of $x$ searches ($x * log_2 n$) and deleting the maximal ($log_2 n$) and the minimal ($log_2 n$) element on Red Black Trees.
\begin{figure}[!h]
\begin{center}
\includegraphics[width=0.7\textwidth]{thesis-pics/dsu-weight-example.png}
\end{center}
\caption{Plot of profitability of data structures for \autoref{fig:dsu-weight-bad-example} for different
$n$ and $x$; blue --- Red Black Trees, white --- Hashtables}
\label{fig:dsu-weight-bad-plot}
\end{figure}
This plot in \autoref{fig:dsu-weight-bad-plot} shows, where the better solution would be to use Red Black
Trees (blue area), and where to use Hashtables (white area). The blue cone, although smaller, still covers a
lot of cases. The earlier algorithm yields hashtables for virtually all examples here ($x > 2$), which means
that the data structures chosen in the blue cone on the plot are not the best ones for the case.
A better approach would be to approximate the current element count $N$ and actually evaluate the complexity
function on the element count, multiplied by the weight, then the sum of those would be our metric of
profitability of a data structure implementation. That would more accurately describe the cost and would be
resistant to less-weighted operations making a big impact. We modify it as follows:
\begin{equation} \label{data-structure-order-weights}
d_1 <_{dsu} d_2 \Leftrightarrow DSCOST(d_1, dsu) \leq DSCOST(d_2, dsu)
\end{equation}
\emph{DSCOST} describes the cost to use the data structure \emph{d} for \emph{dsu} usages.
\begin{equation}
DSCOST(d,dsu) = \sum_{u \in dsu} OPCOST(u,d)
\end{equation}
\emph{OPCOST} describes the cost of a single operation \emph{o}, with weight \emph{w} and
complexity function \emph{c} for a single data structure usage \emph{u}.
\begin{equation}
OPCOST(u,d) = c(N) * w \; \textrm{where} \; (o,c) \in d, u = (v, o, w)
\end{equation}
This generalization invalidates the point of storing data structure complexities in a easily comparable way.
Now we can just use any functions, that are reasonable to compute in small time, or even more complex ones
if not using any runtime solutions like in \autoref{sec:transforming-on-line}. One could even try to
approximate the time better than just in the big-O notation, so the actual constants would matter.
For user guided approaches, like in \autoref{sec:pragmas}, user would have to manually specify a global
count approximation for the cost optimization.
\subsubsection{Profile-guided optimization} \label{sec:pgo}
Profile-guided optimization (PGO) is a compiler optimization technique in computer programming
to improve program runtime performance. In contrast to traditional optimization techniques that
solely use the source code, PGO uses the results of test runs of the instrumented program to
optimize the final generated code \cite{Wipgo}.
Usually the technique is used for optimizing hot loops and if statements. The binary saves logs of itself
working and which source lines are hit more, then a system-wide daemon can recompile the parts of the binary to make it faster for the common case.
If the user has test data he can run against his program, we can take advantage of that. First
we choose the best data structure using an unmodified method and link some library to the
executable. Of course this does not have to be the best data structure possible. Then the user can run the test suite with code coverage option, like \emph{gcov} in GCC, turned on in the compiler. This
generates a file like the one shown in \autoref{fig:gcov}.
\begin{figure} \label{fig:gcov}
\lstinputlisting{thesis-pics/gcov.c}
\centering \line(1,0){450}
\lstinputlisting{thesis-pics/gcov.c.gcov}
\caption{Example of a source code file and the .gcov file, generated by running the
compiled program}
\label{fig:gcov}
\end{figure}
Then the user can pass the \emph{gcov} generated files and the source code to the framework
again. The framework can extract line hits from the \emph{gcov} files on the data structure
operations and set weights on the operations according to the extracted data and then use the
choosing algorithm for operations with weights from \autoref{sec:choose-weights}.
In this method we can approximate the element count \emph{N} based on our coverage data, counting all insert and delete operations.
It is a better solution than letting the user set the weights himself, but we have to keep in mind
that the data the inference is based upon comes from tests and there is no guarantee the real world
data will match the test data.
\subsubsection{Transforming data structures on-line} \label{sec:transforming-on-line}
Another technique known in compilers we might use is called Just-In-Time Compilation (JIT). JIT, also known
as dynamic translation, is a method to improve the runtime performance of computer programs based on byte
code (virtual machine code). Since byte code is interpreted it executes more slowly than compiled machine
code, unless it is actually compiled to machine code, which could be performed before the execution --- making
the program loading slow --- or during the execution. In this latter case, which is the basis for JIT
compilation, the program is stored in memory as byte code, but the code segment currently running is
preparatively compiled to physical machine code in order to run faster.\cite{Wijit}
This technique is, as PGO (\autoref{sec:pgo}), used mostly for peephole optimizations, which means we always
look at a small part of the code, like a hot \emph{if} statement or a loop. With PGO we gathered information
over time and changed the binary, JIT does basically the same thing, only works at runtime. Usually we
implement some proxy abstraction, that decides if, at the current time, it is profitable to compile the
executed part to assembly instead of interpreting it.
Instead of compiling a bytecode to assembly, we can use a proxy to count the specific operations on our
data structure, using the counts of usage as the new weights in the data structure inference algorithm in
\autoref{sec:choose-weights}. We substitute the data structure with a faster one, if the program would
obviously benefit time-wise from the substitution. The new implementation could of course be compiled from a
bytecode (e.g. LLVM IR) to assembly just like the standard JIT approach does it.
The element count \emph{N} can be computed precisely, because we proxy all the data structure operations, so we
just keep a counter.
The heuristics for detection of a good moment for the substitution should be stronger and more careful than
in the standard JIT way. Compiling a part of code to assembly, in most cases, is not as costly as
rebuilding a data structure, because unused code will not be run and a wrong data structure can slow down the
whole program quite a bit. Building a sensible set of heuristics is very hard even for the standard JIT,
e.g. the PyPy project tried a lot of different JIT approaches, before finding the one that is working well \cite{PyPy}.
Depending on the heuristics here, this may be the most beneficial option for a program, or a big performance
hit.
\subsection{Different element types}
Currently the framework works only for integer elements. We could extend it to every primitive type in
C, but it would require some changes. Some data structures require the type to be comparable, which
would not be a big problem, because there is a comparison semantics defined on those types. There is also
a hash function needed, because some data structures, like a hashtable, need to compute hash values to
work. We can just use the binary representation of other types, and use it as an integer for the hash
function. This whole modification does not need any user interaction.
There is a bigger problem with composite types. Comparing pointers is not obvious. You may want to compare
addresses or the contents under that address, depending if you want object or structural equality. The
same problem apppears in hash function arguments. There is also a problem with possible memory leaks. You can
pass a pointer to a string to the data structure and lose the pointer in the program, then when the
structure deletes the pointer and the string stays in the memory to the program's end. Adding this to
the framework would require passing the comparison function, hash function and some kind of destructor
function to the data structure, or possibly use some reference counting system. It would be very
technical and would be beyond the scope of this thesis. With an array or a big struct passed to the
framework, there is a problem if we want to share the data structure and just copy the pointer, copying
the whole thing may be a bad idea, because it can have quite a lot of data inside.
\subsubsection{Linked data structures}
When we want to store structs like \autoref{fig:linked-struct} in a data structure, the
framework, at the moment, could generate some data structure for this kind of structs. The
operation on the data structure would look like in \autoref{fig:struct-old}.
\begin{figure}[h!]
\lstinputlisting{thesis-pics/linked-struct.c}
\caption{An example record that we want to store in a data structure}
\label{fig:linked-struct}
\end{figure}
\begin{figure}[h!]
\lstinputlisting{thesis-pics/struct-old.c}
\caption{Use of the dsinf calls, to perform some data structure operations with the
records; not very comfortable to use}
\label{fig:struct-old}
\end{figure}
In \autoref{fig:struct-old} a comparison function is used; it compares the height. We can use
any function that compares a coordinate or a combination of coordinates, but only one order of
the elements is available per data structure. This can also be achieved without using structs at
all, like in \autoref{fig:struct-pointer-sizeof-reduction}.
\begin{figure}[h!]
\lstinputlisting{thesis-pics/struct-pointer-sizeof-reduction.c}
\caption{Encoding of the problem of having one order data structure on records in C,
into code using pointers}
\label{fig:struct-pointer-sizeof-reduction}
\end{figure}
In \autoref{fig:struct-pointer-sizeof-reduction} we dereference structs as arrays of
chars. To get to any field, we just take a pointer to a field in the array and cast it to the
right type. So it does not really give us any more expressiveness. It would be nice if our
program enabled things like comparing structs by more than one condition in one program. We
would want to be able to use operations like in \autoref{fig:struct-new}.
\begin{figure}[h!]
\lstinputlisting{thesis-pics/struct-new.c}
\caption{Examples of the new dsinf API that takes different orders on structs into
account}
\label{fig:struct-new}
\end{figure}
We change the current API of dsinf (\autoref{fig:struct-api-change}) to enable passing an arbitrary number
of comparison functions defining orders on the data to the constructor \emph{init\_d}. We use the \emph{va\_args}
mechanism in C, which is why we need the order count as the first argument, because the \emph{va\_arg} macro needs
the first argument of a known type, to start iterating over the rest of the arguments. Next arguments are
function pointers of type $int(void *, void *)$.
\begin{figure}[h!]
\lstinputlisting{thesis-pics/struct-api-change.c}
\caption{Change in the API, enabling us to define multiple orders on data}
\label{fig:struct-api-change}
\end{figure}
To achieve this, we need to change the way we choose structures for record types. We will check all
the data structure tuples, of length of the order count, in the cross-product of all data structures. The
result will be a linked data structure, i.e. a number of data structures, in which elements have pointers to
elements in other data structures, representing the same element value --- like on
\autoref{fig:linked-data-structures}. For example, if we consider the code from the earlier example in
\autoref{fig:struct-new}, the result may be a (Red-Black-Tree, Hashtable) pair, where the RBT is for the \emph{weight}
order, and the hashtable is for the \emph{height} order.
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.9\textwidth]{thesis-pics/linked-dses.png}
\end{center}
\caption{Visualization of a linked data structure: elements from a balanced tree, represeting one order of data,
connected with the elements of a linked list, representing the second order of data}
\label{fig:linked-data-structures}
\end{figure}
In the choice algorithm we make distinctions between operations that are order specific and those that are order independent
operations. An example of an order specific operation is \emph{max\_d}, which finds an element that is maximal in
a given order; an example of an order independent operation is \emph{insert\_d}, which does not take the order as an argument, as it has to perform an operation on all the linked data structures.
Sometimes order specific operations, aside from performing an operation on the data structure related to their order, have to update the state in other data structures, e.g. \emph{delete\_d}, after firing, has to run a \emph{delete\_d} operation on the element on the other data structures responsible for different orders. It does not have to be the same operation. For example, \emph{delete\_max\_d} does not have to call \emph{delete\_max\_d}, because it already has the data structure element pointer, and can just call \emph{delete\_d} --- we call the operation
that will have the desired effect with the smallest complexity cost. Calling \emph{delete\_max\_d} would not be correct for this example, because it would delete the maximal element from another order.
We define a function:
\begin{eqnarray}
\mathit{RELATED} :: \mathit{DataStructureOperation} \rightarrow \mathit{DataStructureOperation}
\end{eqnarray}
For read-only general operations the function is undefined, as we have an edge case for that in \autoref{eqn:opcost}:
\begin{eqnarray}
\mathit{RELATED}(Size) = undefined\\
\mathit{RELATED}(AnyElement) = undefined
\end{eqnarray}
For read-write general operations the function is defined as:
\begin{eqnarray}
\mathit{RELATED}(Insert) = Insert\\
\mathit{RELATED}(Delete) = Delete
\end{eqnarray}
For read-only order specific operations we use \emph{Nop}, an operation that does nothing:
\begin{eqnarray}
\mathit{RELATED}(FindMax) = Nop\\
\mathit{RELATED}(FindMin) = Nop\\
\mathit{RELATED}(Find) = Nop\\
\mathit{RELATED}(Successor) = Nop\\
\mathit{RELATED}(Predecessor) = Nop\\
\end{eqnarray}
For read-write order specific operations he function is defined as:
\begin{eqnarray}
\mathit{RELATED}(DeleteMax) = Delete\\
\mathit{RELATED}(DeleteMin) = Delete
\end{eqnarray}
For every supported operation one has to decide if it is read-only or read-write, is it order specific or independent and define what is the related operation for it.
To create an ordering of linked data structures we modify \autoref{data-structure-order-weights} as follows:
\begin{equation} \label{data-structure-order-different-types}
ld_1 <_{dsu} ld_2 \Leftrightarrow \mathit{DSCOST}(ld_1, dsu) \leq \mathit{DSCOST}(ld_2, dsu)
\end{equation}
Variables $ld_1$ and $ld_2$ are linked data structures consisting of \emph{m} linked simple structures.
\begin{eqnarray}
ld_1 = d_{1,1}, d_{1,2}, ..., d_{1,m}\\
ld_2 = d_{2,1}, d_{2,2}, ..., d_{2,m}
\end{eqnarray}
\emph{DSCOST} describes the cost of the linked data structure \emph{ld} for usages \emph{dsu}.
\begin{eqnarray}
\mathit{DSCOST}(ld,dsu) = \sum_{u \in dsu} \mathit{OPCOST}(u, ld)
\end{eqnarray}
\emph{OPCOST} describes the cost of a single usage \emph{u} of the linked data structure \emph{ld}. This has two cases because for read-only order-independent operations we can use the data structure that performs the operation the fastest. The other case is the sum of the main operation on the order specific data structure plus all cost of all the related operations on all the other data structures.
\begin{equation} \label{eqn:opcost}
OPCOST(u, ld) = \begin{cases}
\displaystyle \min_{d \in ld} \mathit{SOPCOST}(u,d) & \text{if } ROOI(u)\\
\displaystyle \sum_{d \in ld} \mathit{SOPCOST}(u,d) + \mathit{RELCOST}(u, d, ld) & \mathit{otherwise}
\end{cases}
\end{equation}
\emph{ROOI} is a predicate that is true only for read-only order independent operations:
\begin{equation}
\mathit{ROOI}(u) = \mathit{True} \text{ iff }u = (v, o, w) \wedge o \text{ read-only order independent operation}
\end{equation}
\emph{SOPCOST} describes the cost of a single usage \emph{u} on one of the substructures \emph{d}
related to the ordering the operation uses.
\begin{eqnarray}
\mathit{SOPCOST}(u,d) = c(N) * w \; \textrm{where} \; (o,c) \in d, u = (v, o, w)
\end{eqnarray}
\emph{RELCOST} describes the cost of a single usage \emph{u} on all the substructures other than \emph{d}. It uses the \emph{ROPCOST} cost for the operations, as we perform the related operation of the operation \emph{o} on all the other structures.
\begin{eqnarray}
\mathit{RELCOST}(u, d, ld) = \sum_{d' \in ld, d \neq d'} \mathit{ROPCOST}(u,d')
\end{eqnarray}
\begin{eqnarray}
\mathit{ROPCOST}(u,d) = c(N) * w \; \textrm{where} \; (\mathit{RELATED}(o),c) \in d, u = (v, o, w)
\end{eqnarray}
\subsection{Upper bound on the element count}
When analyzing the input program, we can try to gather whether the maximum element count in the
data structure is bounded by some constant. If we can obtain this information (in general it is of course undecidable), we can use it to enhance the data structure generated for the program.
When the element count in the data structure is potentially infinite, we have to create an
implementation of the structure that allocates memory lazily, when it needs the space for new elements.
We can imagine that a structure can allocate one chunk on every insertion and free a chunk on every
deletion, or even allocate chunks twice as big and amortize the number of allocations to $O(log \; n)$.
When we have the information about the element count, we can allocate the whole static buffer for the
data structure, and this removes the whole cost of allocation during insertions and deletions. The
example in \autoref{fig:upper-bound-transform} shows such a program.
\begin{figure}
\lstinputlisting{thesis-pics/upper-bound-transform.c}