thesis.tex

\documentclass[a4paper,11pt]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8x]{inputenc}
\usepackage[colorinlistoftodos, textwidth=2cm, shadow]{todonotes}
\usepackage{listings}
\usepackage{appendix}
\usepackage[colorlinks=true,linkcolor=blue]{hyperref}
\usepackage[section]{placeins}
\usepackage{color}
\usepackage{amsmath}

\definecolor{darkgreen}{rgb}{0,0.4,0}
\newcommand{\HRule}{\rule{\linewidth}{0.5mm}}

\lstset{
	language=C
  }

\topmargin 0pt
\advance \topmargin by -\headheight
\advance \topmargin by -\headsep
\textheight 8.9in
\oddsidemargin 0pt
\evensidemargin \oddsidemargin
\marginparwidth 0.5in
\textwidth 6.5in
\parindent 0in
\parskip 1.5ex

\let\subsectionautorefname\sectionautorefname
\let\subsubsectionautorefname\sectionautorefname

\begin{document}

%Polish title page

\begin{titlepage}
\begin{center}

\textsc{\LARGE \\ Uniwersytet Wrocławski }\\
Wydział Matematyki i Informatyki \\
Instytut Informatyki \\[1.5cm]

\textsc{\Large Praca Magisterska}\\[0.5cm]

\HRule \\[0.4cm]
{ \huge \bfseries \emph{dsinf}: Inferencja stuktur danych oparta na kodzie źródłowym \\[0.4cm] }

\HRule \\[1.5cm]

\begin{minipage}{0.4\textwidth}
\begin{flushleft} \large
\emph{Autor:}\\
Aleksander \textsc{Balicki}
\end{flushleft}
\end{minipage}
\begin{minipage}{0.4\textwidth}
\begin{flushright} \large
\emph{Promotor:} \\
Prof. Witold \textsc{Charatonik}
\end{flushright}
\end{minipage}

\vfill

{\large Wrzesień 2014}
\thispagestyle{empty}
\end{center}
\end{titlepage}

% Polish abstract
\renewcommand{\abstractname}{Streszczenie}
\begin{abstract}

	Większość popularnych dzisiaj języków programowania posiada już zaimplementowane biblioteki z wieloma dostępnymi strukturami danych 
	do przechowywania danych. 
	Programiści po prostu muszą nauczyć się, kiedy należy używać danej struktury. W niektórych przypadkach ten wybór jest na tyle prosty,
	że program mógłby automatycznie dobrać najlepszą strukturę danych. Praca ta opisuje \emph{dsinf} --- framework do dobierania najlepszej struktury danych pasującej do zadania na podstawie kodu źródłowego programu w języku C. \emph{dsinf} analizuje przypadki użycia struktur danych w całym programie i proponuje najszybszą. 
	Do zapewnienia jakości tej sugestii, używa danych profilera.

\end{abstract}
\thispagestyle{empty}
\pagebreak

% English title page

\begin{titlepage}
\begin{center}

\textsc{\LARGE \\ University of Wrocław} \\
Department of Matemathics and Computer Science \\
Computer Science Institute\\[1.5cm]

\textsc{\Large Master's Thesis}\\[0.5cm]

\HRule \\[0.4cm]
{ \huge \bfseries \emph{dsinf}: Source based data structure inference \\[0.4cm] }

\HRule \\[1.5cm]

\begin{minipage}{0.4\textwidth}
\begin{flushleft} \large
\emph{Author:}\\
Aleksander \textsc{Balicki}
\end{flushleft}
\end{minipage}
\begin{minipage}{0.4\textwidth}
\begin{flushright} \large
\emph{Supervisor:} \\
Prof. Witold \textsc{Charatonik}
\end{flushright}
\end{minipage}

\vfill

{\large September 2014}
\thispagestyle{empty}
\end{center}
\end{titlepage}

\vfill

\pagebreak

% English abstract
\renewcommand{\abstractname}{Abstract}
\begin{abstract}

    When you need to store data in a data structure, most languages popular
    today already have libraries with convenience data structures implemented.
    Programmers just have to be taught when to use a particular data structure. Some of
    the cases of choosing the right data structure look sufficiently easy, so a
    program could do it automatically. This work describes the \emph{dsinf} project,
    an analysis framework for inferring the best data structure matching the task,
    based on the program's source code in C language. \emph{dsinf} analyses the 
    uses of data structures throughout the program and suggests the fastest one.     
    To further enhance the quality of the suggestion, it uses profiler data.

\end{abstract}
\thispagestyle{empty}
\pagebreak

\tableofcontents

\vfill

\section{Introduction} \label{sec:intro}
	In computer science, a data structure is a particular way of storing and organizing data in a computer so that
	it can be used efficiently\cite{Wids}.

	The interface of a data structure is a set of operations that you can perform on a data structure and some
	semantics of those operations.

	The implementation of a data structure is a realization of some data structure interface in some programming
	language, that is consistent with the semantics.

	There are a lot of different known types of data structures:
	\begin{itemize}
		\item Map/associative array/dictionary is a data structure for storing $(key,\;value)$ pairs, it allows
			insertion, deletion, modification and lookup,
		\item multimap is the same as map, but with the possibility of storing multiple values for a one key,
		\item list is a data structure for storing ordered sets of values,
		\item set is a data structure for storing unordered sets of values,
		\item multiset is the same as set, but with the possibility of storing multiple objects that are equal
			under some equivalence relation,
		\item queue is a data structure implementing operations for queuing and dequeuing elements,
		\item deque is a queue, where you can add and remove elements from both ends as opposed to one,
		\item stack is a data structure implementing operations for putting something on the top of the stack or
			removing the top element from it,
		\item priority queue is like a regular queue, but each element has a priority associated with it, which
			is used for ordered dequeuing of the elements,
		\item string is a data structure for storing text,
		\item tree is a data structure, to hold some tree-like hierarchical structure,
		\item graph is a data structure, to store elements and a relation on those elements,
		\item others, which usually are a modification or a mix of previous ones.
	\end{itemize}

	A different implementation of those data structures makes it possible, that specific operations run time is
	lower than the others. So there is no data structure perfect for all the tasks.

	\subsection{What is dsinf?} \label{sub:intro}

		dsinf is a source code analyzer. It works by analyzing specially prepared C code. That C code, for every
		instance of a data structure, uses a type defined in the dsinf library. This type is called a data
		structure wildcard. The type represents a data structure with the most general interface, i.e. a set union
		of interfaces of all data structures available in the dsinf framework. It connects the data structures
		and operations performed on them and then reasons about which of the available data structure
		implementations would be best for the instances of data structure wildcards in your code, finally (if
		possible) choosing the best matches and printing them out or linking them to your code to create a
		working binary.

		The above analysis now works after compiler C code preprocessing (using CPP or a preprocessor included in a C compiler), before the actual compilation. In the future it can be
		pushed to the phase between compilation and linking as described in \autoref{sec:future}.

		The analysis process uses asymptotic complexity for data structure comparison. The process can also be
		modified by providing source code annotations. One can also use test data to tailor the data
		structure as described in \autoref{sub:importance}.

		The idea of the framework is to provide a data structure, that is always "good enough". Due to
		unsolvability of the general case of the problem, the effect always will be worse than a hand tailored
		data structure to your task, but can save up some time (both programming time and execution time) if
        time is an important factor. To reiterate, using dsinf for the most critical bottleneck of your
		application probably is not a good idea.

	\subsection{Comparing to a database} \label{sub:database}
        There is already a mature solution for storing data and querying it in an efficient way --- databases.
		The main difference between dsinf and a database is that a database has to implement all the operations
		for all the possible SQL queries, including some crosschecks, counting, summations, extremal elements
		and with dsinf your data structure implementation does not have to be ready for all the possible SQL
		queries that can be constructed, but only the ones you used in your source code, which enables it to
		potentially gain some performance over a database, both in space and time complexity.

	\subsection{Comparing to class clusters} \label{sub:classcluster}
		A class cluster is an architecture that groups a number of private, concrete subclasses under a public,
		abstract superclass. The grouping of classes in this way provides a simplified interface to the user,
		who sees only the publicly visible architecture. Behind the scenes, though, the abstract class is
		calling up the private subclass most suited for performing a particular task\cite{AppleCC}. One could
		have an object representing e.g. an array, and have a lot of different private implementations for it, like
		large, sparse arrays, or arrays with sizes known at compile time and optimize operations for those
		cases. The downside of such method is that you can alternate between the data structure representations
		that need a lot of time to transform into each other. dsinf knows exactly what operations can occur and
		tailors the data structure to the operations that will happen.

	\subsection{C API}
        The framework detects specific C functions in the source code and bases the inference on them. Now we define
        what those functions are. The first API version is in \autoref{fig:C-api-v1}.

        \begin{figure}[h!]
            \lstinputlisting{thesis-pics/ds1.h}

            \caption{The simplified dsinf C API}

            \label{fig:C-api-v1}
        \end{figure}

        There is a problem with the \autoref{fig:C-api-v1} API. If there is an
        operation on a data structure that is asymptotically faster than the
        speed of a lookup of an element, it is impossible to obtain in the
        desired time with this API. For example, a balanced search tree
        that guarantees $O(log n)$ lookups, with a linked list of all the
        elements (like in \autoref{sec:gdsm:le}) to enhance the speed of the
        successor and predecessor operations to $O(1)$. Defining a successor
        function using this API convention, it would look like this:

        \begin{lstlisting}
dstype successor_d(ds, dstype);
        \end{lstlisting}

        The problem with this function is that it does not take a pointer to the element in a data structure, so it has
        to actually find the value given in the actual parameter of the function, and then compute the successor in
        $O(1)$, but the search takes $O(log n)$, so the whole operation takes $O(log n)$, which is bad.

        To fix this we declare a new type for a data structure element. The type encompasses a pointer to the place in
        the data structure. The search can be started from this pointer, so there is no need to waste time on finding the element.

        The second version of the API is in \autoref{fig:C-api-v2}. It is updated to use the \emph{dselem} struct. We use the pointer value instead of the simple value like before.

        \begin{figure}[h!]
            \lstinputlisting{thesis-pics/ds2.h}
            \caption{The dsinf C API}

            \label{fig:C-api-v2}
        \end{figure}

        The following code examples will sometimes use \autoref{fig:C-api-v1} API convention for simplicity.

\section{Data structure inference}

	\subsection{Comparison of the complexities}

		If we want to find the best possible data structure for a task, we have to define some kind of ordering
		on the data structures, so we can judge which structures are better than others. Here we want to compare the
		asymptotic complexities of operations on data structures. When trying to define such ordering, we
		encounter a problem.  We can not do it for the general case, because the complexity of an operation
        can be described by arbitrary functions and a comparison of asymptotic complexities is complicated.

		Asymptotic complexity of an operation is stored as a pair of type:

		\begin{eqnarray}
			AsymptoticalComplexity = Int \times Int,
		\end{eqnarray}

		where

		\begin{eqnarray} \label{eqn:linlog}
			(k, \; l) \; means \; O(n^k \log^l{ n}).
		\end{eqnarray}

		The reason to choose such a type is that it is easier to compare than the general case (we can do a
		lexicographical comparison of the two numbers) and it distinguishes most of the data structure operation
		complexities.

		Sometimes we have to use some qualified complexities:

		\begin{eqnarray}
			ComplexityType = \{ Normal, \; Amortized, \; Amortized \;Expected, \; Expected \}
		\end{eqnarray}

		The overall complexity can be seen as a type:

		\begin{eqnarray}
			Complexity = AsymptoticalComplexity \times ComplexityType
		\end{eqnarray}

		Here we can also use a lexicographical comparison, but we have to say that

		\begin{eqnarray}
			Amortized > Normal,\\
			Expected > Amortized,\\
			Amortized \; Expected > Expected
		\end{eqnarray}

		and that $>$ is transitive.

		The ordering is done in such way, because normal complexity of $O(f(n))$ guarantees that no operation
		will exceed that time multiplied by a constant. Amortized is worse, because here, we can only guarantee
		that $n$ consecutive operations will have the average time $O(f(n))$, but single operations can spike
		heavily. Expected is worse than amortized, because here we can only have some probability of achieving
		the mentioned complexity for an operation and there is no bound on the number of spikes, which were under
		control in an amortized complexity.

		We also always choose the smallest asymptotic-complexity-wise complexity.  For example, we have a search
		operation on a splay tree. It is $O(n)$, but $O(\log n)$ amortized, so it is represented as
		$((0,1),Amortized)$.

	\subsection{Choosing the best data structure} \label{sec:choose-ds}

		We define a set \emph{DataStructureOperation}. We can further extend this set, but for now assume that

		\begin{eqnarray}
			\mathit{DataStructureOperation} = \{Insert, \; Update, \; Delete, \; FindMax,\; DeleteMax, \; \dots\}.
		\end{eqnarray}

		Each of the \emph{DataStructureOperation} elements symbolizes an operation you can accomplish on a data
		structure.

		The type

		\begin{equation}\label{data-structure-type}
			DataStructure = DataStructureOperation \rightarrow Complexity
		\end{equation}

		represents a data structure \emph{ds} and all of the operations implemented for it, with their complexities, as a
		partial function \emph{f} from DataStructureOperation to Complexities. It represents some implementation of a
		data structure, where existing arguments of \emph{f} are the operations, which are possible to accomplish on
		\emph{ds}, and the value of the function represents asymptotic speed of the operation. For example
		\autoref{tab:rbt} presents such representation for Red-Black Trees.

		\begin{table}[h!]
			\centering
			\begin{tabular}{|l|l|}
				\hline
				\multicolumn{2}{|c|}{Red-Black Trees} \\
				\hline
				Operation & Complexity \\
				\hline
				Insert 	        & $O(log \; n)$ \\
				Update          & $O(log \; n)$ \\
				Delete	        & $O(log \; n)$ \\
				Find Max 	& $O(log \; n)$/$O(1)$\\
				Delete Max	& $O(log \; n)$ \\
				Search 		& $O(log \; n)$ \\
				Successor 	& $O(log \; n)$/$O(1)$\\
				\hline
			\end{tabular}
			\caption{A table showing the representation of Red-Black trees; complexity varies on
			implementation, speed-up technique described in \autoref{sub:gdsm}}
			\label{tab:rbt}
		\end{table}

		When trying to find the best suited data structure for a given program \emph{p}, we look for data structure
		uses in \emph{p}. Let

		\begin{equation} \label{dsu-type}
			DSU(p) :: [P(DataStructureOperation)]
		\end{equation}

		be a set of groups of data structure operations. One group represents operations on one persistent
        data structure identity in the source code of \emph{p}.

		For every $ds \in DSU(p)$, we define a parametrized comparison operator for data structures $<_{ds}$
		defined as:

		\begin{center}

			\begin{equation}
				d_1 <_{ds} d_2
			\end{equation}

			$\Updownarrow$

			\begin{equation} \label{data-structure-order}
|\{(o, c_1) \in d_1 | o \in ds \wedge o \texttt{ used in $p$} \wedge (o,c_2) \in d_2 \wedge c_1 < c_2 \}| < 0.5 *
 |\{o \in ds | o \texttt{ used in $p$} \}|
			\end{equation}

		\end{center}

		\autoref{data-structure-order} defines an order we use for the comparison of data
		structures. Intuitively a data structure $d_1$ is better than a data stucture $d_2$ if
		it implements "faster" at least half of types of operations used in $p$.

		For a fixed $p$, we have a preorder on data structures induced by $<_{ds}$ and we can sort data structures
		available to the framework using this order. The maximum element is the best data structure
		implementation for the persistent data structure identity.

	\subsection{Collecting the program data} \label{dsu-definition}

		In \autoref{dsu-type} the type of the $DSU()$ operation was mentioned. This section shows, how $DSU()$ is
		defined. We are not considering the problem of 2 different variables having the same name and/or shadowing each
        other, to evade this we can just rename each variable to an UUID, and then remember the mapping for nicer
        output.

		Let $DSOPS(p) :: [(\mathit{VariableName}, \mathit{DataStructureOperation})]$ be the union of:
		\begin{enumerate}

			\item \label{it:global} $\{(g, o)\; |\; g \texttt{ is a global data structure variable in $p$ }
				\wedge \\ o \texttt{ is an operation performed on the value of $g$, somewhere in $p$ } \}$

			\item \label{it:auto} $\{(d, o)\; |\; d \texttt{ is a data structure declared somewhere } \\
				\texttt{ in a body of a function $f$ in $p$ } \wedge \\ o \texttt{ is an operation
				performed on the value of $d$, somewhere in $f$ } \}$

			\item \label{it:param} $\{(p', o)\; |\; p' \texttt{ is a formal parameter of data structure type,}
				\\ \texttt{of a function $f$ in $p$ } \wedge \\ o \texttt{ is an operation performed on the value of 
				$p'$, somewhere in $f$ } \}$

		\end{enumerate}

		An example of the first rule is in \autoref{fig:global-collection-rule}. In the example there are two global data structure variables declared ($global\_ds\_1$, $global\_ds\_2$) and in the bodies of functions we perform some operations on them. The rule returns list of pairs of data structures and operations as they were called.

		An example of the second rule is in \autoref{fig:declared-collection-rule}. This rule is similiar to the first rule, but it works on variables with an auto storage specifier (local variables in C code) like the $declared\_ds$ variable in the example.

		An example of the third rule is in \autoref{fig:parameter-collection-rule}. This rule works similiarly to the two above, only gathering results on variables declared as a function parameter of any function in the program, like the $parameter\_ds$ variable in the example.

        \begin{figure}[h]
            \lstinputlisting{thesis-pics/collecting_global.c}

            \caption{Global variables collection rule}

            \label{fig:global-collection-rule}
        \end{figure}
        \begin{figure}[h]
            \lstinputlisting{thesis-pics/collecting_declared.c}
            \caption{Declared variables collection rule}

            \label{fig:declared-collection-rule}
        \end{figure}
        \begin{figure}[h]
            \lstinputlisting{thesis-pics/collecting_function.c}

            \caption{Function parameters collection rule}

            \label{fig:parameter-collection-rule}
        \end{figure}

        \clearpage


		We want to group the elements in $DSOPS(p)$ to detect the persistent identities \cite{Okasaki} of
		data structures, meaning that in one group, there will be data structure operations conducted on one
		data structure from its allocation, to deallocation or the end of the program, counting in passing the structure as a
		parameter to another function or copying the pointer to the structure.

		$DSU(p)$ is the list of groups created in this step. After this operation, each group has operations performed on the same persistent identity of a data structure. For every such group we find the best matching data structure as shown in 
		\autoref{sec:choose-ds}.

		To group the operations:
		\begin{itemize}

            \item \textbf{Persistent Identities} - for every two pairs $(d_1, o_1)$ and $(d_2, o_2)$ created using
                \autoref{it:global} or \autoref{it:auto}, we put them in the same group if $d_1 = d_2$. This rule for the example \autoref{fig:global-collection-rule} would group all the operations on $global\_ds\_1$ into one group, and all the operations on $global\_ds\_2$ into another.

            \item \textbf{Function calls} - for every pair $(p, o_1)$ that was created by using \autoref{it:param}
                on function $f$, which was called with the actual data structure parameter $d$ as the formal parameter
                $p$, we put $o_1$ into the group of operations on $d$.
                This rule for the example \autoref{fig:parameter-collection-rule} would group all the operations on $declared\_ds$ and $parameter\_ds$ into one group. A more complex example is described in \autoref{fig:function-call-grouping}.

            \item \textbf{Copy propagation} - for every $(do, o_1)$ and $(dc, o_2)$, where the $dc$ variable was
                obtained by copying the value of the variable $do$, we put $o_1$ and $o_2$ in the same group. In \autoref{fig:copy-grouping} this rule would group 
                $declared\_ds$ and $copied\_ds$ into one group.

            \item \textbf{Uniqueness} - for every group, we delete repeating elements. We notice that in \autoref{fig:global-collection-rule} the method yields a list with $(global\_ds\_2, delete\_d)$ two times. This rule takes care of that and leaves only one.

		\end{itemize}

        \begin{figure}[h]
            \lstinputlisting{thesis-pics/function-call-grouping.c}

            \caption{Function call grouping rule}

            \label{fig:function-call-grouping}
        \end{figure}


        \begin{figure}[h]
            \lstinputlisting{thesis-pics/grouping_copy.c}

            \caption{Copy propagation grouping rule}

            \label{fig:copy-grouping}
        \end{figure}

        \clearpage


\pagebreak

\section{Extensions of the idea}

	\subsection{Second extremal element}

		If we want to find the maximal element in a heap, we just look it up in $O(1)$, that is what heaps are
		for.  If we want to find the minimal element we can modify the heap, for it to use an order, which would
		allow us to lookup the minimal element in $O(1)$.  What happens if we want to find the max and the min
		element in the duration of one program?  How to modify our framework to handle this kind of situations?

		\begin{eqnarray*}
			DataStructureOperation = \{\dots, \; FindFirstExtremalElement,\\
			DeleteFirstExtremalElement,\\
			FindSecondExtremalElement,\\
			DeleteSecondExtremalElement\}.
		\end{eqnarray*}

        Now we can add two complexity costs to the data structure definition. We always try to map the first extremal
        element to the kind of element that is used more. We can always reverse the order, so the cheaper one can be
        used primarily, and the more expensive one in situations when we need both types of extremal elements.

	\subsection{Importance of operations} \label{sub:importance}

		Detecting the importance of a single data structure operation is an important problem, because a lot of
		programs will fall into a class where compile-time data structure selection is undecidable. The
        program in \autoref{fig:undecidable} shows that the problem of compile-time deciding on the best data structure
        is impossible to solve.

        \begin{figure}[!h]
			\lstinputlisting{thesis-pics/undecidable.c}

			\caption{An example of the problem of choosing the fastest data structure for a program being
			undecidable, user interaction}

			\label{fig:undecidable}
		\end{figure}

		In \autoref{fig:undecidable}, the best data structure is depending on the user input, which is not known at
		compile time. If we analyze it as is, it will decide based on the information, that every operation is
		used, so the final data structure will not be specific to the task, but just average at everything.

		Code that is not totally dependent on the user input can also cause problems to analyze.

        \begin{figure}[!h]
			\lstinputlisting{thesis-pics/weighted-instructions.c}

			\caption{An example of the problem of choosing the fastest data structure for a program being hard to
            analyze, no user interaction}

			\label{fig:hard-no-user}
		\end{figure}

		In \autoref{fig:hard-no-user} we see a very costly instruction used only once
		($delete\_max\_d$), and a few instructions ($insert$, $search$) run in a loop for a few million times.
		To anyone that knows program complexities it is obvious that this one heavy instruction does not affect
		the execution time of the whole program, yet the framework at the current state treats those
		instructions equally and will probably choose something like Red-Black Trees, to minimize the time of
		the $delete\_max\_d$, instead of ignoring its cost and using a very fast Hashtable.

        We can formally show, that the problem of choosing the fastest data structure is undecidable, by showing that
        it is not easier than the halting problem. In \autoref{fig:undecidable-no-user} we run the $delete\_max\_d$ only
        if a turing machine that we defined accepts an input, so we do not know if we should choose a data structure that
        implements $delete\_max\_d$ fast, or just ignore it, because it will not ever be used because of the machine not
        stopping the computation.

        \begin{figure}
			\lstinputlisting{thesis-pics/turing-machine.c}

			\caption{An example formally showing the undecidability of the problem, no user interaction}

			\label{fig:undecidable-no-user}
		\end{figure}

		\subsubsection{Code pragmas} \label{sec:pragmas}

			A possible solution to this problem is to let the programmer add code pragmas to his source
			code, so he decides how important an instruction is in relation to other instructions and then
			the framework makes use of those values in choosing the data structure.

            \begin{figure}[!h]
				\lstinputlisting{thesis-pics/code-pragmas.c}

				\caption{Source code with pragmas that state the importance of the operation}

				\label{fig:code-pragmas}
			\end{figure}

            In the example in \autoref{fig:code-pragmas} the programmer adds weights to operations, assigning very low values
			for statements that are used rarely or for debug purposes, and very high values for crucial
			parts of the program.

            We would have to extend the API like in \autoref{fig:code-pragmas-api}.

            \begin{figure}[!h]
				\lstinputlisting{thesis-pics/dsw.h}

				\caption{dsinf API, using additional arguments for operation importance}

				\label{fig:code-pragmas-api}
			\end{figure}

			This is not a perfect solution, because we still need the programmer to judge which operations
			should have high weights, but it is nice when a programmer wants to use it for debug purposes or
			otherwise tinker with it.

		\subsubsection{Choosing the best data structure with weights} \label{sec:choose-weights}

			We need to change the algorithm for choosing the best data structure, for it to handle weights
			to the operations in the source code. We change the type of the $DSU()$ operation from
			\autoref{dsu-type} to:

			\begin{equation}
				\mathit{DSU}_w(p) :: [(\mathit{VariableName}, \mathit{DataStructureOperation}, \mathit{Int})]
			\end{equation}

			This change introduces weights for data structure operations in program \emph{p}. We can use the
			additional \emph{Int} field to store the specified weight of the operation. For operations that are used
			multiple times, we sum the weights and the resulting sum is the value for that operation.

			The $DSU()$ definition needs a change from \autoref{dsu-definition}:

			\begin{itemize}
				\item Every rule in $DSOPS(p)$ now adds a triple, where the additional argument is
					weight obtained by using a method from \autoref{sec:pragmas} or
					\autoref{sec:pgo}.

				\item The last step of grouping, which removed repeated elements, now sums up the weights
					of the same operation elements and substitutes it with a new element with the
					sum as its weight.
			\end{itemize}

            The change the definition of \autoref{data-structure-order} is needed. We could still simply compare data
            structure operations on their complexity, making that one point, and multiply that point by the weight of
            the operation for that data structure identity. That is not a very precise solution, let us consider an
            example in \autoref{fig:dsu-weight-bad-example}.

            \begin{figure}[!h]
                \lstinputlisting{thesis-pics/dsu-weight-bad-example.c}

                \caption{Example for weighted data structure choice algorithm, using pragmas API; the weight is
                counted per call site, not per function call, so the weight will not be $x^2$, but $x$ for $search\_d$ .}

                \label{fig:dsu-weight-bad-example}
            \end{figure}

            Let us consider the class of programs, like in \autoref{fig:dsu-weight-bad-example}, but with different $n$
            and $x$ values. Assume for simplicity that we ignore the insertions and the complexity functions of $search\_d$,
            $delete\_max\_d$ and $delete\_min\_d$ are $log_2 n$, $log_2 n$, $log_2 n$ for Red Black Trees and $1$, $n$ and $n$
            for Hashtables respectively --- we discard big-O constants. Then, the answer to the question which data structure 
            is better for this case can be stated as the inequality:

            \begin{equation}
                x + 2n > (x + 2)log_2(n)
            \end{equation}
            
            The left side is the cost of $x$ searches ($x * 1$) and deleting the maximal ($n$) and the minimal ($n$) element on Hashtables.
            The right side is the cost of $x$ searches ($x * log_2 n$) and deleting the maximal ($log_2 n$) and the minimal ($log_2 n$) element on Red Black Trees.
                \begin{figure}[!h]
                \begin{center}
                    \includegraphics[width=0.7\textwidth]{thesis-pics/dsu-weight-example.png}
                \end{center}

                \caption{Plot of profitability of data structures for \autoref{fig:dsu-weight-bad-example} for different
                $n$ and $x$; blue --- Red Black Trees, white --- Hashtables}

                \label{fig:dsu-weight-bad-plot}
            \end{figure}

            This plot in \autoref{fig:dsu-weight-bad-plot} shows, where the better solution would be to use Red Black
            Trees (blue area), and where to use Hashtables (white area). The blue cone, although smaller, still covers a
            lot of cases. The earlier algorithm yields hashtables for virtually all examples here ($x > 2$), which means
            that the data structures chosen in the blue cone on the plot are not the best ones for the case.

            A better approach would be to approximate the current element count $N$ and actually evaluate the complexity
            function on the element count, multiplied by the weight, then the sum of those would be our metric of
            profitability of a data structure implementation. That would more accurately describe the cost and would be
            resistant to less-weighted operations making a big impact. We modify it as follows:

            \begin{equation} \label{data-structure-order-weights}
                d_1 <_{dsu} d_2 \Leftrightarrow DSCOST(d_1, dsu) \leq DSCOST(d_2, dsu)
            \end{equation}

            \emph{DSCOST} describes the cost to use the data structure \emph{d} for \emph{dsu} usages.

            \begin{equation}
                DSCOST(d,dsu) = \sum_{u \in dsu} OPCOST(u,d)
            \end{equation}

            \emph{OPCOST} describes the cost of a single operation \emph{o}, with weight \emph{w} and 
            complexity function \emph{c} for a single data structure usage \emph{u}.

            \begin{equation}
                OPCOST(u,d) = c(N) * w \; \textrm{where} \; (o,c) \in d, u = (v, o, w)
            \end{equation}

            This generalization invalidates the point of storing data structure complexities in a easily comparable way.
            Now we can just use any functions, that are reasonable to compute in small time, or even more complex ones
            if not using any runtime solutions like in \autoref{sec:transforming-on-line}. One could even try to
            approximate the time better than just in the big-O notation, so the actual constants would matter.

            For user guided approaches, like in \autoref{sec:pragmas}, user would have to manually specify a global
            count approximation for the cost optimization.

		\subsubsection{Profile-guided optimization} \label{sec:pgo}

			Profile-guided optimization (PGO) is a compiler optimization technique in computer programming
			to improve program runtime performance.  In contrast to traditional optimization techniques that
			solely use the source code, PGO uses the results of test runs of the instrumented program to
            optimize the final generated code \cite{Wipgo}.

            Usually the technique is used for optimizing hot loops and if statements. The binary saves logs of itself
            working and which source lines are hit more, then a system-wide daemon can recompile the parts of the binary to make it faster for the common case.

			If the user has test data he can run against his program, we can take advantage of that.  First
			we choose the best data structure using an unmodified method and link some library to the
			executable. Of course this does not have to be the best data structure possible. Then the user can run the test suite with code coverage option, like \emph{gcov} in GCC, turned on in the compiler. This
			generates a file like the one shown in \autoref{fig:gcov}.

			\begin{figure} \label{fig:gcov}
				\lstinputlisting{thesis-pics/gcov.c}

				\centering \line(1,0){450}

				\lstinputlisting{thesis-pics/gcov.c.gcov}

				\caption{Example of a source code file and the .gcov file, generated by running the
				compiled program}

				\label{fig:gcov}
			\end{figure}


			Then the user can pass the \emph{gcov} generated files and the source code to the framework
			again. The framework can extract line hits from the \emph{gcov} files on the data structure
			operations and set weights on the operations according to the extracted data and then use the
			choosing algorithm for operations with weights from \autoref{sec:choose-weights}.

            In this method we can approximate the element count \emph{N} based on our coverage data, counting all insert and delete operations.

			It is a better solution than letting the user set the weights himself, but we have to keep in mind 
			that the data the inference is based upon comes from tests and there is no guarantee the real world 
			data will match the test data.

		\subsubsection{Transforming data structures on-line} \label{sec:transforming-on-line}

            Another technique known in compilers we might use is called Just-In-Time Compilation (JIT). JIT, also known
            as dynamic translation, is a method to improve the runtime performance of computer programs based on byte
            code (virtual machine code). Since byte code is interpreted it executes more slowly than compiled machine
            code, unless it is actually compiled to machine code, which could be performed before the execution --- making
            the program loading slow --- or during the execution. In this latter case, which is the basis for JIT
            compilation, the program is stored in memory as byte code, but the code segment currently running is
            preparatively compiled to physical machine code in order to run faster.\cite{Wijit}

            This technique is, as PGO (\autoref{sec:pgo}), used mostly for peephole optimizations, which means we always
            look at a small part of the code, like a hot \emph{if} statement or a loop. With PGO we gathered information
            over time and changed the binary, JIT does basically the same thing, only works at runtime. Usually we
            implement some proxy abstraction, that decides if, at the current time, it is profitable to compile the
            executed part to assembly instead of interpreting it.

            Instead of compiling a bytecode to assembly, we can use a proxy to count the specific operations on our
            data structure, using the counts of usage as the new weights in the data structure inference algorithm in
            \autoref{sec:choose-weights}. We substitute the data structure with a faster one, if the program would
            obviously benefit time-wise from the substitution. The new implementation could of course be compiled from a
            bytecode (e.g. LLVM IR) to assembly just like the standard JIT approach does it.

            The element count \emph{N} can be computed precisely, because we proxy all the data structure operations, so we
            just keep a counter.

            The heuristics for detection of a good moment for the substitution should be stronger and more careful than
            in the standard JIT way. Compiling a part of code to assembly, in most cases, is not as costly as
            rebuilding a data structure, because unused code will not be run and a wrong data structure can slow down the
            whole program quite a bit. Building a sensible set of heuristics is very hard even for the standard JIT,
            e.g. the PyPy project tried a lot of different JIT approaches, before finding the one that is working well \cite{PyPy}.
            Depending on the heuristics here, this may be the most beneficial option for a program, or a big performance
            hit.

	\subsection{Different element types}

		Currently the framework works only for integer elements. We could extend it to every primitive type in
		C, but it would require some changes.  Some data structures require the type to be comparable, which
		would not be a big problem, because there is a comparison semantics defined on those types. There is also
		a hash function needed, because some data structures, like a hashtable, need to compute hash values to
		work. We can just use the binary representation of other types, and use it as an integer for the hash
		function. This whole modification does not need any user interaction.

		There is a bigger problem with composite types. Comparing pointers is not obvious. You may want to compare
		addresses or the contents under that address, depending if you want object or structural equality. The
		same problem apppears in hash function arguments. There is also a problem with possible memory leaks. You can
		pass a pointer to a string to the data structure and lose the pointer in the program, then when the
		structure deletes the pointer and the string stays in the memory to the program's end. Adding this to
		the framework would require passing the comparison function, hash function and some kind of destructor
		function to the data structure, or possibly use some reference counting system. It would be very
		technical and would be beyond the scope of this thesis. With an array or a big struct passed to the
		framework, there is a problem if we want to share the data structure and just copy the pointer, copying
		the whole thing may be a bad idea, because it can have quite a lot of data inside.

		\subsubsection{Linked data structures}

			When we want to store structs like \autoref{fig:linked-struct} in a data structure, the
			framework, at the moment, could generate some data structure for this kind of structs. The
			operation on the data structure would look like in \autoref{fig:struct-old}.

            \begin{figure}[h!]
				\lstinputlisting{thesis-pics/linked-struct.c}

				\caption{An example record that we want to store in a data structure}

				\label{fig:linked-struct}
			\end{figure}

            \begin{figure}[h!]
				\lstinputlisting{thesis-pics/struct-old.c}

				\caption{Use of the dsinf calls, to perform some data structure operations with the
				records; not very comfortable to use}

				\label{fig:struct-old}
			\end{figure}

			In \autoref{fig:struct-old} a comparison function is used; it compares the height. We can use
			any function that compares a coordinate or a combination of coordinates, but only one order of
			the elements is available per data structure. This can also be achieved without using structs at
			all, like in \autoref{fig:struct-pointer-sizeof-reduction}.

            \begin{figure}[h!]
				\lstinputlisting{thesis-pics/struct-pointer-sizeof-reduction.c}

				\caption{Encoding of the problem of having one order data structure on records in C,
				into code using pointers}

				\label{fig:struct-pointer-sizeof-reduction}
			\end{figure}

			In \autoref{fig:struct-pointer-sizeof-reduction} we dereference structs as arrays of
			chars. To get to any field, we just take a pointer to a field in the array and cast it to the
			right type. So it does not really give us any more expressiveness. It would be nice if our
			program enabled things like comparing structs by more than one condition in one program. We
			would want to be able to use operations like in \autoref{fig:struct-new}.

            \begin{figure}[h!]
				\lstinputlisting{thesis-pics/struct-new.c}

				\caption{Examples of the new dsinf API that takes different orders on structs into
				account}

				\label{fig:struct-new}
			\end{figure}

            We change the current API of dsinf (\autoref{fig:struct-api-change}) to enable passing an arbitrary number
            of comparison functions defining orders on the data to the constructor \emph{init\_d}. We use the \emph{va\_args}
            mechanism in C, which is why we need the order count as the first argument, because the \emph{va\_arg} macro needs
            the first argument of a known type, to start iterating over the rest of the arguments. Next arguments are
            function pointers of type $int(void *, void *)$.

            \begin{figure}[h!]
				\lstinputlisting{thesis-pics/struct-api-change.c}

				\caption{Change in the API, enabling us to define multiple orders on data}

				\label{fig:struct-api-change}
			\end{figure}

            To achieve this, we need to change the way we choose structures for record types. We will check all
            the data structure tuples, of length of the order count, in the cross-product of all data structures. The
            result will be a linked data structure, i.e. a number of data structures, in which elements have pointers to
            elements in other data structures, representing the same element value --- like on
            \autoref{fig:linked-data-structures}. For example, if we consider the code from the earlier example in
            \autoref{fig:struct-new}, the result may be a (Red-Black-Tree, Hashtable) pair, where the RBT is for the \emph{weight}
            order, and the hashtable is for the \emph{height} order.

            \begin{figure}[h!]
                \begin{center}
                    \includegraphics[width=0.9\textwidth]{thesis-pics/linked-dses.png}
                \end{center}

				\caption{Visualization of a linked data structure: elements from a balanced tree, represeting one order of data,
                 connected with the elements of a linked list, representing the second order of data}
				\label{fig:linked-data-structures}
			\end{figure}

			In the choice algorithm we make distinctions between operations that are order specific and those that are order independent
            operations. An example of an order specific operation is \emph{max\_d}, which finds an element that is maximal in
            a given order; an example of an order independent operation is \emph{insert\_d}, which does not take the order as an argument, as it has to perform an operation on all the linked data structures. 

            Sometimes order specific operations, aside from performing an operation on the data structure related to their order, have to update the state in other data structures, e.g. \emph{delete\_d}, after firing, has to run a \emph{delete\_d} operation on the element on the other data structures responsible for different orders. It does not have to be the same operation. For example, \emph{delete\_max\_d} does not have to call \emph{delete\_max\_d}, because it already has the data structure element pointer, and can just call \emph{delete\_d} --- we call the operation
            that will have the desired effect with the smallest complexity cost. Calling \emph{delete\_max\_d} would not be correct for this example, because it would delete the maximal element from another order.

			We define a function:

			\begin{eqnarray}
                \mathit{RELATED} :: \mathit{DataStructureOperation} \rightarrow \mathit{DataStructureOperation}
            \end{eqnarray}

            For read-only general operations the function is undefined, as we have an edge case for that in \autoref{eqn:opcost}:

			\begin{eqnarray}
                \mathit{RELATED}(Size) = undefined\\
                \mathit{RELATED}(AnyElement) = undefined
            \end{eqnarray}

            For read-write general operations the function is defined as:

			\begin{eqnarray}
                \mathit{RELATED}(Insert) = Insert\\
                \mathit{RELATED}(Delete) = Delete
            \end{eqnarray}

            For read-only order specific operations we use \emph{Nop}, an operation that does nothing:

			\begin{eqnarray}
                \mathit{RELATED}(FindMax) = Nop\\
                \mathit{RELATED}(FindMin) = Nop\\
                \mathit{RELATED}(Find) = Nop\\
                \mathit{RELATED}(Successor) = Nop\\
                \mathit{RELATED}(Predecessor) = Nop\\
            \end{eqnarray}

			For read-write order specific operations he function is defined as:

			\begin{eqnarray}
                \mathit{RELATED}(DeleteMax) = Delete\\
                \mathit{RELATED}(DeleteMin) = Delete
            \end{eqnarray}

            For every supported operation one has to decide if it is read-only or read-write, is it order specific or independent and define what is the related operation for it.

            To create an ordering of linked data structures we modify \autoref{data-structure-order-weights} as follows:

            \begin{equation} \label{data-structure-order-different-types}
                ld_1 <_{dsu} ld_2 \Leftrightarrow \mathit{DSCOST}(ld_1, dsu) \leq \mathit{DSCOST}(ld_2, dsu) 
            \end{equation}

            Variables $ld_1$ and $ld_2$ are linked data structures consisting of \emph{m} linked simple structures.

            \begin{eqnarray}
            	ld_1 = d_{1,1}, d_{1,2}, ..., d_{1,m}\\            
            	ld_2 = d_{2,1}, d_{2,2}, ..., d_{2,m}
            \end{eqnarray}

            \emph{DSCOST} describes the cost of the linked data structure \emph{ld} for usages \emph{dsu}.

            \begin{eqnarray}
                \mathit{DSCOST}(ld,dsu) = \sum_{u \in dsu} \mathit{OPCOST}(u, ld)
            \end{eqnarray}

			\emph{OPCOST} describes the cost of a single usage \emph{u} of the linked data structure \emph{ld}. This has two cases because for read-only order-independent operations we can use the data structure that performs the operation the fastest. The other case is the sum of the main operation on the order specific data structure plus all cost of all the related operations on all the other data structures.

            \begin{equation} \label{eqn:opcost}
                OPCOST(u, ld) = \begin{cases}
   						\displaystyle \min_{d \in ld} \mathit{SOPCOST}(u,d) & \text{if } ROOI(u)\\
   						\displaystyle \sum_{d \in ld} \mathit{SOPCOST}(u,d) + \mathit{RELCOST}(u, d, ld) & \mathit{otherwise}
  					\end{cases}
            \end{equation}

            \emph{ROOI} is a predicate that is true only for read-only order independent operations:
            \begin{equation}
                \mathit{ROOI}(u) = \mathit{True} \text{ iff }u = (v, o, w) \wedge o \text{ read-only order independent operation}
            \end{equation}

            \emph{SOPCOST} describes the cost of a single usage \emph{u} on one of the substructures \emph{d}
            related to the ordering the operation uses.

			\begin{eqnarray}
				\mathit{SOPCOST}(u,d) = c(N) * w \; \textrm{where} \; (o,c) \in d, u = (v, o, w)
            \end{eqnarray}

			\emph{RELCOST} describes the cost of a single usage \emph{u} on all the substructures other than \emph{d}. It uses the \emph{ROPCOST} cost for the operations, as we perform the related operation of the operation \emph{o} on all the other structures.

            \begin{eqnarray}
                \mathit{RELCOST}(u, d, ld) = \sum_{d' \in ld, d \neq d'} \mathit{ROPCOST}(u,d')
            \end{eqnarray}
            \begin{eqnarray}
                \mathit{ROPCOST}(u,d) = c(N) * w \; \textrm{where} \; (\mathit{RELATED}(o),c) \in d, u = (v, o, w)
            \end{eqnarray}

            
	\subsection{Upper bound on the element count}

		When analyzing the input program, we can try to gather whether the maximum element count in the
		data structure is bounded by some constant. If we can obtain this information (in general it is of course undecidable), we can use it to enhance the data structure generated for the program.

		When the element count in the data structure is potentially infinite, we have to create an
		implementation of the structure that allocates memory lazily, when it needs the space for new elements.
		We can imagine that a structure can allocate one chunk on every insertion and free a chunk on every
		deletion, or even allocate chunks twice as big and amortize the number of allocations to $O(log \; n)$.

		When we have the information about the element count, we can allocate the whole static buffer for the
		data structure, and this removes the whole cost of allocation during insertions and deletions. The
		example in \autoref{fig:upper-bound-transform} shows such a program.

		\begin{figure}
			\lstinputlisting{thesis-pics/upper-bound-transform.c}

			\caption{An example showing a situation where beneficial would be to create a statically allocated data
			structure instead of a dynamically allocated, because of statically known element count}

			\label{fig:upper-bound-transform}
		\end{figure}

		In \autoref{fig:upper-bound-transform} after the first loop, there are no more insertions into the data
		structure, so if we can detect it during our analysis, then we can create a statically allocated version of the data structure.

	\subsection{Minimal element count threshold}

		It is worth noticing that we compare only the asymptotic complexity of data structures. Some awfully
		complicated structures can have good asymptotic results, but the constant is quite high. We can avoid
		this problem by setting a threshold for each structure, what is the smallest number of elements to use
		this data structure.

		Another problem arises -- how to know at compile time, how many elements a data structure will have at
		runtime. We can ask the user to explicitly specify the number during compilation or we can try to
		detect how big the declared data is, with some kind of constant folding analysis, that checks if the
		insertion to the data structure is in a loop that is run more times than the threshold. This is a fragile
		solution, because few programs are easy to analyze this way.

		Another solution is runtime tallying the size, and using the method of transforming data structures
		on-line described in \autoref{sec:transforming-on-line}. When we hit a threshold for a data structure
		with a bigger constant, but overall better fitting to our task, we transform the old one into this one.

		\begin{figure}
			\lstinputlisting{thesis-pics/minimal-elem-count.c}

			\caption{An example showing a situation where beneficial would be to transform a data structure
				into a more complex one, because the number of elements is sufficiently big, so the
				whole transformation is profitable}

			\label{fig:minimal-elem-count}
		\end{figure}

        In \autoref{fig:minimal-elem-count} we see a code example that would yield some kind of a priority queue in our
        framework, although it only has a few elements, and some of the fastest priority queues are pretty heavy-weight
        and it would be better just to use some simpler data structure, because the setup cost is not amortized by the
        operations.

	\subsection{Generic data structure modifications} \label{sub:gdsm}

		When searching for the most apt data structure, we need to have some kind of data structure database,
		where we keep all the structure metadata and implementations for the framework to use. Ideally there
		would be a lot of different data structures there.

		When implementing a data structure, one could easily modify the implementation to maybe match some rare
		specific task that is not needed in most cases. It would be wasteful to keep a copy of the data
		structure for each small combination of those modifications if we want to return a data structure
		tailored perfectly to the task. Of course some of the modifications are very data structure specific, so
		a generalization is not possible, but for some cases, we can extract a piece of code, like a wrapper for
        a data structure that modifies its behavior in some specific way (a similar pattern was used in
        \cite{Okasaki}).

		\subsubsection{Extremal element cache}

			If we want to be able to lookup extremal elements in the data structure, we do not really have
			to know the implementation specifics of the data structure. We only need to intercept the calls
			which insert and delete elements. When we insert an element, we compare it to a cache variable,
			which keeps the biggest element (if our extremal element is the maximal element) known to be in
			the data structure.

			But this approach also has some drawbacks -- if we delete an element, and it occurs that the deleted
			element is the current maximum, we have to find the new maximal element from the rest of
			elements, which can be asymptotically more costly than the delete itself.

				\begin{table}[h!]
				\centering
				\begin{tabular}{|l|l|l|}
					\hline
					\multicolumn{3}{|c|}{Red-Black Trees cost changes with cache} \\
					\hline
					Operation & Complexity & Complexity with cache \\
					\hline
					Insert 	        & $O(log \; n)$ & $O(log \; n)$ \\
					Update          & $O(log \; n)$ & $O(log \; n)$ \\
					Delete	        & $O(log \; n)$ & $O(log \; n)$ \\
					Find Max 	& $O(log \; n)$ & {\color{darkgreen}$O(1)$}\\
					Delete Max	& $O(log \; n)$ & $O(log \; n)$ \\
					Search 		& $O(log \; n)$ & $O(log \; n)$ \\
					\hline
				\end{tabular}
				\caption{A table showing the cost changes of Red-Black trees; complexities without and with the
				extremal element cache}
				\label{tab:rbt-cost-change}
			\end{table}

				\begin{table}[h!]
				\centering
				\begin{tabular}{|l|l|l|}
					\hline
					\multicolumn{3}{|c|}{Hashtables cost changes with cache} \\
					\hline
					Operation & Complexity & Complexity with cache \\
					\hline
					Insert 	        & $O(1)$ & $O(1)$ \\
					Update          & $O(1)$ & {\color{red}$O(n)$} \\
					Delete	        & $O(1)$ & {\color{red}$O(n)$} \\
					Find Max 	& $O(n)$ & {\color{darkgreen}$O(1)$}\\
					Delete Max	& $O(n)$ & $O(n)$ \\
					Search 		& $O(1)$ & $O(1)$ \\
					\hline
				\end{tabular}
				\caption{A table showing the cost changes of Hashtables; complexities without and with the
				extremal element cache}
				\label{tab:rbt-cost-change}
			\end{table}

			\begin{figure}
				\lstinputlisting{thesis-pics/elem-cache.c}

				\caption{An example implementation of the extremal element cache data structure wrapper}

				\label{fig:elem-cache}
			\end{figure}

			\autoref{fig:elem-cache} shows an example implementation. We should note that a lot of data structures already implement this pattern \cite{Wiveb}.

        \subsubsection{Linked elements}\label{sec:gdsm:le}

			When we have a structure that keeps some data, it usually takes $O(n)$ or $O(log \; n)$ operations
			to find a predecessor or a successor. To improve this we can create an overlay with a doubly
			linked list on the elements, so our predecessor/successor lookups are $O(1)$. The drawback here is
			that the cost of insert is increased by searching the successor to link it appropriately. The cost of
			update is also affected. The delete cost stays the same, because we can wire the elements in $O(1)$, 
			just like the removal of an element in a double linked list.

\section{Future work} \label{sec:future}
	It may be a good idea to rewrite the framework to analyze the LLVM\cite{LLVM} intermediate representation internally,
    instead of the C language abstract syntax tree. This could yield better results because of the optimizations that clang
    \cite{Clang} can apply to
	the initial code, like removing unnecessary calls to data structure operations, so that the framework can infer the
	data structure based on the optimized code. LLVM also enables a link-time optimization, so after linking the
	chosen data structure to a program, another optimization step is performed. Also, analyzing LLVM IR is easier
	than analyzing the C language.

    The framework can be easily generalized to substitute data structure comparison functions, so one could write a function
    that, as an example, takes I/O into account, or any combination of I/O and big-O complexity or even something more complicated.

\newpage
\begin{appendices}
\section{Implementation}
	\subsection{Progress}

	The framework currently contains a full C language parser. There are a few caveats, e.g. inline assembly is not parsed, \emph{goto} is not supported. It uses \emph{language-c} Haskell library, so it parser common GCC extensions and works on most of the C code available today.

	Unweighted inference capabilities are implmented and work on C programs including function calls from other functions. Examples of that are mentioned in \autoref{sec:compilation}.

	The inferred data structures are selected from a set generated by using generic data structure modifications described in \autoref{sub:gdsm}.
	
	The \emph{dsinf} binary can work as a compile driver. Any developer can add his own data structure implementation, as long as it conforms to the declared API. The program can automatically analyze, compile and link the inferred data structure, so there is no work left to be done.

	\subsection{Compilation} \label{sec:compilation}
	To install \emph{dsinf} you need \emph{cabal} --- a haskell build manager. The preferred way of obtaining
	\emph{cabal} is by downloading the Haskell Platform (\href{http://www.haskell.org/platform/}{http://www.haskell.org/platform/})
	--- it has cabal bundled inside. You can also use a package manager of your favorite Linux distribution.
	
	After installing \emph{cabal}, clone the repository and build by running the following commands:
	  	\begin{lstlisting}
git clone https://github.com/alistra/data-structure-inferrer
cd data-structure-inferrer/
make dsinf
		\end{lstlisting}
	This will clone the \emph{dsinf} git repository and build the package using a \emph{cabal sandbox}.
	This can take a little while, but all the dependencies are installed locally and will not affect your 
	haskell setup in the system. This will install the \emph{dsinf} binary in your current working directory.
	\subsection{Usage}

	The program has a few modes of working:
	\begin{itemize}
		\item Recommendation mode - the framework analyzes the input files and the data structure operations in
			them, then outputs the recommended data structure on standard output
		\item Advice mode - the framework checks if any small change in the data structure operations would give
			the program a speed boost
		\item Compile mode - works as the Recommendation mode, but instead of printing the data structure to
			standard output, it compiles the files and links it with the chosen structure implementation.
	\end{itemize}
	
	We can trigger the modes using appropriate flags:
	
	\begingroup
    \fontsize{8pt}{12pt}\selectfont
    \begin{verbatim}  
% ./dsinf -h
dsinf [-OPT1 [VAL1] [-OPT2 [VAL2] [...]] [-- [CCOPTS]]
  -o file  --output=file  Output file
  -r       --recommend    Give recommendations about the data structure in the supplied code (default)
  -a       --advice       Give advice about the data structure in the supplied code
  -c       --compile      Compile the code with recommended structure linked
  -i       --inline       Inline the code implementing the recommended structure in the supplied code
  -v       --verbose      Enable verbose messages
  -h       --help         Show help

CCOPTS are passed directly to the compiler
    \end{verbatim}  
\endgroup
	
	There are test files available so that one can check if everything is works alright.
	\begin{verbatim}
% ./dsinf C/tests/d1_delmax_d2_max.c

C/tests/d1_delmax_d2_max.c:
The recommended structure for:
d1 in main
is:
Red-Black Trees
The recommended structure for:
d2 in main
is:
Hashtable with extreme element caching
	\end{verbatim}
	
	\subsection{System Architecture}
		\subsubsection{Core - Recommend.hs} \label{sec:recommend-impl}
		The core part of the recommendation engine is actually very simple:

\begin{verbatim}
recommendDS :: [OperationName] -> IO Structure
recommendDS opns =  do
    let sorted = reverse $ sortBy (\x y-> compareDS x y opns) allStructures
    let bestStructures = head $ groupBy (\x y -> compareDS x y opns == EQ) sorted
    ridx <- randomRIO (0, length bestStructures - 1)
    return $ bestStructures !! ridx
\end{verbatim}
		The argument \emph{opns :: [OperationName]} of this function is a list of data 
		structure operations performed on a persistent data structure identity in the 
		analyzed program. The \emph{allStructures} contains all the data structure definitions
		available for the recommendation engine. More about \emph{allStructures} is explained
		in \autoref{sec:all-structures-impl}. The \emph{compareDS} function is an implementation
		of the ordering in \autoref{data-structure-order}.
		
		In these functions elements from \emph{allStructures} are sorted using the \emph{compareDS} order.
		Then we take the group of first elements that are equal within this ordering --- the structures
		represent the fastest structures for the persistent data structure identity. We return a random
		fastest data structure, so the program can return one data structure.
		
		\subsubsection{C Analyzer - C/Analyzer.hs, Analyzer.hs}
		This part of the code uses the \emph{language-c} Haskell library to parse C code. 
		It analyzes the functions in the source files and extracts the data and passes it
		to the Core part in an easily understandable format as \emph{DSU}s like defined in \autoref{dsu-definition}.

		To run the whole analysis, we call the \emph{analyzeC} function. It takes the file path as an argument.
\begin{verbatim}
analyzeC :: FilePath -> IO [DSInfo]
\end{verbatim}
		The return value is a list of \emph{DSInfo}s which are simply lists of \emph{DSU}s with all the names, 
		which the particular persistent data structure identity can be called. This information is 
		sufficient to pass it to the Core described in \autoref{sec:recommend-impl}.

		The \emph{analyzeC} function traverses the whole abstract syntax tree of the program, which
		is obtained from the \emph{language-c} parser. It recurses through C statements, expressions 
		and other language parts as described in the C99 standard. It gathers the \emph{DSU}s and groups them
		according to rules described in \autoref{dsu-definition}.
				
		\subsubsection{Data structure definitions - AllStructures.hs} \label{sec:all-structures-impl}
		This file stores all structure definitions available to the recommendation engine. 
		This is how an example definition looks like:
\begin{verbatim}
-- | Linked list
ll :: Structure
ll = DS "Linked List"       [
                            Op BoundByRef           (LinLog 1 0, N),
                            Op DecreaseValByRef     (LinLog 0 0, N),
                            Op DeleteByRef          (LinLog 0 0, N),
                            Op DeleteExtremalVal    (LinLog 1 0, N),
                            Op Difference           (LinLog 2 0, N),
                            Op Empty                (LinLog 0 0, N),
                            Op ExtremalVal          (LinLog 1 0, N),
                            Op FindByVal            (LinLog 1 0, N),
                            Op InsertVal            (LinLog 0 0, N),
                            Op Intersection         (LinLog 2 0, N),
                            Op Map                  (LinLog 1 0, N),
                            Op Size                 (LinLog 0 0, N),
                            Op SymDifference        (LinLog 2 0, N),
                            Op Union                (LinLog 0 0, N),
                            Op UpdateByRef          (LinLog 0 0, N)
                                                                    ]
\end{verbatim}
		The \emph{LinLog} constructor stores the asymptotic complexity as described in 
		\autoref{eqn:linlog}. The \emph{N} part is a marker, signifying that this is a normal complexity,
		as opposed to an expected or amortized complexity.
		
		This part also implements the technique described in \autoref{sub:gdsm} that provides generic 
		data structure modifications, e.g. extremal element caching:
\begin{verbatim}
-- | A function that adds an extremal element cache to a data structure
extremalElemCache :: Structure -> Structure
extremalElemCache (DS name ops) = DS (name ++ " with extreme element caching") ops' where
    extVal = fromJust $ find (\dsop -> getOpName dsop == ExtremalVal) ops
    delByRef = fromJust $ find (\dsop -> getOpName dsop == DeleteByRef) ops
    ops' = [Op DeleteByRef (max (getComplexity extVal) (getComplexity delByRef)),
            Op ExtremalVal (LinLog 0 0, N)] ++
            filter (\dsop -> getOpName dsop `notElem` [ExtremalVal, DeleteByRef]) ops
\end{verbatim}
	The \emph{extremalElemCache} function takes a data structure definition as an input and 
	returns a data structure definition with modified costs according to the generic modification.
	As adding an extremal element cache will speed up (to constant time) the lookup of an 
	extremal value, the resulting data structure definition has the cost of \emph{ExtremalVal}
	as \emph{(LinLog 0 0, N)} --- normal constant time. The tradeoff in this case is that, 
	the \emph{DeleteByRef} operation has to update the cache, if it removed the extremal value.
	This adds the cost of the old \emph{ExtremalVal} (the new is constant) to the cost of 
	deletion.
		
		
\end{appendices}

\newpage
\begin{thebibliography}{9}
	\bibitem{Wids} Wikipedia - http://en.wikipedia.org/wiki/Data\_structure
    \bibitem{Wipgo} Wikipedia - http://en.wikipedia.org/wiki/Profile-guided\_optimization
    \bibitem{Wijit} Wikipedia - http://en.wikipedia.org/wiki/Just-in-time\_compilation
    \bibitem{Wiveb} Wikipedia - http://en.wikipedia.org/wiki/Van\_Emde\_Boas\_tree\#cite\_note-3
	\bibitem{LLVM} The LLVM Compiler Infrastructure - http://llvm.org/
	\bibitem{Okasaki} Purely Functional Data Structures - Chris Okasaki
	\bibitem{Clang} clang: a C language family frontend for LLVM - http://clang.llvm.org/
	\bibitem{PyPy} The Architecture of Open Source Applications - http://www.aosabook.org/en/pypy.html, Chapter 19.7
	\bibitem{AppleCC} Cocoa Core Competencies - Class Clusters - \\
		https://developer.apple.com/library/mac/\#documentation/General/Conceptual/DevPedia-CocoaCore/ClassCluster.html
\end{thebibliography}


\end{document}
% vim: set wrap tw=120: