-
Notifications
You must be signed in to change notification settings - Fork 1
/
id3.tex
231 lines (159 loc) · 16.5 KB
/
id3.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
\newpage
\section{Decision Trees}\label{s:decision-trees}
Decision trees are tree-like structures in which each internal node represents a ``test" on an attribute, each branch represents the outcome of the specific test, and each leaf node represents a class label (decision taken after computing all attributes).
More specifically, decision trees depict models of decisions and their possible results.
They are also a simple way to display an algorithm that only contains conditional control statements.
Starting from the root node and following the paths to leaves, each path represents classification rules.
For instance, the attribute ``Gender" would have two edges leaving it, one for ``Male" and
one for ``Female".
Each edge could point to a new attribute (for example ``Age Groups"), and so on.
The leaves of the tree contain the expected class value for transactions matching the path from the root to that leaf.
Given a decision tree, one can predict the class of a new transaction just by following the ``correct" path, which will result in a class label.
He\myslash She can predict the class of a transaction by viewing only the non-class attributes (\textit{i.e.} the ``constructed" decision tree).
For instance, two friends wish to decide if they will go to play tennis outside, depending the weather conditions.
They also remember the previous days that some weather conditions did not allow them to play.
Thus, they decide to create a decision tree to assist them.
First they define the class attribute to be ``Play Tennis", and three more attributes that will help them decide: ``Outlook" (``Sunny", ``Overcast", ``Rainy"), ``Windy" (``High", ``Low") and ``Humidity" (``High", ``Low").
Then they construct a decision tree like figure \ref{f:outlook} that corresponds to the data that the have collect from the past days.
\begin{figure}[th]
\centering
\includegraphics[width=0.8\linewidth]{figures/outlook.pdf}
\caption{Play-Tennis decision tree example}\label{f:outlook}
\end{figure}
From figure \ref{f:outlook}, the two friends are able to extract a set of rules, listed in Code \ref{sc:outlook-rules}, that will help them decide if they should go play tennis given the weather conditions.
{
\begin{minted}[xleftmargin=21pt, framesep=3mm, frame=single, linenos, tabsize=2, breaklines, breaksymbolleft=, fontsize=\footnotesize]{bash}
IF (Outlook = Sunny) AND (Windy = Low) THEN Play=Yes
IF (Outlook = Sunny) AND (Windy = High) THEN Play=No
IF (Outlook = Overcast) THEN Play=Yes
IF (Outlook = Rainy) AND (Humidity = High) THEN Play=No
IF (Outlook = Rainy) AND (Humidity = Low) THEN Play=Yes
\end{minted}
\captionof{lstlisting}{Rules for decision tree from figure \ref{f:outlook}}
\label{sc:outlook-rules}
}
Of course, remembering the weather conditions of more days in the past and if they had played tennis those days -- having an extended dataset -- could probably result to more specific rules.
Decision tree rules, help covering all possible future decisions\hyp transactions and can be used to classify them.
There are many useful examples and it has become evident why this type of learning has become so popular.
The algorithms according to which those tress are generated vary and lead to possible different trees.
One famous algorithm for decision tree construction is ID3 (Iterative Dichotomiser 3).
ID3 is the precursor to the C4.5 algorithm, and is typically used in the machine learning and natural language processing domains.
Both algorithms build decision trees from a set of training data, using the concept of information entropy, however, C4.5 results in better classification utilizing a more efficient splitting criterion.
\subsection{Textbook ID3}\label{s:id3}
The ID3 algorithm was first introduced in \cite{quinlan1986induction} and assumes that each attribute is categorical, such as the aforementioned attribute ``Age Groups" which is separated in four disjoint categories; ``Infant", ``Adolescent", ``Child" and ``Adult".
ID3 algorithm constructs the classification tree top-down in a recursive fashion.
The idea is to find the \textit{best} attribute that classifies the transactions.
At start, the algorithm searches and chooses the best attribute for the root node, and consecutively the remaining transactions are partitioned by it.
On each partition, ID3 is then recursively called.
The \textit{best} attribute is defined as the attribute that has the smallest entropy -- or in other words, the best information gain.
Let $T$ be a set of transactions, $C$ the class attribute and $A$ some non-class attribute.
On each iteration, ID3 iterates through every unused attribute $A$ of the dataset and calculates the entropy $H_C(T)$ (equation \ref{eq:entropy} and algorithm \ref{a:id3-entropy-simple}) for every subset $T$ that results from splitting the dataset on attribute $A$.
\import{./}{algorithms/entropy_textbook.tex}
\begin{equation}\label{eq:entropy}
H_C(T) = \sum_{i=1}^{l} - \frac{\mid T(c_i) \mid}{\mid T \mid} log{\frac{\mid T(c_i) \mid}{\mid T \mid}}
\end{equation}
The idea is to identify the class of a transaction $t$, given that the value of $A$ has been obtained.
Let $A$ obtain values $a_1, \dots, a_m$ and let $T(a_j)$ be the transactions obtaining value $a_j$ for $A$.
Then, the conditional information of $T$ given $A$, is given from equation \ref{eq:TgivenA}.
\import{./}{algorithms/gain_textbook.tex}
\import{./}{algorithms/best_textbook.tex}
\begin{equation}\label{eq:TgivenA}
H_C(T \mid A) = \sum_{j=1}^{m} - \frac{\mid T(a_j) \mid}{\mid T \mid} H_C(T(a_j))
\end{equation}
\begin{equation}\label{eq:gain}
Gain(A) = H_C(T) - H_C(T \mid A)
\end{equation}
Finally, the set $T$ is then split by the selected attribute $A$ (best-attribute algorithm \ref{a:id3-best-simple}) that has the maximum gain (equation \ref{eq:gain} and algorithm \ref{a:id3-gain-simple}) -- or equivalently minimum $H_C(T \mid A)$ -- to produce subsets of the data.
The algorithm continues to recurse on each subset, considering only attributes never selected before.
The \f{AllExamplesSame} (see algorithm \ref{a:id3-same-simple}) procedure is used to determine if all transactions are of the same class, by checking the value of $classAttribute$ for each transaction.
It returns $true$ if they are, $false$ otherwise.
\import{./}{algorithms/allsame_textbook.tex}
The \f{MostCommonLabel} (algorithm \ref{a:id3-most-common-label-simple}) procedure returns the class label that is most common in all available transactions.
\import{./}{algorithms/most_common_label_textbook.tex}
The recursive \f{ID3} algorithm (see algorithm \ref{a:id3-simple} shown below) has the following structure with three halting conditions.
\begin{enumerate}
\item If the set of remaining attributes is empty, then the algorithm returns the label that is most common in all transactions as a leaf (lines 2-3).
\item If all transactions are of the same class, then the algorithm returns this class as a leaf (lines 4-5).
\item If none of the above conditions hold, then the algorithm finds the best splitting attribute (using algorithm \ref{a:id3-best-simple}) and makes a branch for every possible value of that attribute (lines 7-19).
\end{enumerate}
\import{./}{algorithms/id3_textbook.tex}
\subsection{Privacy Preserving ID3}\label{s:pp-id3}
In the previous subsection we described the ID3 algorithm which operates on public data.
We should now describe the privacy\hyp preserving version of the same algorithm, where all the transactions (\textit{examples}) from which the algorithm builds the tree are private data.
First of all, a key difference from the textbook algorithm is the lack of ability to maintain subsets of the transactions based on private conditions.
In particular, as you can see in line 11 of algorithm \ref{a:id3-simple}, we wish to keep a subset of all $examples$ that have a certain value in the $bestAttribute$
column.
As we cannot have conditional statements on private data, the only thing we can do is to apply the oblivious selection technique described in section \ref{s:secrec}.
Thus, we are not able to keep only the subset of rows that satisfy our condition.
A solution to this problem is to define $subset$ as a copy of the $examples$ array.
We keep a vector $eq$ (see line 17 of algorithm \ref{a:id3-pp}) of equal length with $examples$.
The value of $eq[i]$ is $\enc{1}$ if $examples[bestAttribute]$ is equal to $v_i$, or $\enc{0}$ otherwise.
Since the values of $eq$ are encrypted, we perform a vector multiplication between $eq$ and $examples$ so the result will be identical to $examples$ for ever row $i$ that the condition holds ($eq[i]$ is equal to $\enc{1}$), and $\enc{0}$ otherwise.
Then we add a dummy row $[\enc{-1}, \enc{-1}, \dots, \enc{-1}]$ (a row of all $\enc{-1}$, denoted as just $\enc{-1}$ for simplicity) to all rows that don't satisfy the condition (using multiplication with $\enc{1} - eq$).
The resulting array has all rows from $examples$ that satisfy the condition, and rows full of $\enc{-1}$ wherever the condition does not hold.
For that reason, all the algorithms listed below operate on these kinds of ``sets''.
In algorithm \ref{a:id3-count-positives-pp} we describe the \f{CountPositives} procedure which returns the (encrypted) number of rows that are not the dummy $[\enc{-1}, \enc{-1}, \dots, \enc{-1}]$ row.
The \f{CountPositives} procedure is equivalent to the call of $Length$ on sets with public data.
\import{./}{algorithms/count_positives.tex}
The \f{AllExamplesSame} procedure described in algorithm \ref{a:id3-same-pp} returns the same result as that in algorithm \ref{a:id3-same-simple}.
The way this is achieved is by counting the number of examples\myslash transactions that are of class $c_i$, for each possible class $c_i$.
The private variable $res$ is initialized to $\enc{0}$, and gets increased if any of those counts is equal to the total number of examples, \textit{i.e.} all examples are of that class.
If $res$ is greater than $\enc{0}$ (equal to $\enc{1}$) then all the examples have the same class.
\import{./}{algorithms/allsame_pp.tex}
Similarly to the \f{AllExamplesSame} procedure, the \f{MostCommonLabel} procedure also maintains the transaction counts for each possible class $c_i$.
Using the technique described in algorithm \ref{a:max}, we obliviously keep the maximum count and the class label corresponding to that count, which is returned.
\import{./}{algorithms/most_common_label_pp.tex}
The \f{Best} procedure obliviously chooses the attribute that has the greatest information gain.
This attribute is considered to be the best to split the dataset on an will be included in an output tree node.
That is why this attribute can be handled as public data (note the usage of the \f{Declassify} operator in line 10) after it has been privately computed.
\import{./}{algorithms/best_pp.tex}
As in algorithms \ref{a:id3-entropy-simple} and \ref{a:id3-gain-simple}, procedures \f{Entropy} and \f{InformationGain} (algorithms \ref{a:id3-entropy-pp}, \ref{a:id3-gain-pp}) compute the entropy of a set of transactions, and an attribute's information gain respectively.
The key differences with the previously described textbook algorithms include the usage of \f{CountPositives} procedure instead of \f{Length}, due to the difference in the sets involved in the computation, and the modified $log_2$ function that handles zero input.
Also, note that the subsets (algorithm \ref{a:id3-gain-pp}, line 5) are also constructed with the technique described in the beginning of this section.
\import{./}{algorithms/entropy_pp.tex}
\import{./}{algorithms/gain_pp.tex}
The complete \f{ID3} procedure is described in algorithm \ref{a:id3-pp}.
In lines 6-11 of the algorithm we retrieve class label that is same to all examples ad return it.
In order to do that, we cannot just take a random's example value in the $classAttribute$ column.
We should make sure that this is not a dummy ($\enc{-1}$) example (line 8).
The rest of the algorithm is quite similar to the textbook one.
The differences mainly include the subsets handling. (creation, counting, etc.)
\import{./}{algorithms/id3_pp.tex}
\subsubsection{Privacy Assessment}\label{s:id3-privacy-assessment}
In privacy-preserving algorithms, all intermediate values should remain private.
In many cases, intermediate values may reveal some patterns, and in general sensitive information about the private inputs.
However, in the case of ID3, some of these intermediate values are part of the output and eventually will be revealed.
For instance, the attributes chosen in each node of the tree will be revealed in the final results.
Thus, there is no reason to trying to protect them during the protocol execution.
In the privacy\hyp preserving ID3, as in all algorithms in this thesis, we explicitly define which values are private (and thus encrypted), and which are public.
As stated in \cite{lindell2000privacy}, although the name of the attribute with the highest information gain is revealed, nothing is learned of the actual $H_C(T \mid A)$ values themselves.
This observation holds true for all such cases, since neither the information gain nor any other intermediate values which are not explicitly denoted as public can be leaked.
For an algorithm to remain private, the requirement is that any information is learned by the algorithm, can also be learned directly from the public input and output \cite{lindell2000privacy}.
As mentioned in section \ref{s:secrec} the \f{Declassify} operator publishes a private value.
Consecutively, we reason about the selection of public and private data in the described algorithms and also about the usage of the \f{Declassify} operator.
First of all, we consider the case that all involved parties have the same schema for their data.
Thus, attribute names are known to all parties an should not be considered private information.
Lindell and Pinkas have constructed a protocol in \cite{lindell2000privacy} for privately computing ID3 that works in the two\hyp party setting.
This two\hyp party protocol is proven to be private.
However that algorithm does not consider the case in where the set of transactions $T$ is empty. Algorithm \ref{a:id3-pp} is a generalization to the multi\hyp party setting from the 2\hyp party protocol that is described in \cite{lindell2000privacy}.
The algorithm also addresses the empty transaction set case.
The ID3 algorithm (see algorithm \ref{a:id3-pp}) can terminate (\textit{i.e.} return a value) in three different cases.
Each time, the algorithm's output is a node of the resulting tree.
That is why, everything that the algorithm returns in any of the three cases should be considered as public data (see \f{Declassify} operator in lines 9, 11 and public $branches$ variable in line 14).
\textbf{Empty attribute set:}
The first termination channel of the algorithm is when the set of remaining attributes is empty, \textit{i.e.} when there are no attributes left for the algorithm to check..
This information is publicly known since the remaining attributes can be derived from the output tree.
In this case, the algorithm returns the most common label of the remaining transactions.
Here, we use the \f{Declassify} operator to publish that label and add it as a leaf to the tree.
\textbf{All examples same:}
The second way that algorithm \ref{a:id3-pp} can terminate, is when all transactions of the dataset are of the same class.
We use the \f{Declassify} operator to determine that information (\textit{i.e.} if all the dataset is of the same class).
\textbf{After recursive calls:}
Finally, if none of the above cases terminate the execution of ID3, the algorithm will recursively add either a leaf or a sub\hyp tree for each possible value of the selected attribute, and then return.
In order to determine whether to add a leaf or a sub\hyp tree, we publish the information that there are no transactions left in the subset of the dataset that corresponds to a particular value of the selected attribute.
By observing the output tree, as well as the public input, one can make the following observations.
Wherever there is a leaf node with class $c_i$, one can tell if the set of remaining attributes is empty, simply by observing the path from the root of the tree.
If all attributes have been used then the set is empty, otherwise it is not.
If the attribute set is not empty, we can deduce that one of two things holds.
Either all transactions that belong to the current path are of the same class ($c_i$), or that there are no transactions at all in that path.
These exact information is also published by algorithm \ref{a:id3-pp}, with the only difference being that the information published by the algorithm distinguishes between when all transactions are of the same class and the corner case of having no transactions for a particular path of the tree.