-
Notifications
You must be signed in to change notification settings - Fork 1
/
histograms.tex
252 lines (184 loc) · 19.2 KB
/
histograms.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
\section{Privacy Preserving Histograms}\label{s:histograms}
Despite the challenges presented in sections \ref{s:challenges} and \ref{s:simple-algorithms}, the majority of algorithms can be transformed to their privacy preserving equivalent.
However, this is not a straightforward translation , as we examined in section \ref{s:simple-algorithms}, and in most cases it adds a complexity overhead to every algorithm that depends its control flow decisions in private data.
Histograms is a practical and notable example of algorithms that are widely used and the complexity of their privacy preserving version remains in computationally feasible levels, comparing to the textbook algorithm.
But first of all, what is a histogram?
As stated in \cite{ioannidis2003history}, histograms initially conceived as a visual aid to statistical approximations.
Webster’s defines a histogram as ``a bar graph of a frequency distribution in which the widths of the bars are proportional to the classes into which the variable has been divided and the heights of the bars are proportional to the class frequencies".
A histogram is generally a form of classifying and representing data in some categories of a specific range; the range is an individual ``base" element associated with each column.
More specifically, a histogram on some attributes $\{A, B, \dots, Z\}$ is constructed by partitioning the data distribution of those attributes into some ranges which are mutually disjoint subsets called buckets and approximating the frequencies and values in each bucket.
For the sake of simplicity let us suppose that the number of ranges ($\beta$) of all attributes are equal ($\beta \geq 1$).
In figures \ref{f:simple-1d-hist} and \ref{f:simple-2d-hist} we present two histograms, the first one\hyp dimensional over the attribute ``Patient Age" and the second is a two\hyp dimensional over the attributes ``Patient Age" and ``Heart rate".
In both histograms $\beta = 4$, since the range of values for each attribute has been partitioned in four mutually disjoint subsets.
In the 1\hyp dimensional histogram \ref{f:simple-1d-hist} the $y-axis$ corresponds to the total number of occurrences of values that belong to each bucket.
In the 2\hyp dimensional histogram \ref{f:simple-2d-hist} since $y-axis$ corresponds to the ranges for the buckets for the second attribute, the occurrences are depicted with different colors.
\begin{figure}[t]
\centering
\begin{minipage}{.5\textwidth}
\centering
\includegraphics[width=\columnwidth]{figures/1d-simple-hist.png}
\captionof{figure}{An one-dimensional histogram with $\beta = 4$}
\label{f:simple-1d-hist}
\end{minipage}%
\begin{minipage}{.5\textwidth}
\centering
\includegraphics[width=\columnwidth]{figures/2d-simple-hist.png}
\captionof{figure}{A two-dimensional histogram\myslash heatmap with $\beta = 4$}
\label{f:simple-2d-hist}
\end{minipage}
\end{figure}
\begin{figure}[t]
\centering
\begin{minipage}{.5\textwidth}
\centering
\includegraphics[width=\columnwidth]{figures/AgeGroups.png}
\captionof{figure}{One-dimensional histogram displaying Age Groups}
\label{f:simple-1d-categorical-hist}
\end{minipage}%
\begin{minipage}{.5\textwidth}
\centering
\includegraphics[width=\columnwidth]{figures/AgeGroups_PopulationGroups.png}
\captionof{figure}{A two-dimensional histogram displaying Age Groups with Continental Population Groups}
\label{f:simple-2d-categorical-hist}
\end{minipage}
\end{figure}
Similarly, in figures \ref{f:simple-1d-categorical-hist} and \ref{f:simple-2d-categorical-hist} we present two more histograms, but with categorical values.
The attribute ``Age Groups" in \ref{f:simple-1d-categorical-hist} is separated in four disjoint categories; ``Infant", ``Adolescent", ``Child" and ``Adult".
In \ref{f:simple-2d-categorical-hist} we present demographic information (attribute ``Continental Population Groups") in relation with ``Age Groups".
\subsection{Algorithms for Privacy Preserving Histograms}\label{ss:histogram-algos}
Histograms are one of the simplest ways to visualize and easily understand data; rendering them very useful in many data analysis applications.
As we already mentioned, histograms consist of bars that are mutually disjoint and their heights are proportional to the counts of values in the corresponding ranges.
\textbf{Challenges with Privacy-Preserving Histograms}: The textbook algorithm splits the dataset to buckets and consecutively for each item it increments the corresponding ``bucket-counter".
Implementing this algorithm with respect to the privacy of the dataset introduces the problem that for each value it is not trivial to find the bucket that this specific value should be placed.
For instance, let us suppose that we are constructing the histogram in figure \ref{f:simple-1d-categorical-hist}.
In this case $\beta = 4$, thus the algorithm should obliviously split the values into four disjoint buckets.
The range of each bucket is computed considering the $\beta$ parameter, as well as the minimum and maximum age in the dataset.
In this specific example, the minimum age is $18$, while the maximum is $83$.
Having $\beta = 4$ results to bucket-range $(83 - 18) / 4 = 16.25$.
Therefore, the ranges for attribute ``Patient Age" are formed as follows: $[18, 34.25),$ $[34.25, 50.50),$ $[50.50, 66.75),$ $[66.75, 83]$.
Consecutively, for each patient in the dataset the algorithm calculates the number of bucket that the patient belongs by dividing his\myslash her value with the bucket-range; however, this index is encrypted.
Finally, the algorithm iterates through all $\beta$ buckets and obliviously adding $\enc{1}$ if the bucket is the correct one, or $\enc{0}$ otherwise.
The aforementioned algorithm works for numerical values; however, the one for categorical values is very similar.
Bucket-range is fixed to one, and $\beta$ equals to all different possible values in the dataset.
Also, there is no notion of terms minimum and maximum in categorical values.
\subsubsection{Privacy Preserving Histogram Computation: A Naive Approach}\label{sss:histogram-simple}
As we mentioned in section \ref{s:two-types-of-data}, we have separated our algorithms in two major categories, for categorical and for numerical data.
The procedure described in \ref{ss:histogram-algos} is shown in algorithms \ref{a:1d-simple-histogram-categorical} and \ref{a:1d-simple-histogram-numerical}.
These algorithms iterate through all $N$ items and check all possible cells for that item.
When the correct cell is found, they increment the corresponding counter.
The former is tailored to categorical datasets, while the latter is tailored to numerical\myslash continuous values.
\import{./}{algorithms/hist_simple_categorical.tex}
\import{./}{algorithms/hist_simple_numerical.tex}
Both algorithms are straightforward and simple to understand, however, not very efficient.
In the following subsections we delve into details for the privacy preserving algorithms that are based in the same idea, but leverage \texttt{SIMD} operations.
\subsubsection{Privacy Preserving Histograms for Categorical Values}\label{sss:histogram-categorical}
\textbf{One-Dimensional Histograms}: In algorithm \ref{a:1d-histogram-categorical} we present the privacy preserving algorithm of an one-dimensional (1D) histogram for categorical values.
In simple words, categorical data means that the values are discrete.
Hence, the second parameter in the algorithm (dubbed $P$) is the number of possible choices that exist in $array[N]$.
The input data is given in the form of a private array\myslash vector.
The algorithm creates a boolean array of equal size as the input data, and then for each possible encrypted value of the dataset checks (line 3) and counts (line 4) its occurrences.
Method {\fontfamily{lmss} $\textsc{Sum}$} is the same as in algorithm \ref{a:sum}.
Finally, the counts are gathered into a vector and returned -- still encrypted -- to the user.
\import{./}{algorithms/1d_hist_categorical.tex}
\textbf{Multi-Dimensional Histograms}: In algorithm \ref{a:multidim-histogram-categorical} we present the privacy preserving algorithm of multi-dimensional histograms for categorical values.
As in algorithm \ref{a:1d-histogram-categorical}, here, the third parameter ($Ps$) is the number of possible choices that exist in $array$.
However, since here the input data is a private array with multiple dimensions $array[N][M]$, the possible values should express the possible values for each dimension.
Thus, $Ps$ is an array of $A$ slots, where $A$ is the number of attributes.
The algorithm that computes a private multi-dimensional histogram is similar to the one regarding one-dimensional histograms but also addresses the issue of having a histogram of arbitrarily many dimensions.
In case we had a fixed (\textit{i.e.} known \textit{a-priori}) number of dimensions the simplest solution would be to use nested loops, as many as the histogram dimensions.
Since the number of dimensions is not known, we have to think of something else.
We represent the multi-dimensional histogram as a serialized version with an one-dimensional array (a vector) instead of using a multi-dimensional array (a matrix).
For example a 2-dimensional $3 \times 4$ histogram will be represented as a vector whose length is $ 12 $ ($= 3 \cdot 4$), and a $3 \times 4 \times 5$ 3-dimensional histogram wit a vector of length $ 60 $ ($= 3 \cdot 4 \cdot 5$)
When it comes to indexing if we wish to access the 2-dimensional $N \times M$ histogram represented as an one-dimensional at row $ i $ and column $ j $, instead of using $h[i][j]$ we use $h[i \cdot M + j]$.
Similarly, for the 3-dimensional $L \times N \times M$ histogram, $h[i][j][k]$ becomes $h[i \cdot N \cdot M + j \cdot M + k]$. So there needs to be a computation of a single index based on the multiple dimension indexes. In algorithm \ref{a:multidim-histogram-categorical}, the $positions$ vector holds this index for each record of the provided array, as can be seen in lines $ 6 $ and $ 7 $
Similarly to the 1D algorithm, the multi-dimensional one creates a boolean array ($eq$) for each possible histogram cell, which counts the occurrences of that cell in the $positions$ vector, as can be seen in lines $12$ and $13$ of algorithm \ref{a:multidim-histogram-categorical}.
\import{./}{algorithms/multdim_hist_categorical.tex}
\subsubsection{Privacy Preserving Histograms for Numerical Values}\label{sss:histogram-numerical}
\textbf{One-Dimensional Histograms}: In algorithm \ref{a:1d-histogram-numerical} we present the privacy preserving algorithm of one-dimensional (1D) histograms for numerical values.
In contrast with the categorical data, numerical data are not discrete, which means that the buckets in which the histogram will separate the dataset should be fixed (the $\beta$ parameter, as mentioned in section \ref{s:histograms})
First and foremost, the input data is given in the form of a private array\myslash vector of $N$ positions.
The second parameter is open (since the final results will eventually disclose it), and is the number of buckets\myslash cells that the algorithm will create, namely the $\beta$ factor.
The two last parameters are also encrypted and are the minimum and maximum values found in the dataset\myslash array (first parameter of the algorithm).
Those two parameters are necessary in order to determine for each element in the array in which cell should be placed.
This algorithm is very similar and in accord with algorithm \ref{a:1d-histogram-categorical}, however does some extra steps.
In lines 2 and 3, it first determines the width for each cell and then creates an array of $N$ elements that each one indicates the bucket that the element in the corresponding position of $array$ should be placed.
Consecutively, the algorithm creates another boolean array of equal size as the input data, and then for each possible encrypted value of the $cellMap$ checks (line 5) and counts (line 6) its occurrences.
Finally, the counts are gathered into a vector and returned -- still encrypted -- to the user.
Line 5 of algorithm \ref{a:1d-histogram-numerical} is more complex than line 3 of algorithm \ref{a:1d-histogram-categorical}, since the former has to check some corner cases of data that belong to the last bucket.
\import{./}{algorithms/1d_hist_numerical.tex}
\textbf{Multi-Dimensional Histograms}:
The algorithm that implements multi\hyp dimensional histograms for numerical, \textit{i.e.} continuous data is a straightforward generalization of algorithms \ref{a:multidim-histogram-categorical} and \ref{a:1d-histogram-numerical}.
Similarly, for each specified attribute we compute the corresponding cell width (as in the 1\hyp dimensional for numerical values -- alg. \ref{a:1d-histogram-numerical}), that depends on the minimum and maximum values and the specified cell number of that attribute.
Consecutively, we compute the $positions$ vector containing the histogram cell for each one of the $N$ rows of $array$.
Like in algorithm \ref{a:multidim-histogram-categorical}, the $positions$ array contains single indexes that correspond to multiple indexes according to the procedure described above in section \ref{sss:histogram-categorical}.
% \import{./}{algorithms/multdim_hist_numerical.tex}
\subsubsection{Filters in Privacy Preserving Histograms}\label{sss:histogram-filters}
Although histograms are one of the simplest ways to visualize and easily understand data, they also come with some limitations.
For instance, it is difficult to visualize in one image more than three different attributes.
Even though we can compute histograms of arbitrarily many dimensions, even a 3D representation can sometimes be very obscure and difficult to understand.
Oftentimes, one needs to take into consideration multiple attributes in order to get meaningful results back.
However, adding more attributes to the computed histogram is not always an option as for the reasons described above, the output could be obscure or even useless whatsoever.
For all those reasons, we have implemented a \textit{filtering} technique in order to take more attributes into consideration for the secure computation.
With filtering, the user is able to select a histogram computation over some attributes, and also specify some extra constraints over the same or different attributes.
Each tuple from the dataset has to meet the specified criteria\myslash filters in ordered to be counted in the resulting histogram.
The specified filters are represented as a list of constraints that should be met.
Each constraint is defined by the corresponding attribute, an operator ( one of $\{<, >, =\}$ for continuous attributes, and just $=$ for categorical ones) and a value.
Also a boolean operator (\textit{e.g.} \texttt{AND, OR, XOR}) that will be applied between the different constraints is specified.
For example, one could want to see the correlation of age and height but only for a subset of the dataset.
Let us suppose two examples, in the first we want the subset that include all patients of age above $45$, while in the second subset we want to include all patients of age above $45$ and that in the same time weigh less than $90$ kg.
This can be achieved with two requests of two\hyp dimensional histograms on the attributes \textit{Patient Age} and \textit{Height (cm)} that also includes the filters \textit{Patient Age} $ > 45$, and the second additionally have the limitation that \textit{Weight (kg)} $ < 90$, and also the boolean operator \texttt{AND} between them.
These will result to two 2\hyp dimensional histograms (heatmap) with axes the \textit{Patient Age} and \textit{Height (cm)} attributes, but at the same time the results will be restricted by the selection of these filters.
The described example from above, is depicted in figures \ref{f:2d_no_filter}, \ref{f:2d_1_filter} and \ref{f:2d_filter}, in which the impact of the filters is obvious.
\begin{figure}[t]
\centering
\includegraphics[width=0.8\linewidth]{figures/2d_no_filter.png}
\caption{Histogram on \textit{Patient Age} and \textit{Height (cm)} with no filters}
\label{f:2d_no_filter}
\end{figure}
\begin{figure}[th]
\centering
\begin{minipage}{.5\textwidth}
\centering
\includegraphics[width=\columnwidth]{figures/2d_1_filter.png}
\captionof{figure}{Histogram on \textit{Patient Age} and \textit{Height (cm)} with filters on \textit{Patient Age}}
\label{f:2d_1_filter}
\end{minipage}%
\begin{minipage}{.5\textwidth}
\centering
\includegraphics[width=\columnwidth]{figures/2d_filter.png}
\captionof{figure}{Histogram on \textit{Patient Age} and \textit{Height (cm)} with filters on \textit{Patient Age} and \textit{Weight (kg)}}
\label{f:2d_filter}
\end{minipage}
\end{figure}
Another example can be an one\hyp dimensional histogram on Body mass Index (BMI) on the subset of patients that weigh more than $90$ kg or are less than $170$ cm in height.
This can be achieved withl a request of an one\hyp dimensional histogram on the attribute \textit{BMI (kg/msq)} that also includes the filters \textit{Weight (kg)} $ > 90$, \textit{Height (cm)} $ < 170$, and also the boolean operator \texttt{OR} between them.
The effect of this example can be found in figures \ref{f:1d_no_filter} and \ref{f:1d_filter}.
\begin{figure}
\centering
\begin{minipage}{.5\textwidth}
\centering
\includegraphics[width=\columnwidth]{figures/1d_no_filter.png}
\captionof{figure}{Histogram on \textit{BMI (kg/msq)} with no filters}
\label{f:1d_no_filter}
\end{minipage}%
\begin{minipage}{.5\textwidth}
\centering
\includegraphics[width=\columnwidth]{figures/1d_filter.png}
\captionof{figure}{Histogram on \textit{BMI (kg/msq)} with filters on \textit{Height (cm)} and \textit{Weight (kg)}}
\label{f:1d_filter}
\end{minipage}
\end{figure}
\import{./}{algorithms/filters.tex}
The way that \textit{filters} have been implemented is by constructing a vector of size $N$ (\textit{i.e.} as long as the number of tuples in the dataset), that works as a boolean mask corresponding to the specified constraints.
Each position of that vector corresponds to a data tuple from the dataset.
If that tuple satisfies all the specified constraints, then the vector should have the (encrypted) value \texttt{True} in that position, or (encrypted) \texttt{False} otherwise.
This vector then is taken into consideration when we build the histogram.
Only tuples which have ``scored'' \texttt{True} are counted into the histogram.
In algorithm \ref{a:filters-constraint-mask}, the aforementioned vector is called $constraintMask$.
In lines 14-20 we build this boolean mask for each constraint into the $eq$ vector, and in lines 21-27 we combine these vectors to build the total one, according to the specified boolean operator applied between them.
The algorithm for computing histograms for either categorical or numerical attributes utilizes this $constraintMask$ vector.
After having computed the constraints, in order to join them with the data computed by the algorithms, a private multiplication with the $N$-size vector $constraintMask$ is needed.
This multiplication can be done with SIMD operation, which saves a lot of communication overhead between the computing nodes.
As the multi\hyp dimensional histogram computation algorithms for numerical and categorical data, constraint mask can be used for as many dimensions as requested.
Since the constraint\hyp mask is dependent to the number of tuples in the dataset, the dimensions of the requested histogram can be as many as the user requests.
% \import{./}{algorithms/filters2.tex}