-
Notifications
You must be signed in to change notification settings - Fork 1
/
2-enablingtechnologies.tex
106 lines (69 loc) · 15 KB
/
2-enablingtechnologies.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
\chapter{Enabling Technologies}
\label{chap:enabling_technologies}
\begin{chapterintro}
This chapter contains an overview of the existing technologies on which the development of the project will rely. Starting from a general description of the \emph{Knowledge Discovery in Databases} process, a deeper insight on the actual procedures is presented. Some \emph{Data Mining} algorithms are presented, among which is our choice for this project.
To finish, the environment and other technologies which will be used for the development of this project are presented, such as \emph{the R language} and \emph{RStudio}.
\end{chapterintro}
\section{Knowledge Discovery in Databases}
\label{sec:kdd}
The whole process we are approaching in this project is usually known as \emph{Data Mining}, or more generally, \emph{Knowledge Discovery in Databases}. A lot of research has been already done on this field which will serve as background for our project, as well as tools and algorithms which have been designed to treat similar problems and which we can adapt or use as base to develop our own\cite{chen1996data, fayyad1996kdd}.
Knowledge Discovery in Databases - or \emph{KDD} - is a term used to describe the procedure of acquiring high-level knowledge from low-level data. As a formal definition, \emph{Knowledge Discovery} is the non-trivial extraction of implicit, previously unknown, and potentially useful information from data~\cite{frawley1992knowledge}. This knowledge is usually found in the form of patterns and relations between variables which were unlikely to be related.
The KDD process involves several steps~\cite{feyyad1996data} which can be summarized as follows:
\begin{enumerate}
\litem{Understanding the problem:} The first step involves understanding the environment we are studying and gaining relevant prior knowledge. In this step we must identify which goals we want to set for the knowledge discovery process. This is, we must identify the kind of knowledge we want to obtain and the data we can count on for this process.
\litem{Creating a target dataset:} We will usually need to select a subset of variables from the available datasets. While the system we are studying may need a lot of variables to log events or make data relations, we will not likely need all of them to characterize our problem. Reducing the dimensionality of the problem will provide better results and ease the following steps.
\litem{Data cleaning and preprocessing:} In this step we have to discern which data is actually relevant and significant for our study, and which is merely noise or outliers which should be disregarded. Operations such as noise modelling or mapping of missing and unknown values are also taken in this step.
\litem{Data mining:} For this step we must first have decided the purpose of the model derived by the data mining algorithm. For example, summarization, regression, clustering and others. According to our decision, some data mining algorithms will be more appropriate than others.
\litem{Interpretation of results:} Consists on interpreting the discovered patterns, removing those redundant or irrelevant and translating the useful ones into understandable terms.
\litem{Consolidation of discovered knowledge:} The discovered knowledge is finally consolidated in an appropriate form. Depending on the context of our project, it might be simply documented or integrated in predictive modules for the analysed systems.
\end{enumerate}
\emph{Data Mining} comprises a large amount of different algorithms which can be used for the Knowledge Discovery process. Depending on the nature of the data on which we will apply these algorithms, and on the kind of knowledge we expect or want to acquire, we will need algorithms of different types.
Different algorithms can usually be classified in the following categories:
\begin{enumerate}
\litem{Classification:} Learning a function that maps an item into predefined classes.
\litem{Regression:} Learning a function that maps an item to a predicted variable.
\litem{Segmentation:} Identifying a set of clusters to categorise the data.
\litem{Summarization:} Finding a compact description for the data.
\litem{Association:} Finding significant dependencies between different variables.\cite{Zhao2003association}.
\litem{Sequence analysis:} Finding frequent sequences or episodes in data~\cite{zhao2003sequential,weiss2002predicting}.
\end{enumerate}
\section{Choosing an Algorithm}
\label{sec:algorithms}
From the previous classification, \emph{Segmentation} and \emph{summarization} algorithms seem to be obviously out of the question, as their functionality differs completely from the objectives we want to achieve in our project. \emph{Regression} algorithms are inadequate as well due to the nature of our data. We do not count on variables whose future value we want to predict.
\emph{Classification} algorithms might not seem like a good choice at first, as we do not have the need to classificate the events into any existing categories. However, if we define an appropriate set of categories and an appropriate model of items to classify, classification algorithms can be actually useful for our tasks. If we model the current events as the \emph{item} to classify, and the possible categories as the possible events which can happen in the future, we can actually classify the current situation (defined by the events which have already happened) into possible categories each defining what would happen next.
However, \emph{sequence analysis} seems to be the most appropriate category at at first glance: we count on historic data from the past and we want to find event patterns from which we can foresee future events. Patterns which are frequent offer useful information in this direction. If we know a set of several events which often happen in the same sequence, we can expect the later events in the sequence to happen once we have already seen the first ones.
These sequences offer a good starting point, but it is important to realize that \emph{frequency} in a pattern refers to the total of events happening during the observated period, and does not indicate in any way the \emph{probability} of the last events in a sequence to happen once the first ones have been acknowledged. In other words, we want to obtain predictions with a high \emph{probability} rather than a high \emph{frequency}.
In this direction, we will use the approach of the \emph{association} algorithms. Using frequent sequences as an starting point, we will build \emph{association rules} which will relate events in the form of boolean variables.
Therefore, we find this to be the most appropriate approach, as it addresses our problem directly without the need of transforming it into a different kind of situation.
\subsection{General Purpose Algorithms}
Even though we are specifically aiming to perform sequence mining in our project, there are several general purpose algorithms which could be used in our project with the necessary changes. Algorithms such as \emph{Apriori}\cite{kumar2010implementation} could for example be used after a pertinent data transformation. However, these aditional steps would require significant additional work, and therefore we will look for alternative algorithms which can handle these issues on their own.
\subsubsection{Neural networks}
An alternative approach for our problem is to treat it as a \emph{classification} issue instead of an \emph{association} one. This can be simply achieved by defining the current situation as the item to classify, and the categories being all the possible future situations. The set of today's events will therefore fall into one category which indicates which events are likely to happen tomorrow.
\emph{Neural Networks}\cite{rojas1996neural} are amongst the most popular tools for this kind of problem. Neural Networks are a computational tool modeled on the interconnection of the neuron in the nervous systems of the human brain and that of other organisms. Neural Networks are a type of non-linear processing system that is ideally suited for a wide range of tasks, especially tasks where there is no existing algorithm for task completion. They can be trained to solve certain problems using a teaching method and sample data. In this way, identically constructed networks can be used to perform different tasks depending on the training received. With proper training, neural networks are capable of generalization, the ability to recognize similarities among different input patterns, especially patterns that have been corrupted by noise.
\subsection{Sequence Mining Algorithms}
In this section we will analyse some of the existing algorithms designed for the specific purpose of mining sequences. These algorithms will likely require much less adaptation an previous work in order to be applicable to our project, as their purpose is the same or very similar to our goals.
\subsubsection{The SPADE Algorithm}
\label{sec:the_spade_algorithm}
An algorithm to Frequent Sequence Mining is the SPADE\cite{zaki2001spade} (Sequential PAttern Discovery using Equivalence classes) algorithm. It uses a vertical id-list database format, where we associate to each sequence a list of objects in which it occurs. Then, frequent sequences can be found efficiently using intersections on id-lists. The method also reduces the number of databases scans, and therefore also reduces the execution time.
The first step of SPADE is to compute the frequencies of 1-sequences, which are sequences with only one item. This is done in a single database scan. The second step consists of counting 2-sequences. This is done by transforming the vertical representation into an horizontal representation in memory, and counting the number of sequences for each pair of items using a bidimensional matrix. Therefore, this step can also be executed in only one scan.
Subsequent n-sequences can than be formed by joining (n-1)-sequences using their id-lists. The size of the id-lists is the number of sequences in which an item appears. If this number is greater than minsup, the sequence is a frequent one. The algorithm stops when no frequent sequences can be found anymore. The algorithm can use a breadth-first or a depth-first search method for finding new sequences.
One implementation of the SPADE algorithm is the cSPADE algorithm. This implementation allows for timing constraints and is available in the form of a R library \cite{zaki2000cspade}.
\subsubsection{Timeweaver}
A particularly type of algorithm which can be very useful for our purposes are genetic algorithms\cite{goldberg1989genetic}. Specifically, a similar problem to ours has already been aproached in the past in some other research project, through the implementation of the \emph{Timeweaver} algorithm\cite{weiss1999timeweaver}. This implementation is not publicly available, but there are several tools for the implementation of custom genetic algorithms which can be used.
Although this is a very good option for approaching our project, it requires further implementation than other algorithms such as cSPADE (see section~\ref{sec:the_spade_algorithm}).
\subsubsection{The Eclat Algorithm}
The Eclat algorithm \cite{zaki1997new} is used to perform itemset mining. Itemset mining let us find frequent patterns in data like if a consumer buys milk, he also buys bread. This type of pattern is called association rules and is used in many application domains.
The basic idea for the eclat algorithm is use tidset intersections to compute the support of a candidate itemset avoiding the generation of subsets that does not exist in the prefix tree.
This algorithm could be used in our context by treating events as items. If the event \emph{A} happens, it's frequent that also \emph{B} happens. However, as with the shopping basket example, we will only find relations for items being in the same basket, and therefore would find events which are likely to happen together but not predict them over a time window. Althought this would still be achievable through different encoding of for example, today's and tomorrow's event (A1 means \emph{A today} while A2 means \emph{A tomorrow}) this means a significant added complexity, and as there are solutions which already contemplate these issues (see section~\ref{sec:the_spade_algorithm}) this algorithm is not very appropriate for our purposes.
\subsubsection{The FP-Growth Algorithm}
The \emph{FP-Growth} Algorithm, proposed by Han in \cite{han2000mining}, is an efficient and scalable method for mining the complete set of frequent patterns by pattern fragment growth, using an extended prefix-tree structure for storing compressed and crucial information about frequent patterns named frequent-pattern tree (FP-tree). In his study, Han proved that his method outperforms other popular methods for mining frequent patterns, e.g. the Apriori Algorithm \cite{agrawal1994fast} and the TreeProjection \cite{agarwal2001tree}. In some later works \cite{kumar2010implementation} it was proved that FP-Growth has better performance than other methods, including Eclat \cite{zaki1997new} and others.
Although the FP-Growth Algorithm is very efficient and one of the most popular association mining methods, it doesn't provide tools to effectively apply temporal constraints in a simple and efficient way as cSPADE does (see section~\ref{sec:the_spade_algorithm})
\section{The R language}
\label{sec:the_r_language}
For the implementation and execution of the already mentioned methods and algorithms, it is necessary to choose a development environment. There are several options available for Data Mining purposes, and furthermore, algorithms usually are already implemented in several languages as libraries. However, we have found a language of special interest for our purposes: \emph{the R language}\cite{ihaka1996r}.
R is a free software programming language and a software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical software and data analysis\cite{vance2009data,quick2010statistical}.
As a development environment, the \emph{RStudio}\cite{racine2012rstudio} IDE provides a very comfortable environment for R development. Furthermore, it has a server version which allows for easy computation on remote data mining servers.
\section{The Rule Engine}
It is important to distinguish between the development environment, for which we chose the R language and RStudio (see section \ref{sec:the_r_language}) and the final system itself, which does not need to be implemented or rely on the same technologies used for knowledge discovery.
For the implementation of the final prototype, it is convenient to use an already-existing rule engine\cite{liang2009openrulebench}, as these engines are significantly complex and therefore a custom implementation is less advisable than relying on an already existing and developed one.
Due to the environment in which the prototype is to be deployed, we are entitled to use Java for its implementation. One of the most popular rule engines for Java environments is the \emph{Drools Expert}\cite{browne2009jboss,proctor2007relational} framework, which we will use for the implementation of the prototype.