Conclusions.tex

% !TEX root = ThesisGchatzi.tex

\chapter{Conclusions and Future work}
\label{chap:conclusions}

In this thesis, we focused on scalable hybrid indexing, querying and exploration on big time series data, geolocated or not. We also developed a framework for efficient data analysis and knowledge extraction from Big Data. Section~\ref{sec:conc} concludes this work, while in Section~\ref{sec:future} we present our main future directions.

\section{Conclusions}
\label{sec:conc}
Our main focus lies on \textit{geolocated} time series, a novel data type that combines time series data with a spatial extend. First, we introduced the \btsr index, a hybrid index for geolocated time series. Then, we proposed a variety of new hybrid queries that utilize \btsr to apply certain thresholds both on the spatial and time series domain. To efficiently explore large datasets of geolocated time series, we developed two summarization approaches for efficient visual exploration, named bundle and tilemap summary. Next, we introduced the measure of local similarity, that considers two time series similar if the pairwise distance of their values per timestamp does not exceed a given threshold during a pre-defined time interval. Based on this new measure, we introduced two approaches for pair and bundle discovery on time series datasets. We also used local similarity for a new type of hybrid similarity search on large geolocated time series data, using the \btsr index and a modified version of it, named \sbtsr. Finally, we developed a novel distributed framework for analytics, named FML-$k$NN. The framework applies $k$NN joins on Big Data of various data types to allow efficient mining and knowledge extraction.

In more details, regarding hybrid indexing and querying:
\begin{itemize}
	\item We introduced the \textit{\tsr}, an extension of the R-tree spatial index. In the \tsr, each node is augmented with additional information corresponding to the bounds of the time series contained in its subtree, in addition to the standard \emph{Minimum Bounding Rectangle} (MBR), denoting the spatial bound of its contents. Maintaining both kinds of bounds in each node allows to prune the search space simultaneously in the spatial and in the time series dimension while traversing the index. Thus, the number of required node accesses is significantly reduced, since we only retrieve the contents of nodes that may actually contain objects satisfying both types of predicates.
	\item We proposed the \textit{\btsr}, an optimized variant of \textit{\tsr}, with its nodes having entries with more refined bounds by bundling together similar time series. This allows to compute and maintain tighter bounds for each individual bundle, hence increasing the pruning effectiveness. To allow for a larger number of bundles in nodes at higher levels in the tree hierarchy, we exploit \emph{Piecewise Aggregate Approximation} \cite{keogh2001paa,faloutsos2000vldb} to trade off between the number of bundles and the resolution of the bounding time series for each bundle.
	\item We utilized \btsr to answer a variety of hybrid similarity queries on large geolocated time series datasets. To do so, we leveraged its hybrid indexing potential, allowing for more aggressive pruning in the spatial and time series domains simultaneously.
	\item We introduced the hybrid similarity join query that retrieves pairs of geolocated time series among two datasets such that both the distance between their locations and the distance between the time series themselves do not exceed certain given thresholds. We utilized the \btsr index to speed up the computations and, since similarity join on time series is an inherently expensive procedure, we further proposed a space-driven data partitioning scheme that enables a parallel and distributed approach for hybrid similarity joins. Our method leverages hybrid indexing methods to efficiently handle similarity join queries locally within each partition. This is then combined with an optimization that minimizes the amount of data transferred between worker nodes at query time without false misses.
	\item We evaluated all the above on several real-world and synthetic datasets, assessing various metrics, such as node accesses, indexing size and build time, execution time and scalability.
\end{itemize}

Regarding our approaches on geolocated time series visual exploration:
\begin{itemize}
	\item We introduced two geolocated time series summarization approaches for visual exploration, named \textit{bundle} and \textit{tile map summary}. These are supported and driven by two appropriate hybrid indices that speed up the result computation, providing efficient exploration of geolocated time series data. They consist of a spatial and a time series summary that jointly facilitate knowledge extraction and insight gaining. The spatial summary is similar for both and consists of MBRs of geolocated time series, according to a specific predicate (i.e., spatial proximity, or time series similarity). Each MBR is associated with a counter denoting the number of time series it contains
	\item Regarding the bundle summary, it consists of sets of MBTS, that is a band with upper and lower bounds that encloses all time series of a set, providing with a notion of a range of the time series values throughout the time axis. For providing prompt visualizations of summaries over geolocated time series data and minimizing latency when drawing the relevant graphic elements, we need early access to both spatial and time series information while traversing the index. For this purpose, we adapted our \btsr index so as to also include {\em aggregates} per node, i.e., the number of time series pertaining to each bundle. Subsequently, we introduced a new traversal algorithm for efficient retrieval of a given number of bundles that are the most representative in the map area. 
	\item Regarding the tile map summary it is driven by \hisax, a hybrid index we introduced. It constitutes a hybrid variant of the \isax index, augmented with spatial attributes of its nodes' children, to combine spatial and time series information. In each node, besides the SAX word that describes all its children time series, \hisax keeps also the MBR that they form. To minimize the size and overlap of the MBRs, we proposed a spatial splitting policy, that instead of choosing the splitting dimension in a round-robin fashion (as in \isax), it does so by selecting the dimension that produces the smallest overlap and overall size of the two generated MBRs. We introduced a traversal algorithm for applying timebox search on large (both vertically and horizontally) geolocated time series datasets. The traversal algorithm is applied on our \hisax index and returns a tile map-like summary of the qualifying geolocated time series, by taking advantage of the SAX representation's properties.
	\item We evaluated our methods' efficiency, scalability, accuracy using real-world and synthetic datasets. We also assessed the quality of the information they provide, through mock-up visualization examples.
\end{itemize}

In the field of pair/bundle discovery and local similarity search:
\begin{itemize}
	\item We introduced the measure of \textit{local similarity}, that can be applied on co-evolving (i.e., time aligned) time series. Two co-evolving time series are locally similar if the pairwise distance of their values per timestamp does not exceed a given threshold during a time interval, that lasts at least a pre-defined number of consecutive timestamps.
	\item Based on local similarity, we introduced two methods for pair and bundle discovery on co-evolving time series datasets. Since discovering all possible pairs and bundles of locally similar time series within large sets is a computationally expensive process, we employed a value discretization approach that divides the value axis in ranges equal to the value difference threshold, in order to reduce the number of candidate pairs or bundles that need to be checked per timestamp. We also introduced a more aggressive filtering that only checks at selected \textit{checkpoints} across time, but ensuring that no false negatives ever occur. To further reduce the number of examined candidates, we proposed a strategy that judiciously places these checkpoints across the time axis in a more efficient manner.
	\item We extended our previous approach on hybrid queries over geolocated time series to support local similarity, thus allowing more flexible and fine-grained queries and analyses. We introduced the \textit{local similarity score} between two time series, which is defined as the maximum number of consecutive timestamps during which their respective values do not differ by more than a user-specified threshold. For evaluating such queries, we employed the \btsr index. To further enhance the evaluation performance, we introduced an improvement to the \btsr index, named \sbtsr. It is based on temporally segmenting the time series bounds within each node and deriving tighter bounds per segment. Once the time series bounds in each node become more fine-grained, pruning the search space for local similarity queries proves much more effective.
	\item We evaluated the efficiency and scalability of our methods in terms of execution time, using real-world and synthetic datasets.
\end{itemize}

Finally, regarding scalable $k$NN joins:
\begin{itemize}
	\item We introduced FML-$k$NN, a framework of methods for scalable management, analysis and mining on Big Data collections. The framework implements a probabilistic classifier and a regressor. Specifically, we introduced a MapReduce-based version of $k$NN joins, which reduces file operations for large amounts of data and is uniquely initialized upon launch. Our approach is unified in a single session to reduce space occupation and cluster overloading.
	\item We evaluated our framework on real-world and synthetic datasets against similar approaches, showing that the proposed method achieves high prediction precision and better scalability, while providing with useful knowledge extraction capabilities.
\end{itemize}

\section{Future Work}
\label{sec:future}
In the following, we provide several possible future directions for our work, presented in this thesis.

\begin{itemize}
	\item We plan to expand the indexing capabilities of \btsr on multi-dimensional feature spaces over distributed processing frameworks and also explore adaptivity to query workloads.
	\item Regarding visual exploration of geolocated time series, we will research and support more detailed visual analytics and identify more fine-grained patterns. An interesting direction would be to support drilling-down in a particular summarized result and discover whether there are differentiations in the distributions of its constituent, more detailed patterns, both in spatial and time series domains. Moreover, we will focus on supporting more complex time series distance measures that may boost the quality of our summaries.
	\item For pair and bundle discovery, we plan to further improve the scalability of our algorithms to extend their applicability over very large time series datasets, both in terms of cardinality, as well as in terms of length.
	\item Regarding local similarity search, we plan to study the applicability of \texorpdfstring{$\mathcal{S}$}BBTSR-tree on various other hybrid query types, enlarging its potential in geolocated time series exploration.
	\item Finally, we will perform extended case studies using FML-$k$NN on more datasets from various sources, in order to establish our framework's ability in performing ad-hoc data mining tasks. Furthermore, we will explore its applicability on data stream mining applications, where the input is a continuous flow of data records. We will also enhance our framework's knowledge discovery capacity, by extending it with more distributed machine learning approaches, in an attempt to raise its potential on the continuously growing field of Big Data analytics.
\end{itemize}