-
Notifications
You must be signed in to change notification settings - Fork 0
/
lc.tex
331 lines (211 loc) · 45.9 KB
/
lc.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
\documentclass[twocolumn,10pt]{article}
\usepackage[a4paper,hmargin=1.5cm,vmargin=2cm,]{geometry}
\setlength{\columnsep}{0.7cm}
\usepackage{palatino}
\usepackage{graphicx}
\usepackage[utf8]{inputenc}
\usepackage{hyperref}
\usepackage{minted}
\usepackage{balance}
\usepackage{dblfloatfix}
\usepackage[numbers]{natbib}
\usemintedstyle{colorful}
\usepackage[font=small,labelfont=bf]{caption}
\urlstyle{same} % no tt font for URLs
% Fix URL // kerning:
% https://www.joachim-breitner.de/blog/519-Nicer_URL_formatting_in_LaTeX
\makeatletter
\let\UrlSpecialsOld\UrlSpecials
\def\UrlSpecials{\UrlSpecialsOld\do\/{\Url@slash}\do\_{\Url@underscore}}%
\def\Url@slash{\@ifnextchar/{\kern-.11em\mathchar47\kern-.2em}%
{\kern-.0em\mathchar47\kern-.08em\penalty\UrlBigBreakPenalty}}
\def\Url@underscore{\nfss@text{\leavevmode \kern.06em\vbox{\hrule\@width.3em}}}
\makeatother
\newcommand{\blockade}{\rule{3em}{0.7em}} %% Marker for things to change before submission
\newcommand{\fixme}[1]{[ \blockade FIXME: #1]}
\usepackage{natbib}
\bibliographystyle{sn-vancouver}
\setcitestyle{aysep={}}
\usepackage[dvipsnames]{xcolor}
\hypersetup{
colorlinks,
linkcolor={blue!50!black},
citecolor={blue!50!black},
urlcolor={blue!50!black}
}
\renewcommand{\dbltopfraction}{0.7}
\renewcommand{\textfraction}{0.05}
\renewcommand{\bottomfraction}{0.1}
\newcommand{\Supplement}{\href{https://anders-biostat.github.io/lc-paper/}{Supplement}}
\newcommand{\Supplementary}{\href{https://anders-biostat.github.io/lc-paper/}{Supplementary}}
\begin{document}
\setcounter{secnumdepth}{0}
\twocolumn[{%
\centering
\textbf{\Large Simple but powerful interactive data analysis in R with R/LinekdCharts}
\vspace{1.5ex}
Svetlana Ovchinnikova and Simon Anders\\
{\footnotesize Center for Molecular Biology of the University of Heidelberg, Germany}
\vspace{1.5ex}
version 3 (December 2023)
\vspace{6ex}
}]
\section{Abstract}
In research involving data-rich assays, exploratory data analysis is a crucial step. Typically, this involves jumping back and forth between visualizations that provide overview of the whole data and others that dive into details. For example, it might be helpful to have one chart showing a summary statistic for all samples, while a second chart provides details whenever a data point is selected in the first chart. We present R/LinkedCharts, a framework that renders this task radically simple, requiring only very few lines of simple R code to obtain complex and general visualization, which later can be polished to provide interactive data access of publication quality.
\section{Background}
Effective data visualization has been crucial for scientific success since the first quantitative experiments. Yet the amount and complexity of the available data have continuously grown over the last decades, while there are certain limits to how much information one can learn from a static image \citep{hegarty_2011}. Excessive details and multiple overlapping layers make it harder to grasp the crux of a plot. One solution to the problem is to come up with more and more creative and elaborate types of plots, the other is to add an extra dimension by employing interactivity. The first attempts of the latter started in the 1970s \citep{newman_1979, becker_1987}, and by now interactivity is applied ubiquitously, not only in science but also in marketing, journalism and any other field, where there is a need to communicate data-based knowledge to an audience.
Interactive figures are engaging. They allow users to observe data from multiple self-chosen angles making the conveyed message more credible. They also bolster data exploration sparing researchers the necessity of handpicking presumably important parts of data in advance. Therefore, we believe that further integration of interactive tools in a researcher's routine can significantly improve the quality of research.
Numerous tools \citep{caldarola_2017} now provide means of interactive inspection for many specific types of data. Examples from biology include metabolic maps \citep{noronha_2017}, genome assemblies \citep{wick_2015}, scRNA-Seq or other kind of omics data \citep{hillje_2020, rue_2018}, QTL data \citep{broman_2015} and many more. While such solutions are each tailored for one very specific type of data, there are also a number of general low-level frameworks to create interactive apps, such as D3 \citep{bostock_2011} and Vega-Lite \citep{satyanarayan_2015}, and more high-level but still general-purpose packages, such as Vega \citep{satyanarayan_2016}, Shiny, BPG \citep{p_2019}, plotly, Bokeh and Observable Plot.
A crucial part that is used in many of the special-purpose solutions is ``linking'' of charts: the user's click on, e.g., a data point in one overview chart (the overview chart) causes details to this very data item to be shown in another chart (the details chart) \citep{buja_1991}. Even though such linking is often what makes interactive tools for specialized purposes useful, functionality for linking in general is missing from most general-purpose frameworks. For technical reasons, such functionality is, if at all, only offered by frameworks for web programming in JavaScript -- which is most unsatisfactory for bioinformatics, a field where most work is done using R and Python.
In this paper, we present R/LinkedCharts, an R package that makes it very simple to produce linked interactive plots by providing convinient R wrappers around a core built using D3. We will first review the concept of chart linking and explain why it is of so much value especially for bioinformatics data analysis and then demonstrate the versatility of our framework and justify our design decisions. We end with a Discussion on what sets our approach apart from earlier work.
\section{Results and Discussion}
\subsection{Linking charts}
As its name suggests, the central concept of LinkedCharts is linking and focusing \citep{buja_1991}: one can connect two or more plots thus that interacting with one of them affects what is displayed in the others. We illustrate the concept of linking charts with a simple example based on data from \citet{conway_2015}.
In that study, three samples were taken from each of 17 patients with oral cancer: of normal, cancerous, and dysplasic tissue. mRNA from all these samples was sequenced to obtain gene expression values. The goal was to find genes that are differentially expressed between the tissue types -- a standard task in bioinformatics, readily addressed using available software tools \citep{ritchie_2015, love_2014}. Here, we have used the function \mintinline{R}{voom} from the ``limma'' package \citep{law_2014} to compare normal and cancerous tissues. It is common to visualize such a comparison with an MA plot \citep{dudoit_2002}, where each dot represents a gene, showing the gene's average expression on the X-axis and log fold change between the two groups on the Y-axis (Fig \ref{FigD}A). Red dots correspond to genes that are considered significantly different between the two conditions (adjusted p-value $<$ 0.1).
\begin{figure*}
\centering\includegraphics[width=.9\textwidth]{FigD/figD.png}
\caption{An example for two linked charts, based on a study by \citet{conway_2015} comparing cancerous and normal tissues from 19 patients. The MA plot (A) shows all genes with their average expression on the X-axis and log$_2$-fold change between tumour and normal on the Y-axis. Red indicates genes where the difference was reported as significant by the ``voom'' method \citep{law_2014}. The plot to the right (B) shows, for one selected gene (here, LAMB4), the individual expression values (as counts per million, CPM) for each sample. This figure is a screenshot of a LinkedCharts app, the live version of which is provided in the \Supplement{} (as Interactive Supplementary Figure 1): When the user clicks on any point in the MA plot (A), the expression plot (B) changes, showing the selected gene. Thus, one can rapidly gain an impression of the details hidden in a summarizing plot like the MA plot.}
\label{FigD}
\end{figure*}
About these genes, one may now wonder: How does the difference in expression look like for every single patient? Is it consistent across all the patients or only detected in some of them? Are there any artifacts or outliers that cause the p-value to be too small?
To investigate such questions, we can add another plot that shows expression values (as "counts per million", CPM) from each individual sample (Fig \ref{FigD}B). While this second plot can show expression for only one selected gene at a time, the \emph{linking} between the two charts overcomes this limitation: In our implementation, a mouse click on a point in the MA plot causes the plot to the right to switch to displaying the expression values for the thus selected gene. Fig \ref{FigD} depicts our LinkedCharts app, while a live version of the app is provided in this paper's online \Supplement{} (\url{https://anders-biostat.github.io/lc-paper/}) -- and we encourage the reader to pause for a moment and try it out there.
\begin{figure}[bh!]
\begin{minipage}{\columnwidth}
\begin{minted}[xleftmargin=20pt, linenos, highlightlines={2,9-12,17}]{R}
openPage(layout = "table1x2")
gene <- 1
lc_scatter(dat(
x = AveExpr,
y = tissuetumour,
colour = ifelse(adj.P.Val < 0.1,
"red", "black"),
on_click = function(k) {
gene <<- k
updateCharts("A2")
}),
"A1", with = voomResult)
lc_scatter(dat(
x = patient,
y = normCounts[gene, ],
colourValue = tissue,
logScaleY = 10),
"A2", with = sampleTable)
\end{minted}
\end{minipage}
\caption{Code for generating Figure \ref{FigD}. See text for details.}
\label{listing}
\end{figure}
\begin{figure*}[t]
\centering\includegraphics[width=.8\textwidth]{FigB/figB.png}
\caption{Typical syntax of an R/LinkedCharts plot with comparison to the ``ggplot2'' \citep{wickham_2016} package, one of the most widely used plotting libraries. Lines of the code are arranged to put the same aspects of the charts next to each other. The ``iris'' dataset, one of the built-in example datasets of R, was used here. Both pieces of code are complete and fully functional, and their output is shown above the code.}
\label{FigB}
\end{figure*}
\begin{figure*}
\includegraphics[width=\textwidth]{FigA/figA.png}
\caption{Gallery of all available plotting functions in the ``rlc'' package. A scatter plot (A); a bee swarm plot (based on the d3-beeswarm plugin of \citet{lebeau_2017}) (B); a collection of various lines (C); a histogram and a density plot (D); a heatmap (E); a bar chart (F); a collection of interactive elements to gather input from the user (G); functions to add custom HTML code and static plots to the page (H). All these examples with code to create them can be found in the \Supplement.}
\label{FigA}
\end{figure*}
Of course, there are already several tools available for exploring the data from differential-expression assays (e.g., iSee \cite{rue_2018}), and these may or may not fit the needs of a specific analysis. With R/LinkedCharts, we offer the building blocks to build such an app with minimal effort "from scratch", while giving the analyst the flexibility to generate arbitrary plots and arbitrary linkages.
In fact, the minimal code to set up this app takes only the few lines shown in Figure \ref{listing}. (The full code, i.e., including the code for loading the data and running limma/voom, is given in the online \Supplement.)
The two \mintinline{R}{lc_scatter} calls set up the two scatter charts shown in Figure \ref{FigD}. In the left-hand chart (``A1''), each point depicts a gene, and x and y coordinate and point color are taken from the indicated columns of \mintinline{R}{voomResult}, the results table provided by the limma/voom differential-expression tool. Similarly, the right-hand chart (``A2'') takes its data from the sample table, and the y axis from the matrix of normalizxed read counts (that was also used as input to limma/voom).
The lines highlighted in blue cause the linking: In Line 2, we introduce a global variable, \mintinline{R}{gene}, which stores the index of the gene to be shown in the right-hand plot. This index tells the chart which line of the \mintinline{R}{normCounts} matrix (where the normalized counts are stored) to use as \emph{y} values of the expression plot (Line 17). Almost every chart type in R/LinkedCharts has the \mintinline{R}{on_click} argument, which allows the user to define a function that is called whenever someone clicks on an element of the plot (point, line, cell of a heatmap, etc.) and is passed the index of the clicked element (\mintinline{R}{k}). Here, our callback function simply changes the value of \mintinline{R}{gene} to the clicked point index (Line 10). Then, we tell R/LinkedCharts to update the second plot (Line 11; ``A2'' is its ID set in Line 20). Updating means that the package will reevaluate all arguments inside the \mintinline{R}{dat()} function and redraw the chart accordingly. In our case, a new value of \mintinline{R}{gene} will yield new \emph{y} values for the expression plot.
This simple logic is not limited to just two plots, but provides a basis to create many simple and complex apps. In the following, we will showcase a few more examples. The paper's online \Supplement contains live versions for all these apps, as well as the full code to generate the apps and links to necessary data files, allowing the reader to immediately get the app in their R session and experiment with it. For all examples, we provide two versions of the code: minimal with only essential parameters needed to make the app functional, and more extended with custom colors, labels, etc. In the paper, we only focus on the minimal code.
Even more example can be found in our online tutorial at \url{https://anders-biostat.github.io/linked-charts/rlc/tutorials}, including examples dealing with exploration of single-cell sequencing data.
\subsection{Event handling in R}
In the simple example just discussed, the ability to link the two charts is what made the app useful, and what sets R/LinkedCharts apart from other solutions for R, such as Shiny. A short technical detour might therefore be in order to explain why linking is non-trivial. Here, we first have to clarify that virtually all interactive visualization frameworks leverage the power of browser engines for HTML5 and JavaScript: the actual app is displayed by a browser. Therefore, a use interaction, such as a mouse click, is handled by the browser, and any custom event handler has to be specified using the language that the browser understands, i.e., JavaScript.
If the event handler should be written in R (such as our \mintinline{R}{on_click} function), the framework must provide specific functionality to connect that R code with the JavaScript code running in the browser in a manner that preserves all details of the user-interaction event. The difficulty in doing so explains why, so far, only native-JavaScript frameworks like D3 and Observable Plot, offer linking, while R- and Python-based frameworks (Shiny, Plotly, Bokeh) are (despite recent progress) still very limited with respect to offering custom event handling without having to revert to JavaScript.
The possibility to write custom event handler in JavaScript is insufficient if the user interaction should trigger a complex calculation that the analyst has already coded in their usual language of choice, here presumably in R. This is the gap that R/LinkedChart fills.
For details on how R/LinkedCharts makes it now possible to write event handlers in native R, see the Methods section.
\subsection{Basic syntax, chart types, and HTML5 integration}
We aimed to make R/LinkedCharts simple and familiar to any user with at least some basic knowledge of R. Every chart has a set of properties to define each of its specific aspects. In the previous example, we set the properties \mintinline{R}{x}, \mintinline{R}{y} and \mintinline{R}{color}, which received vectors of coordinates and colors to specify the scatter plots' data points. This principle will be familiar to most users from other plotting libraries. For example, Figure \ref{FigB} shows a comparison of the syntax in R/LinkedCharts (``rlc'' package) and ggplot (from the widely used ``ggplot2'' \citet{wickham_2016}) for a simple scatter plot. Lines are arranged to match the same aspects of the plots; above each code block, its output is shown. One can see that the input data structure is identical, and there is hardly any difference between the two.
R/LinkedCharts is not limited to scatter plots. There are 15 main functions in the ``rlc'' package, each generating a specific type of plot (such as scatter plot, heatmap, bar plot, etc.) or a navigation element (such as sliders or text fields). Figure \ref{FigA} shows them all. Each plot is defined by its properties: some of them are required (such as \mintinline{R}{x} and \mintinline{R}{y} for a scatter plot or \mintinline{R}{value} for a heatmap), others are optional (\mintinline{R}{palette}, \mintinline{R}{title}, \mintinline{R}{ticks} etc.). A full list of all the properties with live examples is available at \url{https://anders-biostat.github.io/linked-charts/rlc/tutorials/props.html} and also on the R man page of each plotting function. For each chart type, event handlers (such as the \mintinline{R}{on_click} function already mentioned, and others) can be defined.
LinkedCharts apps are displayed as HTML pages, using a standard Web browser. This means that the layout, as well as decorations (such as headlines), can easily be specified by producing a standard HTML5 page, in which the elements where the charts are to be placed are marked by their \mintinline{R}{id} attribute. As knowledge of HTML5 is wide-spread, this allows practitioners to improve the appearance of the LinkedCharts app without having to learn anything new.
Furthermore, it facilitates integrating LinkedCharts with other web-based apps. For example, one can easily link a LinkedChars app with a web-browser-based genome browser, such as \emph{IGV.js} \cite{robinson_2020}, so that the user's interaction with the LinkedCharts app controls what genomic region is displayed in IGV's genome track. (An example is given on the tutorial web page.)
Once one has developed a rough prototype of a LinkedCharts app, the app's appearance can be easily improved by using HTML5 to specify layout, decorations, and add further static elements. To facilitate this, the web server integrated in R/LinkedCharts provides basic functionality to also serve, e.g., images and CSS style sheets.
\begin{figure*}
\centering\includegraphics[width=.9\textwidth]{FigC/figC.png}
\caption{LinkedCharts can be used to ``walk backwards'' through an analysis pipeline. This is illustrated here using a drug screening experiment \citep{he_2018, ozkan_2020} as an example.
For an interactive version, see Interactive \Supplementary{} Figure 4.
The \emph{blue arrows} show the direction of a typical analysis pipeline used in drug screening experiments. We start with reading intensity values from plates with different cell lines cultured in the presence of studied drugs (A). These values are then normalized and turned into a fraction of the cells that remained viable. A sigmoid curve is fitted to the obtained viability values at different drug concentrations, and the area under the fitted curve yields a single score for each drug (B). Different drugs' scores are compared to each other across all the tested cell lines (C). A drug-drug correlation heatmap is then produced to identify clusters of similar drugs (D). The \emph{red arrows} illustrate the direction of interactive data exploration: We start by showing the summary heatmap plot (D). Suppose the researcher is interested in a particular drug combination or a cluster of drugs. In that case, he or she can examine the corresponding drug scores simply by clicking on the heatmap cell (D) to see the underlying correlation plots (C). Similarly, one can click in a point in (C) to examine the individual viability values at the tested concentrations and check the sigmoid fit (B). And finally, if needed, it is possible to take one more step back and to look at the raw read-outs to inspect them for the presence of any artifacts (A).}
\label{FigC}
\end{figure*}
\subsection{Use cases}
\subsubsection{Interactivity for EDA and for presentation}
In general, the use of interactive data visualization falls into two areas, exploratory data analysis (EDA) and data presentation and dissemination. The latter case is becoming well established: more and more authors now accompany their papers with an interactive resource to present their data and results (for example \cite{travaglini_2020, roider_2020, kalucka_2020}) and allow the reader to browse through them. Typically, this chiefly serves to present and communicate research that has already been completed, and, often, it is only after most of the work on a project has been finished and the paper is being written up that researchers spend a couple of days implementing a nice-looking interactive app to accompany their publication.
However, interactive visualization has possibly even more potential in the early stages of an analysis where the analyst tries to explore new data and to get a feel for it. The reason this is so rarely done (the "interactive visualization gap" in the words of \citet{batch_2017}) might be that setting up interactive visualization usually seems time-consuming and cumbersome. This is why R/LinkedCharts is designed to make it easy to rapidly create a simple app with only a few lines of code. The analyst might produce many such ``quick-and-dirty'' visualizations and only keep a few to later turn them into more polished works for presentation.
In the following, we will discuss use cases along this axis from early EDA to polished presentation.
\subsubsection{Back-tracking in analysis pipelines}
Most analysis of big data comprises multiple steps of data summarization, each reducing the total amount of data and thus losing information.
For example, in the oral-cancer example, we first have for each gene expression values from 28 samples, but the differential expression data analysis summarizes this to just 3 values: the gene's average expression over all samples, the fold change between tumor and healthy and the associated p-value. The LinkedCharts app shown in Figure \ref{FigD} allows to ``undo'' this summarization by inspecting the original values for each gene.
As an example of an analysis pipeline with multiple data-reduction steps, we use the drug-screening study of \citet{ozkan_2020}. A collection of drugs was tested against various pancreatic cancer cell lines at several concentrations per drug. Figure \ref{FigC} illustrates a possible analysis pipeline: Panel A shows the viability read-out from the microtiter plates. For each combination of one cell line and one drug, the values for the different tested concentrations can be shown as a scatter plot, with each point depicting the viability value from one well (panel B). Here, we can fit dose-response curves, which can then be further summarized to a single number, such as the area under the curve, or, in the case of this study, a refined variant of that, called the drug sensitivity score (DSS) \citep{yadav_2014}. If two drugs show effect on the same subset of cell lines, they likely have similar modes of action. Hence, to assess the similarity for each pair of drugs, we compare their activity over all cell lines, as shown by the scatter plots in panel D, where each point represents a cell line, with its \emph{x} and \emph{y} coordinates denoting the drug sensitivity scores of that cell line for the two compared drugs. Again, we summarize each such plot into a single number, the correlation coefficient, and finally, we depict all the correlation coefficients in a correlation-matrix heatmap (panel C).
Often, such an analysis pipeline is fully automated and no one ever looks at the intermediate plots. Inspecting them is, however, crucial to note problems with data quality or mistakes in the design or programming of the analysis pipeline.
LinkedCharts allows to ``walk'' such an analysis pipeline backwards: In the \Supplement, we show an app that depicts the plots of Figure \ref{FigC} in an interactive fashion, as follows. As each cell of the final heatmap (panel C of Figure \ref{FigC}) summarizes on a scatter plot comparing two drugs (panel D), we can click on any cell in the heatmap and then see the corresponding scatter plot. Each point in that scatter plot represents a pair of drug sensitivity scores, which are, themselves, summaries of dose-response curves. Again, clicking on a point in the correlation scatter plot will display these two dose-response curves. Finally, each value in a drug response curve stems from a well in a microtiter plate, and hovering over a point there hence highlights the well in a heatmap depicting the plate.
Thus, LinkedCharts allows to explore the ``parentage'' of any result value. If we find a specific drug-drug correlation value suspicious or surprising, or if we just wish to double-check it before drawing further conclusions from it, we can check its provenance in arbitrary detail. Similarly, we can perform random spot checks.
Each layer in the backwards journey can inform about another type of problem: From the correlation scatter plots, we may find that the correlation coefficient was unduly influenced by a single out-lying cell line, from the dose-response plot, we may find that specific dose-response curves fail to have the expected sigmoid shape, and from inspecting plate plots, we may trace back a surprising final result to, say, a normalization issue or a plate-edge effect.
Once such an analysis pipeline has been developed, all the intermediate results are typically available in suitable data structures, which can be readily explored with LinkedCharts. The \Supplement{} provides code for the example just described.
\subsubsection{Quality assurance thresholds}
Typically, analysis pipelines include steps to exclude bad-quality data. Often, this is done by calculating quality metrics and setting thresholds. In the drug screen example, the goodness of fit of the dose-response curves might be quantified by the residual sum of squares, and if this value exceeds a threshold, the drug sensitivity score might be discarded as unreliable. In the oral-cancer sample, the log fold change of some genes might be unduly influenced by a single outlying sample, and one might use a threshold on an outlier-detection score such as Cook's distance to flag such genes.
Typically, the thresholds on such quality metrics are chosen a priori, often simply taking over values from previous work or from tutorials, even though the characteristics of the assay might have changed. Doing otherwise seems to cause a chicken-and-egg problem: One cannot run the analysis without first somehow deciding on thresholds, and therefore, one cannot use analysis results to guide the choice of thresholds.
The approach of "walking the pipeline backwards" with LinkedCharts opens another approach: Typically, outliers tend to cause false positive results. Therefore, one can run the analysis first without excluding any outliers, then inspect the provenance of the statistically significant items found and will be so guided to specifically those places in the raw data where outliers can actually cause false positives. This provides the analyst with a better ``feel'' for the data and the analysis procedure and helps build an intuition that will allow to more critically judge whether traditionally used standard values for quality-assurance thresholds are appropriate for the specific data set under analysis.
\begin{figure*}[b]
\includegraphics[width=\textwidth]{FigE/figE.png}
\caption{An example of an R/LinkedCharts app (C, D) for a simple exploratory analysis and the code to generate it in comparison with static plots (A, B) produced for the same purpose. The heatmaps (A, C) show Spearman correlation of gene expression for all samples from \citet{conway_2015}. Here, we can see, inter alia, two outlier samples in the heatmap's bottom-right corner and some more or less pronounced clusters of samples with similar gene expression levels. The scatter plots (B, D) show expression values for two samples plotted against each other. Browsing through several such plots can help the researcher get a feeling of the data and explore unexpected patterns like the outliers just mentioned. The code is split into two pieces, where the upper one is responsible for generating the plots and the lower part shows the code to update them to show a specific sample pair. For static plots, one has to execute the same lines of code for any pair of samples, while for R/LinkedCharts the provided code should be added to the list of arguments for the heatmap. After that, switching between pairs of samples can be done simply by clicking on the corresponding cell of the heatmap. The static heatmap (A) was generated with the ``pheatmap'' package \citep{kolde_2019}; scatter plot (B) was made with a base R function. The live version of the app can be found in the \Supplement. For simplicity, gene expression for all the samples is subset to 8000 randomly selected genes.}
\label{FigE}
\end{figure*}
\subsubsection{Exploratory analysis}
Analyzing complex data sets from many different angles and asking many different questions about them is crucial to all computational biology, not only to ensure that one does not overlook potential problems but also in order to not miss the chance of serendipitous discoveries. The importance of such exploratory data analysis (EDA) has been argued since long, and it therefore forms a large part of computational biologists' everyday work. An important element is to pick examples and study them in detail, similarly to the quality-assurance applications discussed in the previous section, but now with the aim of getting a ``feel'' for the data and looking for insights.
The standard approach in inspecting examples is to pick, e.g., a gene from a result list, produce a plot showing the provenance of this result, then pick another gene, change the code for plotting to now show underlying data for this gene, etc. At that point, it is trivial to alter this code into a linked charts app, using the similarity between code for static and dynamic plots (Figure\ \ref{FigE}).
Figure \ref{FigE} illustrates this with another example based on the oral cancer dataset. A bioinformatician had produced a correlation heatmap depicting correlations between all sample pairs (Figure\ \ref{FigE}A), using, e.g., the pheatmap package \citep{kolde_2019}, and now wishes to inspect a specific correlation value (panel B) and writes to this end the short code shown in the figure. To inspect other sample pairs, she would simply change the sample indices in the code. This is routine practice for most bioinformaticians, but cumbersome. As the code example below the plots in Figure \ref{FigE} shows, however, it is now virtually none effort to transform the code into a LinkedCharts app, by merely making a few simple substitutions.
\begin{figure*}[t]
\includegraphics[width=\textwidth]{FigF/figF.png}
\caption{A LinkedCharts app as a paper supplement \cite{wang_2020}. The main chart (upper row, centre) shows for every gene in the study its average expression and so-called $\Delta$-score, which indicates whether evolutionary changes in the translatome compensate for changes in the transcriptome or introduces additional variance. The two plots below show expression values for the selected gene in all the tested samples. The user selects a gene by clicking on the corresponding point of the main plot or by entering the gene name (upper-left corner). The density plot to the right shows the distribution of $\Delta$-scores, and its Y-axis is linked to the Y-axis of the main plot. In the upper-right and bottom-left corners, some additional information on the selected gene is displayed. Icons in the upper left corner allow switching between the three studied tissues. Detailed information on the data, study goal and the source code for the app can be found in the related publication. The app is written in JavaScript and, thus, can be downloaded and opened in any modern browser without installation requirements. Though largely customized, the app is based on the same principles as other examples throughout this paper. For the live version, see \url{https://ex2plorer.kaessmannlab.org/}.}
\label{FigF}
\end{figure*}
\subsubsection{Public apps and concurrent use}
Technically, an R/LinkedCharts app is provided by a web server running inside the R session and can hence be used from any web browser. Importantly, there is no need for that web browser to be running on the same computer as the R session. This allows a bioinformatician to easily share a LinkedCharts application with colleagues. They only have to direct their operating system's firewall to open the TCP/IP port the app is listening at for incoming connections and tell their colleagues their computer's IP address or DNS name and the port number, which they simply enter into their browser's address line.
As now multiple users might use the app simultaneously, we have to make sure that each user gets their own copy of any global variable, such as the variable \mintinline{R}{gene} in the initial code example. To do so, a trivial change is required: one only has to list all such session variables at the beginning. In the initial code example, one would simply amend the first line to
\begin{minted}{R}
openPage(layout = "table1x2",
sessionVars = list(gene = 1))
\end{minted}
\subsubsection{Apps with complex user interfaces}\label{gui_apps}
\begin{figure*}[t]
\centering\includegraphics[width=.85\textwidth]{FigG/figG.png}
\caption{An example of an app that was used as a GUI to perform manual inspection and classification of LAMP testing for SARS-CoV-2 viral RNA \citep{daothi_2020,Lou_2023}. The app was used during our SARS-CoV-2 surveillance study \citep{deckert_2021} and for voluntary testing for Covid-19 infection on campus (University of Heidelberg) in 2020/21. To the right, the app shows a 96-well plate layout colored either by content type (sample, empty, positive or negative control) or by the assigned result. To the left, it shows the results of three tests and one control for each sample. Accumulation of the LAMP product is indicated by the change of color from red to yellow and is measured as a difference in absorbance on two wave lengths. This difference is plotted as a function of time. Besides exploration (highlighting the corresponding lines for each sample), the app allows to manually reassign status, store results and send them to the server, where they can be queried by the test subjects. The app is provided as an R script; the code and some example data are available on GitHub at \url{https://github.com/anders-biostat/lamp_plate_analysis}.}
\label{lc_FigG}
\end{figure*}
In all examples discussed so far, user input is constrained to selecting data points in one chart in order to affect the display in a linked chart. However, the LinkedCharts library also provides for more general means of data input by the user, by leveraging the HTML5 tag \mintinline{html}{<input>} and thus offering buttons, checkboxes, radio buttons, scrolls and text fields via the ``rlc'' function \mintinline{R}{lc_input} (Figure \ref{FigA}G). As for any other LinkedCharts element, \mintinline{R}{lc_input} can be provided a callback function that is run every time the user changes the state of an input element (e.g., clicks a button or enters new text). This allows to easily add functionality to enter, say, a gene name rather than clicking on its point (as in Figure \ref{FigF}), but also to build up complex apps.
Figure \ref{lc_FigG} shown a screenshot of an example of a more complex LinkedCharts app, which was developed as part of an effort to establish LAMP-based testing for SARS-CoV-2 at Heidelberg University campus \citep{daothi_2020}. In this project, a colorimetric assay based on loop-mediated DNA amplification (LAMP, \citep{notomi_2000}) was carried out on microtiter plates. The app allowed
lab technicians to inspect the measured change in pH as function of incubation time, to link curves to wells and to patients, to compare replicates, and to check and, if needed, amend automatic result calling. (See \citep{Lou_2023} for details.) Such continuous quality control is vital for reliable medical diagnostics and has to be offered in an easy-to-use manner and quick-to-grasp to avoid mistakes from repetition and fatigue.
Here, LinkedCharts turned out to be well suited to quickly develop the app, to continuously refine it while the assay was finalized, and to turn it into a production tool, well integrated into the testing campaign's databases and result reporting services.
\subsubsection{LinkedCharts for Open Science}
Analyses in computational biology are often complex and involved, making them difficult to explain and even more so to verify. It is not uncommon that neither the peer reviewers nor the readers of a publication are effectively able to double-check a result unless they would be willing to redo the whole analysis themselves. The importance of making all raw data and code available to do that has been often stressed \citep{gentleman_2005}, but even verifying a complex analysis is a demanding task.
Published interactive apps for data exploration are hence the next step towards open science. Traditionally, publications illustrate the characteristics of typical data by showing ``typical examples'' -- but whether an example can be considered typical can be quite controversial. A LinkedCharts app in a paper's online supplement allows readers to chose their own examples rather than relying on the authors potentially ``cherry-picked'' ones. A second, less obvious, advantage is that interactivity can help clarify the details of a complex analysis.
While we consider the main area of application for the LinkedCharts to be a part of data analysis, it can be used for result presentation as well. For instance, a LinkedCharts app was used as online supplement to the paper of \citet{wang_2020}, a big-data study aiming at elucidating to which extent evolution of expression regulation acts on transcription and to which extent on translation. Using RNA-Seq and ribosomal footprinting data from three organs, taken from animals of six species, changes in transcript abundance and in translation of transcripts into proteins were quantified and compared. A core idea of the analysis was that the evolutionary changes to transcription and translation may either compensate for each other (thus compensating deleterious changes in one layer by an opposite change in the other), or reinforce each other (in case of adaptive changes). To this extent, a score denoted as $\Delta$ was calculated, which is negative if the between-species difference is lower in the ribosome footprinting data than in the transcriptional data (thus indicating that transcriptional difference are at least partially compensated on the translational layer) and positive if the variance at the ribosomal layer is higher (indicating reinforcement).
The definition of this $\Delta$-score is technical, and it is hard for the reader to form an intuition on its meaning. By ``playing around'' a bit with the app, available at \url{https://ex2plorer.kaessmannlab.org/} (static picture: Figure \ref{FigF}), this is quickly remedied: The reader can click on any gene in the upper scatter plot, inspecting examples of genes with positive, negative, or near-zero $\Delta$-score to see the data from the individual samples. After a few clicks, the relationship between the transcriptional and the translational data on the one hand and the $\Delta$ score on the other hand will be clearer than after reading several paragraphs of text. The use of HTML design elements to position explanatory labels renders the app nearly self-explanatory. Here, it is not a simple picture, but an interactive one, that is worth the proverbial thousand word.
LinkedCharts apps can be used as paper supplements in two ways. As it was described previously, any R/LinkedCharts app supports concurrent use and therefore can be made available for public usage with a very few changes to the code. Alternatively, a user with the knowledge of JavaScript can use the \emph{linked-charts.js} library which is a foundation of R/LinkedCharts to make an app fully contained within an HTML file, as the aforementioned supplement to \citet{wang_2020}. Though this approach requires considerably more effort, the resulting app is extremely easy to share, does not require any form of installation and can be run in any modern web browser. The interface of \emph{linked-charts.js} in many aspects is the same as of R/LinkedCharts, which facilitates the code transformation. To give readers a feeling of similarity between R code of R/LinkedCharts and JavaScript apps of \emph{linked-charts.js}, for every example in the \Supplement, we provided code for the both languages.
\section{Summary and Conclusion}
The importance of using interactivity in data exploration has been discussed since long. In bioinformatics, applications to perform specific analyses for specific data types often offer useful interactive features for data exploration. However, wherever fitting special-purpose tools are not available, analyses are still conducted using static plots, and general-purpose frameworks for interactive visualization are, if at all, only used for presentation of the details of an already finished analysis. The reason that general-purpose frameworks for interactive data visualization are rarely used in the actual analysis is two-fold: for technical reasons, the most versatile tools are only available for JavaScript, while bioinformaticians typically work with R and JavaScript. Available tools for R miss a crucial feature: linking.
We have presented R/LinkedCharts, a general-purpose framework for interactive data visualization for R that pulls all event handling from the JavaScript core to the R-based development side. This enables bioinformaticians to code arbitrary reactions to user interactions with individual data points in a chart and to thus link several charts. We show that this allows to easily set up apps where an "overview chart" shows the main results and a click on any item in this overview displays details on this element in a "detail chart". We have discussed numerous ways how variations on this general idea enable powerful data analysis strategies, that can be easily incorporated into a data analyst's existing work routine. We have argued that the consequent use of such techniques allows for improvements at all stages of a project.
\subsection{Availability of Data and Materials}
R/LinkedCharts is available as an R package from CRAN, the standard archive for R packages (\url{https://cran.r-project.org/}), i.e., it can simply be installed with \mintinline{R}{install.packages("rlc")}. No further installation is required, as all components, including the web server and the functionality to link to the web browser, are included in the package and started automatically. For an archived version of the software, see \cite{rlc_2023}.
Codes and detailed explanations for all examples discussed in this paper are given in the paper's interactive Supplement, which is also available at \url{https://anders-biostat.github.io/lc-paper/}. Several detailed usage tutorials are available at \url{https://anders-biostat.github.io/linked-charts/}.
The dataset used for example in Fig \ref{FigD} is available on the European Read Archive (ERA) under accession PRJEB7455 (secondary accession: ERP007185). The count data have been downloaded from the \emph{recount2} project \citep{collado_2017} at \url{https://jhubiostatistics.shinyapps.io/recount/}. The dataset for example in Fig \ref{FigC} was obtained from the authors. All the data necessary to recreate the example apps are provided in the \Supplement{} of the paper along side the corresponding code.
\section{Methods}
\subsection{Implementation}
The JavaScript foundation of R/LinkedCharts is built on top of the D3 library \citep{bostock_2011}.
\emph{linked-charts.js} is by itself a fully functional tool for interactive data visualization that can be used by those familiar with JavaScript to create stand-alone apps. The library is open-source and available on GitHub at \url{https://github.com/anders-biostat/linked-charts}. %The minified version can be downloaded from \url{https://github.com/anders-biostat/linked-charts/raw/master/lib/linked-charts.min.js}, and its stylesheet is at \url{https://github.com/anders-biostat/linked-charts/raw/master/lib/linked-charts.css}.
The ``jrc'' package package is used as a bridge between R and JavaScript. It allows one to run JavaScript code from an R session and vice versa. It also manages client connections to the app and is responsible for all the functionality necessary to make an R/LinkedCharts app public. ``jrc'' in turn is based mainly on ``httpuv'' \citep{cheng_2020} package to run a local server and ensure a WebSocket connection \citep{fette_rfc_2011}. A current version can be found at \url{https://github.com/anders-biostat/jrc}, an archived one at \cite{jrc_2020}.
R/LinkedCharts (``rlc'' package) is an R \citep{R_2019} interface to the JavaScript version of LinkedCharts. In addition to providing access to \emph{linked-charts.js} functionality, it also ensures proper storing of charts and serving them to each connected client by extending ``App'' class of the ``jrc'' package. ``rlc'' is open source and is available on CRAN or GitHub \url{https://github.com/anders-biostat/rlc}.
\subsection{Funding}
The authors acknowledge funding by the Deutsche Forschungsgemeinschaft via CRC 1366 and by the Klaus-Tschira-Stiftung via grant 00.022.2019.
\subsection{Authors’ Contributions}
SO implemented the ``rlc'' package and wrote the manuscript. SA conceived and supervised the project, contributed to its implementation, and edited the manuscript. Both authors jointly designed the software architecture.
\subsection{Ethical Approval}
Not applicable for this publication.
\section{Supplement}
The interactive supplement to this paper can be found online at \url{https://anders-biostat.github.io/lc-paper/} or offline in the zip file accompanying the paper. To see the supplement offline, unpack the zip file and use a web browser to open the file \emph{index.html} contained within.
\begin{small}
\bibliography{lc}
\end{small}
\end{document}