Alzheimer DataLENS is an R Shiny web
application with an HTML/CSS/JavaScript front-end, an R back-end, and a
MongoDB database. Details about the mission
and purpose of DataLENS are provided in README.md
. Briefly, DataLENS
is an open data analysis platform which aims to advance Alzheimer’s
disease research by enabling the analysis, visualization, and sharing of
-omics data, where -omics refers to the branch of science concerned with
quantifying the levels of biological molecules.
The file ui.R
defines the user interface (UI) of the application,
using a shinydashboard
foundation supplemented by custom CSS defined
in www/css/custom.css
. I also employ various R objects from the
htmltools
package which represent HTML tags (see
?htmltools::builder
) and use the tags$zzz
syntax to prevent
namespace conflicts. I aimed to design the UI, color scheme, and
DataLENS logo as modern yet minimalist, and include various icons from
FontAwesome to guide the user through the
web application.
The foundation of the DataLENS back-end is R Shiny, a framework for building web applications using the R programming language and statistical computing environment. I selected R as my programming language of choice due to the predominance of R in the domain of -omics analysis and the availability of various key packages (i.e., libraries) designed for biological datasets which enabled me to assemble state-of-the-art bioinformatics pipelines.
Next, I decided to use Shiny since the interactivity which Shiny facilitates allowed me to help non-technical audiences interpret complex, multi-dimensional -omics datasets and permit users to perform advanced R analyses with ease. Further, Shiny is designed on principles of reactive programming, a declarative programming paradigm concerned with managing interaction flows and the propagation of change to deliver the fastest and most streamlined experience for the end-user. Given the large size of my database (> 20 Gb) and complexity of -omics data, optimizing the application for speed (perhaps at the expense of memory required) was critical. In particular, upon registering a change to a user input, Shiny determines which outputs are relevant (i.e., depend on the value of that input) and selectively modifies only those outputs, thereby minimizing the work required.
When writing the back-end code for DataLENS in server.R
, I maximized
the reactivity of Shiny by considering the reactive graph, which is the
data model that connects reactive inputs, reactive expressions, and
reactive outputs (i.e., reactive producers and reactive consumers). In
particular, I used the reactlog()
functionality to visualize reactive
dependencies in a specific instance of Alzheimer DataLENS. The resulting
reactive graph is shown below:
As demonstrated above, upon user interaction, Shiny will only update outputs downstream of that input in the reactive graph. Some features of the back-end code which increase performance and mitigate errors include:
-
Multiple
validate()
statements which validate user input on the server side, and prompt the user to correct the input if needed. -
Use of the
observeEvent()
function with anactionButton()
for event handling to delay modifying the reactive graph until a specific user action is completed, thereby minimizing database queries or API calls (e.g., minimizing POST requests to the STRING API, since requests are not made until the user clicksGenerate Network
orUpdate Network
). -
Use of
data.tables
, which support fast data wrangling operations including merging and sorting (see here), after database queries withmongolite
, as well as pipes (i.e.,%>%
) frommagrittr
for code readability.
To learn more about R Shiny, please see the book Mastering Shiny by Hadley Wickham, available here.
I selected MongoDB as my database service (which is a NoSQL database, in
contrast to a SQL database such as SQL Server or SQLite) since it
supports multi-dimensional data types such as documents or arrays. These
values are stored in a JSON
-like format. As MongoDB is not a
relational database, more performant indexing (as compared to SQL)
allows for high-throughput queries (e.g., querying large -omics
datasets).
When working with my database at the MongoDB Shell, I defined various
indices to support efficient query execution rather than a collection
scan (see documentation
here). Specifically, in the
expression
collection (where collections are analogous to SQL tables),
I defined indices on the GeneSymbol
and FileName
columns which
dramatically improved the time required per query.
Various components of Alzheimer DataLENS interface with external services, components, or resources, including those listed below:
- Queries to the MongoDB database are handled by
mongolite
functions after the connections is established upon initial execution ofserver.R
. - Gene validation is performed using the
org.Hs.eg.db
genome wide human annotation, which depends on Entrez identifier mappings. - Protein-protein interaction network analysis is performed using the STRING database (version 11). Specifically, the construction of each network is facilitated by a POST request made to the STRING application programming interface (API). STRING is a database of known and predicted protein-protein interactions.
- Brain segmentation plots are created via the
ggseg
package, which in turn relies on the Desikan-Killany cortical atlas and the automatic subcortical segmentation atlas. The manually-constructed mapping between DataLENS brain region identifiers andggseg
identifiers (derived from Freesurfer parcellations, etc.) is provided atwww/assets/ggseg_mapping.xlsx
.
Finally, I thank the original providers of the data contained in the DataLENS database.
The rationale for the inclusion of each required package is given below.
Package | Rationale |
---|---|
shiny |
Framework for web application development in R. |
shinydashboard |
Dashboard infrastructure for Shiny, see here. |
mongolite |
Interface to MongoDB. |
AnnotationDbi |
Implements the base class for all Bioconductor annotation packages that contains a database connection. Specifically, select() allows translation between gene identifiers. |
org.Hs.eg.db |
The latest human genome annotation from Bioconductor. |
httr |
To make POST requests to the STRING database. |
data.table |
Fast and memory efficient data frames in R. |
purrr |
Functional programming tools, including and especially various iterable functions. |
magrittr |
Implements the pipe operator, %>% . |
ggplot2 |
Flexible declarative framework for data visualization in R based on The Grammar of Graphics. |
plotly |
Converting ggplot2 graphics to interactive, web-based versions via ggplotly() . |
DT |
Wrapper to the JavaScript library DataTables . |
ggseg |
Brain segmentation plots in R. |
igraph |
Network analysis in R. |
ggraph |
ggplot2 -based network visualization in R. |
ggiraph |
Converting ggplot2 graphics to interactive, web-based versions; also allows the selection of graphical elements in Shiny. |
Finally, I load tested the application using the shinyloadtest
package
and the accompanying shinycannon
command line tool to determine how
many users DataLENS can support and identify performance bottlenecks.
-
First, I recorded a typical user session using
record_session()
.shinyloadtest::record_session('http://127.0.0.1:3704')
-
I then used the
shinycannon
command line tool to replay the recording in parallel, simulating many simultaneous users accessing DataLENS.java -jar shinycannon-1.1.0-45731f0.jar recording.log http://127.0.0.1:3704 --workers 5 --loaded-duration-minutes 2 --output-dir run
-
Finally, I analyzed the load test logs to generate a load test report.
df = shinyloadtest::load_runs("run") shinyloadtest::shinyloadtest_report(df, "load_test.html")
The load test report is provided in full at www/assets/load_test.html
.
Below, I excerpt the report to show five simulated users executing
back-to-back sessions:
Note that the narrow event bars suggest that DataLENS is performant under high-demand conditions.
To learn more about Shiny load testing, please see the shinyloadtest
documentation.