Training Plan v3.0 / CFDE / December 2020
This Common Fund Data Ecosystem Coordination Center (CFDE-CC) training plan lays out our plan of action for 2021 as well as our overarching goals. Over the coming year, we will issue periodic reports that provide progress updates on our efforts, assessment and evaluation results, and detail the next steps for training.
The goals of the CFDE training effort are threefold. First, we want to work with specific CFDE Data Coordinating Centers (DCCs) to develop and run DCC-specific and targeted cross-DCC-data set training programs to help their users make improved use of their data. Second, we want to provide broad-based training on data analysis in the cloud, to help the entire CFDE user base shift into a more sustainable long term approach. And third, we expect that broad and deep engagement with a wide range of users will help us identify new use cases for data reuse that can be brought back to the CF DCCs and the CFDE. Collectively, our training program will train users in basic bioinformatics and cloud computing, help the DCCs lower their support burden, improve user experience, and identify new use cases for data reuse and data integration within and across DCCs.
In this training plan, we have no specific plans to interface with training efforts outside the Common Fund. However, we are aware of a number of training efforts with similar goals, including Broad’s Terra training program and ANViL’s training focus. The underlying technologies and approaches we are using in our trainings and materials (see below) are entirely compatible with these programs and are designed to allow access and re-use across efforts and teams.
All training materials produced by the CFDE-CC will be made available through the central nih-cfde.org web site, under CC0 or CC-BY licenses, which will allow them to be used and remixed by any other stakeholders without limitations. Assessment and iteration on the materials will be carried out by the CFDE-CC’s training team during the pilot period, which we expect to be the first 1-2 months of development for any given lesson; we will engage with external assessment and evaluation as our efforts expand.
The CFDE-CC’s training component is run by Dr. Titus Brown and Dr. Amanda Charbonneau, three training postdocs, and two staff training coordinators. The training component is closely integrated with the engagement plan, and we expect training to interface with user experience evaluation and iteration across the entire CF and CFDE as well as use case creation and refinement.
Our initial plan was to run a series of in-person workshops during 2020. However, our plan has pivoted to an online strategy because of the COVID-19 pandemic still sweeping the world. In particular, we expect there to be no in-person meetings for the foreseeable future. While we believe we can leverage online training effectively, setting up the system has required a great deal of experimentation with formats, technologies, and teaching styles. As of late 2020, we have started running pilot workshops to test our training materials in an online setting, and will continue to host larger and more workshops as we refine our lessons and teaching strategies.
Online training is very different from in-person training. In our experience, in-person training offers a natural focus for many and can support an extended (~4-6 hrs/day) engagement with materials. Moreover, technology problems on the learner’s side can often be fixed by in-person helpers who have direct access to the learner’s computer. Finally, the intensity of in-person workshops combines well with the higher cost of travel: in the past we have successfully run many in-person workshops, lasting between 2 days and 2 weeks, where either the instructors or the students travel significant distances to attend the workshop.
Online training requires different affordances. Learner attention span in the absence of interpersonal interaction is much shorter. Remote debugging is possible but much less effective than in-person debugging. And both instructors and learners must manage more technology, including teleconferencing software and chat, often on the same screen size as before. These challenges, among others, have limited the effectiveness of online training efforts, including MOOCs (Massive Open Online Course); several studies of MOOCs have shown that most learners drop out quickly, and that benefits gained have mostly been to those who already had experience with the material.
In exchange for these downsides, online training offers some opportunities. By using asynchronous delivery of material, different schedules can be accommodated among the learners, and there is much more time for offline experimentation and challenge experiments. Moreover, online training can offer somewhat more scalability and can potentially be offered more cheaply, since it involves no travel or local facilities.
We have transitioned our initial materials for in-person workshops to lessons that can be delivered online. For many lessons, we accomplished this by breaking lessons up into 5-10 minute video chunks, or “vidlets”, that showcase concepts and technical activities. These chunks can be viewed in “flipped classroom” or offline mode, and will be interspersed with opportunities for virtual attendees to seek technical help, explore their own interests, and ask questions in an individual or group setting. In some lessons, we opted for an entirely written approach, with a number of interactive text elements and screen shots. All training materials used for workshops are available online (https://training.nih-cfde.org/en/latest/) as written step-by-step tutorials, providing learners multiple ways to approach the material.
In contrast to in-person materials, which require instructor notes but rely in large part on the presenter, both videos and screenshots, are laborious. A ~1 hour in-person lesson might take 2-3 days to develop and write out while the same lesson as an online walkthrough will likely take a week or more. An online lesson will likely require dozens of formatted screenshots as well as more detailed explanations and teaching tips to help users advance through the lesson. Vidlets may reduce the need for detailed documentation, but require a great deal of time-consuming planning and editing.
These materials also require much more upkeep than in-person lessons. With in-person materials, changes to the Kids First interface, for example, would only matter if they changed the functionality of the portal, and lesson content for new features could generally be developed and added to a lesson without overhauling the materials. However, even minor color and placement changes to the Kids First interface can render our materials useless. Vidlets need to be completely re-recorded with nearly every update, and the screenshots from walkthroughs generally all need to be re-taken. For 2021, we are evaluating the pros and cons of each of these methods, as well as continually exploring new ways to deliver online content more efficiently.
After our initial materials revamp, we have started to work on offering online 'in person' lessons via Zoom. We deliver each lesson within the training team, and then expand to groups outside our team. Each delivery is a walkthrough of an entire lesson with users and will result in an iteration to change the materials to reflect discussion during the walkthrough. After 2-3 iterations are delivered to beta users and CF program members, we will set up a formal registration system and encourage adventurous biomedical scientists to attend sessions.
As of late 2020, we have tested two lessons, offered to a larger audience as pilot workshops. Here too, we are experimenting with the exact approach we will use. Online learning, especially for people with slow internet connections or limited screen size, can be extremely difficult. We expect to combine Zoom teleconferences, live streaming, and helpdesk sessions via our CFDE training helpdesk, but will continue to assess how well these are working for our learners and update accordingly. We are conducting assessments and evaluating our overall approach, as well as next steps for specific lesson development with each session.
This lesson development approach is slow and cautious, and provides plenty of opportunity to improve the materials in response to lived experience of both instructors and learners. During the lesson development and delivery period, we will work closely with each partner DCC to make sure our lessons align with their best practices, as well as convey any technical challenges with user experience back to the DCCs in order to identify potential improvements in DCC portals. We expect to be able to develop 2-3 new lessons per website release, as well as updating existing content. Our training website release plan provides a timeline for posting new tutorials. In addition to release timelines, the release plan describes the different stages between releases, the internal CFDE training material review process along with format and tags for the release documentation both for the public facing website and the GitHub repo.
Our assessments for the coming year will focus on improving our impact by better understanding the needs of our learners, areas where our materials can be improved, and techniques for better online delivery of our materials. We will do extensive curriculum review following each training with the goal of improving both our materials and instruction. For each lesson, we will evaluate the training by applying a variety of formative and summative assessment techniques including within-training check-points, pre- and post-training surveys, live observation evaluations of lessons, and, in later 2021, remote interviews with learners and trainers both before and after training. We will also work with DCCs to measure continued use by learners as one of our longer-term metrics.
Throughout 2021, we will issue periodic training reports that describe the lessons learned from each training, as well as summaries of anonymized survey results. The results from these reports will also be used to develop larger-scale instruments that we can use to standardize summative assessment.
We will work with DCCs to build training materials that help their current and future users make use of their data sets. Our primary goals here are to (a) create and expand materials for users, (b) offer regular trainings in collaboration with the DCCs, (c) provide expanded help documentation for users to lower the DCC support burden, and (d) work with the DCCs over time to refine the user experience and further lower the DCC support burden.
In addition to developing DCC-specific training material, we will also test and provide bug reports for new DCC technology/infrastructure/tools and accompanying documentation for GTEx, Ex-RNA, IDG, LINCS, HuBMAP, SPARC, Kids First, and Metabolomics. For example, we plan to help LINCS by testing and reviewing of tutorials, documentation, and manuals. The materials will focus on access and use of LINCS resources in the cloud with use cases focused on combining data from other CF programs. We will also be testing use cases and generating associated tutorials for DCC developed resources.
We have begun a Whole Genome Sequencing (WGS) and RNAseq tutorial using data from Kids First and have worked with Kids First and Cavatica (the Kids First data analysis platform) to improve their interface so that it can be used for training. We have added multiple lessons on setup and use of the Kids First data portal, linking to Cavatica analysis platform, as well as uploading data to Cavatica. With Kids First, we have conducted two pilot trainings, and plan to both re-offer these and expand our trainings for 2021.
As part of GTEx's 2021 workplan, we will be developing specific workshop materials, testing the materials, helping to host the workshops, and compiling workshop assessments for GTEx. One workshop will be aimed at using their gene expression datasets to conduct analysis and the second workshop will focus on java tutorials for visualization. We plan to develop and make public RNAseq tutorials for the Kids First/Cavatica platform as well as the GTEx/ANViL/Terra platform.
The exact timelines for these lessons, and others, will depend on the schedules of the host DCCs, how we recruit participants, and how quickly our lesson development proceeds. We will be incorporating different pieces (user-led walk-throughs, video lessons, live virtual sessions) and assessing the materials and configurations to deploy the best possible lesson implementations.
- Persistent, user-led walkthrough documents
- Accompanying short videos of difficult sections
- Materials are graduate level, research scientist-focused
- Materials available at https://training.nih-cfde.org web site, under CC0 or CC-BY licenses
- Lessons align with DCC best practices
- Self-Guided Materials Assessment
- Elicited user feedback in the webpage interface
- Analysis of web analytics to determine user engagement
- Instructor Guided Materials Assessment
- Materials contain breaks for checking understanding/formative assessment
- Pre-training surveys on prior knowledge on data sets and techniques, specific learning goals, and self-confidence;
- Post-training surveys on improved knowledge, learning goals, tutorial format and content, and use case gaps in the training materials.
- Conduct remote interviews with learners both before and after training
- Secure any approvals for human data collection
- Collect contact information from learners
- Persistent video lessons
- Videos are accessible
- Include written transcripts
- Include closed-captioning
- Materials are graduate level, research scientist-focused
- Materials available at https://training.nih-cfde.org web site, under CC0 or CC-BY licenses
- Lessons align with DCC best practices
- Assessment
- Elicited user feedback in the webpage interface
- Analysis of web analytics to determine user engagement
We will develop online training materials for biomedical scientists that want to analyze data in the cloud. Many future NIH Common Fund plans for large scale data analysis rely on analyzing the data on remote-hosted cloud platforms, be they commercial clouds such as Amazon Web Services and Google Compute Platform (GTEx, Kids First) or on-premise hosting systems like the Pittsburgh Supercomputing Center (HuBMAP). Working in these systems involves several different technologies around data upload, automated workflows, and statistical analysis/visualization on remote platforms.
Since most biomedical scientists have little or no training in these areas, they will need substantial support to take advantage of cloud computing platforms to do large scale data analysis.
We have run a pilot workshop for connecting to and running BLAST analysis on AWS and anticipate providing materials for several more workshops on cloud bioinformatics in 2021. These workshops consist of a number of different pieces including user-led tutorials, custom-made video lessons, virtual forums, and live teaching sessions. The exact timeline and number of workshops will depend on how we recruit participants and how quickly our lesson development proceed.
On our training website, we have tutorials on setting up and connecting to a virtual computer using Amazon Web Services (AWS) and conducting basic bioinformatic analyses on AWS (Genome-wide association study, BLAST sequence similarity analysis). We also have a workflow management with Snakemake tutorial that runs on a Google Cloud Platform compute environment server (binder). We plan to add tutorials for setting up and connecting to the Google Cloud Platform in 2021.
For workflows, there are two primary workflow systems in use, WDL (used by Terra) and CWL (used by Cavatica). At least one of these (and sometimes both) is supported by every CF program that uses cloud workflow systems. We will develop initial training materials for data-analysis focused on biomedical scientists to make use of these workflow systems, based on our existing workflow materials.
For statistics/visualization, there are two commonly used analysis systems, R/RStudio and Python/Jupyter, that are used by almost all of the CF programs. We already have in-person training material for these systems, and will adapt them to online delivery.
- Persistent, user-led walkthrough documents
- Accompanying short videos of difficult sections
- Materials are graduate level, research scientist-focused
- Materials available at https://training.nih-cfde.org web site, under CC0 or CC-BY licenses
- Lessons align with DCC best practices
- Self-Guided Materials Assessment
- Elicited user feedback in the webpage interface
- Analysis of web analytics to determine user engagement
- Instructor Guided Materials Assessment
- Materials contain breaks for checking understanding/formative assessment
- Pre-training surveys on prior knowledge on data sets and techniques, specific learning goals, and self-confidence;
- Post-training surveys on improved knowledge, learning goals, tutorial format and content, and use case gaps in the training materials.
- Conduct remote interviews with learners both before and after training
- Secure any approvals for human data collection
- Collect contact information from learners
- Persistent video lessons
- Videos are accessible
- Include written transcripts
- Include closed-captioning
- Materials are graduate level, research scientist- focused
- Materials available at https://training.nih-cfde.org web site, under CC0 or CC-BY licenses
- Lessons align with community best practices
- Assessment
- Elicited user feedback in the webpage interface
- Analysis of web analytics to determine user engagement
As the CFDE community grows, there will be in increased need for training materials to guide members on how to work within the Ecosystem. These resources will cover a broad array of topics that relate to the CFDE project management infrastructure including: GitHub, ZenHub, Google, groups.io, Slack, and onboarding. We also need to offer tutorials that guide new members on how to update and improve the Ecosystem such as how to make changes in CFDE owned web-sites, how to create and join working groups, and tutorials on interacting with the CFDE search portal as a DCC collaborator and uploading new data to the CFDE portal.
Currently, we have a variety of materials for internal training, that are variously housed publicly in the training site, or in member-only portions of GitHub and our Google Drive. Our public training site has topics such as how to use Github branches, edit CFDE websites from Github, and contribute to the training website, as well as training on how to format data for inclusion in the CFDE Portal. The site also hosts a style guide to ensure consistency across lessons on the training website. The style guide includes documentation for required lesson components and optional resources (vidlets, binder compute environments, demo Github repos), as well as format guidelines and template files. Internal resources such as how to create and report on working groups, how to interact with our project management system, and how to complete NIH reporting requirements are stored in non-public facing spaces, but are readily available to onboarded members of the CFDE. All of these resources are continually updated, and new tutorials are created whenever a member of the Ecosystem needs them.
- Persistent, user-led walkthrough documents
- Lessons align with CFDE best practices
- Materials are laymen level
The CFDE portal is still in alpha release and not yet available to the general public, however in 2021 we expect that the portal will be more highly publicized and go into more widespread use. As the portal accumulates more users, there will be a greater need for trainings that are specific to this CFDE resource. We also anticipate that as usage increases and new use cases are discovered, the functionality of the portal will grow and change, requiring additional training resources.
We have created two lessons to demonstrate how to extract a manifest containing subsets of DCC metadata from multiple CF programs in the CFDE data portal, which can be used to subsequently search for the data at the originating CF data portal.
- Persistent, user-led walkthrough documents
- Accompanying short videos of difficult sections
- Materials are graduate level, research scientist-focused
- Materials available at https://training.nih-cfde.org web site, under CC0 or CC-BY licenses
- Lessons align with DCC best practices
- Self-Guided Materials Assessment
- Elicited user feedback in the webpage interface
- Analysis of web analytics to determine user engagement
- Instructor Guided Materials Assessment
- Materials contain breaks for checking understanding/formative assessment
- Pre-training surveys on prior knowledge on data sets and techniques, specific learning goals, and self-confidence;
- Post-training surveys on improved knowledge, learning goals, tutorial format and content, and use case gaps in the training materials.
- Conduct remote interviews with learners both before and after training
- Secure any approvals for human data collection
- Collect contact information from learners
- Persistent video lessons
- Videos are accessible
- Include written transcripts
- Include closed-captioning
- Materials are graduate level, research scientist- focused
- Materials available at https://training.nih-cfde.org web site, under CC0 or CC-BY licenses
- Lessons align with CFDE best practices
- Assessment
- Elicited user feedback in the webpage interface
- Analysis of web analytics to determine user engagement
In tandem with the specific workshops above, we will engage with biomedical scientists who are interested in reusing CF data. We will include members of the CF communities, biomedical scientists who attend our training sessions, and biomedical scientists recruited via social media for targeted discussions as well as to build an online forum. Discussions will be used to inform future use case development for data analysis and integration, as well as for continuing training engagement. GTEx in particular is in close contact with their end user community, and has suggested that their user base would be available for kickstarting this engagement.
Although Common Fund Programs have many high level goals in common, they each have distinct user bases, specific mandates, and areas of expertise. As such, many of the people who work at these programs know little or nothing about the other programs. As one goal of the CFDE is to foster cross DCC collaboration on scientific projects, we are working to engage these programs with each other, and to help them find common goals and shared interests.
Starting in September 2020, we began hosting weekly Cross Pollination events with DCCs to introduce the CFDE portal, discuss data harmonization, and allow conversation between Common Fund programs. These cross-pollination events are continuing on a monthly basis from December 2020 through 2021, with most talks being done by DCCs.
In addition to Cross Pollination events, we have also created a framework for DCCs to create interest based working groups. These working groups will allow member DCCs to guide important CFDE decisions such as what terms should be included in the CFDE portal, and how to harmonize them across groups. To facilitate broad consensus building, we have also created a Request For Comments (RFC) system that allows any working group to write a short description of a standard technology or tool that they would like to be used broadly within the CFDE, so it can be distributed for consideration to all members.
- Online community space for learner engagement
- Formal registration system
- Code of Conduct
- Moderator Group
- DCC and other expert volunteers to answer questions
- Promotion of materials
- Promotion of online community
- Open to all Common Fund programs
- Formal registration system
- Code of Conduct
- Promotion of materials
- Promotion of online community
The goals of our assessment program are to simultaneously improve our training and outreach offerings as well as improving the teaching techniques of our instructors. We will accomplish this by iteratively trying, adopting, and assessing new training technology and methods to improve specific trainings as well as overall training program and technology platforms. In addition to a variety of surveys, we plan to do open ended interviews with learners in late 2021. These interviews will leave space for open ended conversations to discover new challenges, unmet training needs, negatives and positives about current training efforts, and to discover things not covered by surveys.
We will explore a number of technologies to measure within-lesson engagement and do formative assessment. While asynchronous online training challenges traditional “stop-and-quiz” approaches, low-stakes multiple-choice quizzes can be incorporated into online lessons easily and provide valuable feedback to learners and trainers. Faded examples that learners can fill in on their own time and submit via a common interface can be used to provide feedback asynchronously. More dynamic documentation, supporting both quizzes and executable code, could be used to provide engaging exercises. However, all of these require experimentation and evaluation in order to determine which choices work best within the context of the platforms we choose to host videos and tutorials. We will also assess overall confidence metrics for both “is this training potentially relevant/useful based on its description” and self-confidence in actualizing bioinformatics analyses. See https://carpentries.org/assessment/ for some examples. This experimentation is an ongoing part of our training work, and will be reviewed in the periodic 2021 training assessments.
In our first few pilot workshops, we have used within-training check-points, pre- and post-training surveys, and live observation evaluations of lessons to gather information from our learners. While our results are currently still limited, we have used this information to make changes for re-running the pilots in 2021.
- For online workshops (now in pilot phase)
- Number of people that show initial interest in training (in progress)
- Number of return trainees within a lesson
- Number of return trainees across lessons
- Number of trainees that indicate interest in additional as-yet-undeveloped training events
- For web sites/documentation
- Site visit metrics
- Page visit metrics
- For forums
- Number of registrations
- Number of logins
- Number of posts
- Number of repeat engagements (e.g. follow ups to posts)
- For videos
- Video watch statistics
- Video completion statistics
- Web site hosting stats