-
Notifications
You must be signed in to change notification settings - Fork 3
/
paper_draft.tex
200 lines (141 loc) · 17.7 KB
/
paper_draft.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
% Template for PLoS
% Version 1.0 January 2009
%
% To compile to pdf, run:
% latex plos.template
% bibtex plos.template
% latex plos.template
% latex plos.template
% dvipdf plos.template
\documentclass[10pt]{article}
% amsmath package, useful for mathematical formulas
\usepackage{amsmath}
% amssymb package, useful for mathematical symbols
\usepackage{amssymb}
% graphicx package, useful for including eps and pdf graphics
% include graphics with the command \includegraphics
\usepackage{graphicx}
% cite package, to clean up citations in the main text. Do not remove.
\usepackage{cite}
\usepackage{color}
% Use doublespacing - comment out for single spacing
%\usepackage{setspace}
%\doublespacing
% Text layout
\topmargin 0.0cm
\oddsidemargin 0.5cm
\evensidemargin 0.5cm
\textwidth 16cm
\textheight 21cm
% Bold the 'Figure #' in the caption and separate it with a period
% Captions will be left justified
\usepackage[labelfont=bf,labelsep=period,justification=raggedright]{caption}
% Use the PLoS provided bibtex style
\bibliographystyle{plos2009}
% Remove brackets from numbering in List of References
\makeatletter
\renewcommand{\@biblabel}[1]{\quad#1.}
\makeatother
% Leave date blank
\date{}
\pagestyle{myheadings}
%% ** EDIT HERE **
%% ** EDIT HERE **
%% PLEASE INCLUDE ALL MACROS BELOW
%% END MACROS SECTION
\begin{document}
% Title must be 150 characters or less
\begin{flushleft}
{\Large
\textbf{openSNP - Crowdsourcing Genome Wide Association Studies}
}
% Insert Author names, affiliations and corresponding author email.
\\
Bastian Greshake$^{1,\ast}$,
Philipp Bayer$^{2}$,
Fabian Zimmer$^{3}$,
Julia Reda$^{4}$
\\
\bf{1} Goethe University, Frankfurt am Main, Germany
\\
\bf{2} Bond University, Gold Coast, Australia
\\
\bf{3} Westf\"alische Wilhelms Universit\"at, M\"unster, Germany
\\
\bf{4} Johannes-Gutenberg University, Mainz, Germany
\\
$\ast$ E-mail: [email protected]
\end{flushleft}
% Please keep the abstract between 250 and 300 words
\section*{Abstract}
Genome wide association studies (GWAS) are a cheap and quick way to assess health risks by comparing Single Nucleotide Polymorphisms (SNPs) between groups of participants. Direct-To-Consumer Companies like 23andme offer their customers to sequence their SNPs alongside with an evaluation of the customer's genetic risks. However, the data 23andme and other companies generate is not accessible for other scientists, and withholds some information from their customers for various reasons. In this paper, we present an open approach to GWAS by introducing openSNP, a web platform which allows GWAS-customers to openly share their SNPs with scientists for free. % misses survey
% Please keep the Author Summary between 150 and 200 words
% Use first person. PLoS ONE authors please skip this step.
% Author Summary not valid for PLoS ONE submissions.
\section*{Author Summary}
\section*{Introduction}
Genome Wide Association Studies (GWAS) are a comparatively easy and cheap way to find Single Nucleotide Polymorphisms (SNPs) which can be interesting because of their medical relevance. SNPs found through GWAS can be used to find candidate genes for a closer inspection or to predict disease risks. Genome Wide Association Studies make use of statistics to compare the alleles of patients to the alleles of healthy controls. By this the method does not allow to find causal differences but mere correlations. The first GWAS was published in 2005 and compared age-related macular degeneration in contrast to a healthy control group (doi:10.1126/science.1109557). Since the beginning the number of participants in those studies is rising and over 1200 GWAS have been performed (doi:10.1186/1471-2350-10-6.) and over 5000 SNPs have been linked to different diseases and traits in those studies %(http://www.genome.gov/page.cfm?pageid=26525384&clearquery=1#result_table).
Since 2006 companies like 23andMe, deCODEme or FamilyTreeDNA offer Direct-To-Consumer (DTC) genetic testing. Those companies use DNA micro arrays to screen for around 1 million SNPs spread over the human genome. In return customers get an analysis of the results, as well as a raw file that includes the SNP-IDs and their respective allele for the customer. In 2011 23andMe alone had over 100.000 customers\footnote{http://spittoon.23andme.com/2011/06/15/23andme-2011-state-of-the-database-address/} - the company recognizes the potential to perform GWAS with that amount of data by using surveys to ask their customers about traits and diseases. With the consent of the customer those data is used for association studies. 23andMe published several articles in which they replicate known findings but also find new associations for Parkinson's Disease \cite{Eriksson2010, Do2011}. Over 30,000 23andme-customers participated in those association studies.
Although companies like 23andMe are willing to contribute to science it is not easy for individual scientists to access the data. This is mainly due to privacy concerns of the customers. Nevertheless there are individual customers who are willingly sharing their data. Most do so by uploading their data to their personal website or to open software repositories like GitHub. While this is makes it possible for scientists to access the data, it requires a lot of work to keep track of all new genotyping data that is available to the public. While projects like the SNPedia try to keep track of all the files, this still does not allow to perform GWAS, as the phenotypic information is not attached to the genetic information. Projects that attach the phenotype to the genetic information, like the Personal Genome Project, still don't allow for an easy re-use of the data, as they lack an advanced programming interface (API) or other methods by which researchers could download the data.
A possible solution to this can be a community-driven platform that aggregates genetical and phenotypical information of people who are willing to share their data with the general public and have given their informed consent. We designed a survey to assess interest in such a crowd sourcing platform, in which we asked how many people would be willing to share their genetic and phenotypic information with the public. Additionally we built a platform which allows customers of DTC genetic testing to publicate of genetic and phenotypic information and gives researchers multiple ways to reuse the data.
% Results and Discussion can be combined.
\section*{Results}
\subsection*{Survey on Sharing Genetic Information}
229 people, 180 with a self-reported chromosomal sex of XY, 56 with a self-reported chromosomal sex of XX, participated in the survey. The mean age of the participants is 33 (SD = 11,29) and over 81.7 \% reported their ethnicity as caucasian. 39.7 \% of the participants are already customer of at least one DTC genetic testing company and further 30.1 \% of them plan to become one in the future. 29.7 \% don't plan to become a DTC customer. There is no significant difference in the usage of DTC companies between chromosomal sexes (Somers-d).
67.7 \% of all participants would share their data with their DTC-company without any constraints, 25.8 \% would do so, if the company does not share the data with third parties. 6.6 \% of the participants would not share their data. There is no significant difference between sharing-habits between both chromosomal sexes (Somers-d). Those who are a customer of a DTC company or are planing to become one in the future are more likely to share their results, compared to those who don't plan to get themselves genotyped (Somers-d).
There are significant differences, tested by Tukey's HSD test, between those people who are already genotyped and those who don't plan to get genotyped: The first group is more likely to agree to share their information because they want to help scientists (mean difference = 0.465, SE = 0.128, p = 0.001), because they think of personal benefits (mean difference = 0.448, SE = 0.183, p = 0.04) and because they are curious (mean difference = 1.159, SE = 0.193, p < 0.001).
On the other hand those people who are not planning to get genotyped are more likely to not share their data, because they agree to fear discrimination (mean difference = 1.060, SE = 0.195, p < 0.001), because they agree that they feel it is a breach of their privacy (mean difference = 0.821, SE = 0.225, p = 0.001), because agree that they fear negative consequences for their family (mean difference = 0.733, SE = 0.21, p = 0.002) or because they agree that they fear personalized advertising (mean difference = 0.848, SE = 0.208, p < 0.001).
Similarly those people who would share data with their DTC provider are more likely to agree on sharing the data, because they want to help scientists (mean difference = 1.57, SE = 0.199, p < 0.001), because they think of personal benefits (mean difference = 0.951, SE = 0.308, p = 0.006), and because they are curious (mean difference = 1.99, SE = 0.321, p < 0.001).
Those participants who are not planning to get genotyped are more likely to agree to not share their data, because they fear discrimination (mean difference = 1.52, SE = 0.322, p < 0.001), because they feel it is a breach of their privacy (mean difference = 1.871, SE = 0.324, p < 0.001), because they fear consequences for their family (mean difference = 1.146, SE = 0.32, p = 0.001) and because they fear personalized advertising (mean difference = 1.112, SE = 0.357, p = 0.006).
\subsection*{openSNP}
We created openSNP, a website which allows users to upload their genotypings from the companies 23andme, deCODEme and Family Tree under the Creative Commons Zero-license, which - in accordance with the Panton Principles (doi:10.1371/journal.pbio.1001195) - allows a complete reuse of the data without any constraints. Users are encouraged to list as many phenotypes as possible through to a simple achievement-system which rewards users that upload their data and enter phenotypic information with small badges that are shown on their profile page.
The possible answers in terms of variations for a single phenotype are not limited and every user can add completely new phenotypes if the corresponding questions about this are lacking. To reduce the amount of manual data curation openSNP tries to avoid the entry of the same phenotype or variation, but with a slightly different spelling by helping users at entering data by an autocompletion-feature which lists similar entries which are already in the openSNP-database.
On the side of getting access to the data users can download single genotyping files for specific users, get archives of multiple genotyping files grouped by phenotypic variation or can access a single download that includes all genotyping-files and all phenotypes in a comma separated table. Additionally users can access the genetic data through the Distributed Annotation System, which allows to get all data for specific chromosomes and specific positions on single chromosomes.
Between the start of openSNP on 09/27/2011 and 12/18/2011 214 people have signed up with openSNP, 79 of those have uploaded their genotyping files. Through this the openSNP database lists 69486471 SNPs which are distributed over 1938604 unique Rs-IDs. In the same timeframe all users combined have entered 675 variations which are distributed over 47 different phenotypes. See figure n for a distribution of data acquisition over time.
The mean number of users that have entered their variation for a single phenotype is 14.36 (SD 12.65), the median is 10. The distribution of how many users have entered their data per phenotype can be seen in figure n+1. The phenotype which has been provided by the most users is the eye color which has been provided by 54 different users. There are two phenotypes which have only been provided by a single user: The score of the Writing-SAT and triglyceride-levels.
A total number of 15229 documents relevant to the SNP-IDs which are listed in openSNP could be found in the databases of Mendeley, the Public Library of Science and SNPedia. Of the primary literature 25 \% are released in Open Access-journals and can be freely free of charge by every user (Figure n+2). For usability reasons SNPs are ranked by the amount of information gathered by the external services.
The external services themselves are ranked by how easy users can access information out of these sources. The SNPedia entries are given the highest impact, as those are already manually curated, followed by open access publications out of the Public Library of Science. Lowest values are given to the Mendeley-results, as those aren't necessary freely available for every user. SNPedia is valued 2.5 times as high as a PLoS publication and 5 times as high as a Mendeley-entry.
\section*{Discussion}
% You may title this section "Methods" or "Models".
% "Models" is not a valid title for PLoS ONE authors. However, PLoS ONE
% authors may use "Analysis"
\section*{Materials and Methods}
\subsection*{Survey on Sharing Genetic Information}
The survey was done with Google Docs and was distributed to possible participants through the 23andMe-community, the DIYBiology mailinglist, blogs which focus on genetics and DTC genetic testing and social media like Twitter, Google Plus and Facebook. Because of this the results are biased in a way that customers of DTC genetic testing are overrepresented. As we wanted to find out how people who have purchased DTC genetic testing care about sharing their data with third parties this does not lead to further problems.
The survey included questions on the age, chromosomal sex and ethnicity of the participants. Additionally it included questions on if they are already customer of a DTC company, are planning to become one or don't plan to become one. If they are already a customer they also got asked if they already share their genetic and phenotypic data. All participants got asked if they would share their genetical or phenotypic information with their DTC company, possible answers were "Yes", "Yes, but only if they did not share my medical information with anybody else" and No".
The survey also asked some scale-questions which measured how strong participants agree/disagree on different reasons to share or not to share their information with third parties. The scale went from 1 = strongly disagree over 3 = neutral to 5 = strongly agree. Motivations given to share data were "because you want to help scientists with their research", "because of possible personal benefits (e.g. getting treatments for a disease you have, possibility of new medication, etc.)", "because it may deliver advertising that is relevant to me" and "out of curiosity". Motivations given not to share data were "because advertisers could use the information for targeted campaigns", "because of possible negative consequences for closely related persons", "because of the breach of your privacy" and " because of the fear of discrimination (e.g. by the employer, the state, some insurance company)". Additionally participants had the possibility to give own reasons to share or not to share their data.
The data of the survey was analyzed with SPSS 19.
\subsection*{Technical realization of the platform}
The main platform is realized using the web-framework Ruby on Rails. Postgres is used as the main database backend for Rails. The database stores genotyping results, phenotypic information of the users, literature results of Mendeley, the Public Library of Science and summaries on SNPs which can be found in SNPedia. The literature database of Mendeley is queried using the REST API, which delivers results in JSON. The literature database of the Public Library of Science is queried using the respective REST API, which delivers results in an XML-format. Summaries on SNPs are provided by SNPedia, through querying the content through the MediaWiki-API. All databases are queried using the unique identifier of each SNP as search term.
SNPs are cataloged by their unique identifier, which consists of a prefix (mostly \textit{rs}) and a unique number. This is a common format which is used by the NCBI dbSNP database and is also widely used and easily parsed from different literature-sources. Publications out of the different databases and the genotypes of the users are associated with single SNPs by the Rs-ID. Allele and genotype frequencies are are updated regularly, based on the data present in openSNP.
Processes with a longer runtime, such as parsing the genotyping results, creating archives of results which are be mailed to users and queries to external resources are handled using the ruby-gem Resque and a Redis-server. Search-features on the platform itself are implemented using SOLR and the ruby-gem sunspot. Additionally data can be requested from openSNP using the Distributed Annotation System. The data for this is stored in a mySQL-database, the delivery of the data is done by using ProServer, which was modified by Gel et al. for use in easyDAS.
A flowchart of all services incorporated in openSNP and how users can upload or access the data is given in figure n+3. The source code of openSNP is published under Creative Commons BY-SA 3.0 and can be downloaded at http://github.com/gedankenstuecke/snpr. The genetical and phenotypical data is licensed under Creative Commons Zero.
% Do NOT remove this, even if you are not including acknowledgments
\section*{Acknowledgments}
%\section*{References}
% The bibtex filename
\bibliography{papers}
\section*{Figure Legends}
%\begin{figure}[!ht]
%\begin{center}
%%\includegraphics[width=4in]{figure_name.2.eps}
%\end{center}
%\caption{
%{\bf Bold the first sentence.} Rest of figure 2 caption. Caption
%should be left justified, as specified by the options to the caption
%package.
%}
%\label{Figure_label}
%\end{figure}
\section*{Tables}
%\begin{table}[!ht]
%\caption{
%\bf{Table title}}
%\begin{tabular}{|c|c|c|}
%table information
%\end{tabular}
%\begin{flushleft}Table caption
%\end{flushleft}
%\label{tab:label}
% \end{table}
\end{document}