Similarity-based Prediction of Crystallization Conditions
Knowledge of the three-dimensional structure of a protein is important for understanding and influencing its biological functions. The most common technique for determining the structure of a protein is X-ray crystallography, but obtaining high quality crystals for X-ray diffraction remains the bottleneck of this technique. Predicting chemical conditions that would produce high quality crystals from the amino acid sequence of a protein would considerably speed up this process. Previous efforts have attempted to determine the optimal pH for crystallization from a protein's pI, predict a list of chemical conditions for crystallization from a protein's amino acid sequence, or predict the crystallization outcome from a set of chemical conditions on a small (single-protein) scale. However, to our knowledge, no attempt has been made to predict the optimal technique, pH, temperature, chemical conditions, and concentrations of those chemical conditions all together solely from a protein's amino acid sequence. Here, we present and evaluate SPreCC, a method of predicting categorical, continuous, and presence/absence information for crystallization experiments based on amino acid sequence similarity.
We will present four models for different types of crystallization information. The first will be a model for predicting the presence/absence status of chemical conditions such as sodium chloride or tris. The second will be a model for predicting categorical values, which will be applied to predicting the optimal crystallization technique. The third will be a model for predicting continuous variables like the molarity of sodium chloride, the percent volume per volume of dimethyl sulfoxide, or the pH. The fourth will be a model for simultaneously predicting concentrations and polymer lengths for chemical conditions like polyethylene glycol (PEG). A single condition might use multiple models; for example, sodium chloride uses both the presence/absence model for predicting whether it should be in the crystallization mixture and the concentration model for predicting its optimal concentration if it is present. Sequence similarities will be determined using Mash, and "the database" will refer to proteins in the train set (see data curation) with the relevant data available.
Let
where
The specified model enables the fitting of two parameters:
Because of the memory requirements involved in manipulating all the amino acid identity scores at once, we will use stochastic gradient descent to pick a protein at random, determine its amino acid identity against all the proteins in the database, predict its binary value, and update the weights according to the loss. With a learning rate
To achieve an initially plausible weight scheme, the following initializations will be used:
Plot 1: Initialization weights versus sequence identity for the presence/absence model.
The learning rate will start as
Let
where
where
The specified model enables the fitting of two parameters:
Because of the memory requirements involved in manipulating all the amino acid identity scores at once, we will use stochastic gradient descent to pick a protein at random, determine its amino acid identity against all the proteins in the database, predict its class, and update the weights according to the loss. With a learning rate
To achieve an initially plausible weight scheme, the following initializations will be used:
Let
Intuitively, this is a kernel density estimate combining the distribution of the variable for all proteins and the distribution of the variable for only proteins similar to the protein of interest. However, two issues arise: not all of these similar proteins are equally similar, so they should not be weighted equally, and the optimal bandwidth
where
with
(In the implementation, all values
The specified model enables the fitting of three parameters:
However, we are actually interested in the integrals of these quantities, which are obtained with the Leibniz integral rule:
Because of the memory requirements involved in manipulating all the amino acid identity scores at once, we will use stochastic gradient descent to pick a protein at random, determine its amino acid identity against the proteins in the database, compute its density function, and update the weights according to the loss. With a learning rate
The partial derivative of the density function with respect to each parameter will be computed exactly, but the Leibniz integrals will be approximated with a left Riemann sum and a
By linearity of expectation, the expectation of
The mode of the distribution can be found by evaluating the probability density function (PDF) from
with
We then take the
To achieve an initially plausible bandwidth scheme, the following initializations will be used:
Plot 2: Initialization bandwidths versus sequence identity for the continuous variable model.
The learning rate will start as
Let
Intuitively, this is a kernel density estimate weighing together the distribution of the variable for all proteins and the distribution of the variable for only proteins similar to the protein of interest. However, two issues arise: not all of these similar proteins are equally similar, so they should not be weighted equally, and the optimal bandwidths
for
with
(As before, all values
The specified model enables the fitting of six parameters:
However, we are actually interested in the integrals of these quantities, which are obtained with the Leibniz integral rule:
Because of the memory requirements involved in manipulating all the amino acid identity scores at once, we will use stochastic gradient descent to pick a protein at random, determine its amino acid identity against the proteins in the database, compute its density function, and update the weights according to the loss. With a learning rate
The partial derivative of the density function with respect to each parameter will be computed exactly, but the Leibniz integrals will be approximated with a left Riemann sum and a
By linearity of expectation, the expectation of
The mode of the distribution can be found by evaluating the PDF from
To achieve an initially plausible bandwidth scheme, the following initializations will be used for
Crystallization condition data was downloaded from the Protein Data Bank (PDB) on July 26th 2022 using PyPDB, and any proteins with crystallization data available were downloaded in their CIF file format from the PDB, resulting in 160,136 initial proteins. Because many CIF files contain multiple amino acid chains (from proteins crystalized in dimers or complexes), we extracted the longest amino acid chain from each file, and we deduplicated any pairs of sequences and metadata free text that matched exactly. We found that there were often dozens of proteins deposited with identical sequences and free text crystallization descriptions, usually from papers crystallizing a protein multiple times to evaluate different binding interactions. This deduplication step removed 27,154 proteins (17% of the original proteins) for a final set of 132,982 proteins. To obtain a realistic evaluation metric and further reduce the chance of train/evaluation overlap, we divided the proteins by date into train (before 2018, n=100,690), validation (2018-2019, n=16,733), and test (2020-2022, n=15,559) sets. The fact that so many identical proteins had identical crystallization information and that (as the evaluation will show) proteins of a similar time are more informative of each other suggests that previous evaluations that split all proteins randomly without any apparent deduplication likely overestimated the effectiveness of their methods.
Because the crystallization conditions are provided in a free text field, we further processed the conditions to obtain standardized, numerical values. Of the 116,713 proteins with a crystallization technique (e.g. vapor diffusion) provided, we grouped 116,529 (99.8%) of them into one of thirteen standardized categories with the remainder falling into an "other" category. Initially, 106,220 proteins provided a pH value in the designated pH field, and we were able to extract another 7,306 values from the free text and pH range fields for a total of 113,526 pH values. 117,068 entries provided the temperature at crystallization. Of the 132,982 deduplicated proteins with any crystallization information, 115,254 had some information in their free text fields, and 103,762 of those had information that included numerical values and recognizable units. Because many entries listed the same chemicals with units of "percent," "percent volume per volume," "percent volume per weight," and "percent weight per volume," we combined these units as "percent" since the solvents usually had densities near that of water, so the different versions of "percent" were similar. If the experimental setup was recorded in the Biological Macromolecule Crystallization Database (BMCD), we replaced the free text field with the BMCD's parsed version because their parsing scripts show a higher level of reliability than ours. Otherwise, we parsed the free text field with our own scripts to produce a list of chemicals and concentrations for each entry. From these 103,762 entries, we determined the 100 most common chemicals and the concentration of each of these chemicals in each crystallization solution. Each chemical beyond these top 100 was present in 168 (0.16% of usable entries) or fewer entries. Of the usable entries, 67.3% (69,823) contained only chemicals in this top-100 list with the remainder containing rare or sufficiently misspelled chemicals. Because most chemicals in most of the remaining entries (63.6%) were still in the top-100 list, we retained these entries for training and evaluation despite the model's inability to predict their complete sets of conditions. Finally, to ensure that mistyped concentrations did not skew our results, we excluded any instances of a chemical's concentration falling significantly outside the typical range. Specifically, excluding values beyond three standard deviations from the original mean and then additionally excluding values outside of twenty standard deviations of the mean of this reduced set seemed to have the desired effect. (This can be verified by examining the distribution of each chemical in the figures folder.)
In general, to keep as many proteins for training and evaluation as possible, we included in training and evaluation any fields with recognizable and plausible information but excluded any fields without information. For example, if a PDB entry only provided the protein's method of crystallization and pH at crystallization, we included that protein for training or evaluation on method and pH but excluded it from training or evaluation on temperature, chemical presence or absence, and chemical concentration.
The optimal crystallization condition prediction tool would be one which (1) accurately and with high precision and recall predicts which chemicals are necessary for crystallization, (2) with a small margin of error predicts the optimal concentration or an optimal range of concentrations for those chemicals, (3) with a small margin of error predicts the optimal value of or an optimal range for the pH and temperature for crystallization, and (4) accurately predicts the optimal crystallization method to use. We consider this last prediction the least useful because almost all proteins use vapor diffusion, and those that do not (e.g. many membrane proteins) are well characterized enough before X-ray crystallography to know that an alternative method will work better. However, in the spirit of making our results comparable with previous work, we will include this prediction.
Five regularization models were tested in this evaluation. (Each of these models contained as many of the above presence/absence, classification, continuous, and 2-variable continuous models as necessary for its crystallization condition prediction. From here on, "model" will generally refer to one of the five regularization models tested rather than to one of the four mathematical models described above, each of which can only predict one type of data anyway.) Model 1, the least regularized model, uses the initializations described above with
Train statistics were generated by obtaining, for each train protein, the predicted crystallization technique, the predicted presence/absence and concentration values for all chemical conditions, and the predicted polymer lengths when relevant using weights from the fit parameters and a protein bank of all proteins in the train set excluding the protein for evaluation. Validation statistics were generated likewise except that the weights were generated from the fit parameters of the relevant model's train set, and the protein bank for evaluating similarity only contained proteins from the train set (i.e. validation proteins were not searched against each other). Except when calculating the capture rates of density intervals, the concentration and polymer length of polymeric chemicals were treated as separate continuous variables for evaluation.
Figure 1: Number of proteins passing the Mash p-value threshold for similarity to the target sequence.
Because the model is highly dependent on similar proteins for its predictions, it is important that it has enough proteins in its database to predict on new proteins. While the train set usually had more similar proteins per sequence evaluated and fewer sequences with no similar proteins than the validation set, 77% of proteins in the validation set still had at least one similar protein, and most had far more (Figure 1). This suggests that in a significant majority of cases, the model will be able to generate more specific predictions than the overall means even on proteins that are entirely novel.
Figure 2: Weighting curves from fit parameters. (Bandwidths are normalized to their conditions' standard deviations for comparison.)
As expected, in the presence/absence and classification models, the model parameters updated to give larger weights to amino acid sequences that were more similar to the target sequence (Figure 2). The fit parameters for predicting the optimal crystallization technique in Model 1 show an extreme example of this: sequences with 90% similarity to the target sequence are weighted 38.5 times as heavily as sequences with 50% similarity to the target sequence. A slight correlation was observed between how often a chemical or technique was used and the magnitudes of its fit parameters, but this was largely driven by a few very common chemicals and techniques with large fit parameters. In the concentration and polymer length prediction models, the bandwidths shrink with increasing sequence similarity as expected: predictive distributions should be narrower around more similar, more informative sequences. Increasing regularization of the parameters leads to more uniform weights as expected, and almost no correlation was observed between the magnitudes of a chemical's parameters and how often that chemical was used.
Figure 3: Areas Under the ROC Curves and F1 statistics for predicted probabilities of inclusion from the presence/absence models for all chemical conditions. Thresholds are chosen for the presence/absence models by taking the F1-maximizing threshold from the train set for each condition. The F1 statistic weighted by class abundance (equivalent to the accuracy) is also shown for the classification model that predicts the crystallization technique. The results shown are from Model 1.
We first evaluated the ability of the model to predict the presence or absence of chemical conditions. When predicting the presence or absence of a chemical condition, the predicted values of inclusion for each condition show considerable linear separability (Figure 3). Despite the presence/absence model's construction, the predicted value should not be taken as a probability; using a threshold of 0.5 for inclusion of a chemical condition dramatically reduces the model's recall and therefore F1 as very few inclusions are predicted. Instead, choosing an F1-maximizing threshold based on the train data (usually between 0.2 and 0.3) produces much better precision, recall, and F1 on both the train and validation sets. This creates noticeable overfitting to the train set, but it still improves the predictions of the validation set. Generating the AUC and F1 metrics only from proteins with similar proteins in the train set (that is, proteins with at least one train set protein passing the Mash p-value threshold for similarity) shows that the metrics improve slightly when similar proteins are present. Interestingly, the metrics were very similar for all models, suggesting that the inclusion or exclusion of proteins is much more important than the weights assigned to them. (The metrics were so similar across models that we chose not to additionally display them.) Importantly, though the individual F1 statistics are rather poor, a chemical's F1 statistics is highly correlated with the frequency with which it is used.
Figure 4: Overall F1, Precision, and Recall weighted by the frequency of a chemical or technique's use.
In keeping with other attempts, we report the weighted precision, recall, and F1, which are calculated by determining each metric for each chemical condition and crystallization technique and taking the weighted average of each metric with weights corresponding to their frequency of use (Figure 4). Because the precision, recall, and F1 are much better for common chemicals like PEG and common techniques like vapor diffusion than for uncommon chemicals and techniques, these metrics provide a considerably more optimistic view of the model than in Figure 3. While these metrics are nominally inferior to the one known other attempt at predicting presence/absence values for chemical conditions and techniques, we believe their randomized construction of train and test sets and apparent lack of experiment deduplication led to inflated performance metrics for their models, though it is not clear exactly how much inflation occurred. As before, the metrics were very similar for all models, suggesting that the inclusion or exclusion of proteins is much more important than the weights assigned to them. (As before, the metrics were so similar across models that we chose not to display them.)
Figure 5: Absolute error between the mode of the predicted density function and the actual concentration, polymer length, pH, or temperature. The results are reported in multiples of the variable's standard deviation for comparison. Predictions were only included for analysis if the true value of the variable was non-zero.
We next assessed the ability of the model to predict continuous values, specifically the concentration of each chemical condition, the length of a polymer, the pH, and the temperature. Of the mean, median, and mode of the predicted density function, the mode was the best predictor of the actual variable's value by a slight margin. For almost all variables, the model was able to predict within 1 standard deviation of the true value on average (Figure 5). For example, when predicting pH, all models regardless of whether they were predicting on the train or validation set produced density functions with modes less than 1 pH unit away from the reported pH on average, even when including predictions on proteins with no similar proteins available. However, with regard only to absolute error, the model usually performed better but not tremendously better than much simpler models. For example, always predicting the train set pH mean gave an average absolute error of 1.02 on the validation set, while Model 1's predictions yielded an average absolute error of 0.99. Always predicting the train set mean sodium chloride concentration (for experiments that used it) gave an average absolute error of 0.56 M on the validation set (for experiments that used sodium chloride) while Model 1 gave an average absolute error of 0.45 M in the same evaluation. Additionally, as expected, the model's predictions became slightly less precise when predicting on proteins with no similar proteins available.
Figure 6: Slope of the linear relationship between the true concentration, polymer length, pH, or temperature and the mode of the predicted density function. A linear model was fit for each continuous variable, and the value and p-value of the slope are reported. The horizontal red line shows p=0.05, and the vertical red line shows a fit coefficient of 0. No multiple testing correction was performed. Predictions were only included for analysis if the true value of the variable was non-zero.
To evaluate how the model's predicted continuous values changed in response to changes in the true values, we fit a linear model with the formula
Figure 7: Proportion of true values falling within the predicted density interval for continuous one- and two-variable models. Red lines mark 0.75 and 0.9. Predictions were only included for analysis if the true value of the variable was non-zero.
To provide a range of plausible values for chemical concentrations, polymer lengths, the pH, and the temperature, the model can take a user specified density interval and return the bounds associated with that interval. More specifically, given a desired density level
Figure 8: Widths of density intervals. The results are reported in multiples of the variable's standard deviation for comparison. Predictions were only included for analysis if the true value of the variable was non-zero.
To assess whether the density intervals reported were sufficiently narrow to be useful, we calculated the average interval width in terms of the variable's standard deviation (to ease comparisons between variables). Model 1, the most specific model, was able to generate density intervals of less than 2 standard deviations for 77% of variables in the validation set at the 75% density level. As a specific example, at the 75% density interval level, Model 1 produced pH density intervals with average widths of 2.57, capturing the true pH value 72.4% of the time. However, as in the absolute error evaluation, the model does not perform tremendously better than much simpler models. For example, an interval 2.57 wide centered at the mean of the train pH captures 73.3% of validation pH values, outperforming the model. Similarly, at the 75% density level on the validation set, Model 1 produced sodium chloride density intervals with average widths of 1.41 M for experiments using sodium chloride, capturing the true sodium chloride value in 90.0% of the intervals. However, an equal interval from 0 to 1.41 M would have still captured the true sodium chloride value in 88.4% of experiments that used sodium chloride in the validation set (centering the interval at the train mean would have included negative values, so this is a more realistic comparison).
Predictions of the PEG concentration were the least precise with average intervals of over 3 standard deviations in every trained model's 75% density interval and intervals of over 4 standard deviations in every trained model's 90% interval. This was likely due to the fact that the model was simultaneously attempting to predict a concentration and a polymer length for PEG.
This project was envisioned and planned by Will Nickols, Benjamin Tang, Jorge Guerra, Srihari Ganesh, and Andrew Lu. Will Nickols additionally implemented the models and workflows. The computations were run in part on the FASRC Cannon cluster supported by the FAS Division of Science Research Computing Group at Harvard University. We would like to thank the Curtis Huttenhower lab at the Harvard School of Public Health for their computing resources. The content and any mistakes herein are solely the responsibility of W. Nickols, B. Tang, J. Guerra, S. Ganesh, and A. Lu and do not necessarily reflect the views of the Huttenhower lab.