-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Performance improvements for generating candidate models #254
ENH: Performance improvements for generating candidate models #254
Conversation
…ngle set of features
…s successive The goal here is that we may have some models that don't want to use `make_successive` for feature sets.
modifies ESPEI code and def breaks gibbs energies, using this commit for backup
Factor out TDB analysis from notebook 2 Start moving get_data_quantities to a FittingStep staticmethod
Binary VM (VA) fitting looks like it works now!
The changes are mostly in adding parameters and not really the fitting itself - that was working. This may be throwaway code (see the comment added), so it's not too complex.
The idea of the modified version is that we also compute the actual site fractions because individual site fractions are not currently handled by ESPEI, but can slip in from existing models if not using a reference state where those contributions cancel (e.g. no _MIX or _FORM refstates keep the unary extrapolation). to do this, we'll use the config tuple and create site fractions from the points dict. tests currently pass locally
tests still passing
fit_formation_energy -> fit_parameters
passing tests
It's working!
Pass through all the function indirection.
Add it to AbstractRKMPropertyStep
…tion.utils from espei.parameter_selection.utils import _get_sample_condition_dicts to from espei.error_functions.non_equilibrium_thermochemical_error import get_sample_condition_dicts
VA is normalized per atom
This is useful for organizing datasets for different runs while having one single source of truth for the data
This algorith has N*M complexity, which is an enormous simplification to the more complex algorith that converges to N^M complexity as N->inf. Before this change, even moderate N would cause _build_feature_matrix to become the dominant time-limiting function in profiling.
Note that the performance issues this resolved are mostly due to a combination of the number of candidate models and the amount of data. Generating the candidate models has an up front cost, but the result is cached so it's not overly expensive. The main contributor is that with many candidate models and data, most of the time is eventually spent in |
This set of changes improves the performance of parameter selection with two primary changes:
When we build candidate models (renamed$N(1-N^M)/(1-N)$ , the simplified version has complexity $NM$ , where $N$ and $M$ are the number of composition-independent features and interaction features, respectively.
build_feature_sets
tobuild_candidate_models
) we take all combinations of the product of composition-independent features with interaction features. The implication of this is that some models that have a lot of features, for example heat capacity temperature features with four binary interaction features, can get very expensive to generate candidate models because the current implementation has geometric complexity with respect to the temperature and interaction features (as documented). Here we make an optimization for cases when the general implementation will generate more thancomplex_algorithm_candidate_limit
(default=1000
) candidate models, where the simplified version will have the same number of composition-independent features for all interaction features. Instead of geometric complexityA profiling-guided optimization in
espei.paramselect._build_feature_matrix
. The feature matrix is a concrete matrix of reals (rows: observations, columns: feature coefficients). We use a symengine vector (ImmutableDenseMatrix
) to fill the feature matrix row-wise, moving an inner loop to fast SymEngine rather than slow Python. Roughly 3x speedup of this function after this change.