Discussion of methods parameters #83
Replies: 26 comments
-
Hi, Perhaps a partial solution to the fact that tree size plays a role and some methods use mini-batches is to have a budget that takes into account:
Or, perhaps even more simple to implement, set a total the budget to be a total number of component calls budget (i.e. exclude
This does not account for everything of course, as some methods carry extra computational complexity in other parts than evaluations (e.g., I know mine: GP-GOMEA builds a linkage model every generation to decide what swaps to perform at crossover time, RDO parses a library of subfunctions to decide which to use as mutating branch). Maybe C++-based implementations could be additionally compared in terms of total time? |
Beta Was this translation helpful? Give feedback.
-
Yes, I'm currently planning to compare all methods on process time I think you are right @marcovirgolin that having a total component evaluation budget (sometimes called "point evaluations") would be even more fair... but I'm skeptical we can get it implemented equivalently (and quickly) in each of the 10 or so methods we're benchmarking, most of which provide only no. of gens and population size as arguments. ellyn supports setting max point evals, but I'm not sure about others. And of course nodes with different computational complexity (e.g.
let me look into this. For sure I can get RAM usage, not sure about load, especially since it is a multi-use, heterogenous cluster. |
Beta Was this translation helpful? Give feedback.
-
if we think only about the evolutionary methods, how many of them does have local search and or mini-batch? If only those maintained by the people involved in this repo, we could go with @foolnotion and @marcovirgolin suggestions:
In any case, it would also be interesting to have an estimate of running times for these algorithms. Maybe running a single run of |
Beta Was this translation helpful? Give feedback.
-
Just to add my 2 cents to the discussion as well. Unfortunately, I haven't come across the ultimate method for performance comparison of algorithms for symbolic regression. Every metrics has some kind of shortcomings, i.e. wall clock time includes the efficiency of the implementation, number of evaluated solutions does not take local search into account,or point evaluations omit the overhead of some more advanced calculations (for example linkage calclations. IMHO the fairest comparison is to estimate computational effort by using point evaluations and if that's not feasible number of evaluated solutions (include local search evaluations) will do just fine (as @folivetti suggests). I would neglect the fact that different symbols have different complexities (e.g AQ or exp) and also ignore the model length while calculating the spent evaluation budget. However, as an additional measure an estimate of the actual running time of the algorithms would be interesting and correct some of the shortcomings of using the number of evaluations as computational budget. |
Beta Was this translation helpful? Give feedback.
-
Based on the discussion it seems we're agreeing on a limit of 500,000 evaluations, including local search evaluations and accounting for mini-batch size. Another practical issue is maximum run-time on the cluster. Our short priority queue has a max of 24 hours, which would be an ideal limit for us, but based on our last experiment I'm not sure some of these methods will finish individual runs on datasets on a single thread in that time. Important differences from last time: I've added the ability to sub-sample the datasets in PMLB that are large (currently set to 10,000 samples max), and we're using HalvingGridSearch this time which is a bit quicker. So my questions are:
|
Beta Was this translation helpful? Give feedback.
-
I have added a time limit for Operon. I was reluctant to do it before because I thought it would introduce non-deterministic behavior and impact reproducibility, but that's something the user must be aware of. However, like @mkommend mentioned above, a time limit might mean that you benchmark not only the method but also the implementation.
Maybe let them time out the first time around (it probably won't happen on all problems, only on a subset), then rerun each method with a sub-sampled training data on the specific problems where it timed out.
10,000 looks ok to me. I looked at https://epistasislab.github.io/pmlb/ and found 14 problems with > 10000 samples. You could also choose to sample a % of the observations (e.g. 10%) with an upper ceiling for the really large datasets.
|
Beta Was this translation helpful? Give feedback.
-
I would rather reduce the max number of samples for everyone and, just then, reducing it even more for those algorithms that timed out. But I would make sure everyone can run all of the 500k evaluations, for the same reasons pointed out by @foolnotion and @mkommend . Could you make a single run of each algorithm (not the grid search, just for a single hyperparameters setting) with the About changing the code to implement a time limit, in my case it could hurt the performance badly since I'm relying on lazy evaluation to optimize some steps, and this would require me to force strictness in some key points of my code. In any case, I think 24 hours should be enough for a single run on the largest dataset. But I haven't tested it yet to be sure. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
I've run ITEA for 564_fried dataset on my local machine using 500k evaluations. A single run of the grid search with 6 combos using HalvingGridSearch with 5 folds took about 10 hours.
|
Beta Was this translation helpful? Give feedback.
-
Did you restrict it to one thread?
I would also like to do 1 million, but I know that this will make the experiment much more difficult. Then again, if we set time limits, it shouldn't be too much of a problem.... as others have mentioned this would put us further towards benchmarking implementations rather than methods. Personally I am ok with that, given that 24 - 72 hours is a reasonably generous time budget for training a model on a dataset with 10,000 samples.
Ottimo.
This sounds like a cool approach. I would wonder if it would bias our results though, especially since we are comparing to random forests. I think i'm more comfortable just limiting the samples for all methods equivalently, so that, even if we are making generalization variance higher in doing so, it is the same restriction across the benchmarked methods.
Yes. |
Beta Was this translation helpful? Give feedback.
-
forgot about that :-D it used two threads. I'll try to repeat this experiment tomorrow with a single thread. update about the execution time: I left it running again the same setting as before and with a single thread and, this time, without any other cpu hungry process (e.g. Chrome). It took 10 hours and a 9 minutes. |
Beta Was this translation helpful? Give feedback.
-
@lacava great that we use the same data splits for each algorithm. That'd basically mean we don't have an issue with variance due to About my suggestion to reach 1M
My personal preference then would be to keep the evaluation budget smaller (even though my own code is C++) and hope most algorithms can consistently reach it, just because I personally find algorithmic search capability more interesting than implementation speed. Then we have both sets of results (evaluations-based, time-based) to present. Update about using eval budget: I tried right now using Standard GP against GP-GOMEA on my (slow) laptop w/ the same code base, with a limit of 500,000 evaluations, on Boston housing (~500 samples). Because GP-GOMEA's trees don't really grow over time (it uses a template-like encoding and performs a form of homologous crossover), GP-GOMEA takes me ~10 times less runtime than Standard GP. So that goes back to our discussion about point evaluations. |
Beta Was this translation helpful? Give feedback.
-
Thanks everyone for your input. I'm working on finalizing the settings, adding a sklearn API for AI-Feynman, and testing runtime on 201_pol as @folivetti suggested. (It does seem like most methods finish under 8 hours, but all have not completed). I've done some dry runs and everything is more or less working. Here's what we have: Experiment Settings
Todo:
|
Beta Was this translation helpful? Give feedback.
-
Now it's off, I will make a pull request with hyper-parameter settings so that there are 6 settings in total, and that the evolution stops at 500K evaluations. I will also add
I personally think that depends on what you want to claim. Simplification might cause misattribution of merit. For example, algorithm A that includes a smart complexity penalization term in the loss might be beaten by algorithm B that includes no penalization, solely out of luck, because B's giant models happen to simplify a lot. Is that a merit of the intended algorithmic design of B? So, I think potential merits are best evaluated without simplification. Perhaps the best way is to collect data on both settings (w/ and w/o simplification), see if they differ significantly, and then decide what to present, in what light.
I agree 100%. |
Beta Was this translation helpful? Give feedback.
-
That's a good point. I'll try to capture both separately. |
Beta Was this translation helpful? Give feedback.
-
I have a few questions:
|
Beta Was this translation helpful? Give feedback.
-
I think it's ok to simplify especially for GP where the encoding is inherently redundant (e.g. typically, only binary symbols are used, so n-ary operations are emulated via nesting), not to mention the inevitable bloat. It seems perfectly fine to compare simplified models, as this is a routine step during post processing anyway. Like @marcovirgolin suggested you could compare both versions (simplified and non-simplified) and see the differences. Our infix exporter was designed with
Also agree. One practical thing to keep in mind is that if the test set is really huge, it might impact the runtime of methods that calculate the test error between iterations/generations (e.g. if they report any stats during the run). |
Beta Was this translation helpful? Give feedback.
-
None of the methods will have the (potentially large) test data passed to them, except thru calls to |
Beta Was this translation helpful? Give feedback.
-
At the moment, no. We are assessing generalization error, complexity and run-time, which means methods with bigger max sizes might be more accurate but might be more complex / having longer running time. We could try to decide on a common setting but I'd worry about differences in what defines solution "sizes" between methods as well. What do you think?
I'm wary of giving the exact minimal set, since I think it's easy to over-fit to the data/problem once you know. But I also don't want to prevent any method from finding a solution due to not having a basic function. What I will say is this: the list of operators below covers all the functions in the problems with exact solutions:
this is not the minimal set and there are equivalences (i.e. tanh can be expressed with exp, +, and /, ), but everything is in there. You might consider using hyperparameter tuning to choose function sets if you think it will be useful. |
Beta Was this translation helpful? Give feedback.
-
reflecting on this a bit, not many of our methods have |
Beta Was this translation helpful? Give feedback.
-
btw, one of my students added the functions He hasn't changed anything beyond that. Unless you want to run the experiment only with code provided by the authors, feel free to use this version. |
Beta Was this translation helpful? Give feedback.
-
Hm, surely the limit we set does make a difference runtime-wise, at least for the more "traditional" GP encodings. In my experience, given enough run time, the avg. model size of the pop tends to converge to the maximum allowed size. Given what you say, I guess we could consider this max size to be a hyperparam setting as well then, aimed at e.g. controlling for generalization. This way we don't have the issue of deciding a limit for everyone to adhere to (moreover, possibly complicated for less-traditional encodings).
Thank you :) |
Beta Was this translation helpful? Give feedback.
-
OK, some updates: I'm planning to set a single-train instance limit of 2 hours for each method (where possible). That translates to ~ 40-48 hours per dataset:
Some runs have started they should all be submitted tonight. Thanks everyone! |
Beta Was this translation helpful? Give feedback.
-
Very nice! I'm looking forward for the results. It will certainly help us to determine the current state of Symbolic Regression and what we need to improve in our own methods. Just to be sure, have the experiments already started? I'm asking because I'm doing some optimizations in my own code that could help cut down the total runtime. |
Beta Was this translation helpful? Give feedback.
-
Yep, everything is running now. But feel free to update for the next round! |
Beta Was this translation helpful? Give feedback.
-
seems like most of these issues are settled; closing for now |
Beta Was this translation helpful? Give feedback.
-
Hi,
I am opening this issue to start a discussion about the settings and computational limits of the methods. I added some of the email comments in there, feel free to edit!
Evaluation budget
There will be a fixed evaluation budget that each method can expend in its own way. Suggested budget: 500,000 evaluations.
Things to consider:
It seems reasonable to take into account local search iterations and adjust for minibatch sampling.
It might also be interesting to monitor the load average on the cluster, if you can isolate it on a per-method basis, maybe combine with other measurements like memory usage. This would give a more general measure of each method's computational requirements.
Hyper-parameters
Model complexity
Not much to comment here, maybe just a minor nit pick: I noticed some methods also use the "AQ" (analytical quotient) symbol, which can be decomposed into basic math operations:
aq(a,b) = a / sqrt(1 + b^2)
. What is then the complexity of the AQ symbol?Beta Was this translation helpful? Give feedback.
All reactions