[SRBench 2024] Datasets #174

folivetti · 2024-06-25T14:28:41Z

folivetti
Jun 25, 2024
Maintainer

We've been discussing internally about the next iteration of SRBench. One of the key points is the selection of a subset of the current datasets and the introduction of new ones. What we discussed so far:

10 datasets from the current set (avoiding the bias of Friedman's datasets)
10 real-worl datasets with known GT (i.e., Feynman and data from Mile's paper)
2 synthetic generator function that will be used to create 10 datasets with varying amount of noise and 10 datasets with varying number of irrelevant features

Regarding the selection from the current set, we've come up with three different criteria:

Select those that have a large variation in R^2 across different algorithms
Select a minimal set that maintains the rank
Remove datasets in which the edit distance between pair of models from different algorithms is small

Maybe we should start with the whole set and simply filter them out by removing all that doesn't change rank, all that are trival (By the edit distance measure), and see what remains.

folivetti · 2024-06-28T15:29:14Z

folivetti
Jun 28, 2024
Maintainer Author

@lacava @MilesCranmer @gbomarito @hengzhe-zhang @gAldeia @foolnotion

0 replies

gAldeia · 2024-07-01T01:08:37Z

gAldeia
Jul 1, 2024
Maintainer

Are we considering keeping Strogatz datasets in this track?

If we use Feynman, I would suggest replacing our datasets with those from Matsubara et al. in https://arxiv.org/pdf/2206.10540, taking the same number of easy/medium/hard problems to use.

0 replies

gAldeia · 2024-07-01T01:13:23Z

gAldeia
Jul 1, 2024
Maintainer

When selecting a subset of our current datasets, I like the idea that this subset should have the slightest effect on overall results from the current SRBench version.

I can see the reasons why we could keep/drop many datasets. However, IMO this should be done without affecting the overall results so we can avoid inserting bias based on our interpretation of the results.

Also, from my perspective, I would like to use the ol' reliable number of 30 repetitions for each dataset since this gives us more statistical robustness and allows us to expect less variation in the results. If you agree, the final number of datasets we are going to use should be defined by taking this computational burden into account.

2 replies

gAldeia Jul 5, 2024
Maintainer

Continuing this line of thought and after discussing it in an online meeting with @lacava and @folivetti, I did the following.

Inspired by the idea from @MilesCranmer that suggested looking at some dimensionality reduction plots to search for patterns, and based on suggestion of @lacava to use UMAP instead of tSNE (I guess because UMAP is better at preserving global structure through cluster distances, I did a dimensionality reduction with a table where each row is a dataset, and each column is a combination of (algorithm, metric) reporting the median metric value of the algorithm for that given dataset.

Here are some scatterplots with labeling based on dataset attributes:

I then tried to pick a number of clusters by looking at the elbow of the SSE curve and decided to work with 5 clusters:

Applying a KMeans, I got:

Selecting 4 datasets from each cluster based on the distance to its centroid, I ended up with the following list of 20 datasets:

Finally, to check if ranks are preserved within this subset, as @gbomarito suggested, we get the following:

The left side is the Pareto plot using only 20 datasets; the right side is the entire benchmark.

I can try playing around with this a bit further if any of you think that makes sense. Also, I can try including edit distance into this calculation (@hengzhe-zhang I would really appreciate your help if that's the case!).

gAldeia Jul 23, 2024
Maintainer

As we discussed in last meeting, I would make the following changes:

Regenerate clusters without using time. 20 clusters, get 1 dataset from each (instead of 4 datasets in each of the 5 clusters). Get the dataset with the highest variance among the algorithms within each cluster.
Test with and without normalization, always using ranks
Ask Hengzhe for help with te edit distance.
Try using silhouette instead of SSE when investigating the elbow of the number of clusters view knee for silhouette. Geoff tried DBScan and it wasn't better than what he got with k-means.

Just a reminder, this selection was made by using kmeans(n_clusters=20) over two latent dimentions obtained by applying UMAP to a table with datasets as rows, and metric_algorithm as columns (for each metric in r2, model size, avg edit distance and each algorithm in the sbench results).

Final datasets selection:

Performance on this subset:

I tried plotting the R2 and model size ranks for the selected/excluded datasets from my approach and got this:

	Selected datasets (10 feynman and 10 non-feynman)	Excluded
model size rank
r2 rank

We could use these plots to visually determine if we are getting a representative selection.

I'll share the code with you all as soon as I can.

hengzhe-zhang · 2024-07-07T10:28:22Z

hengzhe-zhang
Jul 7, 2024

Hi! Here is an update from my side. In our previous meeting, @lacava mentioned that our goal might be to find datasets that can differentiate algorithms. Thus, I now use the pairwise differences between different algorithms on each dataset to select representative datasets.

This approach is different from clustering. An intuitive way to understand this is that many Friedman datasets are selected because Friedman effectively differentiates different algorithms. In other words, some datasets may not be able to distinguish different algorithms, and all algorithms achieve an R2 of -1 (just an example).

So, if clustering is used, I'm not sure if a dataset where all algorithms perform equally poorly will be selected or not. If so, we might select problematic datasets from which no learning algorithm can extract useful information.

The following two tables use pairwise differences to select top datasets based on R2/Edit distance. They show stark differences. I think R2 might be more appropriate in the current situation.

                     dataset  average_R2_difference
70         605_fri_c2_250_25                236568.262516
118      706_sleuth_case1202                    85.751254
107         656_fri_c1_100_5                    61.320601
112      665_sleuth_case2002                    58.765072
29   485_analcatdata_vehicle                    45.849505
119      712_chscase_geyser1                    26.347469
33                  522_pm10                    14.554056
86          624_fri_c0_100_5                    10.525743
38             542_pollution                     9.166740
20              218_house_8L                     9.138097
39                   547_no2                     7.859947
18                 210_cloud                     7.175195
13              192_vineyard                     5.808021
25           230_machine_cpu                     4.949792
78         615_fri_c4_250_10                     2.796081
113        666_rmftsa_ladata                     2.775495
97         644_fri_c4_250_25                     2.583776
67         602_fri_c3_250_10                     2.178911
57         591_fri_c1_100_10                     1.662001
75          611_fri_c3_100_5                     1.549416

                dataset  average_edit_distance
105   653_fri_c0_250_25           21471.208791
64   598_fri_c0_1000_25           20768.428571
97    644_fri_c4_250_25           18123.483516
82   620_fri_c1_1000_25           10006.208791
61   595_fri_c0_1000_10            7703.351648
19         215_2dplanes            7635.153846
71   606_fri_c2_1000_10            7344.747253
73   608_fri_c3_1000_10            7270.406593
85   623_fri_c4_1000_10            7219.681319
45            564_fried            7154.164835
95    641_fri_c1_500_10            6858.087912
59   593_fri_c1_1000_10            6812.000000
88    627_fri_c2_500_10            6728.758242
106   654_fri_c0_500_10            6704.175824
99    646_fri_c3_500_10            6682.087912
11    1203_BNG_pwLinear            6544.296703
69    604_fri_c4_500_10            6492.505495
113   666_rmftsa_ladata            6195.516484
14       195_auto_price            5197.395604
17        207_autoPrice            5098.747253

0 replies

gbomarito · 2024-07-16T21:29:35Z

gbomarito
Jul 16, 2024

Hey all,

I just saw this at GECCO: https://arxiv.org/pdf/2301.01488

its an informed down sampling for test cases in lexicase selection. As I was watching I was thinking about how similar this is to down sampling of datasets in SRBENCH. Could/Should we do something similar where we choose a subset of datasets based on maximizing distance of "performance vectors" from our current algorithms?

0 replies

hengzhe-zhang · 2024-07-22T09:41:02Z

hengzhe-zhang
Jul 22, 2024

@gbomarito This is a great idea. I have implemented a simple function to select datasets based on the error matrix T and the sampling ratio r, hoping this will be useful.

def euclidean_distance(vec1, vec2):
    return np.sqrt(sum((x1 - x2) ** 2 for x1, x2 in zip(vec1, vec2)))


def farthest_first_traversal(T, r):
    ds = set()  # the down-sample
    size = int(r * len(T))  # desired size of down-sample
    # Add a random case to the down-sample
    ds.add(random.choice(list(T)))
    T = T - ds  # remove the selected case from the training set

    while len(ds) < size:
        max_min_dist = -1
        case_to_add = None
        for case in T:
            min_dist = min(euclidean_distance(case, c) for c in ds)
            if min_dist > max_min_dist:
                max_min_dist = min_dist
                case_to_add = case
        ds.add(case_to_add)
        T.remove(case_to_add)

    return ds

4 replies

gAldeia Jul 23, 2024
Maintainer

This is useful, I can play around with this idea if you want to.

One thing (that may or may not be a problem) is that we have multiple objectives, Im not sure if (and how) we should handle this.

folivetti Jul 25, 2024
Maintainer Author

I like this idea. It should be insightful to see the list of a selection of 10 datasets using this method.

gAldeia Jul 26, 2024
Maintainer

I tried using it with our current results. It has stochastic behavior (but it seems to be a very stable method based on some runs I tried). To share one possible result (maybe this can be insightful), here are 10 datasets with this method (as @folivetti suggested):

Below is the corresponding Pareto front with these 10 datasets:

(If anyone knows how to make collapsable sections in github+markdown, please let me know)

folivetti Jul 26, 2024
Maintainer Author

ok, I don't like this list 😆
the way we're doing it we simply picked the datasets with different achievable R^2, but in many of those selection the algorithms behave very similar to each other.

After putting some thoughts into that, and reading @gAldeia manifesto (#177):

We should characterize subclasses of problems and identify which algorithms are better, mapping problems to algorithms.
Remove easy problems.

I would propose we simplify what we are doing and simply to apply that in this order:

remove datasets with low variance between algorithms in the test set
remove datasets with worst r2_test greater than 0.9
pick a sample of different classes of data as in New datasets + reorganization of current benchmarks #153 (if we're picking 10, we could go for 5 of R [1 friedman and 4 non-friedman?], 3 Z+, and 2 R+)

folivetti · 2024-07-25T12:23:16Z

folivetti
Jul 25, 2024
Maintainer Author

During GECCO, Alberto Tonda mentioned this recently curated benchmark for regression
https://openml.org/search?type=benchmark&sort=tasks_included&study_type=task&id=353
https://openreview.net/pdf?id=HebAOoMm94

It contains only 23 datasets. I think most of them are currently contained in srbench. Maybe we should consider integrating these for srbench 2025?

2 replies

gAldeia Jul 30, 2024
Maintainer

With help of Alberto implementations (https://github.com/albertotonda/symbolic-regression-conformal-prediction/tree/main), I downloaded and pre-processed these datasets.

After taking a look, many of these are familiar from previously published papers (not from me), and I agree that they could be used.

We could have one track that uses this benchmark suite (it would also make srbench directly comparable with other regression algorithms based on this OpenML suite as well).

Obs: it's actually 35 datasets!

gAldeia Aug 7, 2024
Maintainer

I did manual labor in finding equivalences between PMLB datasets and the 35 problems from this OpenML suite. I looked at descriptive statistics and dataset dimensionality to determine whether they are the same, and in a few cases, I opened up the data:

cars (openml)--> It is not the same from srbench (195_auto_price, 207_autoPrice, 485_analcatdata_vehicle);
cpu_activity --> same as 197_cpu_act and 573_cpu_act;
geographical_origin_of_music --> same as 4544_GeographicalOriginalofMusic looks the same;
red_wine --> same as quality_red;
white_wine --> same as quality_white;
solar_flare --> no similar named file in srbench. PMLB has flare_1, flare_2, and flare, but these are all classification problems;
california_housing, miami_housing, brazilian_houses --> no equivalent housing dataset in srbench. SRBench has 537_houses, 574_house_16H, and 218_house_8L;
Airfoil, energy_efficiency, and concrete are in OpenMLbut not in SRBench, although they are common in the regression literature as benchmarks.

So, if running SRBench (as it is) and OpenML, we would have 4 identical datasets (cpu_activity, geographical_origin_of_music, red_wine, white_wine), and several (cars and housing) which could be dropped from SRBench. I will proceed my reduced SRBench selection excluding these, assuming that OpenML will be a track on the next iteration of the paper

gAldeia · 2024-08-05T19:05:52Z

gAldeia
Aug 5, 2024
Maintainer

Some datasets we should ignore when doing this analysis:

print(df_results.shape)

# Removing mislabeled datasets (these are clf, but PMLB v1.0 had it as regr)
df_results = df_results[ ~df_results["dataset"].isin(["banana", "titanic"]) ]

# ignoring new datasets from PMLB that hasnt been benchmarked with other methods yet
# df_results[['algorithm', 'dataset']].value_counts().unstack().sum(axis=0).sort_values()
df_results = df_results[ ~df_results["dataset"].isin([
    "nikuradse_2",
    "nikuradse_1"
]) ]

# Removing duplicated datasets
# 562 and 227, 573 and 197, 1203 and 229, 207 and 195
df_results = df_results[ ~df_results["dataset"].isin([
    "562_cpu_small",
    "573_cpu_act",
    "229_pwLinear",
    "207_autoPrice",
]) ]

# with pd.option_context('display.max_rows', None, 'display.max_columns', None):
#     display(df_results.dataset.value_counts())

print(df_results.shape)

1 reply

gAldeia Aug 7, 2024
Maintainer

Found some more that we can discard before doing our analysis:

197_cpu_act and 573_cpu_act are the same. 562_cpu_small and 227_cpu_small are also the same but with fewer features. Maybe we should stick to the version without fewer features? Also, when looking at the values, the output is weird: half zeros, half between 50 and 100, looking like user IDs. I would drop all of these.
561_cpu and 230_machine_cpu are almost the same, with one extra column on the former. We could use only 561_cpu.
537_houses and 218_house_8L are pretty similar in feature number, data, and imbalance. 574_house_16L looks the same as 218_house_8L but has twice as many features (the same number of rows). Should we keep only 574_house_16L?

folivetti · 2024-08-06T10:48:40Z

folivetti
Aug 6, 2024
Maintainer Author

Alright, so as we discussed yesterday, we will pick $20$ datasets from the current srbench
by sampling datasets with different characteristics.
Thinking about the range of the values of the target variable, we have:

range | count
R         74
R+        15
Z          1
Z+        30
{0, 1}     2
Name: dataset, dtype: int64

I suggest we discard the Z and {0,1} and then we pick 10 R, 4 R+, 6 Z+ (more or less proportional). We should be careful to pick Z+ with many distinct values, as we are not compatible with multiclass yet.

From the 10 R we can try to select 2 of each:
few samples, low dimensional
few samples, high dimensional
many samples, low dimensional
many samples, high dimensional

From the above selection we try to pick 1 or 2 Friedman, maybe those with high variance in the results.
What do you all think?

@gAldeia @gbomarito @hengzhe-zhang @lacava

3 replies

gAldeia Aug 6, 2024
Maintainer

Sounds good. I will try to come up with something, then will list the main properties and criteria for the selection!

hengzhe-zhang Aug 7, 2024

Alright, so as we discussed yesterday, we will pick 20 datasets from the current srbench by sampling datasets with different characteristics. Thinking about the range of the values of the target variable, we have:
range | count
R         74
R+        15
Z          1
Z+        30
{0, 1}     2
Name: dataset, dtype: int64
I suggest we discard the Z and {0,1} and then we pick 10 R, 4 R+, 6 Z+ (more or less proportional). We should be careful to pick Z+ with many distinct values, as we are not compatible with multiclass yet.

From the 10 R we can try to select 2 of each: few samples, low dimensional few samples, high dimensional many samples, low dimensional many samples, high dimensional

From the above selection we try to pick 1 or 2 Friedman, maybe those with high variance in the results. What do you all think?

@gAldeia @gbomarito @hengzhe-zhang @lacava

I have a question regarding dataset selection: should we consider including other OpenML datasets in this round or purely select from PMLB? Including more datasets might allow us to work with a more diverse range of data.

lacava Aug 7, 2024
Maintainer

I'm happy to incorporate non-pmlb datasets

gAldeia · 2024-08-09T10:28:30Z

gAldeia
Aug 9, 2024
Maintainer

EDIT: trying to do two clusters for reduced srbench, one with only-friedman and other with non-friedman

I come to a suggestion for the next iteration. There are 4 different tracks:

I come to a suggestion for the next iteration. There are four different tracks:

Reduced SRBench: datasets from the previous iteration after clustering the results from that paper. The idea was to get the most diverse subset. I removed duplicates from OpenML and also problems where 50%> of the algorithms would achieve R2>0.95. I did one clustering with non-Friedman with 9 clusters and random selection from each cluster, then another clustering with only Friedman and picked 3. I tried UMAP, PCA and tSNE, but UMAP was showing the most diverse R2 performance (At the bottom is the R2 cluster map of the selected datasets).
OpenML: I'm using the OpenML suite 353. It has many famous datasets and may allow us to compare some of our partial results with other papers (also outside the SR community). I also noticed four problems belonging to previous SRBenches and four very similar ones. That is why I kept the reduced SRbench with only ten datasets.
First principle: real world data curated by @folivetti and @MilesCranmer. One idea was to include the first principle in the Pareto plot to see if any algorithm is dominating it (we need to ensure we use the exact measurement of size)
Feynman: hard set selected by Matsubara (https://arxiv.org/abs/2206.10540). They did a fantastic job making this selection (and making the problems even harder with log-uniform distribution). Given the efforts to run SRBench, we could restrict the use of challenging problems without dummy variables for now, and we leave the in-depth evaluation to them.

5 replies

folivetti Aug 13, 2024
Maintainer Author

@gAldeia yesterday we decided to:

discard the OpenML (we may use it in the next iteration after we put some more thought into that)
Sample 12 datasets from the selected Feynman, so we will have 12 datasets of each track (you can use your clustering approach to pick them)
For the Feynman we will go with the noiseless, no dummies, uniform distribution and 10k data points (was that it @lacava ? I forgot whether we would use the log-uniform distribution)

gAldeia Aug 13, 2024
Maintainer

I will filter the hard problems based on the distribution type. If there is more than 12 datasets left, I will do the clustering!

gAldeia Aug 14, 2024
Maintainer

The goal was to select 12 datasets from the hard problems identified by Matsubara et al. (https://arxiv.org/pdf/2206.10540).

I first removed all problems that use operators not common to all symbolic regression (SR) implementations (e.g., arcsin).

Next, I filtered out datasets where a linear regression (LR) model could achieve an accuracy >= 0.999.

Then, I examined the current results from the latest SRBench iteration and selected problems where at least one algorithm could find the correct solution. This process yielded 11 candidate problems.

The 12th dataset was chosen at random (using df.sample()).

Final selection:

II.11.27
I.40.1
I.30.3
III.9.52
III.10.19
II.6.15b
III.21.20
II.35.21
II.36.38
I.44.4
I.32.17
II.11.20

@folivetti Could you check and share your thoughts?

folivetti Aug 14, 2024
Maintainer Author

I think that's a good selection :-)
Let's move on to the next step!

gAldeia Aug 14, 2024
Maintainer

I forgot to add the clustermaps. Here it is:

hengzhe-zhang · 2024-08-20T05:09:27Z

hengzhe-zhang
Aug 20, 2024

Hi everyone! I came across a paper published at ICML 2024 that reports excellent performance on the Feynman dataset, achieving an R² score of 0.99. The author had previously published a paper at ICLR 2023 with similar results (0.99) on the same dataset.

ICML 2024 A Neural-Guided Dynamic Symbolic Network for Exploring Mathematical Expressions from Data:

ICLR 2023 Transformer-based model for symbolic regression via joint supervised learning:

Given these high-performance results, should we consider increasing the difficulty of the AI Feynman dataset?

For example, we could try methods we used in the SRBench competition, such as increasing the number of confounding variables.

1 reply

folivetti Aug 20, 2024
Maintainer Author

Hi @hengzhe-zhang nice findings.
Well, to be fair, their results claim to solve almost every dataset :-D
For this edition we can add a little noise (N(0, 0.05*std(y), or something like that) and see what happens.
If we see that many algorithms solve it, then we tune that for the next iteration

What do you think? We can discuss in our meeting later.

hengzhe-zhang · 2024-08-20T21:23:15Z

hengzhe-zhang
Aug 20, 2024

I've explored Kaggle competitions a bit more. Kaggle allows for custom evaluation scripts, but the time budget for evaluation is limited to 30 minutes. Some existing competitions, especially those hosted by large companies, seem to have an 8-hour evaluation period, but it's unclear if this is available to everyone.

Nevertheless, we might consider using Kaggle for SRBench 2025. The advantage of Kaggle is that it allows participants to see their results in real time. However, a potential drawback is that it might be challenging to control the number of evaluations unless we implement post-competition checks on their code.

2 replies

folivetti Aug 21, 2024
Maintainer Author

I would say we can keep those two in parallel in the next years:

Kaggle SRBench competition with the objective of simulating a more realistic scenario in which the participant will go through all of the pipelines and enjoy more freedom to explore
current SRBench as a way to measure the performance of the algorithms given a controlled environment

both formats have their values and measure different things

gAldeia Aug 21, 2024
Maintainer

I really like the idea of using Kaggle! Evaluation will be standardized for all participants, and the time limit will be the main constraint on submissions.

Also, with our plans of reducing the benchmark size and re-running the experiments, we will get more realistic time estimations in a few months to decide wether it would be possible.

[SRBench 2024] Datasets #174

folivetti Jun 25, 2024 Maintainer

Replies: 12 comments · 20 replies

folivetti Jun 28, 2024 Maintainer Author

gAldeia Jul 1, 2024 Maintainer

gAldeia Jul 1, 2024 Maintainer

gAldeia Jul 5, 2024 Maintainer

gAldeia Jul 23, 2024 Maintainer

hengzhe-zhang Jul 7, 2024

gbomarito Jul 16, 2024

hengzhe-zhang Jul 22, 2024

gAldeia Jul 23, 2024 Maintainer

folivetti Jul 25, 2024 Maintainer Author

gAldeia Jul 26, 2024 Maintainer

folivetti Jul 26, 2024 Maintainer Author

folivetti Jul 25, 2024 Maintainer Author

gAldeia Jul 30, 2024 Maintainer

gAldeia Aug 7, 2024 Maintainer

gAldeia Aug 5, 2024 Maintainer

gAldeia Aug 7, 2024 Maintainer

folivetti Aug 6, 2024 Maintainer Author

gAldeia Aug 6, 2024 Maintainer

hengzhe-zhang Aug 7, 2024

lacava Aug 7, 2024 Maintainer

gAldeia Aug 9, 2024 Maintainer

folivetti Aug 13, 2024 Maintainer Author

gAldeia Aug 13, 2024 Maintainer

gAldeia Aug 14, 2024 Maintainer

folivetti Aug 14, 2024 Maintainer Author

gAldeia Aug 14, 2024 Maintainer

hengzhe-zhang Aug 20, 2024

folivetti Aug 20, 2024 Maintainer Author

hengzhe-zhang Aug 20, 2024

folivetti Aug 21, 2024 Maintainer Author

gAldeia Aug 21, 2024 Maintainer

folivetti
Jun 25, 2024
Maintainer

Replies: 12 comments 20 replies

folivetti
Jun 28, 2024
Maintainer Author

gAldeia
Jul 1, 2024
Maintainer

gAldeia
Jul 1, 2024
Maintainer

gAldeia Jul 5, 2024
Maintainer

gAldeia Jul 23, 2024
Maintainer

hengzhe-zhang
Jul 7, 2024

gbomarito
Jul 16, 2024

hengzhe-zhang
Jul 22, 2024

gAldeia Jul 23, 2024
Maintainer

folivetti Jul 25, 2024
Maintainer Author

gAldeia Jul 26, 2024
Maintainer

folivetti Jul 26, 2024
Maintainer Author

folivetti
Jul 25, 2024
Maintainer Author

gAldeia Jul 30, 2024
Maintainer

gAldeia Aug 7, 2024
Maintainer

gAldeia
Aug 5, 2024
Maintainer

gAldeia Aug 7, 2024
Maintainer

folivetti
Aug 6, 2024
Maintainer Author

gAldeia Aug 6, 2024
Maintainer

lacava Aug 7, 2024
Maintainer

gAldeia
Aug 9, 2024
Maintainer

folivetti Aug 13, 2024
Maintainer Author

gAldeia Aug 13, 2024
Maintainer

gAldeia Aug 14, 2024
Maintainer

folivetti Aug 14, 2024
Maintainer Author

gAldeia Aug 14, 2024
Maintainer

hengzhe-zhang
Aug 20, 2024

folivetti Aug 20, 2024
Maintainer Author

hengzhe-zhang
Aug 20, 2024

folivetti Aug 21, 2024
Maintainer Author

gAldeia Aug 21, 2024
Maintainer