-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Where does the Bayesian Optimisation is working for Hyperparameter search? #1717
Comments
We use SMAC as the bayesian optimization library. You can find it here although it's quite convoluted considering it inherits andd overrides some of SMAC's functionality. auto-sklearn/autosklearn/smbo.py Line 16 in 6732112
|
Hey @eddiebergman! Thanks for the reply. I actually want to log all the model and hyperparameters used by the PS: Not about the ensemble models. I want the models which used while getting trained before getting the best model. |
Hiyo, unfortunatly the easy ways are not the most informative:
|
I am getting the cost and the configuration like this:
And letting you know that I am using bi-objective function and in that I am returning a combined score. So is that the correct way to do so. I am also dumbing all the info in a JSON file. |
Seems correct to me :) |
While training I said that I am using the bi-objective function in autosklearn. Like this: def bi_objective_fn(solution, prediction):
"""
Calculate a combined score of accuracy and fairness.
:param solution: True labels.
:param prediction: Predicted labels.
:return: Combined score.
"""
protected_attr = "Sex"
metric_id = 2
split = generate_train_subset("test_split.txt")
subset_data_orig_train = data_orig_train.subset(split)
if os.stat("beta.txt").st_size == 0:
default = RandomForestClassifier(
n_estimators=1750,
criterion="gini",
max_features=0.5,
min_samples_split=6,
min_samples_leaf=6,
min_weight_fraction_leaf=0.0,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
bootstrap=True,
max_depth=None,
)
degrees = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
mutation_strategies = {"0": [1, 0], "1": [0, 1]}
dataset_orig = subset_data_orig_train
res = create_baseline(
default,
dataset_orig,
privileged_groups,
unprivileged_groups,
data_splits=10,
repetitions=10,
odds=mutation_strategies,
options=[0, 1],
degrees=degrees,
)
acc0 = np.array(
[np.mean([row[0] for row in res["0"][degree]]) for degree in degrees]
)
acc1 = np.array(
[np.mean([row[0] for row in res["1"][degree]]) for degree in degrees]
)
fair0 = np.array(
[
np.mean([row[metric_id] for row in res["0"][degree]])
for degree in degrees
]
)
fair1 = np.array(
[
np.mean([row[metric_id] for row in res["1"][degree]])
for degree in degrees
]
)
if min(acc0) > min(acc1):
beta = (max(acc0) - min(acc0)) / (max(acc0) - min(acc0) + max(fair0))
else:
beta = (max(acc1) - min(acc1)) / (max(acc1) - min(acc1) + max(fair1))
f = open("beta.txt", "w")
f.write(str(beta))
f.close()
else:
f = open("beta.txt", "r")
beta = float(f.read())
f.close()
beta += 0.2
if beta > 1.0:
beta = 1.0
try:
num_keys = sum(1 for line in open("num_keys.txt"))
print(num_keys)
beta -= 0.050 * int(int(num_keys) / 10)
if int(num_keys) % 10 == 0:
os.remove(temp_path + "/.auto-sklearn/ensemble_read_losses.pkl")
f.close()
except FileNotFoundError:
pass
fairness_metrics = [
1 - np.mean(solution == prediction),
disparate_impact(subset_data_orig_train, prediction, protected_attr),
statistical_parity_difference(
subset_data_orig_train, prediction, protected_attr
),
equal_opportunity_difference(
subset_data_orig_train, prediction, solution, protected_attr
),
average_odds_difference(
subset_data_orig_train, prediction, solution, protected_attr
),
]
print(
fairness_metrics[metric_id],
1 - np.mean(solution == prediction),
fairness_metrics[metric_id] * beta
+ (1 - np.mean(solution == prediction)) * (1 - beta),
beta,
)
combined_score = fairness_metrics[metric_id] * beta + (
1 - np.mean(solution == prediction)
) * (1 - beta)
print(
f"Beta: {beta}, Combined Score: {combined_score}, Fairness Metric: {fairness_metrics}, Accuracy: {np.mean(solution == prediction)}"
)
write_file(
"./titanic_rf_spd_results/titanic_rf_score.txt",
str(
f"Combined Score: {combined_score}, Fairness Metric: {fairness_metrics}, Accuracy: {np.mean(solution == prediction)}\n"
),
mode="a",
)
return combined_score
# Create a custom metric object (bi-objective function)
accuracy_scorer = autosklearn.metrics.make_scorer(
name="accu",
score_func=bi_objective_fn,
optimum=1,
greater_is_better=False,
needs_proba=False,
needs_threshold=False,
)
automl = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=60 * 60,
memory_limit=10000000,
include_estimators=["CustomRandomForest"],
ensemble_size=1,
initial_configurations_via_metalearning=25,
include_preprocessors=[
"kernel_pca",
"select_percentile_classification",
"select_rates_classification",
],
tmp_folder=temp_path,
delete_tmp_folder_after_terminate=False,
metric=accuracy_scorer,
) So I am unable to get what actually the As in most of the cost is 0.0. Can you help me with this? |
I can't really tell you why it's 0.0 all the time but one thing that might help to know about is the You're sure that your metric is able to return a result? It seems like it's just constantly setting the |
Thanks for the help! Can you help me out with And also what does this file |
I would advise not touching the config files and honestly they're quite outdated given the version of sklearn they ran on. You can read more in the autosklearn paper but essentially the Therefore the choice comes down to, do you think your data is suitably unique such that the metalearning configurations are all going to perform worse than a random set of configurations? Longer story, I'm still in the process of slowly building a revamped AutoSklearn and there we hope to include user provided metadata. Part of this will also be to provide an updated metadataset that solves some issues in the current set of configurations from metalearning. |
Thanks for the detailed info. Is there any type of caching happens when we run the same model on the same dataset for couple of times? |
Nope, AutoSklearn doesn't cache between calls. In fact there's almost no caching that happens at all other than dumping models and predictions to disk to use later for |
Can I get intermediate results, of the models which are ensembling, and apply any technique to make the models better by keeping mutation based OR out-of-automl meta-learning based idea in mind? |
Intermediate results of models, while it's running...not easily at all. Intermediate results in terms of post-analysis, yes, although models which are not in the top 50 (default) are pruned to save disk space. Whether you can improve these models further, yup absolutely. We are revisiting the pipelines in the newer version |
Hey! Can you tell me what the |
It's just the metric value converted in some manner such that it's something to be minimized which is what SMAC needs. For bounded metrics, this also means it's min-max normalized betwwen |
Hello @eddiebergman In autosklearn, while logging the run history, it is sequentially in which the automl runs the configuration? |
Should be, i.e. the runhistory shows the order in which configurations finished |
I have also noticed that in some cases the Also, can we print every run configuration as soon as it is executed? Another thing as well, I have implemented a custom scorer which I named as |
Short Question Description
I just want to know where does the Bayesian Optimisation is working for Hyperparameter search? I am currently working on Fairness so I have a query on that.
Thanks
The text was updated successfully, but these errors were encountered: