Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

161 add gradient boosting, logistic regression and random forests #210

Merged
merged 24 commits into from
Dec 13, 2023

Conversation

nmaarnio
Copy link
Collaborator

@nmaarnio nmaarnio commented Oct 23, 2023

Basic implementations of gradient boosting, logistic regression and random forests using Sklearn.

@nmaarnio nmaarnio linked an issue Oct 23, 2023 that may be closed by this pull request
@nmaarnio nmaarnio added the modelling Features related to modelling functions label Oct 23, 2023
@nmaarnio
Copy link
Collaborator Author

@nialov could you take a look at this PR at some point? Also @zelioluca , let me know if you have any comments since you're now also working with Sklearn and modelling stuff. I might be unaware of some best practices in this domain, and I am not sure if the parameter optimization is good like this.

@nmaarnio nmaarnio marked this pull request as ready for review October 26, 2023 09:24
@nmaarnio nmaarnio mentioned this pull request Oct 26, 2023
@nialov
Copy link
Collaborator

nialov commented Nov 2, 2023

Will take a look next week hopefully!

@zelioluca
Copy link
Collaborator

zelioluca commented Nov 2, 2023 via email

Copy link
Collaborator

@nialov nialov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, very small stuff. Fix if you think it is worth the time. I am not an expert on the functionality of these sklearn functions so difficult to say anything about the business logic itself but looks very solid to me!

**kwargs,
)

# Training and optionally tuning the model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this comment I understand the fitting should be done regardless of using tuning or not. Should update the comment if this is not the case and remove the else-clause if this is the case.

eis_toolkit/prediction/gradient_boosting.py Show resolved Hide resolved
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

metrics = {"MAE": mae, "MSE": mse, "RMSE": rmse, "R2": r2}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicky stuff but I would refactor the strings to module variables e.g.

MAE = "MAE"
MSE = "MSE"
...

Could then import these in the tests instead of using strings manually there.

tests/prediction/gradient_boosting_test.py Outdated Show resolved Hide resolved
@zelioluca
Copy link
Collaborator

zelioluca commented Nov 8, 2023 via email

@zelioluca
Copy link
Collaborator

very good job guys! I like it!

@nmaarnio
Copy link
Collaborator Author

@msmiyels I have now pushed the reworked versions of these three ML models. The tests should cover most cases, but I might have missed something.

@nmaarnio nmaarnio requested a review from msmiyels November 15, 2023 10:10
@nmaarnio nmaarnio changed the title 161 add gradient boosting and random forests 161 add gradient boosting, logistic regression and random forests Nov 15, 2023
Copy link
Collaborator

@msmiyels msmiyels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Niko,

thank you, this is very clean and well structured code 🐱‍🏍

Regarding the methods, I have some points to discuss/to check on your side:

  1. Personally, I would like to have an option for the visibility of the training progress (verbose) since it can take quite a while to get finished. Most sklearn functions are supporting this.

  2. I would think about to add the shuffle parameter for the train-test-splits. If you have data which are ordered in a certain way, it would make sense to shuffle the dataset before splitting into train-test-data.

  3. Another point is something we already started to discuss in the meeting last week:
    It would be nice to have the option to split before the actual training and test the model against data that it has never seen before.

    Currently, we have
    1. training without any split
    2. training with some split (simple, cv),

    but the test-portion of the data will always be used for the model to optimize a respective score.

    I think it would be useful to have a real test, even if it becomes only relevant when a sufficient amount of data are available.

  4. There are some comments regarding the test functions within the code review.

tests/prediction/logistic_regression_test.py Outdated Show resolved Hide resolved

# Test that all predicted labels have perfect metric scores since we are predicting with the test data
for metric in metrics:
np.testing.assert_equal(out_metrics[metric], 1.0)
Copy link
Collaborator

@msmiyels msmiyels Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite dangerous. For whatever reason, the returned score for accurary here is 1.0. However, there are some misclassifications (3) when using the same parameters as provided in the test. The correct score should be 0.98 with max_iter = 100.

What I did to check:

# load tool stuff
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from eis_toolkit.prediction.logistic_regression import logistic_regression_train

# get data
X_IRIS, Y_IRIS = load_iris(return_X_y=True)

# define data to use
X = X_IRIS
Y = Y_IRIS

# run training
metrics = ["accuracy"]
model, out_metrics = logistic_regression_train(X, Y, random_state=42, max_iter=100)

# run prediction
y_pred = model.predict(X)

# get number of false classifications and compare
count_false = np.count_nonzero(y_pred - Y)
print(f"Returned metrics: {out_metrics}\nSum of FALSE classifications: {count_false}")

# Returned metrics: {'accuracy': 1.0}
# Sum of FALSE classifications: 3

# calculate and compare accuracy manually
score = accuracy_score(Y, y_pred)
print(f"Manual score: {score}\nDifference to returned score from EIS: {score - out_metrics['accuracy']}")

# Manual score: 0.98
# Difference to returned score from EIS: -0.020000000000000018

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found out the reason why accuracy is 1.0 in out_metrics but 0.98 when it is calculated like in your snippet. The out_metrics are produced during training, and in this case a simple split of 20%-80% is used. The model trained with 80% of the data manages to classify correctly all in the 20%, therefore accuracy is 1.0. However, after training the whole data is fit to the model. So both Y and y_pred in score = accuracy_score(Y, y_pred) are different in your code than what they are in the training function, and model trained with the whole data apparently doesn't manage to predict all its labels correctly.

tests/prediction/logistic_regression_test.py Outdated Show resolved Hide resolved
eis_toolkit/prediction/gradient_boosting.py Show resolved Hide resolved
eis_toolkit/prediction/random_forests.py Show resolved Hide resolved
@nmaarnio
Copy link
Collaborator Author

nmaarnio commented Nov 20, 2023

Hi Micha,

thanks for the thorough review and good comments!

I'll add both verbose mode and shuffle to simple train-test split (I noticed shuffle=True by default in the Sklearn function, but I can make it explicit).

About the option to split data before, do you think running a model with test_method = "none" does not do the trick here? Of course, if you want to evaluate your model in this case, you need to do it after training the model using a Sklearn method. I might have misunderstood what you meant here, so can you maybe describe it a bit more?

@msmiyels
Copy link
Collaborator

Hi Niko,

sorry for my delay. Of course 😉

What's currently implemented uses either

  • the complete dataset for training (no split option), or
  • a portion of it, with the other portion used to optimize a certain score (simple split, cv)

What's missing is the option to train with a either the first or second approach, but to keep another independent part of if input data away from the modeling. This third part can be used for a "real" test szenario, as the data were neither seen nor incorporated in the modeling part.

Its only useful if enough data are available, which is, especially in geo-space, often not the case, but in principle, it would look like this:

  1. split the data into two parts. One for training (train) and one for testing (test) on the "unseen" data
  2. train the model, either without any split or with simple/cv options on the train dataset and optimize scores on a certain metric and the validation (valid) part, if those.
  3. Test the model on the test dataset, which was not used for any modeling stage.

Does this makes more sense now? Maybe it's already possible to build the workflow like this calling the functions with different parameters, but if so, it would be good to have this kind of workflow integrated in the Plugin.

Cheers 😉

@nmaarnio
Copy link
Collaborator Author

Hey,

no problem.

Okay I understand now. Do you think we should parameterize the training functions so that this whole workflow could be accomplished with one function call? What the user can now do is just as you described, first split their data (using train_test_split for example, we can make this importable from EIS or create a wrapper), then train how they like, and finally use predict, score or some other Sklearn method. Of course, this assumes the user is aware that all these functions/options exist and can be used.

I guess we could add parameter evaluate_split or validate_split that would define the other split of data (and maybe swap the names so validate_split is the "inner" one and test_split for the unseen data), or give the option to give already separated test dataset. However, I am a bit worried these functions will grow too large and complicated to use. In the Plugin, I am sure we can find some ways to show all these different options to the user. But what do you think? I am open to suggestions how to best incorporate this feature :)

This was linked to issues Nov 22, 2023
@msmiyels
Copy link
Collaborator

Ahoi Niko 🏴‍☠️,

hope u had a nice start into the new week!

I think this train/valid/test split is quite common in the ML world. However, I agree on the point of keeping things simple and do not overload it with complexity. So, even if it would be nice to have something like this in the core functionality of the toolkit, we could assume that advanced users who use the toolkit in code rather than the plugin are able to do

  1. the train_test_split based on the sklearn functionality
  2. train the model with the eistoolkit (either with or without simple split and validation data)
  3. predict with the test data to get test results with the eistoolkit on unseen data

For the non-advanced users, it would be great to have this integration in the plugin, but its totally okay if it uses the "workflow" above instead of calling a single complex toolkit function. We just have to take care that we describe the different approaches in the guidelines, for both user types.

But to keep the terms consistent, I would suggest to call everything in the eistookit functions

  • related to the training: train
  • related to the validation and scoring during training and model-building: valid
  • related to the preditction and scoring with unseen data: test
  • related to the prediction with unknown labels (i.e. the purpose of this whole thing here): prediction

Would you agree on this?

- Renamed test -> validation where validation was meant
- Renamed simple_split -> split
- Renamed simple_split_size -> split_size
@nmaarnio
Copy link
Collaborator Author

nmaarnio commented Nov 29, 2023

Hi!

I definitely agree with term consistency – good that you pointed it out! The terms used were quite unclear.

Now:

  • core/private function is called _train_and_validate_sklearn_model
  • test_method is called validation_method
  • simple_split is called just split and simple_split_size is called split_size (I thought the "simple" didn't add much to explain this validation method)
  • docstrings are talking only about validation instead of testing / validation

Regarding the workflow for testing/scoring model with unseen data, I think this 3-step workflow is good for the Toolkit side. However, I ended up creating simple EIS Toolkit functions split_data, test_model and predict, so now users can still use our own functions/wrappers to conduct this workflow. What do you think about the structure now?

@nmaarnio nmaarnio requested a review from msmiyels December 7, 2023 13:26
@nmaarnio nmaarnio mentioned this pull request Dec 7, 2023
# Approach 2: Validation with splitting data once
elif validation_method == SPLIT:
X_train, X_valid, y_train, y_valid = split_data(
X, y, split_size=split_size, random_state=random_state, shuffle=True
Copy link
Collaborator

@msmiyels msmiyels Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the split_data function allows the shuffle on/off, it could be added to this function as parameter, too (but kept as a True default). Also not crucial and most likely for advanced users only.

Copy link
Collaborator Author

@nmaarnio nmaarnio Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the shuffle as a parameter to the inner function (_train_and_validate_sklearn_model). I didn't add it yet to the public ones, but can do that too if you think it should be possible.

Copy link
Collaborator

@msmiyels msmiyels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahoi Niko, 🏴‍☠️

nice work! 🐱‍🏍

I've only got these two points:

  • random_seed to None would be better
  • no "hardcoding" for shuffle parameter

Except of these, ready to merge 🚀

@nmaarnio nmaarnio merged commit 0b84fc9 into master Dec 13, 2023
4 checks passed
@nmaarnio nmaarnio deleted the 161-add-gradient-boosting-and-random-forests branch December 13, 2023 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
modelling Features related to modelling functions
Projects
None yet
6 participants