Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix package dependency issues #36

Open
wants to merge 32 commits into
base: master
Choose a base branch
from

Conversation

ChristopherGS
Copy link
Contributor

@ChristopherGS ChristopherGS commented Apr 9, 2022

Update: For students checking this PR, the fixes have now been applied to the commit history. Please update your local copy from the current master branch

Fix for the Unit Testing a Production ML Model section

ChristopherGS and others added 29 commits February 1, 2020 10:03
# by the `fit` method, since this allows us to access the transformed
# dataframe. For other models we could use the `transform` method, but
# the GradientBoostingRegressor does not have a `transform` method.
X_transformed, _ = pipeline.price_pipe._fit(X_train, y_train)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@solegalli previously I was able to use the _fit method to access the transformed dataframe of the pipeline. This method seems to have been updated (possibly here: scikit-learn/scikit-learn@88ce8cd) and is throwing an error.

Any idea how to access the transformed dataframe for the GradientBoostingRegressor (since it doesn't have a transform method?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not straightforward. Here some potential solutions:

https://stackoverflow.com/questions/54332654/how-to-analyse-the-intermediate-steps-of-sklearn-pipeline
https://stackoverflow.com/questions/48743032/get-intermediate-data-state-in-scikit-learn-pipeline?

In short, I think you need to apply all the transformations one after the other, just up to the one before the model.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I asked the sklearn community through the email list and one of the developers told me to do this:

Using slicing: model[:-1].transform(X)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That worked perfectly :)

from gradient_boosting_model import pipeline
from gradient_boosting_model.config.core import config

from functools import reduce


def test_pipeline_drops_unnecessary_features(pipeline_inputs):
    # Given
    X_train, X_test, y_train, y_test = pipeline_inputs
    assert config.model_config.drop_features in X_train.columns
    pipeline.price_pipe.fit(X_train, y_train)

    # When
    # We access the transformed inputs with slicing
    transformed_inputs = pipeline.price_pipe[:-1].transform(X_train)

    # Then
    assert config.model_config.drop_features in X_train.columns
    assert config.model_config.drop_features not in transformed_inputs.columns


def test_pipeline_transforms_temporal_features(pipeline_inputs):
    # Given
    X_train, X_test, y_train, y_test = pipeline_inputs

    # When
    # We access the transformed inputs with slicing
    transformed_inputs = pipeline.price_pipe[:-1].transform(X_train)

    # Then
    assert (
        transformed_inputs.iloc[0]["YearRemodAdd"]
        == X_train.iloc[0]["YrSold"] - X_train.iloc[0]["YearRemodAdd"]
    )

@@ -1,2 +1,3 @@
flask>=1.1.1,<1.2.0
markupsafe==2.0.1 # https://github.com/aws/aws-sam-cli/issues/3661
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key Section 5 fix

feature_engine>=0.3.1,<0.4.0
joblib>=0.14.1,<0.15.0
numpy>=1.20.0,<1.21.0
pandas>=1.3.5,<1.4.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key section 4 fix 1


# old model for testing purposes
# source code: https://github.com/trainindata/deploying-machine-learning-models/tree/master/packages/regression_model
tid-regression-model>=2.0.20,<2.1.0
tid-regression-model==3.1.2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key section 4 fix 2

@@ -1,4 +1,5 @@
Flask>=1.1.1,<1.2.0
markupsafe==2.0.1 # https://github.com/aws/aws-sam-cli/issues/3661
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key fix for section 9

@@ -1,12 +1,13 @@
# ML Model
tid-gradient-boosting-model>=0.1.18,<0.2.0
tid-gradient-boosting-model>=0.3.0,<0.4.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key fix for section 9 (2/2)

Note required new publishing of gradient boosting model (so sklearn versions are compatible)

@ChristopherGS ChristopherGS force-pushed the master branch 7 times, most recently from 1716ef4 to 72ae086 Compare April 16, 2022 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants