-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Shift detection over time with Resemblance Model #72
Comments
I see how training a resemblance model could help detect seasonality in the data. I think there should be more context provided before offering a tool like this:
|
Hmm, I see this throwing confusing outputs. There are two potential flaws
|
Addressing your comments:
What do you think? |
This blogpost identifies different types of datashift and provides some ways of tackling it. TL;DR:
In terms of Probatus implementation this can form a module,
|
@anilkumarpanda That is a great proposal. It seems like a very large feature though. If we split it into the tree parts and tackle each one separately, and then try to combine using some wrapper that would be doable. For the first point, most can be done using components already available in probatus. For the rest, we would need to use other libraries. In order to work on it we would probably need involvement of multiple collaborators, and set a more structured way of working on it. Who would like to contribute to that? |
I agree with the above points. The feature is a large one and needs to be separated. I will create a separate issues for this one. Linking them to this master issue. |
One way to measure at which point in time we can observe a data shift.
One idea would be to split data into multiple folds based on time e.g. 10. Then one could train a resemblance model taking:
By measuring the AUC over different split time, we could observe whether data significantly changes at any point in time, and we could monitor, which features contribute to that.
One drawback is the first and last iteration would lead to high AUC, due to small sample size of one of the classes.
Another idea would be to again split the data into folds based on time. However, in this case we would perform cross-validation, where at each iteration only one fold belongs to class 1. This way we could point to e.g. months that significantly differ from other months in the past and the future.
This feature could be part of sample_similarity, and the user could specify which resemblance model to use: permutation or shap based (default could be SHAP).
The init should take as arguments the clf and resemblance_model type, and fit should take X dataset, and indication of number of bins, or split dates, and date column name.
The output of compute should be the report presenting the validation AUC for each iteration of the process, the information about current time split date and top feature of the resemblance model (top_n parameter?)
In the plot method we could plot the AUC over time, but also user should be able to plot resemblance model plots for a specific iteration, to analyse it.
The text was updated successfully, but these errors were encountered: