Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the validity of employing ml model for the covariate shift calculation in the example [Arangopipe_Feature_ext2_output.ipynb] #167

Open
tomaszek0 opened this issue Dec 22, 2021 · 3 comments

Comments

@tomaszek0
Copy link

tomaszek0 commented Dec 22, 2021

Hi,
The approach in the example is a simple way to demonstrate the covariate shift. Thank you for an informative description of the covariate shift detection problem and your work. However, I am a bit confused. What is the sense to engage a machine learning model to solve a problem that is solved at a start by "human learning", ie. predetermined dividing of data on two groups according to generated histogram? This histogram showed us already the covariate shift in the dataset. For me, dividing the dataset into a reference group (a group with lat values less than -119) and a group representing the whole dataset makes more sense. I know that there is demonstrated a simple example, but the addition of a dataset example with a hided covariate shift would be helpful (the breast cancer dataset is a classic and very easy binary classification dataset from sklearn.datasets).

@rajivsam
Copy link
Contributor

Hi @tomaszek0 , thanks for the question. So the idea is something like this. In real world models, dataset shift and covariate shift occur because of changing business conditions, for example, customer tastes change, market forces change etc. In this case we are simulating two such datasets from different conditions. In the real world, this would happen organically. Does that help? Your request for a real-world example is noted.

@tomaszek0
Copy link
Author

Hi @rajivsam , Simpler = Better (https://towardsdatascience.com/the-limitations-of-machine-learning-a00e0c3040c6). Here is a similar approach to a covariate shift detection as shown by Du Phan (https://medium.com/data-from-the-trenches/a-primer-on-data-drift-18789ef252a6). I feel that in the case of "Arangopipe-feature..." some additional explanation of the features drift should have been given by computing the drift values as an equivalent of feature importance (we should be able to indicate the feature that discriminates the corrupted (shifted) samples from the reference (stationary) samples (for example https://docs.seldon.io/projects/alibi-detect/en/latest/examples/cd_spot_the_diff_mnist_wine.html).

@rajivsam
Copy link
Contributor

rajivsam commented Dec 29, 2021

@tomaszek0 noted. In fact, you choose to implement drift detection with logistic regression, then you get what you are referring to. The choice of the classifier is really a design and application preference. Interest in this feature is noted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants