Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Usinglog_metric() with numpy types or other invalid values crashes Redis 😰 #127

Open
mohammedri opened this issue Apr 22, 2020 · 12 comments
Assignees
Labels
bug Something isn't working

Comments

@mohammedri
Copy link

We have faced this issue before, but for some reason it was ignored. Recently it was resurfaced by https://github.com/SebastianGergen again in the Slack community channels.

Doing log_metric() with anything other than ints or floats essentially break the entire system and requires you to re-install Atlas because Atlas goes into a non-recoverable state.

This happens because our Flask API is expecting a certain contract with Redis which is not being met.

We should fail more gracefully, or support numpy types.

Related: #96


Other details
This was his message:

Hi Atlas Team, one more issue that I faced today in my "on premise" atlas installation: when I make "foundations.log_metric('accuracy', accuracy)" and add a value which is somehow "invalid" (like accuracy = np.random.rand(1)[0]) it seems that the whole database crashes or something. I tried several simple "valid" values like accuracy = 0.5, and it works fine.  The result after the crash is problematic: The project overview page does not show up any more (actually it only shows up for a brief short milisecond and then disappears), and I am left with a blank GUI page in the browser. Only way to recover for me until now is re-installing :disappointed: Sorry. this description is not very clear. I hope you can get my problem :slightly_smiling_face:

when I use accuracy = float(np.random.rand(1)[0]) and foundations.log_metric('accuracy', accuracy) it works fine. So i guess that numpy makes a problem, anyways it is pretty unconvenient, that at least from my perspective I can't recover atlas after once appending a numpy based value.
@mohammedri mohammedri added the bug Something isn't working label Apr 22, 2020
@pippinlee pippinlee self-assigned this Apr 28, 2020
@mohammedri
Copy link
Author

Does this PR #130 fix this issue as well @pippinlee ?

@pippinlee
Copy link

Don't believe the colon issue was the same root cause, but @nicknagi correct me if I'm wrong.

@mohammedri
Copy link
Author

Sounds good! Did you have an expected timeline for this?

@pippinlee
Copy link

Tested out log_metric() using np.float64 types specifically, trying to reproduce previous errors using .log_metric('key', np.random.rand(1)[0]).

I was able to successfully log without error. This may have been thanks to Nick’s fix. I'll add a test to prove np.float64 support. I think we should be clearer with what types we do allow logging for though. Here’s a list of values to help us understand current state WRT numpy:

Valid
np.float64(29)

Invalid
type(np.random.rand(1)[0])
np.array([1,2,3,4])
np.ndarray([1,2,3])

Note: all of the above invalid values gracefully fail with an error like the following: TypeError: Invalid metric with key="np.ndarray([1,2,3])" of value=[[[ 2.33233293e-316 -4.3347313 ... with type <class 'numpy.ndarray'>. Value should be of type string or number, or a list of strings / numbers.

Ongoing questions:

  1. Is that error message helpful?
  2. Can we better define types we wish to support? That'll help us with these one offs "does log_metric() support type X", and we can provide this in docs to make this clearer to users.

@mohammedri
Copy link
Author

mohammedri commented May 11, 2020

Initial thoughts:

Is that error message helpful?

I think the error message is helpful as is. The last message clearly specifies it all.
TypeError: Invalid metric with key="np.ndarray([1,2,3])" of value=[[[ 2.33233293e-316 -4.3347313 ... with type <class 'numpy.ndarray'>. Value should be of type string or number, or a list of strings / numbers.

Can we better define types we wish to support?

The intention behind log metric is to support recording metrics such as precision, recall, loss, Area Under Curve, f1-score et.al. NOT to support the outputs of a task e.g. the output of a BERT model, or the output of a generative NLP model - as such it seems weird to me that we support strings and a list of strings / numbers.

On the contrary I can see - why someone would wish to view the output on the GUI.

There are two routes to go from here:

  1. We be more flexible: support strings, lists of string / numbers, np.array et. al.
  2. We stick to what log_metric is supposed to be there for and strip off support for strings / lists.

Can anyone recall why we decided to support strings?

@mohammedri
Copy link
Author

My vote is to strip support for strings / lists and use log_metrics for what it is meant to be used for

@ekhl
Copy link
Contributor

ekhl commented May 11, 2020

The decision of whether to change the behaviour of this feature (e.g. modify the types supported by log_metric) looks to be out of scope of this issue. I suggest that if this bug has been fixed we close it out and open a new issue to address that separate concern.

@pippinlee
Copy link

Works for me. Going to submit a PR to cover us for np.float64 and then we can close.

@kyledef
Copy link
Contributor

kyledef commented May 11, 2020

I think this issue also came up when Damien and others in the DS team were exploring Atlas. I suspect their use case is a valid concern for us. We should understand better what it is before we mark this as closed.

pippinlee added a commit that referenced this issue May 15, 2020
… with log_metric(), this test case ensures that is no longer the case #127
@shazraz
Copy link

shazraz commented May 15, 2020

@mohammedri & @pippinlee ,

@cindyzczhao and I were able to reproduce this error on 0.1.1 using the following code snippet

import foundations
import numpy as np
x = np.float64(5)
fooundations.log_metric('metric', x)

It broke the GUI and we had to flush redis to get things back up.

@pippinlee
Copy link

It broke the GUI and we had to flush redis to get things back up.

Thanks for the heads up. Have confirmed this is because we need to do a new release––using current release 0.1.1 still has log_metric bug.

Latest master works without error with logging np.float64(5). Have added a cypress test to confirm this.

@mohammedri
Copy link
Author

Hey @pippinlee okay to close this issue as this is fixed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants