-
Notifications
You must be signed in to change notification settings - Fork 631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add error handling prototype #2244
Conversation
94364c7
to
6841423
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the succinct prototype, I think I get the idea 👍
from sys import exc_info | ||
|
||
|
||
def _safe_function(predefined_return_value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might there be cases where a predefined value won't work and we need a factory function instead? E.g. best effort propagation with NonRecordingSpan(span_context)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is a very valid point. In a previous attempt to implement this I tried to make it possible for the safety mechanism to return a value that would be the result of a certain SDK function that would receive the arguments passed to the API function so that different values could be returned depending on the actual call the end user made.
Nevertheless, there is a paradox, the code that makes this new value can also fail, and would also need a predefined value to return if that happened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevertheless, there is a paradox, the code that makes this new value can also fail, and would also need a predefined value to return if that happened.
That code would also be wrapped or we could explicitly/manually safe-guard it so should be doable I guess?
|
||
@_safe_function(0.0) | ||
def function(a: int, b: int) -> float: | ||
return _get_sdk_module("trace").function(a, b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am weary of this dynamic behavior–it looks very difficult to test or have static checks protecting the behavior. That combined with the warning mechanism, I am worried it would be easy to have a silent breakage.
As you said in the README, for this to work every SDK must keep the exact same import paths and fully-qualified symbols. This is a big design requirement and easy to accidentally mess up for SDK implementors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am weary of this dynamic behavior–it looks very difficult to test or have static checks protecting the behavior.
Sorry what can be difficult to test?
What do you mean with static checks?
That combined with the warning mechanism, I am worried it would be easy to have a silent breakage.
I think that is a valid objection but to the spec itself. It is possible to have a silent breakage by not immediately raising an exception that crashes the application, but that is what the spec requires if I understand it correctly.
As you said in the README, for this to work every SDK must keep the exact same import paths and fully-qualified symbols. This is a big design requirement and easy to accidentally mess up for SDK implementors.
Well, what we need is a mechanism that can tell the API where is the corresponding SDK object located. We are using the same qualified path of the API object to find its matching SDK object but that can be changed with a mechanism that uses an arbitrary SDK-defined mapping to make it possible for the API to find the SDK objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry what can be difficult to test?
That the SDK is correctly reproducing the exact names and import paths as the API.
What do you mean with static checks?
pylint/mypy/IDEs can check things like that your import paths are correct or that you implement an ABC correctly. In this case, it's impossible to tell these tools "this package must mirror all of the interfaces of OpenTelemetry API".
Well, what we need is a mechanism that can tell the API where is the corresponding SDK object located. We are using the same qualified path of the API object to find its matching SDK object but that can be changed with a mechanism that uses an arbitrary SDK-defined mapping to make it possible for the API to find the SDK objects.
I think a better way to do this is with a class or protocol. You could also expose the SDKs functionality as an interface which SDKs would implement and the API knows how to use:
# could also be typing.Protocol
class SDK(ABC):
@abstractmethod
def class0_factory() -> Type[Class0]:
pass
def function(a, b):
pass
def Class0() -> Class0:
_sdk.class0_factory()
def function(a, b):
_sdk.function()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pylint/mypy/IDEs can check things like that your import paths are correct or that you implement an ABC correctly. In this case, it's impossible to tell these tools "this package must mirror all of the interfaces of OpenTelemetry API".
Ok, but I never expected or intended for the static checkers to be the ones responsible of checking that the SDKs are compliant with the API.
This is a very OpenTelemetry-specific requirement, I don't think static checkers or the typing
module to be designed with this in mind, so this approach looks like forcing our design to work with a tool that was never intended to check this kind of things.
The approach here is for the API to have functions and abstract classes that the SDK has to implement. We can provide a list of the functions and classes so that the SDK implementations know what they have to implement and they can check that their SDKs implement them fully.
To summarize, I do agree with you @aabmass that it is very important for SDKs to be able to have a test case that pretty much says "the API is being fully implemented" or not. I just think that this kind of testing is very OpenTelemetry-specific and it would be necessary to force the design of this mechanism and the design of static type checkers to make this checking happen in a static type checker. It is better to provide clear documentation for SDK implementations and maybe additional functionality in the API (like a list of all the stuff that any SDK has to implement) so that every SDK can easily implement their own test case for API compliance.
The API or SDK may fail fast and cause the application to fail on | ||
initialization... | ||
|
||
*Initialization* is understood as the process of setting the SDK. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think arguably some would consider creating Tracers/Instruments/Exporters/Views part of the initialization process. Could we make the fail-fast behavior configurable for users who want to make sure long-lived objects like these are created successfully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, what do you mean with fail fast behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failing on initialization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, but how would this work? Imagine this situation:
# application.py
set_sdk("sdk") # This can fail fast
... # Here the user does a lot of stuff including using the OpenTelemetry API
meter = create_meter(...) # This can also fail fast?
That would mean that after a lot of code has been executed something suddenly fails because creating a meter is considered initialization. This behaves like something that breaks the application code. Now, someone may argue that meters, tracers, etc are always created before doing anything else but there is no guarantee that would be the case. We could make it configurable that the creating of these objects fail fast but I see almost no value in this, the moment one of these objects is created and something goes wrong a warning will be raised and the user will know and if they want the application to crash they can run it in error mode. I don't see why we need to handle the creation of these objects in a special way.
|
||
The Python warning that is "raised" when an exception is raised in the SDK | ||
function or method can be transformed into a full exception by running the | ||
Python interpreter with the `-W error` option. This Python feature is used to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not too familiar with the warnings
module. Is there a way to automatically turn this on when running tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is an option that can be passed to Pytest, if I remember correctly, I have done that before.
safety_prototype/README.rst
Outdated
After an SDK is set, calling an API function or method will call its | ||
corresponding SDJ function or method. Any exception raised by the SDK function | ||
or method will be caught by the safety mechanism and the predefined value | ||
returned instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the API needs to provide the safety mechanism for the SDK. This would only work if the user does all instrumentation through the API which is not currently the case; instead we have the SDK as an implementation of the API. For example this should work fine:
## server.py
from opentelemetry.trace import TracerProvider
class Server:
def __init__(self, tracer_provider: TracerProvider):
self._tracer = tracer_provider.get_tracer(type(self)
def index(request: Request) -> Response:
with self._tracer.start_span("do-something"):
# ...
## main.py
from opentelemetry.sdk.trace import TracerProvider
import server
tracer_provider = TracerProvider(...)
server.Server(tracer_provider).serve()
As I understand it, the error handling mechanism would not work here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it won't work here, but one of the central requirements of this prototype is that the user only uses API objects. I am aware of the fact that the user can currently use the SDK directly, I understand the argument, I just don't think it applies to this prototype.
There is additional advantage in this approach: We can definitively decide what is a breaking change and what is not. A breaking change would be:
This means that the semantic versioning schema we have been following so far would not apply to the SDK. Only the API has API proxy objects or SDK setting functions. This makes sense, it matches the first requirement of semantic versioning. |
@ocelotl In the SIG you said this would require breaking changes we can't make. Do you still need reviews on this PR? |
Yes, please, keep reviewing. We need to be compliant with the spec somehow, so we need to move forward in this direction. Maybe we need to break, but that is not as bad, we can keep supporting version 1.X and 2.X at the same time, users just have to stick with the one they are currently using. |
I think we should try to keep v1.x as long as possible–OTel has a bad reputation for churn already. If we had two incompatible API versions, wouldn't we need to maintain one version of each instrumentation for each API as well? Can we take that option off the table? What do you think of just using the decorator you proposed here in both the API and SDK to wrap key methods which shouldn't throw? |
""" | ||
|
||
|
||
def internal(function): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A thought I had: If we decide to lose the predefined return value (and instead return None), this could be implemented as a class that each class (or base class for that matter) in the SDK inherits from. Something like:
class safetyClass(object):
def __getattribute__(self, name):
returned_attribute = object.__getattribute__(self, name)
if callable(returned_attribute):
return _safety(returned_attribute)
return returned_attribute
This way, it's way less of a breaking change that needs to happen all at once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but we can't always return None
. 🤷 The safety mechanism must return the same kind of object as the corresponding function or method is expected to return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you're right, but we should change the calling functions to understand that an exception occurred and the returned value is a predefined one... For example, if in your snippet someone calls the safe function(1,0)
the return value would be 0.0
, which isn't the true result of 1/0
- it is a result that there was an exception, and the calling function needs to check the return value.
And if we are choosing to do so, we might as well create some object that represents that an exception occurred, and use the option I offered with that value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calling functions do not need to check the return value, they will receive a fake, predefined value when an exception is raised. This is intentional and the way that the user can know that the value is not real is for them to run the application with the -W error
option so that warnings are turned into exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, makes sense
except Exception: # pylint: disable=broad-except | ||
exception = "".join(format_exception(*exc_info())) | ||
|
||
if exception is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I think that an expected behaviour of such a feature would be that user made exceptions would be raised, no? OpenTelemetry should be seamless for the user's application, and if it is exception-driven, this will alter the behaviour (unless -W error
is used and then OpenTelemetry originated errors will alter the behaviour)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I think there may be a misunderstanding here. The mechanism proposed here is intended to guard against exceptions raised in the SDK. Any exception raised in the application code directly that is caused by the calling of a function or method outside of the OpenTelemetry API is not protected and will be raised normally.
So, you mention "user made exceptions". If you mean by that exceptions raised by code not called by functions or methods in the OpenTelemetry API, then this mechanism will not cause any issues, these "user made exceptions" will be raised normally.
If with "user made exceptions" you mean exceptions in any part of the SDK that were "intentionally coded", for example something like this:
@_safe_function(X)
def some_function(a, b):
...
if some_condition:
raise Exception("Some exception")
...
then the mechanism will also catch these exceptions. This is intentional. The spec says that any exception of this kind must be handled by the safety mechanism, regardless of how "intentional" it is or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, Maybe I am missing something, but that's not what I meant. What I meant is something where the OpenTelemetry wrapper is "safe", and the original function which it wraps raises an exception (For example, raise Http404()
in Django). In this case, the original function will raise the exception, which will raise to the OpenTelemetry Django wrapper, which will raise to the safety wrapper. Perhaps Django isn't the best example as it uses a middleware object and doesn't directly call wrapt.wrap_function_wrapper
, but the SDK does wrap user-defined functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sorry but I don't understand what's wrong here. It is intentional to guard against any kind of exception raised in the scope of the safety mechanism, regardless of who raised it or whatever intention was behind raising it. This is a specification requirement, because if we let an exception to be raised it will crash the application which must not happen. So, if your concern is that this mechanism may end up "swallowing" an user made exception, then yes, this mechanism will do that and it is intended to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. I just wanted to bring this up because it alters the behavior, so it will a known side effect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is handled by the Django Framework so the developer can decide how the 404 response looks. If we do choose to address this issue because it might alter the behavior of the program, it can probably be done by parsing the trace stack, but I'm not sure it's something that we want to do because this logic might get complicated.
What do you think about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can revisit this when we go over all instrumentation to ensure we don't raise errors there. We shouldn't do anything special to handle any known exceptions differently.
We just need to ensure that our context managers (start_as_current_span
, use_span
) don't swallow exceptions raised from within the context. May be this means they shouldn't be decorated and error protection should be coded manually into their bodies or may be they can catch all exceptions but annotate the ones coming from user code so the decorator can re-raise them but that sounds unnecessarily complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but annotate the ones coming from user code so the decorator can re-raise them...
Hm, but does this mean that the re-raised exceptions will be able to crash the application?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, yes. If the application would crash without instrumentation then it should crash with instrumentation as well, right? We should ensure that:
- SDK/instrumentation code never raises exceptions.
- For instrumented code paths, we record exceptions as events if the user wants to and then re-raise.
This is how use_span
works today: https://github.com/open-telemetry/opentelemetry-python/blob/main/opentelemetry-api/src/opentelemetry/trace/__init__.py#L530-L559
As an example, if I have a function that looks like:
def my_func():
raise Exception()
and I instrument it as:
def my_func():
with tracer.start_as_current_span("span1"):
raise Exception()
then the only change that should happen is that now my program should export a span called span1
with an event which has information about the exception I'm raising. It should not swallow the exception. However, if an exception is raised inside start_as_current_span()
function or any other function it calls directly, that exception should be swallowed and never reach my code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see.
Ok, I think the current design works as you want it to work, because it will only catch exceptions raised in the execution of start_as_current_span
(not in its with
-managed context) in the example above.
Yes, we would have to and I see the issue here, it is a lot of code.
I wouldn't, I'm not sure we have an alternative yet.
This could work, I don't think having to put a decorator on every method/function is a good solution but it can work, will look into this further. Now, this is pretty much the approach I tried implementing in #2152. Keep in mind that implementing that would require applying the decorator to every single part of the API, including the "util" functions and classes that we may have there (I now think having "util" stuff in the API is a mistake, I understand at first glance it looks like the most logical place where to put it since the API is the universal dependency for all SDKs and because of that it is convenient to have "util" code there. Nevertheless, that kind of functions or classes are not part of the spec-defined API and should not be there, they now cause this problem we have now, the API has one design responsibility and it is to serve as a set of interfaces that the SDKs have to implement, not holding utilitarian code). |
5f5bea3
to
113834b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The safety implementation looks great and we should totally use it. I'm a bit concerned about "SDK registration" and API trying to "protect" the SDK.
IMO, we should just use the safety mechanism in SDK directly in addition to the API. This means the SDK will provide error protection out of the box even when used directly and it can have additional methods/features not covered by the API which will be protected as well. It will also simplify all the error protection implementation on the API side. One downside is that any 3rd party SDK implementation will have to implement its own error protection but I think that is an acceptable trade off.
# This is the warning mentioned in the README file. | ||
warn(f"OpenTelemetry handled an exception:\n\n{exception}") | ||
exception = None | ||
resetwarnings() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity, why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand the behavior or warn
, only one warning of a certain kind is "raised" unless this method is called. I think the intention is not to swamp the output with similar warnings, but for development purposes I prefer to have them all being displayed, we can reconsider that later.
from sys import exc_info | ||
|
||
|
||
def _safe_function(predefined_return_value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevertheless, there is a paradox, the code that makes this new value can also fail, and would also need a predefined value to return if that happened.
That code would also be wrapped or we could explicitly/manually safe-guard it so should be doable I guess?
|
||
The user only calls API functions or methods, never SDK functions or methods. | ||
|
||
Every SDK must implement every public function or method defined in the API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if they don't? Will we crash on initialization or will missing/bad methods just always return the "default" value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, I am ok with crashing on initialization.
I think it is totally acceptable if we have to decorate all public functions with safeguards. I assume private ones don't need it as they get called internally by the public ones but even if we need to decorate them as well, I think that is acceptable as well. A few additional key strokes when implementing new functions shouldn't be a big deal. With some tooling (may be a pylint plugin), compliance on future code should be pretty easy as well. It seems we are focusing on trying to find a way to automatically/dynamically apply the decorator on everything instead of explicit coverage. I'm not sure if it is worth it TBH. |
Co-authored-by: Owais Lone <[email protected]>
+1 we don't need a uniform magic approach. Explicit is better. There may even be some cases where we just stick a try/except directly in the code. I also think the decorator looks good and warnings is a good mechanism for this (as long as we can rate-limit the logs it creates so they don't spam output). |
This reverts commit b769845.
I won't be working on this PR no more for the foreseeable future, closing. |
This is a safety prototype.
I am opening this draft PR not with the intention of merging this into
main
but to make it possible for other contributors to comment on this prototype.