Add error handling prototype #2244

ocelotl · 2021-10-28T09:42:25Z

This is a safety prototype.

I am opening this draft PR not with the intention of merging this into main but to make it possible for other contributors to comment on this prototype.

aabmass

Thanks for the succinct prototype, I think I get the idea 👍

aabmass · 2021-10-28T15:32:42Z

safety_prototype/opentelemetry-api/src/opentelemetry/_safety.py

+from sys import exc_info
+
+
+def _safe_function(predefined_return_value):


Might there be cases where a predefined value won't work and we need a factory function instead? E.g. best effort propagation with NonRecordingSpan(span_context)

Yes, that is a very valid point. In a previous attempt to implement this I tried to make it possible for the safety mechanism to return a value that would be the result of a certain SDK function that would receive the arguments passed to the API function so that different values could be returned depending on the actual call the end user made.

Nevertheless, there is a paradox, the code that makes this new value can also fail, and would also need a predefined value to return if that happened.

Nevertheless, there is a paradox, the code that makes this new value can also fail, and would also need a predefined value to return if that happened.

That code would also be wrapped or we could explicitly/manually safe-guard it so should be doable I guess?

aabmass · 2021-10-28T15:39:09Z

safety_prototype/opentelemetry-api/src/opentelemetry/trace/__init__.py

+
+@_safe_function(0.0)
+def function(a: int, b: int) -> float:
+    return _get_sdk_module("trace").function(a, b)


I am weary of this dynamic behavior–it looks very difficult to test or have static checks protecting the behavior. That combined with the warning mechanism, I am worried it would be easy to have a silent breakage.

As you said in the README, for this to work every SDK must keep the exact same import paths and fully-qualified symbols. This is a big design requirement and easy to accidentally mess up for SDK implementors.

I am weary of this dynamic behavior–it looks very difficult to test or have static checks protecting the behavior.

Sorry what can be difficult to test?
What do you mean with static checks?

That combined with the warning mechanism, I am worried it would be easy to have a silent breakage.

I think that is a valid objection but to the spec itself. It is possible to have a silent breakage by not immediately raising an exception that crashes the application, but that is what the spec requires if I understand it correctly.

As you said in the README, for this to work every SDK must keep the exact same import paths and fully-qualified symbols. This is a big design requirement and easy to accidentally mess up for SDK implementors.

Well, what we need is a mechanism that can tell the API where is the corresponding SDK object located. We are using the same qualified path of the API object to find its matching SDK object but that can be changed with a mechanism that uses an arbitrary SDK-defined mapping to make it possible for the API to find the SDK objects.

Sorry what can be difficult to test?

That the SDK is correctly reproducing the exact names and import paths as the API.

What do you mean with static checks?

pylint/mypy/IDEs can check things like that your import paths are correct or that you implement an ABC correctly. In this case, it's impossible to tell these tools "this package must mirror all of the interfaces of OpenTelemetry API".

Well, what we need is a mechanism that can tell the API where is the corresponding SDK object located. We are using the same qualified path of the API object to find its matching SDK object but that can be changed with a mechanism that uses an arbitrary SDK-defined mapping to make it possible for the API to find the SDK objects.

I think a better way to do this is with a class or protocol. You could also expose the SDKs functionality as an interface which SDKs would implement and the API knows how to use:

# could also be typing.Protocol class SDK(ABC): @abstractmethod def class0_factory() -> Type[Class0]: pass def function(a, b): pass def Class0() -> Class0: _sdk.class0_factory() def function(a, b): _sdk.function()

pylint/mypy/IDEs can check things like that your import paths are correct or that you implement an ABC correctly. In this case, it's impossible to tell these tools "this package must mirror all of the interfaces of OpenTelemetry API".

Ok, but I never expected or intended for the static checkers to be the ones responsible of checking that the SDKs are compliant with the API.

This is a very OpenTelemetry-specific requirement, I don't think static checkers or the typing module to be designed with this in mind, so this approach looks like forcing our design to work with a tool that was never intended to check this kind of things.

The approach here is for the API to have functions and abstract classes that the SDK has to implement. We can provide a list of the functions and classes so that the SDK implementations know what they have to implement and they can check that their SDKs implement them fully.

To summarize, I do agree with you @aabmass that it is very important for SDKs to be able to have a test case that pretty much says "the API is being fully implemented" or not. I just think that this kind of testing is very OpenTelemetry-specific and it would be necessary to force the design of this mechanism and the design of static type checkers to make this checking happen in a static type checker. It is better to provide clear documentation for SDK implementations and maybe additional functionality in the API (like a list of all the stuff that any SDK has to implement) so that every SDK can easily implement their own test case for API compliance.

aabmass · 2021-10-28T15:52:08Z

safety_prototype/README.rst

+    The API or SDK may fail fast and cause the application to fail on
+    initialization...
+
+*Initialization* is understood as the process of setting the SDK.


I think arguably some would consider creating Tracers/Instruments/Exporters/Views part of the initialization process. Could we make the fail-fast behavior configurable for users who want to make sure long-lived objects like these are created successfully?

Sorry, what do you mean with fail fast behavior?

Failing on initialization

Well, but how would this work? Imagine this situation:

# application.py set_sdk("sdk") # This can fail fast ... # Here the user does a lot of stuff including using the OpenTelemetry API meter = create_meter(...) # This can also fail fast?

That would mean that after a lot of code has been executed something suddenly fails because creating a meter is considered initialization. This behaves like something that breaks the application code. Now, someone may argue that meters, tracers, etc are always created before doing anything else but there is no guarantee that would be the case. We could make it configurable that the creating of these objects fail fast but I see almost no value in this, the moment one of these objects is created and something goes wrong a warning will be raised and the user will know and if they want the application to crash they can run it in error mode. I don't see why we need to handle the creation of these objects in a special way.

aabmass · 2021-10-28T15:53:07Z

safety_prototype/README.rst

+
+The Python warning that is "raised" when an exception is raised in the SDK
+function or method can be transformed into a full exception by running the
+Python interpreter with the `-W error` option. This Python feature is used to


Im not too familiar with the warnings module. Is there a way to automatically turn this on when running tests?

Yes, it is an option that can be passed to Pytest, if I remember correctly, I have done that before.

aabmass · 2021-10-28T16:02:21Z

safety_prototype/README.rst

+After an SDK is set, calling an API function or method will call its
+corresponding SDJ function or method. Any exception raised by the SDK function
+or method will be caught by the safety mechanism and the predefined value
+returned instead.


I don't think the API needs to provide the safety mechanism for the SDK. This would only work if the user does all instrumentation through the API which is not currently the case; instead we have the SDK as an implementation of the API. For example this should work fine:

## server.py from opentelemetry.trace import TracerProvider class Server: def __init__(self, tracer_provider: TracerProvider): self._tracer = tracer_provider.get_tracer(type(self) def index(request: Request) -> Response: with self._tracer.start_span("do-something"): # ... ## main.py from opentelemetry.sdk.trace import TracerProvider import server tracer_provider = TracerProvider(...) server.Server(tracer_provider).serve()

As I understand it, the error handling mechanism would not work here.

Yes, it won't work here, but one of the central requirements of this prototype is that the user only uses API objects. I am aware of the fact that the user can currently use the SDK directly, I understand the argument, I just don't think it applies to this prototype.

ocelotl · 2021-10-28T17:06:38Z

There is additional advantage in this approach:

We can definitively decide what is a breaking change and what is not. A breaking change would be:

A removal of a public symbol or the removal of a parameter or the addition of a mandatory parameter in any of the symbols in any of the modules that contains API proxy objects (this file, for example).
A equivalent change in an SDK-setting function.

This means that the semantic versioning schema we have been following so far would not apply to the SDK. Only the API has API proxy objects or SDK setting functions. This makes sense, it matches the first requirement of semantic versioning.

aabmass · 2021-10-28T21:53:06Z

@ocelotl In the SIG you said this would require breaking changes we can't make. Do you still need reviews on this PR?

ocelotl · 2021-10-29T10:06:18Z

@ocelotl In the SIG you said this would require breaking changes we can't make. Do you still need reviews on this PR?

Yes, please, keep reviewing. We need to be compliant with the spec somehow, so we need to move forward in this direction. Maybe we need to break, but that is not as bad, we can keep supporting version 1.X and 2.X at the same time, users just have to stick with the one they are currently using.

aabmass · 2021-10-29T15:24:29Z

I think we should try to keep v1.x as long as possible–OTel has a bad reputation for churn already. If we had two incompatible API versions, wouldn't we need to maintain one version of each instrumentation for each API as well? Can we take that option off the table?

What do you think of just using the decorator you proposed here in both the API and SDK to wrap key methods which shouldn't throw?

oxeye-nikolay · 2021-10-29T15:41:56Z

safety_prototype/opentelemetry-api/src/opentelemetry/_safety.py

+    """
+
+
+    def internal(function):


A thought I had: If we decide to lose the predefined return value (and instead return None), this could be implemented as a class that each class (or base class for that matter) in the SDK inherits from. Something like:

class safetyClass(object): def __getattribute__(self, name): returned_attribute = object.__getattribute__(self, name) if callable(returned_attribute): return _safety(returned_attribute) return returned_attribute

This way, it's way less of a breaking change that needs to happen all at once

Sure, but we can't always return None. 🤷 The safety mechanism must return the same kind of object as the corresponding function or method is expected to return.

I guess you're right, but we should change the calling functions to understand that an exception occurred and the returned value is a predefined one... For example, if in your snippet someone calls the safe function(1,0) the return value would be 0.0, which isn't the true result of 1/0- it is a result that there was an exception, and the calling function needs to check the return value.
And if we are choosing to do so, we might as well create some object that represents that an exception occurred, and use the option I offered with that value

The calling functions do not need to check the return value, they will receive a fake, predefined value when an exception is raised. This is intentional and the way that the user can know that the value is not real is for them to run the application with the -W error option so that warnings are turned into exceptions.

Ok, makes sense

oxeye-nikolay · 2021-10-29T15:44:57Z

safety_prototype/opentelemetry-api/src/opentelemetry/_safety.py

+            except Exception:  # pylint: disable=broad-except
+                exception = "".join(format_exception(*exc_info()))
+
+            if exception is not None:


In general, I think that an expected behaviour of such a feature would be that user made exceptions would be raised, no? OpenTelemetry should be seamless for the user's application, and if it is exception-driven, this will alter the behaviour (unless -W error is used and then OpenTelemetry originated errors will alter the behaviour)

Hm, I think there may be a misunderstanding here. The mechanism proposed here is intended to guard against exceptions raised in the SDK. Any exception raised in the application code directly that is caused by the calling of a function or method outside of the OpenTelemetry API is not protected and will be raised normally.

So, you mention "user made exceptions". If you mean by that exceptions raised by code not called by functions or methods in the OpenTelemetry API, then this mechanism will not cause any issues, these "user made exceptions" will be raised normally.

If with "user made exceptions" you mean exceptions in any part of the SDK that were "intentionally coded", for example something like this:

@_safe_function(X) def some_function(a, b): ... if some_condition: raise Exception("Some exception") ...

then the mechanism will also catch these exceptions. This is intentional. The spec says that any exception of this kind must be handled by the safety mechanism, regardless of how "intentional" it is or not.

Well, Maybe I am missing something, but that's not what I meant. What I meant is something where the OpenTelemetry wrapper is "safe", and the original function which it wraps raises an exception (For example, raise Http404() in Django). In this case, the original function will raise the exception, which will raise to the OpenTelemetry Django wrapper, which will raise to the safety wrapper. Perhaps Django isn't the best example as it uses a middleware object and doesn't directly call wrapt.wrap_function_wrapper, but the SDK does wrap user-defined functions.

Ok, sorry but I don't understand what's wrong here. It is intentional to guard against any kind of exception raised in the scope of the safety mechanism, regardless of who raised it or whatever intention was behind raising it. This is a specification requirement, because if we let an exception to be raised it will crash the application which must not happen. So, if your concern is that this mechanism may end up "swallowing" an user made exception, then yes, this mechanism will do that and it is intended to do so.

Alright. I just wanted to bring this up because it alters the behavior, so it will a known side effect

It is handled by the Django Framework so the developer can decide how the 404 response looks. If we do choose to address this issue because it might alter the behavior of the program, it can probably be done by parsing the trace stack, but I'm not sure it's something that we want to do because this logic might get complicated.
What do you think about it?

I think we can revisit this when we go over all instrumentation to ensure we don't raise errors there. We shouldn't do anything special to handle any known exceptions differently.

We just need to ensure that our context managers (start_as_current_span, use_span) don't swallow exceptions raised from within the context. May be this means they shouldn't be decorated and error protection should be coded manually into their bodies or may be they can catch all exceptions but annotate the ones coming from user code so the decorator can re-raise them but that sounds unnecessarily complicated.

but annotate the ones coming from user code so the decorator can re-raise them...

Hm, but does this mean that the re-raised exceptions will be able to crash the application?

I think so, yes. If the application would crash without instrumentation then it should crash with instrumentation as well, right? We should ensure that:

SDK/instrumentation code never raises exceptions.

For instrumented code paths, we record exceptions as events if the user wants to and then re-raise.

This is how use_span works today: https://github.com/open-telemetry/opentelemetry-python/blob/main/opentelemetry-api/src/opentelemetry/trace/__init__.py#L530-L559

As an example, if I have a function that looks like:

def my_func(): raise Exception()

and I instrument it as:

def my_func(): with tracer.start_as_current_span("span1"): raise Exception()

then the only change that should happen is that now my program should export a span called span1 with an event which has information about the exception I'm raising. It should not swallow the exception. However, if an exception is raised inside start_as_current_span() function or any other function it calls directly, that exception should be swallowed and never reach my code.

Ah, I see.

Ok, I think the current design works as you want it to work, because it will only catch exceptions raised in the execution of start_as_current_span (not in its with-managed context) in the example above.

ocelotl · 2021-11-02T11:03:42Z

If we had two incompatible API versions, wouldn't we need to maintain one version of each instrumentation for each API as well?

Yes, we would have to and I see the issue here, it is a lot of code.

Can we take that option off the table?

I wouldn't, I'm not sure we have an alternative yet.

What do you think of just using the decorator you proposed here in both the API and SDK to wrap key methods which shouldn't throw?

This could work, I don't think having to put a decorator on every method/function is a good solution but it can work, will look into this further. Now, this is pretty much the approach I tried implementing in #2152. Keep in mind that implementing that would require applying the decorator to every single part of the API, including the "util" functions and classes that we may have there (I now think having "util" stuff in the API is a mistake, I understand at first glance it looks like the most logical place where to put it since the API is the universal dependency for all SDKs and because of that it is convenient to have "util" code there. Nevertheless, that kind of functions or classes are not part of the spec-defined API and should not be there, they now cause this problem we have now, the API has one design responsibility and it is to serve as a set of interfaces that the SDKs have to implement, not holding utilitarian code).

owais

The safety implementation looks great and we should totally use it. I'm a bit concerned about "SDK registration" and API trying to "protect" the SDK.

IMO, we should just use the safety mechanism in SDK directly in addition to the API. This means the SDK will provide error protection out of the box even when used directly and it can have additional methods/features not covered by the API which will be protected as well. It will also simplify all the error protection implementation on the API side. One downside is that any 3rd party SDK implementation will have to implement its own error protection but I think that is an acceptable trade off.

owais · 2021-11-15T11:54:34Z

safety_prototype/opentelemetry-api/src/opentelemetry/_safety.py

+                # This is the warning mentioned in the README file.
+                warn(f"OpenTelemetry handled an exception:\n\n{exception}")
+                exception = None
+                resetwarnings()


out of curiosity, why do we need this?

As far as I understand the behavior or warn, only one warning of a certain kind is "raised" unless this method is called. I think the intention is not to swamp the output with similar warnings, but for development purposes I prefer to have them all being displayed, we can reconsider that later.

owais · 2021-11-15T11:57:46Z

safety_prototype/opentelemetry-api/src/opentelemetry/_safety.py

+from sys import exc_info
+
+
+def _safe_function(predefined_return_value):


Nevertheless, there is a paradox, the code that makes this new value can also fail, and would also need a predefined value to return if that happened.

That code would also be wrapped or we could explicitly/manually safe-guard it so should be doable I guess?

safety_prototype/README.rst

owais · 2021-11-15T12:06:45Z

safety_prototype/README.rst

+
+The user only calls API functions or methods, never SDK functions or methods.
+
+Every SDK must implement every public function or method defined in the API


What happens if they don't? Will we crash on initialization or will missing/bad methods just always return the "default" value?

Good question, I am ok with crashing on initialization.

owais · 2021-11-15T12:16:43Z

This could work, I don't think having to put a decorator on every method/function is a good solution but it can work, will look into this further. Now, this is pretty much the approach I tried implementing in #2152. Keep in mind that implementing that would require applying the decorator to every single part of the API, including the "util" functions and classes that we may have there (I now think having "util" stuff in the API is a mistake, I understand at first glance it looks like the most logical place where to put it since the API is the universal dependency for all SDKs and because of that it is convenient to have "util" code there. Nevertheless, that kind of functions or classes are not part of the spec-defined API and should not be there, they now cause this problem we have now, the API has one design responsibility and it is to serve as a set of interfaces that the SDKs have to implement, not holding utilitarian code).

I think it is totally acceptable if we have to decorate all public functions with safeguards. I assume private ones don't need it as they get called internally by the public ones but even if we need to decorate them as well, I think that is acceptable as well. A few additional key strokes when implementing new functions shouldn't be a big deal. With some tooling (may be a pylint plugin), compliance on future code should be pretty easy as well.

It seems we are focusing on trying to find a way to automatically/dynamically apply the decorator on everything instead of explicit coverage. I'm not sure if it is worth it TBH.

Co-authored-by: Owais Lone <[email protected]>

aabmass · 2021-11-16T19:55:08Z

It seems we are focusing on trying to find a way to automatically/dynamically apply the decorator on everything instead of explicit coverage. I'm not sure if it is worth it TBH.

+1 we don't need a uniform magic approach. Explicit is better. There may even be some cases where we just stick a try/except directly in the code.

I also think the decorator looks good and warnings is a good mechanism for this (as long as we can rate-limit the logs it creates so they don't spam output).

This reverts commit b769845.

ocelotl · 2024-09-03T19:23:16Z

I won't be working on this PR no more for the foreseeable future, closing.

ocelotl force-pushed the safety_prototype branch from 94364c7 to 6841423 Compare October 28, 2021 09:50

aabmass reviewed Oct 28, 2021

View reviewed changes

ocelotl requested a review from aabmass October 28, 2021 18:25

oxeye-nikolay reviewed Oct 29, 2021

View reviewed changes

ocelotl requested a review from oxeye-nikolay November 2, 2021 12:09

ocelotl added 9 commits November 11, 2021 15:33

Add safety prototype

53163c7

WIP

b9209cc

WIP

b610019

Fix application

0bb06d3

Put logic in base class

69f8ba1

Fix __getattribute__

15dafab

WIP

e54608d

Handle returned safe objects

c16b79c

Make it work with contextmanager

113834b

ocelotl force-pushed the safety_prototype branch from 5f5bea3 to 113834b Compare November 11, 2021 21:33

Add API class

658b612

This was referenced Nov 11, 2021

Add safeties for classes and functions #2152

Closed

Adds Aggregation and instruments as part of Metrics SDK #2234

Merged

owais reviewed Nov 15, 2021

View reviewed changes

Update safety_prototype/README.rst

70a78eb

Co-authored-by: Owais Lone <[email protected]>

This was referenced Dec 10, 2021

Refactor metrics instrument #2297

Merged

An invalid name should return a working no-op Meter implementation #1200

Closed

ocelotl changed the title ~~Add safety prototype~~ Add error handling prototype Dec 15, 2021

ocelotl mentioned this pull request Dec 15, 2021

Log or raise when creating invalid instruments in API #2143

Closed

lzchen mentioned this pull request Jan 20, 2022

Validate add/record operations for instruments #2394

Merged

ocelotl self-assigned this Jan 24, 2022

ocelotl closed this Mar 29, 2023

ocelotl reopened this Mar 29, 2023

ocelotl added 2 commits September 2, 2023 09:35

df

b769845

Revert "df"

45dc711

This reverts commit b769845.

lzchen mentioned this pull request Aug 26, 2024

Question: how to report internal instrumentation errors for troubleshooting purposes open-telemetry/opentelemetry-python-contrib#2813

Open

ocelotl closed this Sep 3, 2024

		from sys import exc_info


		def _safe_function(predefined_return_value):


		The user only calls API functions or methods, never SDK functions or methods.

		Every SDK must implement every public function or method defined in the API

Add error handling prototype #2244

Add error handling prototype #2244

Conversation

ocelotl commented Oct 28, 2021

aabmass left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ocelotl Nov 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ocelotl Nov 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ocelotl commented Oct 28, 2021

aabmass commented Oct 28, 2021

ocelotl commented Oct 29, 2021

aabmass commented Oct 29, 2021

Choose a reason for hiding this comment

ocelotl Nov 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ocelotl Nov 2, 2021 • edited Loading

Choose a reason for hiding this comment

oxeye-nikolay Nov 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ocelotl commented Nov 2, 2021 • edited Loading

owais left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

owais commented Nov 15, 2021

aabmass commented Nov 16, 2021

ocelotl commented Sep 3, 2024

ocelotl Nov 2, 2021 •

edited

Loading

ocelotl Nov 2, 2021 •

edited

Loading

ocelotl Nov 2, 2021 •

edited

Loading

ocelotl Nov 2, 2021 •

edited

Loading

oxeye-nikolay Nov 2, 2021 •

edited

Loading

ocelotl commented Nov 2, 2021 •

edited

Loading