-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Telemetry work #22
base: main
Are you sure you want to change the base?
Telemetry work #22
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,173 @@ | ||
# Adding observability in the JDBC wrapper via metrics | ||
|
||
# Introduction | ||
|
||
AWS Aurora are databases provided by AWS that are hosted in the AWS cloud. Currently available in Aurora are instances of MySQL and PostgreSQL databases. In order to enable the access from an application to an Aurora database, users need to set up a driver that has the ability to connect and interact with an Aurora database instance set up remotely. Existing drivers for MySQL and PostgreSQL work fine with Aurora instances, but do not take advantage of any of the additional features provided by the Aurora databases. | ||
|
||
Recently, AWS has created the AWS JDBC Wrapper driver. The AWS JDBC Wrapper is a driver application that is not a driver in itself, but more of a driver enhancer - an application that set up alongside a database driver, enables some Aurora-specific features to the database driver. Examples of those features are the ability of driver failover (link to doc) and integration with IAM and AWS Secrets Manager (link). The structure of the AWS JDBC Wrapper is organized in plugins, where each new wrapper feature is isolated and independent from other features. That allows users to select which wrapper features/plugins they require in their application workflow. | ||
|
||
# Problem | ||
|
||
Through the driver, applications interact with a remote database by sending requests. Those requests trigger a chain of execution where every one of the plugins enabled by the user is activated and executed, until the request reaches the database and returns. | ||
|
||
In its current form, the AWS JDBC Wrapper is much like a black box: during the chain of execution, there is actually no possibility to measure the individual performance of each plugin throughout the execution of a query. Which means that in the event of loss of performance while using the Wrapper, the troubleshooting process is manual, requiring users to dive deep inside application logs and manually inspecting the behavior of each plugin enabled. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since this PR aims to fix this problem and this doc will be merged in with the solution, should we replace |
||
|
||
This project aims to add internal observability of the Wrapper performance at plugin level. | ||
|
||
# Goals | ||
|
||
The goals of the project are the following: | ||
|
||
- Instrument the wrapper code in order to obtain metrics and traces at plugin level | ||
- Define observability interfaces that make the code agnostic to a specific observability library or tool | ||
- Add plugin-specific metrics to existing plugins | ||
- Implement connectors/exporters to visualize the generated observability data | ||
|
||
# Proposed solution | ||
|
||
## Definitions | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The following few paragraphs (lines 30-36) don't seem to be related Definitions. Could we move these up a level, right under Proposed solutions? |
||
As previously mentioned in the Goals section, one important thing we aim is while adding observability to the driver, to not attach our codebase to any vendor specific or specific existing metrics library. | ||
|
||
This is not only to avoid adding a new external dependency to the wrapper, but mostly due to the fact that very often, applications that use the wrapper to interact with a database already have their own observability and/or monitoring implementations. | ||
|
||
When that is the case, we would like ideally to have wrapper specific metrics simply added to the already existing users application monitoring. In order to achieve that, we needed to find some standard observability method that would fit most users workflow. For this project, we have decided to follow OpenTelemetry (link) notations and definitions for our monitoring entities. | ||
|
||
OpenTelemetry aims to define vendor and language agnostic specs on how to monitor applications. | ||
|
||
The entities we introduce in this project are the following: Metrics and Traces. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. since we mention metrics before traces on this line, lets introduce them in the same order below (eg metrics section first, then traces section) |
||
|
||
### Traces | ||
|
||
We define a trace as a defined sequence of an application execution, specifically defined by its start and its end. A trace can contain the entire application execution, or a single atomic operation. Traces can be related either by hierarchy either with symbolic links. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should this read |
||
|
||
Collecting libraries will see the traces boundaries and measure the execution time for a particular trace. For a better understanding on the application behavior, different attributes can be attached to a trace, such as a success status and/or status code. | ||
|
||
In our project, traces are represented as `TelemetryContext` objects. A `TelemetryContext` object will be created/opened at every plugin invocation or execution. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggestion: "A |
||
|
||
`TelemetryContext` objects are similar to what OpenTelemetry define as *spans*, and also to the concept of *trace* in AWS X-Ray, the trace visualization mechanism featured in AWS Cloudwatch. | ||
|
||
### Metrics | ||
|
||
Metrics will also be added to our observability suite for the numeric data that will be collected throughout the application execution. | ||
|
||
For metrics, we follow the same standard defined by OpenTelemetry, which consists of: | ||
- Counters | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we have the plurality of the metrics be consistent? |
||
- Gauge | ||
- Histogram | ||
|
||
Gauges will be used for situations where the numeric data is varies, but not incrementally throughout the execution of the application. An example of that would be a hit/miss ratio for a cache. Further information on those 3 entities, and to which kind of data they suit better can be found in the OpenTelemetry documentation (link). | ||
brunos-bq marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the link for OpenTelemetry documentation missing? |
||
|
||
### Warning | ||
|
||
It is important to state that despite the fact that the entities and objects that we define here are either similar, or either can be mapped to OpenTelemetry concepts, we are still creating our observability layer independently of any available library or suite. Using the OpenTelemetry libraries for JAVA require to write the mapping (in this case, an interface instantiation) from our definitions to the OpenTelemetry objects. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This first sentence seems a bit wordy to me, can we rephrase:
brunos-bq marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Plugin-level tracing | ||
|
||
Having defined our objects and entities for tracing, our solution will be first to open a telemetry context for every wrapper execution. In that way, the monitoring will be able to show traces for every different operation executed by the wrapper. Every operation, such as `createStatement()` and `executeQuery()` will now be individually traced. | ||
|
||
In order to achieve that, a `TelemetryContext` object will be opened inside every call to `executeWithPlugins()` in the `WrapperUtils` class. The object will be created before the execution and closed right after, tracing its execution. | ||
|
||
Then, once those traces are created, we will also open a new telemetry context for each plugin execution inside the wrapper execution. | ||
|
||
<div style="center"><img src="images/telemetry-execution.png" width="120%" height="120%"/></div> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the image there are two bars, with each bar having a label beneath. At first, I was confused by what "top level telemetry context" was referring to: the top bar, bottom bar, or a separate concept/object. Could we put the label for each bar above that bar instead of beneath? I think it will make the image clearer. |
||
|
||
The plugin execution chain is created in the `ConnectionPluginManager` class, where execution lambdas are generated and invoke the `execute()` method for every plugin subscribed to a given execution. | ||
|
||
Inside the function that generates that plugin execution chain, we will encapsulate those lambdas with `TelemetryContext` objects that will then trace every plugin execution individually. | ||
|
||
<div style="center"><img src="images/telemetry-pluginlevel.png" width="120%" height="120%"/></div> | ||
|
||
The image above displays the different telemetry contexts that are opened inside the plugin execution chain, allowing plugin specific tracing. | ||
|
||
## Plugin-specific metrics | ||
|
||
In addition to plugin-level tracing, we will also introduce specific metrics related to performance for each available plugin. The list of the available plugins that will have metrics added is the following: | ||
|
||
- Data cache plugin | ||
- Failover plugin | ||
- IAM authentication plugin | ||
- AWS Secrets manager plugin | ||
- EFM (Enhanced failure monitoring) plugin | ||
|
||
### Data cache plugin | ||
|
||
Metrics: | ||
(Metric name | metric type | Unit (if applicable) | Dimensions | Description) | ||
|
||
- Execution counter\ | ||
Type: counter\ | ||
Counts the amount of times the plugin has been executed | ||
|
||
- Put counter\ | ||
Type: counter\ | ||
Counts the amount of times any object has been put into the cache | ||
|
||
- Get counter\ | ||
Type: counter\ | ||
Counts the amount of times any object has been looked up in the cache | ||
|
||
- Hit counter\ | ||
Type: counter\ | ||
Counts the amounts of time an object was looked up in the cache and retrieved | ||
|
||
- Hit counter per query\ | ||
Type: counter\ | ||
Dimension: query\ | ||
Counts the amounts of time a given query was looked up in the cache and retrieved | ||
|
||
- Miss counter\ | ||
Type: counter\ | ||
Counts the amounts of time an object was looked up in the cache and not found | ||
|
||
- Miss counter per query\ | ||
Type: counter\ | ||
Dimension: query | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. missing \ |
||
Counts the amounts of time each query was looked up in the cache and not found | ||
|
||
- Hit/Miss ratio\ | ||
Type: percentage | ||
brunos-bq marked this conversation as resolved.
Show resolved
Hide resolved
|
||
In a given frequency, computes the percentage of hits / the amount of lookups | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (for my own understanding): what do you mean by a given frequency? What frequency are you referring to? Is the frequency the same thing as the hit/miss percentage? |
||
|
||
- Hit/Miss ratio\ | ||
Type: gauge\ | ||
Dimension: query\ | ||
In a given frequency, computes the percentage of hits / the amount of lookups for each query | ||
|
||
- Cache clear counter\ | ||
Type: counter\ | ||
Counts the amount of times the data cache has been cleared | ||
|
||
### Failover plugin | ||
|
||
Metrics: | ||
(Metric name | metric type | Unit (if applicable) | Dimensions | Description) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we could consider removing this heading here and below, it seems quite clear without it. However, if we do keep it we could consider either removing the unit/dimension part (since neither are present in any of the metrics) or changing the dimensions part to |
||
|
||
- Execution counter\ | ||
Type: counter\ | ||
Counts the amount of times the plugin has been executed | ||
|
||
- Failover trigger counter\ | ||
Type: counter\ | ||
Counts the amount of times failover has been triggered | ||
|
||
- Writer failover counter\ | ||
Type; counter\ | ||
brunos-bq marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Counts the amount of time failover was triggered and the driver has reconnected to a writer instance | ||
|
||
- Writer failover counter\ | ||
brunos-bq marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Type; counter\ | ||
Counts the amount of time failover was triggered and the driver has reconnected to a writer instance | ||
|
||
- Reader failover counter\ | ||
Type; counter\ | ||
brunos-bq marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Counts the amount of time failover was triggered and the driver has reconnected to a reader instance | ||
|
||
### EFM Plugin | ||
|
||
Metrics: | ||
(Metric name | metric type | Unit (if applicable) | Dimensions | Description) | ||
|
||
- Execution counter\ | ||
Type: counter\ | ||
Counts the amount of times the plugin has been executed | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's a few placeholder links in this doc, just a reminder to update them with the actual links |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
/* | ||
* Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"). | ||
* You may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package software.amazon; | ||
|
||
import com.amazonaws.xray.AWSXRay; | ||
import com.amazonaws.xray.AWSXRayRecorderBuilder; | ||
import io.opentelemetry.api.OpenTelemetry; | ||
import io.opentelemetry.api.trace.propagation.W3CTraceContextPropagator; | ||
import io.opentelemetry.context.propagation.ContextPropagators; | ||
import io.opentelemetry.exporter.otlp.metrics.OtlpGrpcMetricExporter; | ||
import io.opentelemetry.exporter.otlp.trace.OtlpGrpcSpanExporter; | ||
import io.opentelemetry.sdk.OpenTelemetrySdk; | ||
import io.opentelemetry.sdk.metrics.SdkMeterProvider; | ||
import io.opentelemetry.sdk.metrics.export.PeriodicMetricReader; | ||
import io.opentelemetry.sdk.trace.SdkTracerProvider; | ||
import io.opentelemetry.sdk.trace.export.SimpleSpanProcessor; | ||
import java.sql.Connection; | ||
import java.sql.DriverManager; | ||
import java.sql.ResultSet; | ||
import java.sql.SQLException; | ||
import java.sql.Statement; | ||
import java.util.Properties; | ||
import java.util.concurrent.TimeUnit; | ||
import software.amazon.jdbc.PropertyDefinition; | ||
|
||
public class MetricsExample { | ||
|
||
// User configures connection properties here | ||
public static final String POSTGRESQL_CONNECTION_STRING = | ||
"jdbc:aws-wrapper:postgresql://atlas-postgres.cluster-czygpppufgy4.us-east-2.rds.amazonaws.com:5432/postgres"; | ||
private static final String USERNAME = "pgadmin"; | ||
private static final String PASSWORD = "my_password_2020"; | ||
brunos-bq marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
private static final String SQL_DBLIST = "SELECT datname FROM pg_database;"; | ||
private static final String SQL_SLEEP = "select pg_sleep(120);"; | ||
private static final String SQL_TABLELIST = "select * from information_schema.tables where table_schema='public';"; | ||
|
||
private final OpenTelemetry openTelemetry; | ||
private final OtlpGrpcSpanExporter spanExporter; | ||
private final OtlpGrpcMetricExporter metricExporter; | ||
private final SdkTracerProvider tracerProvider; | ||
private final SdkMeterProvider meterProvider; | ||
|
||
public MetricsExample() { | ||
spanExporter = OtlpGrpcSpanExporter.builder().setEndpoint(System.getenv("OTEL_EXPORTER_OTLP_ENDPOINT")).build(); | ||
metricExporter = OtlpGrpcMetricExporter.builder().setEndpoint(System.getenv("OTEL_EXPORTER_OTLP_ENDPOINT")).build(); | ||
|
||
tracerProvider = SdkTracerProvider.builder().addSpanProcessor(SimpleSpanProcessor.create(spanExporter)).build(); | ||
meterProvider = SdkMeterProvider.builder() | ||
.registerMetricReader(PeriodicMetricReader.builder(metricExporter).setInterval(15, TimeUnit.SECONDS).build()) | ||
.build(); | ||
|
||
openTelemetry = OpenTelemetrySdk.builder() | ||
.setTracerProvider(tracerProvider) | ||
.setMeterProvider(meterProvider) | ||
.setPropagators(ContextPropagators.create(W3CTraceContextPropagator.getInstance())) | ||
.buildAndRegisterGlobal(); | ||
} | ||
|
||
public void doWork(Properties properties) throws SQLException { | ||
try (final Connection conn = DriverManager.getConnection(POSTGRESQL_CONNECTION_STRING, properties); | ||
final Statement statement = conn.createStatement(); | ||
final ResultSet rs = statement.executeQuery(SQL_SLEEP)) { | ||
System.out.println(Util.getResult(rs)); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does pg_sleep return anything here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pg_sleep returns an empty resultset I believe |
||
} | ||
} | ||
|
||
public static void main(String[] args) throws SQLException { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I get a nullpointerexception when I run the example as-is even though you have the cluster endpoints and credentials specified
|
||
final MetricsExample example = new MetricsExample(); | ||
|
||
AWSXRayRecorderBuilder builder = AWSXRayRecorderBuilder.standard(); | ||
brunos-bq marked this conversation as resolved.
Show resolved
Hide resolved
|
||
AWSXRay.setGlobalRecorder(builder.build()); | ||
|
||
final Properties properties = new Properties(); | ||
properties.setProperty(PropertyDefinition.PLUGINS.name, "dataCache, efm, failover"); | ||
properties.setProperty(PropertyDefinition.USER.name, USERNAME); | ||
properties.setProperty(PropertyDefinition.PASSWORD.name, PASSWORD); | ||
|
||
properties.setProperty(PropertyDefinition.ENABLE_TELEMETRY.name, String.valueOf(true)); | ||
// Traces: Available values are XRAY, OTLP and NONE | ||
properties.setProperty(PropertyDefinition.TELEMETRY_TRACES_BACKEND.name, "XRAY"); | ||
// Metrics: Available values are OTLP and NONE | ||
properties.setProperty(PropertyDefinition.TELEMETRY_METRICS_BACKEND.name, "NONE"); | ||
|
||
System.out.println("-- starting metrics e2e test"); | ||
|
||
System.out.println("-- env vars"); | ||
System.out.println("AWS_REGION: " + System.getenv("AWS_REGION")); | ||
System.out.println("OTEL_METRICS_EXPORTER: " + System.getenv("OTEL_METRICS_EXPORTER")); | ||
System.out.println("OTEL_TRACES_EXPORTER: " + System.getenv("OTEL_TRACES_EXPORTER")); | ||
System.out.println("OTEL_LOGS_EXPORTER: " + System.getenv("OTEL_LOGS_EXPORTER")); | ||
System.out.println("OTEL_EXPORTER_OTLP_ENDPOINT: " + System.getenv("OTEL_EXPORTER_OTLP_ENDPOINT")); | ||
System.out.println("OTEL_RESOURCE_ATTRIBUTES: " + System.getenv("OTEL_RESOURCE_ATTRIBUTES")); | ||
|
||
System.out.println("-- running application"); | ||
|
||
AWSXRay.beginSegment("application"); | ||
example.doWork(properties); | ||
AWSXRay.endSegment(); | ||
brunos-bq marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we leave the responsibility of explaining what the wrapper is to the repo home page or do you think its helpful to have here as well?