Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry work #22

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
290 changes: 290 additions & 0 deletions docs/design-document.md

Large diffs are not rendered by default.

Binary file added docs/images/telemetry-app.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/telemetry-execution.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/telemetry-pluginlevel.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
173 changes: 173 additions & 0 deletions docs/metrics-document.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
# Adding observability in the JDBC wrapper via metrics

# Introduction

AWS Aurora are databases provided by AWS that are hosted in the AWS cloud. Currently available in Aurora are instances of MySQL and PostgreSQL databases. In order to enable the access from an application to an Aurora database, users need to set up a driver that has the ability to connect and interact with an Aurora database instance set up remotely. Existing drivers for MySQL and PostgreSQL work fine with Aurora instances, but do not take advantage of any of the additional features provided by the Aurora databases.

Recently, AWS has created the AWS JDBC Wrapper driver. The AWS JDBC Wrapper is a driver application that is not a driver in itself, but more of a driver enhancer - an application that set up alongside a database driver, enables some Aurora-specific features to the database driver. Examples of those features are the ability of driver failover (link to doc) and integration with IAM and AWS Secrets Manager (link). The structure of the AWS JDBC Wrapper is organized in plugins, where each new wrapper feature is isolated and independent from other features. That allows users to select which wrapper features/plugins they require in their application workflow.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we leave the responsibility of explaining what the wrapper is to the repo home page or do you think its helpful to have here as well?


# Problem

Through the driver, applications interact with a remote database by sending requests. Those requests trigger a chain of execution where every one of the plugins enabled by the user is activated and executed, until the request reaches the database and returns.

In its current form, the AWS JDBC Wrapper is much like a black box: during the chain of execution, there is actually no possibility to measure the individual performance of each plugin throughout the execution of a query. Which means that in the event of loss of performance while using the Wrapper, the troubleshooting process is manual, requiring users to dive deep inside application logs and manually inspecting the behavior of each plugin enabled.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR aims to fix this problem and this doc will be merged in with the solution, should we replace In its current form... with Previously, the AWS...?


This project aims to add internal observability of the Wrapper performance at plugin level.

# Goals

The goals of the project are the following:

- Instrument the wrapper code in order to obtain metrics and traces at plugin level
- Define observability interfaces that make the code agnostic to a specific observability library or tool
- Add plugin-specific metrics to existing plugins
- Implement connectors/exporters to visualize the generated observability data

# Proposed solution

## Definitions

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following few paragraphs (lines 30-36) don't seem to be related Definitions. Could we move these up a level, right under Proposed solutions?

As previously mentioned in the Goals section, one important thing we aim is while adding observability to the driver, to not attach our codebase to any vendor specific or specific existing metrics library.

This is not only to avoid adding a new external dependency to the wrapper, but mostly due to the fact that very often, applications that use the wrapper to interact with a database already have their own observability and/or monitoring implementations.

When that is the case, we would like ideally to have wrapper specific metrics simply added to the already existing users application monitoring. In order to achieve that, we needed to find some standard observability method that would fit most users workflow. For this project, we have decided to follow OpenTelemetry (link) notations and definitions for our monitoring entities.

OpenTelemetry aims to define vendor and language agnostic specs on how to monitor applications.

The entities we introduce in this project are the following: Metrics and Traces.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we mention metrics before traces on this line, lets introduce them in the same order below (eg metrics section first, then traces section)


### Traces

We define a trace as a defined sequence of an application execution, specifically defined by its start and its end. A trace can contain the entire application execution, or a single atomic operation. Traces can be related either by hierarchy either with symbolic links.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this read by heirarchy or with symbolic links?


Collecting libraries will see the traces boundaries and measure the execution time for a particular trace. For a better understanding on the application behavior, different attributes can be attached to a trace, such as a success status and/or status code.

In our project, traces are represented as `TelemetryContext` objects. A `TelemetryContext` object will be created/opened at every plugin invocation or execution.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: "A TelemetryContext object will be created for each wrapper invocation and each individual plugin invocation. For more details see the section on [plugin level tracing](#plugin-level-tracing)."


`TelemetryContext` objects are similar to what OpenTelemetry define as *spans*, and also to the concept of *trace* in AWS X-Ray, the trace visualization mechanism featured in AWS Cloudwatch.

### Metrics

Metrics will also be added to our observability suite for the numeric data that will be collected throughout the application execution.

For metrics, we follow the same standard defined by OpenTelemetry, which consists of:
- Counters

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have the plurality of the metrics be consistent?

- Gauge
- Histogram

Gauges will be used for situations where the numeric data is varies, but not incrementally throughout the execution of the application. An example of that would be a hit/miss ratio for a cache. Further information on those 3 entities, and to which kind of data they suit better can be found in the OpenTelemetry documentation (link).
brunos-bq marked this conversation as resolved.
Show resolved Hide resolved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the link for OpenTelemetry documentation missing?


### Warning

It is important to state that despite the fact that the entities and objects that we define here are either similar, or either can be mapped to OpenTelemetry concepts, we are still creating our observability layer independently of any available library or suite. Using the OpenTelemetry libraries for JAVA require to write the mapping (in this case, an interface instantiation) from our definitions to the OpenTelemetry objects.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This first sentence seems a bit wordy to me, can we rephrase: Although the entities and objects we define here are similar or can be mapped to OpenTelemetry concepts, we are still...

brunos-bq marked this conversation as resolved.
Show resolved Hide resolved

## Plugin-level tracing

Having defined our objects and entities for tracing, our solution will be first to open a telemetry context for every wrapper execution. In that way, the monitoring will be able to show traces for every different operation executed by the wrapper. Every operation, such as `createStatement()` and `executeQuery()` will now be individually traced.

In order to achieve that, a `TelemetryContext` object will be opened inside every call to `executeWithPlugins()` in the `WrapperUtils` class. The object will be created before the execution and closed right after, tracing its execution.

Then, once those traces are created, we will also open a new telemetry context for each plugin execution inside the wrapper execution.

<div style="center"><img src="images/telemetry-execution.png" width="120%" height="120%"/></div>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the image there are two bars, with each bar having a label beneath. At first, I was confused by what "top level telemetry context" was referring to: the top bar, bottom bar, or a separate concept/object. Could we put the label for each bar above that bar instead of beneath? I think it will make the image clearer.


The plugin execution chain is created in the `ConnectionPluginManager` class, where execution lambdas are generated and invoke the `execute()` method for every plugin subscribed to a given execution.

Inside the function that generates that plugin execution chain, we will encapsulate those lambdas with `TelemetryContext` objects that will then trace every plugin execution individually.

<div style="center"><img src="images/telemetry-pluginlevel.png" width="120%" height="120%"/></div>

The image above displays the different telemetry contexts that are opened inside the plugin execution chain, allowing plugin specific tracing.

## Plugin-specific metrics

In addition to plugin-level tracing, we will also introduce specific metrics related to performance for each available plugin. The list of the available plugins that will have metrics added is the following:

- Data cache plugin
- Failover plugin
- IAM authentication plugin
- AWS Secrets manager plugin
- EFM (Enhanced failure monitoring) plugin

### Data cache plugin

Metrics:
(Metric name | metric type | Unit (if applicable) | Dimensions | Description)

- Execution counter\
Type: counter\
Counts the amount of times the plugin has been executed

- Put counter\
Type: counter\
Counts the amount of times any object has been put into the cache

- Get counter\
Type: counter\
Counts the amount of times any object has been looked up in the cache

- Hit counter\
Type: counter\
Counts the amounts of time an object was looked up in the cache and retrieved

- Hit counter per query\
Type: counter\
Dimension: query\
Counts the amounts of time a given query was looked up in the cache and retrieved

- Miss counter\
Type: counter\
Counts the amounts of time an object was looked up in the cache and not found

- Miss counter per query\
Type: counter\
Dimension: query
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing \

Counts the amounts of time each query was looked up in the cache and not found

- Hit/Miss ratio\
Type: percentage
brunos-bq marked this conversation as resolved.
Show resolved Hide resolved
In a given frequency, computes the percentage of hits / the amount of lookups
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(for my own understanding): what do you mean by a given frequency? What frequency are you referring to? Is the frequency the same thing as the hit/miss percentage?


- Hit/Miss ratio\
Type: gauge\
Dimension: query\
In a given frequency, computes the percentage of hits / the amount of lookups for each query

- Cache clear counter\
Type: counter\
Counts the amount of times the data cache has been cleared

### Failover plugin

Metrics:
(Metric name | metric type | Unit (if applicable) | Dimensions | Description)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could consider removing this heading here and below, it seems quite clear without it. However, if we do keep it we could consider either removing the unit/dimension part (since neither are present in any of the metrics) or changing the dimensions part to Dimensions (if applicable)


- Execution counter\
Type: counter\
Counts the amount of times the plugin has been executed

- Failover trigger counter\
Type: counter\
Counts the amount of times failover has been triggered

- Writer failover counter\
Type; counter\
brunos-bq marked this conversation as resolved.
Show resolved Hide resolved
Counts the amount of time failover was triggered and the driver has reconnected to a writer instance

- Writer failover counter\
brunos-bq marked this conversation as resolved.
Show resolved Hide resolved
Type; counter\
Counts the amount of time failover was triggered and the driver has reconnected to a writer instance

- Reader failover counter\
Type; counter\
brunos-bq marked this conversation as resolved.
Show resolved Hide resolved
Counts the amount of time failover was triggered and the driver has reconnected to a reader instance

### EFM Plugin

Metrics:
(Metric name | metric type | Unit (if applicable) | Dimensions | Description)

- Execution counter\
Type: counter\
Counts the amount of times the plugin has been executed
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a few placeholder links in this doc, just a reminder to update them with the actual links

5 changes: 5 additions & 0 deletions examples/AWSDriverExample/build.gradle.kts
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,9 @@ dependencies {
implementation("software.amazon.awssdk:rds:2.17.289")
implementation("software.amazon.awssdk:secretsmanager:2.17.285")
implementation(project(":aws-advanced-jdbc-wrapper"))
implementation("io.dropwizard.metrics:metrics-core:4.2.13")
brunos-bq marked this conversation as resolved.
Show resolved Hide resolved
implementation("io.opentelemetry:opentelemetry-api:1.20.1")
implementation("io.opentelemetry:opentelemetry-sdk:1.20.1")
implementation("io.opentelemetry:opentelemetry-exporter-otlp:1.20.1")
implementation("com.amazonaws:aws-xray-recorder-sdk-core:2.13.0")
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
/*
* Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License").
* You may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package software.amazon;

import com.amazonaws.xray.AWSXRay;
import com.amazonaws.xray.AWSXRayRecorderBuilder;
import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.api.trace.propagation.W3CTraceContextPropagator;
import io.opentelemetry.context.propagation.ContextPropagators;
import io.opentelemetry.exporter.otlp.metrics.OtlpGrpcMetricExporter;
import io.opentelemetry.exporter.otlp.trace.OtlpGrpcSpanExporter;
import io.opentelemetry.sdk.OpenTelemetrySdk;
import io.opentelemetry.sdk.metrics.SdkMeterProvider;
import io.opentelemetry.sdk.metrics.export.PeriodicMetricReader;
import io.opentelemetry.sdk.trace.SdkTracerProvider;
import io.opentelemetry.sdk.trace.export.SimpleSpanProcessor;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.Properties;
import java.util.concurrent.TimeUnit;
import software.amazon.jdbc.PropertyDefinition;

public class MetricsExample {

// User configures connection properties here
public static final String POSTGRESQL_CONNECTION_STRING =
"jdbc:aws-wrapper:postgresql://atlas-postgres.cluster-czygpppufgy4.us-east-2.rds.amazonaws.com:5432/postgres";
private static final String USERNAME = "pgadmin";
private static final String PASSWORD = "my_password_2020";
brunos-bq marked this conversation as resolved.
Show resolved Hide resolved

private static final String SQL_DBLIST = "SELECT datname FROM pg_database;";
private static final String SQL_SLEEP = "select pg_sleep(120);";
private static final String SQL_TABLELIST = "select * from information_schema.tables where table_schema='public';";

private final OpenTelemetry openTelemetry;
private final OtlpGrpcSpanExporter spanExporter;
private final OtlpGrpcMetricExporter metricExporter;
private final SdkTracerProvider tracerProvider;
private final SdkMeterProvider meterProvider;

public MetricsExample() {
spanExporter = OtlpGrpcSpanExporter.builder().setEndpoint(System.getenv("OTEL_EXPORTER_OTLP_ENDPOINT")).build();
metricExporter = OtlpGrpcMetricExporter.builder().setEndpoint(System.getenv("OTEL_EXPORTER_OTLP_ENDPOINT")).build();

tracerProvider = SdkTracerProvider.builder().addSpanProcessor(SimpleSpanProcessor.create(spanExporter)).build();
meterProvider = SdkMeterProvider.builder()
.registerMetricReader(PeriodicMetricReader.builder(metricExporter).setInterval(15, TimeUnit.SECONDS).build())
.build();

openTelemetry = OpenTelemetrySdk.builder()
.setTracerProvider(tracerProvider)
.setMeterProvider(meterProvider)
.setPropagators(ContextPropagators.create(W3CTraceContextPropagator.getInstance()))
.buildAndRegisterGlobal();
}

public void doWork(Properties properties) throws SQLException {
try (final Connection conn = DriverManager.getConnection(POSTGRESQL_CONNECTION_STRING, properties);
final Statement statement = conn.createStatement();
final ResultSet rs = statement.executeQuery(SQL_SLEEP)) {
System.out.println(Util.getResult(rs));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does pg_sleep return anything here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pg_sleep returns an empty resultset I believe

}
}

public static void main(String[] args) throws SQLException {
Copy link
Collaborator

@karenc-bq karenc-bq Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get a nullpointerexception when I run the example as-is even though you have the cluster endpoints and credentials specified

> Task :driverexample:MetricsExample.main() FAILED
Exception in thread "main" java.lang.NullPointerException: endpoint
	at java.base/java.util.Objects.requireNonNull(Objects.java:233)
	at io.opentelemetry.exporter.otlp.trace.OtlpGrpcSpanExporterBuilder.setEndpoint(OtlpGrpcSpanExporterBuilder.java:93)
	at software.amazon.MetricsExample.<init>(MetricsExample.java:60)
	at software.amazon.MetricsExample.main(MetricsExample.java:84)

final MetricsExample example = new MetricsExample();

AWSXRayRecorderBuilder builder = AWSXRayRecorderBuilder.standard();
brunos-bq marked this conversation as resolved.
Show resolved Hide resolved
AWSXRay.setGlobalRecorder(builder.build());

final Properties properties = new Properties();
properties.setProperty(PropertyDefinition.PLUGINS.name, "dataCache, efm, failover");
properties.setProperty(PropertyDefinition.USER.name, USERNAME);
properties.setProperty(PropertyDefinition.PASSWORD.name, PASSWORD);

properties.setProperty(PropertyDefinition.ENABLE_TELEMETRY.name, String.valueOf(true));
// Traces: Available values are XRAY, OTLP and NONE
properties.setProperty(PropertyDefinition.TELEMETRY_TRACES_BACKEND.name, "XRAY");
// Metrics: Available values are OTLP and NONE
properties.setProperty(PropertyDefinition.TELEMETRY_METRICS_BACKEND.name, "NONE");

System.out.println("-- starting metrics e2e test");

System.out.println("-- env vars");
System.out.println("AWS_REGION: " + System.getenv("AWS_REGION"));
System.out.println("OTEL_METRICS_EXPORTER: " + System.getenv("OTEL_METRICS_EXPORTER"));
System.out.println("OTEL_TRACES_EXPORTER: " + System.getenv("OTEL_TRACES_EXPORTER"));
System.out.println("OTEL_LOGS_EXPORTER: " + System.getenv("OTEL_LOGS_EXPORTER"));
System.out.println("OTEL_EXPORTER_OTLP_ENDPOINT: " + System.getenv("OTEL_EXPORTER_OTLP_ENDPOINT"));
System.out.println("OTEL_RESOURCE_ATTRIBUTES: " + System.getenv("OTEL_RESOURCE_ATTRIBUTES"));

System.out.println("-- running application");

AWSXRay.beginSegment("application");
example.doWork(properties);
AWSXRay.endSegment();
brunos-bq marked this conversation as resolved.
Show resolved Hide resolved
}
}
4 changes: 4 additions & 0 deletions wrapper/build.gradle.kts
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,10 @@ dependencies {
compileOnly("org.postgresql:postgresql:42.5.0")
compileOnly("org.mariadb.jdbc:mariadb-java-client:3.1.0")
compileOnly("org.osgi:org.osgi.core:4.3.0")
compileOnly("io.opentelemetry:opentelemetry-api:1.20.1")
compileOnly("io.opentelemetry:opentelemetry-exporter-otlp:1.20.1")
compileOnly("io.opentelemetry:opentelemetry-sdk-extension-autoconfigure:1.20.1-alpha")
compileOnly("com.amazonaws:aws-xray-recorder-sdk-core:2.13.0")

testImplementation("org.junit.platform:junit-platform-commons:1.9.0")
testImplementation("org.junit.platform:junit-platform-engine:1.9.0")
Expand Down
Loading