Skip to content

Commit

Permalink
Merge pull request #123 from ExpediaGroup/feature/all_metrics
Browse files Browse the repository at this point in the history
Coadhale metrics sent to Datadog with JMX enabled
  • Loading branch information
DhrubajyotiSadhu authored Feb 6, 2024
2 parents 0e975ec + 3a9baf9 commit 154f83d
Show file tree
Hide file tree
Showing 4 changed files with 56 additions and 49 deletions.
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,12 @@ All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/) and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).

## [4.0.0] - 2024-02-06
### Added
- Enables JMX (Java Management Extensions) on Hadoop clients, allowing for remote monitoring and management of JVM-related metrics
### Removed
- CloudWatch metrics in favour of JMX Prometheus Exporter.

## [3.0.17] - 2024-01-31
### Added
- Enable prometheus jmx agent when running on ECS by exporting `EXPORTER_OPTS`
Expand Down
86 changes: 43 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,50 +4,50 @@
For more information please refer to the main [Apiary](https://github.com/ExpediaGroup/apiary) project page.

## Environment Variables
|Environment Variable|Required|Description|
|----|----|----|
|APIARY_S3_INVENTORY_PREFIX|No (defaults to `EntireBucketDaily`)|Prefix used by S3 Inventory when creating data in the inventory bucket.|
|APIARY_S3_INVENTORY_TABLE_FORMAT|No (defaults to `ORC`)|Format of S3 inventory data - `ORC`, `Parquet`, or `CSV`|
|APIARY_SYSTEM_SCHEMA|No (defaults to `apiary_system`)|Name for internal system database.|
|AWS_REGION|Yes|AWS region to configure various AWS clients.|
|AWS_WEB_IDENTITY_TOKEN_FILE|No|Path of the AWS Web Identity Token File for IRSA/OIDC AWS authentication.|
|DISALLOW_INCOMPATIBLE_COL_TYPE_CHANGES|No|`true`/`false` value for hive.metastore.disallow.incompatible.col.type.changes, default `true`.|
|ENABLE_GLUESYNC|No|Option to turn on GlueSync Hive Metastore listener.|
|ENABLE_HIVE_LOCK_HOUSE_KEEPER|No|Option to turn on Hive Metastore Hive Lock House Keeper.|
|ENABLE_METRICS|No|Option to enable sending Hive Metastore metrics to CloudWatch.|
|ENABLE_S3_INVENTORY|No|Option to create Hive tables on top of S3 inventory data if enabled in `apiary-data-lake`. Enabled if value is not null/empty.|
|ENABLE_S3_LOGS|No|Option to create Hive tables on top of S3 access logs data if enabled in `apiary-data-lake`. Enabled if value is not null/empty.|
|EXTERNAL_DATABASE|No|Option to enable external database mode, when specified it disables managing Hive Metastore MySQL database schema.|
|GLUE_PREFIX|No|Prefix added to Glue databases to handle database name collisions when synchronizing multiple Hive Metastores to the Glue catalog.|
|HADOOP_HEAPSIZE|No|Hive Metastore Java process heapsize.|
|HMS_AUTOGATHER_STATS|No (default is `true`)|Whether or not to create basic statistics on table/partition creation. Valid values are `true` or `false`.|
|LIMIT_PARTITION_REQUEST_NUMBER|No (default is `-1`)|To protect the cluster, this controls how many partitions can be scanned for each partitioned table. The default value "-1" means no limit. The limit on partitions does not affect metadata-only queries.|
|HIVE_METASTORE_ACCESS_MODE|No|Hive Metastore access mode, applicable values are: readwrite, readonly|
|HIVE_DB_NAMES|No|comma separated list of Hive database names, when specified Hive databases will be created and mapped to corresponding S3 buckets.|
|HIVE_METASTORE_LOG_LEVEL|No|Hive Metastore service Log4j log level.|
|HMS_MIN_THREADS|No (defaults to `200`)|Minimum size of the Hive metastore thread pool.|
|HMS_MAX_THREADS|No (defaults to `1000`)|Maximum size of the Hive metastore thread pool.|
|INSTANCE_NAME|Yes|Apiary instance name, will be used as prefix on most AWS resources to allow multiple Apiary instance deployments.|
|KAFKA_BOOTSTRAP_SERVERS|No|Kafka Bootstrap Servers to enable Kafka Metastore listener and send Metastore events to Kafka.|
|KAFKA_CLIENT_ID|No|Kafka label you define that names the Kafka producer.|
|KAFKA_COMPRESSION_TYPE|No (defaults to `1048576`)|The maximum size of a request in bytes. This setting will limit the number of record batches the producer will send in a single request to avoid sending huge requests. This is also effectively a cap on the maximum uncompressed record batch size.|
|KAFKA_MAX_REQUEST_SIZE|No|Kafka Compression type, if none is specified there is no compression enabled. Values available are gzip, lz4 and snappy.|
|LDAP_BASE|No|LDAP base DN used to search for user groups.|
|Environment Variable|Required| Description |
|----|----|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|APIARY_S3_INVENTORY_PREFIX|No (defaults to `EntireBucketDaily`)| Prefix used by S3 Inventory when creating data in the inventory bucket. |
|APIARY_S3_INVENTORY_TABLE_FORMAT|No (defaults to `ORC`)| Format of S3 inventory data - `ORC`, `Parquet`, or `CSV` |
|APIARY_SYSTEM_SCHEMA|No (defaults to `apiary_system`)| Name for internal system database. |
|AWS_REGION|Yes| AWS region to configure various AWS clients. |
|AWS_WEB_IDENTITY_TOKEN_FILE|No| Path of the AWS Web Identity Token File for IRSA/OIDC AWS authentication. |
|DISALLOW_INCOMPATIBLE_COL_TYPE_CHANGES|No| `true`/`false` value for hive.metastore.disallow.incompatible.col.type.changes, default `true`. |
|ENABLE_GLUESYNC|No| Option to turn on GlueSync Hive Metastore listener. |
|ENABLE_HIVE_LOCK_HOUSE_KEEPER|No| Option to turn on Hive Metastore Hive Lock House Keeper. |
|ENABLE_METRICS|No| Option to enable sending Hive Metastore and JMX metrics to Prometheus. |
|ENABLE_S3_INVENTORY|No| Option to create Hive tables on top of S3 inventory data if enabled in `apiary-data-lake`. Enabled if value is not null/empty. |
|ENABLE_S3_LOGS|No| Option to create Hive tables on top of S3 access logs data if enabled in `apiary-data-lake`. Enabled if value is not null/empty. |
|EXTERNAL_DATABASE|No| Option to enable external database mode, when specified it disables managing Hive Metastore MySQL database schema. |
|GLUE_PREFIX|No| Prefix added to Glue databases to handle database name collisions when synchronizing multiple Hive Metastores to the Glue catalog. |
|HADOOP_HEAPSIZE|No| Hive Metastore Java process heapsize. |
|HMS_AUTOGATHER_STATS|No (default is `true`)| Whether or not to create basic statistics on table/partition creation. Valid values are `true` or `false`. |
|LIMIT_PARTITION_REQUEST_NUMBER|No (default is `-1`)| To protect the cluster, this controls how many partitions can be scanned for each partitioned table. The default value "-1" means no limit. The limit on partitions does not affect metadata-only queries. |
|HIVE_METASTORE_ACCESS_MODE|No| Hive Metastore access mode, applicable values are: readwrite, readonly |
|HIVE_DB_NAMES|No| comma separated list of Hive database names, when specified Hive databases will be created and mapped to corresponding S3 buckets. |
|HIVE_METASTORE_LOG_LEVEL|No| Hive Metastore service Log4j log level. |
|HMS_MIN_THREADS|No (defaults to `200`)| Minimum size of the Hive metastore thread pool. |
|HMS_MAX_THREADS|No (defaults to `1000`)| Maximum size of the Hive metastore thread pool. |
|INSTANCE_NAME|Yes| Apiary instance name, will be used as prefix on most AWS resources to allow multiple Apiary instance deployments. |
|KAFKA_BOOTSTRAP_SERVERS|No| Kafka Bootstrap Servers to enable Kafka Metastore listener and send Metastore events to Kafka. |
|KAFKA_CLIENT_ID|No| Kafka label you define that names the Kafka producer. |
|KAFKA_COMPRESSION_TYPE|No (defaults to `1048576`)| The maximum size of a request in bytes. This setting will limit the number of record batches the producer will send in a single request to avoid sending huge requests. This is also effectively a cap on the maximum uncompressed record batch size. |
|KAFKA_MAX_REQUEST_SIZE|No| Kafka Compression type, if none is specified there is no compression enabled. Values available are gzip, lz4 and snappy. |
|LDAP_BASE|No| LDAP base DN used to search for user groups. |
|LDAP_CA_CERT|Base64 encoded Certificate Authority Bundle to validate LDAP SSL connection.|
|LDAP_SECRET_ARN|No|LDAP bind DN SecretsManager secret ARN.|
|LDAP_URL|No|Active Directory URL to enable group mapping in metastore.|
|MYSQL_CONNECTION_POOL_SIZE|No (defaults to `10`)|MySQL Connection pool size for Hive Metastore. See [here](https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L1181) for more info.|
|MYSQL_DB_HOST|Yes|Hive Metastore MySQL database hostname.|
|MYSQL_DB_NAME|Yes|Hive Metastore MySQL database name.|
|MYSQL_SECRET_ARN|Yes|Hive Metastore MySQL SecretsManager secret ARN.|
|MYSQL_SECRET_USERNAME_KEY|No (defaults to `username`)|Hive Metastore MySQL SecretsManager secret username key.|
|RANGER_AUDIT_DB_URL|No|Ranger audit database JDBC URL.|
|RANGER_AUDIT_SECRET_ARN|No|Ranger audit database secret ARN.|
|RANGER_AUDIT_SOLR_URL|No|Ranger Solr audit URL.|
|RANGER_POLICY_MANAGER_URL|No|Ranger admin URL from where policies will be downloaded.|
|RANGER_SERVICE_NAME|No|Ranger service name used to configure RangerAuth plugin.|
|SNS_ARN|No|The SNS topic ARN to which metadata updates will be sent.|
|TABLE_PARAM_FILTER|No|A regular expression for selecting necessary table parameters. If the value isn't set, then no table parameters are selected.|
|LDAP_SECRET_ARN|No| LDAP bind DN SecretsManager secret ARN. |
|LDAP_URL|No| Active Directory URL to enable group mapping in metastore. |
|MYSQL_CONNECTION_POOL_SIZE|No (defaults to `10`)| MySQL Connection pool size for Hive Metastore. See [here](https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L1181) for more info. |
|MYSQL_DB_HOST|Yes| Hive Metastore MySQL database hostname. |
|MYSQL_DB_NAME|Yes| Hive Metastore MySQL database name. |
|MYSQL_SECRET_ARN|Yes| Hive Metastore MySQL SecretsManager secret ARN. |
|MYSQL_SECRET_USERNAME_KEY|No (defaults to `username`)| Hive Metastore MySQL SecretsManager secret username key. |
|RANGER_AUDIT_DB_URL|No| Ranger audit database JDBC URL. |
|RANGER_AUDIT_SECRET_ARN|No| Ranger audit database secret ARN. |
|RANGER_AUDIT_SOLR_URL|No| Ranger Solr audit URL. |
|RANGER_POLICY_MANAGER_URL|No| Ranger admin URL from where policies will be downloaded. |
|RANGER_SERVICE_NAME|No| Ranger service name used to configure RangerAuth plugin. |
|SNS_ARN|No| The SNS topic ARN to which metadata updates will be sent. |
|TABLE_PARAM_FILTER|No| A regular expression for selecting necessary table parameters. If the value isn't set, then no table parameters are selected. |

# Contact

Expand Down
2 changes: 2 additions & 0 deletions files/jmx-exporter.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,5 @@ ssl: false
lowercaseOutputName: true
lowercaseOutputLabelNames: true
attrNameSnakeCase: true
rules:
- pattern: ".*"
11 changes: 5 additions & 6 deletions files/startup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -83,17 +83,16 @@ if [[ -n $RANGER_AUDIT_DB_URL ]]; then
fi
fi


# Enables metrics for Hive Metastore and configures Prometheus JMX exporter for monitoring in ECS and Kubernetes environments.
if [ -n "$ENABLE_METRICS" ]; then
update_property.py hive.metastore.metrics.enabled true /etc/hive/conf/hive-site.xml
#configure to send metrics to cloudwatch when running on ECS
if [ -n "$ECS_CONTAINER_METADATA_URI" ]; then
# This line enables JMX (Java Management Extensions) on Hadoop clients, allowing for remote monitoring and management of JVM-related metrics and operations.
# source: https://docs.datadoghq.com/integrations/hive/?tab=containerized
export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=8088"

# enable prometheus jmx agent when running on ECS
# This line configures the JMX Prometheus exporter, enabling the collection of JMX metrics from the JVM and their exposure in Prometheus format for integration with monitoring systems.
export EXPORTER_OPTS="-javaagent:/usr/lib/apiary/jmx_prometheus_javaagent-${EXPORTER_VERSION}.jar=8080:/etc/hive/conf/jmx-exporter.yaml"
export CLOUDWATCH_NAMESPACE="${INSTANCE_NAME}-metastore"
export ECS_TASK_ID=$(wget -q -O - ${ECS_CONTAINER_METADATA_URI}/task|jq -r .TaskARN|awk -F/ '{ print $NF }')
update_property.py hive.service.metrics.class com.expediagroup.apiary.extensions.metastore.metrics.CodahaleMetrics /etc/hive/conf/hive-site.xml
fi
#enable prometheus jmx agent when running on kubernetes
if [ -n "$KUBERNETES_SERVICE_HOST" ]; then
Expand Down

0 comments on commit 154f83d

Please sign in to comment.