Skip to content

Commit

Permalink
[2.2.1] - 2024-07-22 (#300)
Browse files Browse the repository at this point in the history
### Added

- Optimised instance list performance. #76
- Added support for using Unix epoch time format as the time key for single line text log type.

### Fixed

- Fixed an issue that missing time key when editing JSON config. #296
- Fixed an issue that upgrading to v2.2.0 failed due to missing CMK permissions. #297


---------

Co-authored-by: James Ma <[email protected]>
  • Loading branch information
James96315 and James Ma authored Jul 22, 2024
1 parent c6a6f75 commit 876df63
Show file tree
Hide file tree
Showing 54 changed files with 1,717 additions and 2,336 deletions.
15 changes: 15 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,18 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [2.2.1] - 2024-07-22

### Added

- Optimised instance list performance. #76
- Added support for using Unix epoch time format as the time key for single line text log type.

### Fixed

- Fixed an issue that missing time key when editing JSON config. #296
- Fixed an issue that upgrading to v2.2.0 failed due to missing CMK permissions. #297

## [2.2.0] - 2024-06-20

### Added
Expand All @@ -26,7 +38,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Fixed a bug where the system could not read properties of undefined ('accountId') when the Next button was clicked without selecting an Instance Group. #236
- Fixed an issue where logs were not received when using the solution-provisioned staging bucket in Light Engine. #237
- Fixed a permissions issue in the LogMerger State Machine within Light Engine: The S3ObjectMigration Lambda failed due to insufficient KMS permissions on the analytics S3 bucket. #272
- Fixed a bug that the maximum number of distributions that can be displayed is 100 when creating pipeline. #278
- Fixed a bug that prevented instances from being listed when switching accounts on the Instance Group list page. #291
- Fixed a bug where creating a Log Conf with JSON type, if the field type select float, can not create the index template. #293


## [2.1.2] - 2024-03-19

Expand Down
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ The Centralized Logging with OpenSearch solution provides comprehensive log mana
- [Architecture](#architecture)
- [Deployment](#deployment)
- [Customization](#customization)
- [Collection of operational metrics](#collection-of-operational-metrics)


## Solution Overview
Expand Down
92 changes: 92 additions & 0 deletions docs/en/implementation-guide/trouble-shooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,3 +186,95 @@ echo /usr/local/openssl11/lib/ >> /etc/ld.so.conf
ldconfig
```

## I found that the OpenSearch data node's disk space was full, and then executed "delete index_prefix*" from the dev tools in the OpenSearch Dashboards. After execution, the index name suffix no longer contains the time format. What should I do to fix?

!!! Warning "Note"

The following operation will delete the currently written index, resulting in data loss.

1. Open the Centralized Logging with OpenSearch console, find the pipeline which has this issue and choose **View details**.
2. Go to Monitoring > Lambda Processor, and click on the link(start with `/aws/lambda/CL-xxx`) under **Lambda Processor**.

![](../images/trouble-shooting/lambda-link.png)

3. Go to **Lambda** console > **Configuration** > **Concurrency**, choose **Edit**, select **Reserve concurrency** and set it to 0.

![](../images/trouble-shooting/lambda-configuration-concurrency.png)

![](../images/trouble-shooting/lambda-edit-concurrency.png)

4. Open the OpenSearch Dashboards, go to **Dev Tools**, input `DELETE your_index_name` and click to send request.

![](../images/trouble-shooting/aos-dev-tools.png)

![](../images/trouble-shooting/delete_index.png)

5. Input `GET _cat/indices/your_index_name` and click to send request. If **"status"** is 404 and **"type"** is index_not_found_exception in the returned result, it means success. Otherwise, please repeat step 4.

![](../images/trouble-shooting/cat_index.png)

6. Input `POST /your_index_name/_rollover` and click to send request.

7. Go to **Lambda** console > **Configuration** > **Concurrency**, choose **Edit**, select **Reserve concurrency** and set it to the value you want, or select **Use unreserved account concurrency**, save.

## Standard Operating Procedure for Proxy Stack Connection Problems

### When I access OpenSearch dashboards through the proxy, the browser shows 504 gateway timeout

##### Possible Root cause:
a. If instances keeps terminating and initializing

i. Wrong security Group

b. Instances are not keep terminating

i. VPC peering request not accepted

ii. Peering with the wrong VPC

iii. Route table has the wrong routes

c. Check if VPC Peering is working.

### When I access OpenSearch dashboards through the proxy, the browser shows "Site can't be reached"

![](../images/trouble-shooting/site_cannt_be_reached.png)

##### Possible root cause:

1. Application Load Balancer is deployed inside private subnet

2. The proxy stack has just been re-deployed, it takes at least 15mins for DNS server to resolve the new Load Balancer endpoint address


##### Solution:

1. ALB deploy location is wrong, just delete the proxy stack and create a new one

2. wait for 15 mins

## I set the log collection path to /log_path/*.log, what will be the impact?

!!! Warning "Note"

Normally we don't recommend using wildcard * as a prefix for matching logs. If there are hundreds, or even thousands of files in the directory, this will seriously affect the rate of FluentBit's log collection, and it is recommended that you can remove outdated files on a regular basis.

## The log file names are the same for different systems, but the log path contains the system name in order to differentiate between the different systems. I wish to create a pipeline to handle this, how should I set the log path?

!!! Info "Note"

#### Let's go through an example:

For example, we have 3 environments, dev, staging, prod. The log paths are /log_path/dev/jvm.log, /log_path/staging/jvm.log, and /log_path/prod/jvm.log. In this scenario if you wish to create only one pipeline, you can set the log path as follows:

![](../images/trouble-shooting/log_path.png)

`/log_path/*/jvm.log`.

## In EKS environment, I am using DaemonSet mode to collect logs, but my logs are not using standard output mode, how should I configure the Yaml file for deployment?

As we know, if you create a pipeline and the selected log source is EKS in the CLO, the system will automatically generate the content in YAML format for you to assist you in creating the deployment file for you to deploy FluentBit. You can match the log path `/your_log_path/` in the YAML file and remove the `Parser cri_regex`. Please refer to the following screenshot for details:

![](../images/trouble-shooting/without_cri_log.png)

Binary file added docs/images/trouble-shooting/aos-dev-tools.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/trouble-shooting/cat_index.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/trouble-shooting/delete_index.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/trouble-shooting/lambda-link.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/trouble-shooting/log_path.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/trouble-shooting/without_cri_log.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
from commonlib.logging import get_logger
from typing import List
import json
from distutils.util import strtobool
from commonlib.utils import strtobool
from commonlib.model import (
AppLogIngestion,
LogConfig,
Expand All @@ -22,6 +22,7 @@
GroupPlatformEnum,
)
from commonlib.dao import InstanceDao
from commonlib import AWSConnection
from commonlib.exception import APIException, ErrorCode
from jinja2 import FileSystemLoader, Environment
from flb.flb_model import FluentBitDataPipeline
Expand Down Expand Up @@ -53,6 +54,8 @@

instance_table_name = os.environ.get("INSTANCE_TABLE_NAME")
instance_dao = InstanceDao(table_name=instance_table_name)
conn = AWSConnection()
ssm_cli = conn.get_client("ssm", region_name=default_region)


class FluentBitDataPipelineBuilder(object):
Expand Down Expand Up @@ -285,6 +288,63 @@ def _get_os(self, instance_id: str) -> GroupPlatformEnum:
return ec2_instance.platformType
return GroupPlatformEnum.LINUX

def _get_flb_params(self):
log_level = ssm_cli.get_parameter(
Name=f"/{stack_prefix}/FLB/log_level", WithDecryption=True
)["Parameter"]["Value"]

flush = ssm_cli.get_parameter(
Name=f"/{stack_prefix}/FLB/flush", WithDecryption=True
)["Parameter"]["Value"]

mem_buf_limit = ssm_cli.get_parameter(
Name=f"/{stack_prefix}/FLB/mem_buf_limit", WithDecryption=True
)["Parameter"]["Value"]

buffer_chunk_size = ssm_cli.get_parameter(
Name=f"/{stack_prefix}/FLB/buffer_chunk_size", WithDecryption=True
)["Parameter"]["Value"]

buffer_max_size = ssm_cli.get_parameter(
Name=f"/{stack_prefix}/FLB/buffer_max_size", WithDecryption=True
)["Parameter"]["Value"]

buffer_size = ssm_cli.get_parameter(
Name=f"/{stack_prefix}/FLB/buffer_size", WithDecryption=True
)["Parameter"]["Value"]

retry_limit = ssm_cli.get_parameter(
Name=f"/{stack_prefix}/FLB/retry_limit", WithDecryption=True
)["Parameter"]["Value"]

store_dir_limit_size = ssm_cli.get_parameter(
Name=f"/{stack_prefix}/FLB/store_dir_limit_size", WithDecryption=True
)["Parameter"]["Value"]

storage_type = ssm_cli.get_parameter(
Name=f"/{stack_prefix}/FLB/storage_type",
WithDecryption=True,
)["Parameter"]["Value"]

storage_pause_on_chunks_overlimit = ssm_cli.get_parameter(
Name=f"/{stack_prefix}/FLB/storage_pause_on_chunks_overlimit",
WithDecryption=True,
)["Parameter"]["Value"]

flb_params = {
"log_level": log_level,
"flush": flush,
"mem_buf_limit": mem_buf_limit,
"buffer_chunk_size": buffer_chunk_size,
"buffer_max_size": buffer_max_size,
"buffer_size": buffer_size,
"retry_limit": retry_limit,
"store_dir_limit_size": store_dir_limit_size,
"storage_type": storage_type,
"storage_pause_on_chunks_overlimit": storage_pause_on_chunks_overlimit,
}
return flb_params


class InstanceFlb(Flb):
def __init__(self, sub_account_cwl_monitor_role_arn: str = cwl_monitor_role_arn):
Expand All @@ -303,11 +363,14 @@ def build_instance_data_pipelines(self, instance_with_ingestion_list: dict):
def get_flb_conf_content(self, content_type="parser"):
instance_content = dict()
if len(self._instance_flb_pipelines) > 0:
flb_params = self._get_flb_params()
content_template = self._template_env.get_template(f"{content_type}.conf")
for key, value in self._instance_flb_pipelines.items():
params = dict()
params["flb_data_pipelines"] = value

# Getting customized parameters from ssm
params["ssm_params"] = flb_params
# build cwl monitor param
params["region"] = default_region
params["stack_prefix"] = stack_prefix
Expand Down Expand Up @@ -371,6 +434,9 @@ def generate_deployment_content(self) -> str:
template_file = f"./k8s-{self._eks_source.deploymentKind}.conf"
k8s_template = self._template_env.get_template(template_file)
params = dict()
# Getting customized parameters from ssm
params["ssm_params"] = self._get_flb_params()

params["env"] = DeploymentEnvEnum.EKSCluster.value
params["eks_cluster_name"] = self._eks_source.eksClusterName
params["svc_acct_role"] = self._eks_source.logAgentRoleArn
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{{placeholder}}[SERVICE]
{{placeholder}} Flush 5
{{placeholder}} Flush {{ssm_params.flush}}
{{placeholder}} Daemon off
{{placeholder}} Log_level Info
{{placeholder}} Log_level {{ssm_params.log_level}}
{{placeholder}} Http_server On
{{placeholder}} Http_listen 0.0.0.0
{{placeholder}} Http_port 2022
Expand Down Expand Up @@ -59,16 +59,17 @@
{{placeholder}} DB /var/fluent-bit/state/flb_container-{{item.tag}}.db
{% endif %}
{{placeholder}} DB.locking True
{{placeholder}} Mem_Buf_Limit {{item.mem_buf_limit}}
{{placeholder}} Mem_Buf_Limit {{ssm_params.mem_buf_limit}}
{{placeholder}} # Since "Skip_Long_Lines" is set to "On", be sure to adjust the "Buffer_Chunk_Size","Buffer_Max_Size" according to the actual log. If the parameters are adjusted too much, the number of duplicate records will increase. If the value is too small, data will be lost.
{{placeholder}} # https://docs.fluentbit.io/manual/pipeline/inputs/tail
{{placeholder}} Buffer_Chunk_Size 512k
{{placeholder}} Buffer_Max_Size 5M
{{placeholder}} Buffer_Chunk_Size {{ssm_params.buffer_chunk_size}}
{{placeholder}} Buffer_Max_Size {{ssm_params.buffer_max_size}}
{{placeholder}} Skip_Long_Lines On
{{placeholder}} Skip_Empty_Lines On
{{placeholder}} Refresh_Interval 10
{{placeholder}} Rotate_Wait 30
{{placeholder}} storage.type filesystem
{{placeholder}} storage.type {{ssm_params.storage_type}}
{{placeholder}} storage.pause_on_chunks_overlimit {{ssm_params.storage_pause_on_chunks_overlimit}}
{{placeholder}} Read_from_Head False
{{placeholder}} Path_Key file_name
{{placeholder}} Path {{item.tail.logPath}}
Expand Down Expand Up @@ -157,7 +158,7 @@
{{placeholder}} Merge_Log_Key log_processed
{{placeholder}} K8S-Logging.Parser On
{{placeholder}} K8S-Logging.Exclude Off
{{placeholder}} Buffer_Size 0
{{placeholder}} Buffer_Size {{ssm_params.buffer_size}}
{{placeholder}} Use_Kubelet True
{{placeholder}} Kubelet_Port 10250
{{placeholder}} Labels On
Expand All @@ -175,7 +176,7 @@
{% if item.parser.time_format and item.parser.time_format!='""' %}
{{placeholder}} Time_key_format %Y-%m-%dT%H:%M:%S.%LZ
{% endif %}
{{placeholder}} Retry_Limit False
{{placeholder}} Retry_Limit {{ssm_params.retry_limit}}
{{placeholder}} Role_arn {{item.role_arn}}
{{placeholder}}
{% elif item.output_name=='MSK' %}
Expand All @@ -192,7 +193,7 @@
{{placeholder}} rdkafka.acks -1
{{placeholder}} rdkafka.compression.type snappy
{{placeholder}} rdkafka.security.protocol plaintext
{{placeholder}} Retry_Limit False
{{placeholder}} Retry_Limit {{ssm_params.retry_limit}}
{{placeholder}}
{% elif item.output_name=='S3' %}
{{placeholder}}[OUTPUT]
Expand All @@ -203,6 +204,7 @@
{{placeholder}} region {{item.region_name}}
{{placeholder}} total_file_size {{item.s3.max_file_size}}M
{{placeholder}} upload_timeout {{item.s3.upload_timeout}}s
{{placeholder}} store_dir_limit_size {{ssm_params.store_dir_limit_size}}
{{placeholder}} use_put_object true
{% if item.s3.compression_type | lower == "gzip" %}
{{placeholder}} s3_key_format /{{item.s3.prefix}}/%Y-%m-%d-%H-%M-%S-$UUID.gz
Expand All @@ -218,7 +220,7 @@
{{placeholder}} json_date_format iso8601
{% endif %}
{{placeholder}} tls.verify False
{{placeholder}} Retry_Limit False
{{placeholder}} Retry_Limit {{ssm_params.retry_limit}}
{{placeholder}} Role_arn {{item.role_arn}}
{{placeholder}}
{% else %}
Expand All @@ -229,7 +231,7 @@
{{placeholder}} AWS_Region {{item.region_name}}
{{placeholder}} Host {{item.aos.endpoint}}
{{placeholder}} Port 443
{{placeholder}} Retry_Limit False
{{placeholder}} Retry_Limit {{ssm_params.retry_limit}}
{{placeholder}} AWS_Auth On
{{placeholder}} TLS On
{{placeholder}} Suppress_Type_Name On
Expand Down
Loading

0 comments on commit 876df63

Please sign in to comment.