Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EC2 Ubuntu instance health check failure: SSM get correct "Online" Output response from instance, but UI shows Offline #267

Closed
owenCCY opened this issue Mar 14, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@owenCCY
Copy link
Contributor

owenCCY commented Mar 14, 2024

Describe the bug

Customer using CLO 2.1.1 (upgraded from CLO1.0.x)
They have more than 100 instances.
They are using Ubuntu 22.04, and installed flb agent, SSM get correct "Online" Output response from instance, but UI shows Offline.

Output:
{"fluent-bit":{"version":"1.9.10","edition":"Community","flags":["FLB_HAVE_IN_STORAGE_BACKLOG","FLB_HAVE_PARSER","FLB_HAVE_RECORD_ACCESSOR","FLB_HAVE_STREAM_PROCESSOR","FLB_HAVE_TLS","FLB_HAVE_OPENSSL","FLB_HAVE_METRICS","FLB_HAVE_AWS","FLB_HAVE_AWS_CREDENTIAL_PROCESS","FLB_HAVE_SIGNV4","FLB_HAVE_SQLDB","FLB_HAVE_METRICS","FLB_HAVE_HTTP_SERVER","FLB_HAVE_SYSTEMD","FLB_HAVE_VALGRIND","FLB_HAVE_FORK","FLB_HAVE_TIMESPEC_GET","FLB_HAVE_GMTOFF","FLB_HAVE_UNIX_SOCKET","FLB_HAVE_LIBYAML","FLB_HAVE_ATTRIBUTE_ALLOC_SIZE","FLB_HAVE_PROXY_GO","FLB_HAVE_JEMALLOC","FLB_HAVE_LIBBACKTRACE","FLB_HAVE_REGEX","FLB_HAVE_UTF8_ENCODER","FLB_HAVE_LUAJIT","FLB_HAVE_C_TLS","FLB_HAVE_ACCEPT4","FLB_HAVE_INOTIFY","FLB_HAVE_GETENTROPY","FLB_HAVE_GETENTROPY_SYS_RANDOM"]}}

Expected Behavior

Instance Online

Current Behavior

Instance Offline

Reproduction Steps

Use CLO 2.1.1, have more then 1 page of instances, install agent in Ubuntu 22.04 then click load more.

Possible Solution

No response

Additional Information/Context

No response

Solution Version

2.1.1

AWS Region. e.g., us-east-1

No response

Other information

No response

@owenCCY owenCCY added the bug Something isn't working label Mar 14, 2024
@owenCCY
Copy link
Contributor Author

owenCCY commented Mar 15, 2024

Base on customer environment check, their frontend sends duplicate instance ids into SSM client, causing api error:
An error occurred (DuplicateInstanceId) when calling the SendCommand operation:

Checked their UI, duplicate instances are listed in the frontend, the api call getInstanceAgentStatus sends duplicate instance ids, causing the above issue.

The solution applied for customer:
Add dedupe code in CentralizedLogging-APIInstanceAPIInstanceAgentStatus
from : instance_list = args.get("instanceIds", list())
to: instance_list = list(set(args.get("instanceIds", list())))

@owenCCY
Copy link
Contributor Author

owenCCY commented Mar 15, 2024

Base on reproduction test, we do not see the same issue in the release version (2.1.1 and above).

Will keep watching if upgraded customers have the same issue.

@owenCCY owenCCY closed this as completed Mar 15, 2024
@owenCCY owenCCY reopened this Mar 15, 2024
@Jiale-Fang
Copy link

Jiale-Fang commented Jul 19, 2024

I am still encountering this issue. In the AWS Solution OpenSearch UI, the status alternates between showing "online" and "offline". Actually, i do see ssm run the command and successfully get the fluent bit information. But i am not sure why UI can not fetch that information

CLO Version: 2.2.0
EC2 Operating System: Windows Server 2022

image
image
image

@James96315
Copy link
Contributor

@Jiale-Fang , in v2.2.0, the auto-refresh interval is 3 minutes. You can upgrade to v2.2.1 and we have changed it to 10 seconds.

@Jiale-Fang
Copy link

Jiale-Fang commented Jul 22, 2024

Thank you @James96315 . Actually, my question is: I see Fluent Bit is running fine on my Windows 2022 machine, but in the AWS CLO console, it cannot correctly get the agent status. Sometimes it displays as online, and sometimes it displays as offline. And in systems manager command history, i can see commands' output are correctly getting fluent bit's version info.

@James96315
Copy link
Contributor

@Jiale-Fang , can you upgrade the CLO to the latest version. Also, what is the instance type you are using? If the instance type is t2.micro, the detection will be slower. If you still have questions, we can set up a time to take a look at it remotely!

@Jiale-Fang
Copy link

Hi @James96315 , Thank you for your response. Unfortunately, I am unable to upgrade to version 2.2.1 at this time. I am currently using CLO version 2.2.0 and installing fluent-bit in t3a.large instances. When i fetch the agent status for windows instances, Sometimes it displays as online, and sometimes it displays as offline. However, this doesn't affect my normal usage. Please feel free to contact me if you need more information. Thanks!
image
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants