Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dogfood OpenEBS E2E failures by capturing useful information #144

Open
vharsh opened this issue Feb 2, 2022 · 1 comment
Open

Dogfood OpenEBS E2E failures by capturing useful information #144

vharsh opened this issue Feb 2, 2022 · 1 comment
Labels
testing For testing related enhancements wontfix This will not be worked on

Comments

@vharsh
Copy link
Member

vharsh commented Feb 2, 2022

Questions

  1. What should be the goal of this tool? Should it just stick to pointing troubling areas or also dump data of the troubling areas and can that data be trusted at the face value?
  2. How much logs should the tool collect, if at all. Just enough or all of it so that further debugging is done by grep-ing outputs in an editor of choice or some back-and-forth commands?
  3. What should be a baseline assumption for this tool(it's turtles all the way down, which turtle should be this tool's last one)? Is it a good idea to assume, K8s is supposed to be healthy and is managed perfectly by the admin?

Background

  • Right now we have super preliminary support for debugging Cstor volumes, it'd be good if we can think on something on the lines of debugging + creating a github issue + dogfooding, etc.
  • Right now, the cstor volume debugging, just points to places which seems off, it'd be good to sort of plan and implement, debugging in stages, i.e. it helps narrow down the search space by pointing out what's right, what isn't & what may not be
    • Identify a list of things, which needs to be checked(is the storage-engine replicated?, should NDM agents failing affect this volume/pool?)
    • K8s APIserver is up & healthy
    • K8s kube-system components are up, is kubelet container(for certain setups) up, how does node-heartbeat look like for concerned nodes(are they alive and kicking, do they have any X-Pressure)?
    • Networking isn't down(imp for replicated storage engines)
    • Relevant OpenEBS components are up(as identified in step-1)
  • There are some limitations to the tool, it might be hard to figure out(at first) if the application is failing because of storage or vice versa.

Goals

  • While OpenEBSctl can show some information we generally ask our community users while interacting with them in a single shot and we plan to help them automatically create a GitHub issue via Ability to generate raise GitHub issues with required troubleshooting information. #39.
  • It might be a good ask to think of using the same tool to collect useful information on cluster-destruction, which is likely what happens when an E2E test fails. It might be useful as a replacement of a bunch of kubectl & shell commands.
  • To-be-decided-and-updated

Pre-requisites issues for this task:

  1. Refactor describe output code such that it returns results instead of printing them out #143
  2. Ability to generate raise GitHub issues with required troubleshooting information. #39
@vharsh
Copy link
Member Author

vharsh commented Feb 28, 2022

I'll have a chat with the E2E team about how useful this can become & what more enhancements can help it get there.

@Abhinandan-Purkait Abhinandan-Purkait added wontfix This will not be worked on testing For testing related enhancements labels Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing For testing related enhancements wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants