Skip to content

Commit

Permalink
Move support section (#2341)
Browse files Browse the repository at this point in the history
Co-authored-by: Luca <[email protected]>
Co-authored-by: Andreas Sommer <[email protected]>
  • Loading branch information
3 people authored Nov 29, 2024
1 parent 5437da5 commit e61a424
Show file tree
Hide file tree
Showing 19 changed files with 287 additions and 620 deletions.
3 changes: 3 additions & 0 deletions .vale/styles/config/vocabularies/docs/accept.txt
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ firewalling
freely
Grafana
gsctl
Honeybadger
HTTP
IP[s]?
IPAM
Expand All @@ -52,6 +53,7 @@ NLB[s]?
onboarding
passthrough
perfectly
Planeteers
Promtail
quickly
randomly
Expand All @@ -62,6 +64,7 @@ runbook[s]?
SemVer
separately
Sigstore
SLA[s]?
Spotify
SSL
subnet
Expand Down
4 changes: 2 additions & 2 deletions src/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,9 +39,9 @@ menu:
name: Use the API
url: /vintage/use-the-api/
weight: 40
- identifier: support-training
- identifier: support
name: Support & Training
url: /vintage/support/
url: /support/
weight: 50
- identifier: changes
name: Changes and releases
Expand Down
4 changes: 2 additions & 2 deletions src/content/support/_index.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
---
title: Support
description: How does our support is designed making sure that customer platforms are up to date and functional 24/7.
description: How our support model works and how our operation team handles the incidents.
weight: 40
last_review_date: 2024-03-07
last_review_date: 2024-11-28
owner:
- https://github.com/orgs/giantswarm/teams/sig-docs
---
119 changes: 119 additions & 0 deletions src/content/support/incident-process/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
---
title: Incident process
description: The process used by the Giant Swarm support team when a priority one incident is called.
weight: 100
aliases:
- /support/p1-process
menu:
principal:
parent: support
identifier: support-incident
user_questions:
- What process does Giant Swarm follow in case of critical incidents?
owner:
- https://github.com/orgs/giantswarm/teams/teddyfriends
last_review_date: 2024-11-26
---

After years of handling critical enterprise workloads in production, Giant Swarm has strengthened the incident process based on valuable learnings. This document focuses on critical incidents, called `Priority 1` (P1) incidents, although some steps may also apply to regular incidents.

Giant Swarm classifies incidents as either critical (`P1`) or routine (`P2`). Critical incidents impair a customer production system, while routine incidents don't impact production and follow a straightforward process.

## Separation of responsibilities

It’s important to ensure that everyone involved in an incident knows their role and what's expected of them, without conflicting with others' responsibilities. Somewhat counterintuitively, a clear separation of responsibilities allows individuals more autonomy, as they don't need to constantly coordinate actions.

### Roles

At Giant Swarm, two roles are defined: `Incident Coordinators` and `Operations Engineers`.

### Incident coordinator

The `Incident Coordinator` maintains the high-level overview of the incident. Structuring the incident response, the coordinator assigns responsibilities according to need and priority. By default, the coordinator holds all positions/responsibilities not delegated. If necessary, the coordinator can remove roadblocks that prevent operations engineers from working effectively.

The coordinator is the public face of the incident response, responsible for issuing periodic updates to all involved teams—both customer teams and within Giant Swarm—acting as the bridge between customer and team. The coordinator will need to be present in the situation rooms of customers. For this reason, an [Opsgenie team](https://support.atlassian.com/opsgenie/docs/what-are-teams-in-opsgenie/) groups all members who can act as incident coordinators.

When there is a dedicated coordinator assigned to an incident, this person isn’t debugging systems but focuses on coordinating the team and managing customer communication.

### Operations engineer

The `Operations Engineer` works with the coordinator to respond to the incident, responsible for debugging and applying changes to a system.

Out teams are on-call in [`Opsgenie`](https://support.atlassian.com/opsgenie/docs/what-are-teams-in-opsgenie/) in case any incident is triggered at any point of time.

## Incident process

Inspired by the well-known [Incident Command System](https://en.wikipedia.org/wiki/Incident_Command_System) used by US firefighters, the process is adapted to manage developer platforms.

The main tenet is to have a simple process integrated with incident tooling ([Incident.io](https://incident.io/)) to simplify life for engineers. Once a critical incident is declared, the process should guide actions without needing to read instructions.

The process is broken down into these steps:

1. [Identify](#identify)
2. [Investigate](#investigate)
3. [Fixing](#fixing)
4. [Monitoring](#monitoring)
5. [Closing up](#closing-up)

### Identify

The first step is to identify the incident and understand its impact and severity. There are three possible sources:

1. Alert received pointing to an impacted production system
2. Customer reaches out via Slack
3. Customer sends an urgent email

For the first two options, the engineer [declares an incident](https://help.incident.io/en/articles/5947915-declaring-incidents) using the `Slack` shortcut directly on the alert or customer message in the communication channel. The shortcut generates a pop-up to introduce details of the incident, such as name, whether it's a live incident or triage, severity, summary, or affected customer.

![Incident.io Shortcut Popup Screenshot](shortcut_screenshot.png)

If the incident comes from an urgent email, [incident.io](https://incident.io/) automatically creates a channel for the incident and notifies the person on call. The incident is created in `triage` so the Operations Engineer needs to confirm the severity of the issue before triggering the P1 process. Often, the customer provides a call link to join and confirm the problem.

__Note__: For `triage` incidents, a [decision workflow](https://incident.io/blog/using-decision-flows) is designed to help engineers decide the severity of an incident.

Once the `P1` criticality is confirmed, [incident.io](https://incident.io/) triggers a set of [workflows](https://help.incident.io/en/articles/6971329-getting-started-with-workflows) to drive the incident. These workflows include:

- `Escalation Matrix`, displaying different customer contacts to call in an emergency.
- `Role assignment`, automatically assigning the Operations Engineer role to the person reporting the incident.
- `Ping people on call`, notifying colleagues who are on call once the incident is created automatically by urgent email.

For `P1` incidents, the first step is to build the team. Often, the engineer creating the channel isn't part of the [Incident Coordinators Group](https://giantswarm.app.opsgenie.com/teams/dashboard/f02504a3-83d4-4ea8-b55c-8c67756f9b2e/main), so escalation is needed to involve someone from that team. When creating an incident channel, [incident.io](https://incident.io/) provides a button to escalate and select the coordinator schedule from `Opsgenie`. At least a two-person team is needed to manage a critical incident (communications and operations).

![Incident.io Escalate Screenshot](escalate_screenshot.png)

### Investigate

Once the team is built, the person assigned to the `Operations Engineer` role will carry on with the investigation. The incident coordinator will be in contact with the customer, via messaging or in a call, providing information to the Operator to aid the investigation.

__Note__: In exceptional cases, the person who acknowledges the alert can manage communication and fix the problem simultaneously, but in such cases, ensure the customer is aware of the measures implemented to solve the issue.

Operations Engineers focus on the investigation, but 30-minute intervals are established to update the customer on the current state. Findings are shared in the channel, and the coordinator can pin these messages to help track actions performed.

If the coordinator needs more responders, escalation to more team members is possible using [incident.io](https://incident.io/) command `/inc escalate`.

By default, every 30 minutes [incident.io](https://incident.io/) will remind the coordinator to share updates with the customer or report any progress on the incident channel.

### Fixing

After identifying the root cause, a solution is implemented to prevent further downtime for the customer service. Often, the solution is temporary and replaced once the actual fix is rolled out to the platform. Once the cause is identified and the problem is being fixed, the coordinator updates the incident channel status to `Fixing` (using `/inc update`). The same command can be used to update the summary with any progress.

### Monitoring

Once the fix or workaround is implemented, the coordinator communicates with the customer and moves the incident status to the monitoring phase, where the team remains on standby. The engineer monitors metrics and communication channels to confirm no regression. The incident remains in this state for a period, typically a day or two, until agreement with the customer confirms no regression.

### Closing up

Closing the incident doesn’t mean the work is done. The coordinator creates a [`Postmortem`](https://docs.giantswarm.io/support/overview/#postmortem-process) document detailing all information collected during the incident and shares it with the customer. The [incident.io](https://incident.io/) functionality allows generating a `Google` document as post-mortem, filling most parts with metadata and pinned messages gathered during the incident. The dedicated account engineer for the customer will review and seek feedback from any incident participants.

Any remaining follow-up items are converted into GitHub tickets by the coordinator and moved to the product teams to improve service and avoid repeating mistakes.

### Diagram

The entire workflow is visualized for better understanding, focusing on the most common scenario and excluding all possible ramifications as those are exceptions.

![Incident Workflow](p1_flow_diagram.jpg)

## More info

- This process takes inspiration from [`incident coordinator` role](https://en.wikipedia.org/wiki/Incident_commander).
- [Incident shortcut cheatsheet](https://help.incident.io/en/articles/5948163-shortcuts-cheatsheet)
68 changes: 68 additions & 0 deletions src/content/support/overview/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
---
linkTitle: Overview
title: Customer support
description: The support we provide is an essential part of our offering. Here we explain various support service processes and workflows.
weight: 10
last_review_date: 2024-11-25
user_questions:
- What should I know when working with Giant Swarm's support staff?
- How is Giant Swarm organizing support?
menu:
principal:
parent: support
identifier: support-overview
owner:
- https://github.com/orgs/giantswarm/teams/team-planeteers
---

Giant Swarm has developed a custom support model to offer exceptional assistance by defining different layers of support.

## Direct support via Slack

Our first level of support involves close interactions via Slack, ensuring bi-directional feedback as quickly as possible. This channel is also used to answer a variety of questions, which can extend beyond the platform to include anything cloud-native.

The first level support is available from 08:00 to 18:00 (CET) on Monday-Friday. With a distributed team across the world, questions are often answered outside these hours. Support shifts rotate across teams, focusing on channels with clear internal handovers.

In case the first line support is unable to resolve your request, it's escalated to an engineer from the team responsible of the component or application in question. This is managed through a 24-hour rotating shift.

## Project management

The shared goal is to build a developer platform together. As the expert in your company's domain, you are supported in creating a valuable experience for your developers. This collaboration involves creating a roadmap and defining the necessary milestones. At the same time, our Solution Architects have created a [training program]({{< relref "/support/training-catalog" >}}) to help you get the most out of the platform.

Every customer is assigned a dedicated `Account Engineer` who holds regular sync meetings to discuss project progress, address any blockers, or manage changes in requirements. This go-to person provides additional support and acts as a backup if the first line support is overloaded.

## Operational support

Our team trust the `DevOps` principle `You build it, you run it` and for that reason each part of the platform is managed by a different team.

On-call engineers monitor all alerts from environments where your workloads run. These engineers are available every day, ensuring that issues are addressed as quick as possible, even during nights and weekends.

Currently, the mean time to acknowledge an alert is around two minutes, with incident resolution typically taking less than two hours. Not all alerts result in downtime; alerts are configured to resolve issues before they lead to actual incidents.

Additionally, you have a dedicated email address to contact the on-call engineer at any time for cases where problems are noticed that haven't been detected by monitoring.

### Fully monitored platform

The Giant Swarm platform includes a monitoring and alerting system that helps the operations team maintain Service Level Agreements (SLAs) across all customer platforms.

The monitoring stack observes the platform's underlying infrastructure, including the networking layer, DNS resolution, Kubernetes core components, cloud providers, and other targets, providing a complete view of system health.

Applications running on top of the platform, offering observability, connectivity, or security, are instrumented to expose metrics to the monitoring system, ensuring continuous operation.

### Incident handling

Whenever an on-call engineer receives an alert from the monitoring platform, the incident process begins. A new Slack channel is created specifically for the issue, where information is gathered. Thanks to [incident.io](https://incident.io) automation, the ops team can quickly escalate an incident or generate a report for you (postmortem).

Treating major alerts as incidents offers many benefits: information is contained in a single channel, recurrent problems or trends can be identified to improve the support process, and there is a historical record of actions and information related to specific issues. More information about the incident process can be found in the [Giant Swarm incident process]({{< relref "/support/incident-process" >}}).

### Postmortem process

The `postmortem` culture, [created by Google](https://sre.google/sre-book/postmortem-culture/), is established to document problems correctly, find root causes, and fix them permanently across all installations. Every time an incident is closed, if it's not a false positive, a postmortem is created.

Postmortems are developed throughout the week. On Mondays, the product team meets to distribute postmortems across product teams. During each team's weekly sprint planning, a specific engineer is assigned to each postmortem. Postmortems take priority over feature development, with engineers dedicating at least one day a week to solving these problems.

A postmortem is only closed once the underlying issue is fixed and deployed to all affected environments. Additionally, postmortems often result in new or refined alerts and operational recipes to share knowledge across the operations team.

## The future

The support process is always evolving. As improvements are made, policies, processes, and tools are refined. However, the goals remain clear and steadfast: to provide seamless support to you, the customer.
84 changes: 84 additions & 0 deletions src/content/support/training-catalog/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
linkTitle: Training catalog
title: Giant Swarm training catalog
description: Overview of the trainings and workshops offered to customers in order to share our knowledge and best practices with them and answer possible follow-up questions.
weight: 20
menu:
principal:
parent: support
identifier: support-training
last_review_date: 2024-11-25
user_questions:
- Which trainings does Giant Swarm offer?
- How can I learn about Cloud Native topics?
- How can I fully fully utilize my Giant Swarm setup?
- How do I learn about the Giant Swarm platform?
- How do I get up to speed once I become a Giant Swarm customer?
owner:
- https://github.com/orgs/giantswarm/teams/team-planeteers
---

Giant Swarm perceives the cloud-native landscape as a journey, and it takes time and effort to get across it successfully. Based on the lessons learnt through these years helping our customers, our teams have built a training set that guides customers to understand the principles and the tools. The training is why-focused rather than how-focused, meaning that the aim is to explain the reasons behind the tools and practices, rather than just how to use them.

### Getting Started with Giant Swarm

Giant Swarm offers a series of training sessions to help you get started with our platform. During the onboarding process, your account engineer will schedule these sessions. Our experts will present to you the following topics:

[_Team Planeteers_](https://www.giantswarm.io/about)

- Who is Giant Swarm?
- What's the support model?
- The onboarding into the web UI and the CLI - `kubectl gs`
- How does GS help platform teams serve application teams?
- What are best practices that we should aim for together?
- What's some useful "technical" knowledge? What differentiates Giant Swarm from other providers?

### Kubernetes 101

[_Team Planeteers_](https://www.giantswarm.io/about)

- High level overview of microservices
- What's `Kubernetes` and what are the reasons to use it
- Basic `Kubernetes` concepts
- Basic `Kubernetes` best practices

### GitOps 101

[_Team Honeybadger_](https://www.giantswarm.io/about)

- What's `GitOps`?
- What led us to `GitOps` in the first place?
- What are the advantages and reasons to work towards this method?
- Tooling - our recommendations
- Internal tour of how Giant Swarm automation and pipeline is setup
- An example of `GitOps` in action

### Cloud Native Security 101

[_Team Shield_](https://www.giantswarm.io/about)

- What does security mean in a cloud-native world?
- What are `Pod Security Standards`, `Network Policies`, `RBAC`, etc.?
- What needs to be considered in securing infrastructure outside of `Kubernetes` clusters?
- Tooling options for "Day 2" operations
- Our default security stack

### Monitoring and Observability 101

[_Team Atlas_](https://www.giantswarm.io/about)

- What's the advantage of observability on `Kubernetes`?
- Which tools do we provide and recommend? Why?
- How does Giant Swarm do alerting internally?
- What does the Giant Swarm monitoring stack look like?
- Monitoring infrastructure and applications

### Troubleshooting and Best Practices 101

[_Team Planeteers_](https://www.giantswarm.io/about)

- What are the most common mistakes we've seen in 10 years of experience?
- How do we make deployments Cloud Native?
- How do we ensure proper scalability?
- How do we debug issues in a `Kubernetes` cluster?
- Tips and tricks for day-to-day operations and debugging
16 changes: 0 additions & 16 deletions src/content/vintage/support/_index.md

This file was deleted.

Loading

0 comments on commit e61a424

Please sign in to comment.