Skip to content

DevOps Handbook

Martin Lupa edited this page Mar 8, 2024 · 8 revisions

DevOps Handbook

The following document takes "The DevOps Handbook" by Gene Kim as a guiding material in order to document all the actions taken to support what the author calls the Three Ways. The Three Ways is a set of principles that guide the DevOps mindset and aim at improving the efficiency and efficacy of the development process and they are the output of applying the most trusted principles from the domain of physical manufacturing and leadership to the IT value stream

Important

Even though there might be areas where we have not implemented a specific process yet, they will still be mentioned so future edits to this Handbook can include them.

The First Way: The Principles of Flow

Make our work visible

To organise and make work visible we decided to capitalise Github's built in Project tab, which provides an adaptable spreadsheet for tracking work, and which also integrates with Issues and Pull Requests. This tool provides a comprehensive visibility to the current state of each ticket and an easy-to-understand interface for developers and stakeholders.

DevOps Team Planning spreadsheet screenshot

A new Issue is created to document a new requirement and is automatically shown in the Project board as a "Todo" item. When a team member is assigned to such issue, it is moved into the "In Progress" state. The development phase takes part, and when the issue is closed, it is automatically moved to the "Done" status in the board. Closing an issue as "not planned" takes the ticket to the "Aborted" status.

Limit work in progress (WIP)

We have not implemented a limit to the WIP carried by each developer at a given time, but a started ticket should ideally be finished before starting a new one. This helps avoiding context switching and multitasking, which is proven to reduce productivity.

Reduce batch sizes

The team has implicitly adopted a mindset that tries to reduce batch sizes. For example, when migrating the app from Python to .NET ASP, a first controller (AppController) was created and a basic "Index" endpoint was tested before moving forward into the next endpoints.

Moreover, the chosen development flow (Git Flow) supports the use of feature branches. By splitting a big task into multiple smaller features we allow multiple developers to work in smaller and more manageable tasks in parallel, reducing the risk that entail developing bigger features.

The release of new features is done manually every week through Github's Release feature. This could potentially be automatised in the future.

Reduce the number of handoffs

Ideally, a ticket should be started and finished by the same developer/s that were assigned to it. Reducing or eliminating handoffs contributes to avoiding knowledge loss and unnecessary documentation.

Continually identify and elevate our constraints

  • Code deployment:

We are currently undergoing a process of automating our deployment process. As a countermeasure to this constraint we currently have a deployment(main branch) and preproduction-deployment(develop branch -staging-) workflows that uses Github Actions to trigger a deployment process into Digital Ocean.

  • Test setup and run:

Both previously mentioned workflows will integrate a job that runs the tests automatically when triggered.

  • Overly tight architecture:

We are undergoing a process to implement ORM into our database, which will decouple our database stack from the application logic, making it easier to flip stack in the future if needed.

Eliminate hardships and waste in our value stream

  • Partially done work

  • Extra processes:

Are we doing any work that does not add value to the customer?

  • Extra features:

Are we developing any feature outside of the course requirements? Are we prioritising hard requirements over extra features?

  • Task switching:

Do we feel that we need to switch tasks too often?

  • Waiting:

Do we identify any blockers? Are we communicating them in an open way so others can help us unblock them?

  • Motion:

Motion waste can be created when people who need to communicate frequently are not colocated. Handoffs also create motion waste and often require additional communication.

  • Defects:

Incorrect, missing, or unclear information, materials, or products create waste, as effort is needed to resolve these issues.

  • Nonstandard or manual work

What other parts of the development process do we want to automatise to reduce manual work?

  • Heroics:

Situations where individuals and teams are put in a position where they must perform unreasonable acts, which may become part of their daily work (e.g., nightly 2:00am problems in production).

Are we having these situations currently? If we are, it is always welcomed to reach out into our communication channels so we can reorganise the workload.

The Second Way: The Principles of Feedback

Working safely within complex systems

Do our current tests ensure that changes to the code base are safe to deploy to production?

The preproduction serveris one element that we are implementing so we have an opportunity of testing changes before deploying to production.

See problems as they occur

What mechanisms do we have in place for monitoring and logging our systems in production? Do we have a mechanism to ensure feedback is incorporated into development practices?

We currently use the provided status page for all groups (http://206.81.24.116/status.html) to visualise problems and errors in our application. In the future we will add more tools to visualise them, such as Grafana to add metrics and logging tools. Digital Ocean also provides Monitoring and Resource alert tools that we are not currently using.

Screenshot of status for all groups page showing app errors by group

Digital Ocean's Droplets and Volumes do not auto-resize, but we have created two Resource alerts that will keep the development team well informed in case we need to scale up the available resources.

Digital Ocean's resource alerts screenshot

Swarm and solve problems to build new knowledge

The mentioned alerts under "See problems as they occur" will trigger an email to some of the developers in the team so they are timely informed to solve the issues.

Enable optimizing for downstream work centers

The Third Way: The Principles of Continual Learning and Experimentation

Enabling organisational learning and a safety culture

Institutionalise the improvement of daily work

Transform local discoveries into global improvements

Inject resilience patterns into our daily work

Leaders reinforce a learning culture

Triggering questions:

  1. What is our approach to fostering a culture that encourages experimentation and learning from failure?
    • This includes celebrating failures as learning opportunities.
  2. How do we allocate time and resources for learning new technologies or processes?
    • Dedicated time for exploration can be beneficial.
  3. What mechanisms do we have for sharing knowledge and learnings within the team and organization?
    • Knowledge sharing sessions, wikis, etc.
  4. How do we encourage and support contributions to open source, public speaking, and other forms of external knowledge sharing?
  5. What processes do we have for conducting post-mortems on failures or incidents?
    • Focus on blameless post-mortems to understand what happened and why.