Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: introduce nightly job for robustness test #658

Merged
merged 2 commits into from
Jan 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions .github/workflows/robustness_nightly.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
name: Robustness Nightly
permissions: read-all
on:
schedule:
- cron: '25 9 * * *' # runs every day at 09:25 UTC
# workflow_dispatch enables manual testing of this job by maintainers
workflow_dispatch:

jobs:
main:
# GHA has a maximum amount of 6h execution time, we try to get done within 3h
uses: ./.github/workflows/robustness_template.yaml
with:
count: 100
testTimeout: 200m
runs-on: "['ubuntu-latest-8-cores']"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add an instance of this robustness job on ['actuated-arm64-8cpu-8gb']?

Happy for it to be done as a follow-up issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Let me handle this. It might require to enable some kernel modules.

38 changes: 38 additions & 0 deletions .github/workflows/robustness_template.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
name: Reusable Robustness Workflow
on:
workflow_call:
inputs:
count:
required: true
type: number
testTimeout:
required: false
type: string
default: '30m'
runs-on:
required: false
type: string
default: "['ubuntu-latest']"
permissions: read-all

jobs:
test:
timeout-minutes: 210
runs-on: ${{ fromJson(inputs.runs-on) }}
steps:
- uses: actions/checkout@v4
- id: goversion
run: echo "goversion=$(cat .go-version)" >> "$GITHUB_OUTPUT"
- uses: actions/setup-go@v5
with:
go-version: ${{ steps.goversion.outputs.goversion }}
- name: test-robustness
run: |
set -euo pipefail

make gofail-enable

# build bbolt with failpoint
go install ./cmd/bbolt
sudo -E PATH=$PATH make ROBUSTNESS_TESTFLAGS="--count ${{ inputs.count }} --timeout ${{ inputs.testTimeout }} -failfast" test-robustness
18 changes: 5 additions & 13 deletions .github/workflows/robustness_test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,8 @@ on: [push, pull_request]
permissions: read-all
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- id: goversion
run: echo "goversion=$(cat .go-version)" >> "$GITHUB_OUTPUT"
- uses: actions/setup-go@v5
with:
go-version: ${{ steps.goversion.outputs.goversion }}
- run: |
make gofail-enable
# build bbolt with failpoint
go install ./cmd/bbolt
sudo -E PATH=$PATH make test-robustness
uses: ./.github/workflows/robustness_template.yaml
with:
count: 10
testTimeout: 30m
runs-on: "['ubuntu-latest-8-cores']"
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -84,4 +84,4 @@ test-failpoint:
.PHONY: test-robustness # Running robustness tests requires root permission
test-robustness:
go test -v ${TESTFLAGS} ./tests/dmflakey -test.root
go test -v ${TESTFLAGS} ./tests/robustness -test.root
go test -v ${TESTFLAGS} ${ROBUSTNESS_TESTFLAGS} ./tests/robustness -test.root
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice we don't have any graceful validation that whoever runs make test-robustness has actually run the pre-requisite make gofail-enable prior. Currently we just get cryptic error printed which folks might not understand at first glance, i.e:

=== RUN   TestRestartFromPowerFailure/fp_ext4_commit5s                                         
    powerfailure_test.go:112: start bbolt bench -work -path /tmp/TestRestartFromPowerFailurefp_ext4_commit5s797232614/002/boltdb -count=1000000000 -batch-size=5
    powerfailure_test.go:129: simulate power failure                                           
    powerfailure_test.go:134: random pick failpoint: resizeFileError                           
    powerfailure_test.go:168:                                                                  
                Error Trace:    /home/james/Documents/bbolt/tests/robustness/powerfailure_test.go:168
                                                        /home/james/Documents/bbolt/tests/robustness/powerfailure_test.go:135
                                                        /home/james/Documents/bbolt/tests/robustness/powerfailure_test.go:81
                Error:          Received unexpected error:                                     
                                Put "http://127.0.0.1:12345/resizeFileError": dial tcp 127.0.0.1:12345: connect: connection refused

In etcd-io/etcd we introduced https://github.com/etcd-io/etcd/blob/f8d5ba9a3fdff4f706f06ef0f0c0a27bb033ac53/tests/framework/integration/testing.go#L102. Perhaps we should introduce something similar here, not a blocker on this pr though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I build bbolt with failpoints. I think gofail.List doesn't work for this case. Let me think about how to handle.

5 changes: 5 additions & 0 deletions tests/dmflakey/dmflakey.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ import (
"fmt"
"os"
"os/exec"
"path"
"path/filepath"
"strings"
"time"
Expand Down Expand Up @@ -289,6 +290,10 @@ func createEmptyFSImage(imgPath string, fsType FSType) error {
return fmt.Errorf("failed to create image because %s already exists", imgPath)
}

if err := os.MkdirAll(path.Dir(imgPath), 0600); err != nil {
return fmt.Errorf("failed to ensure parent directory %s: %w", path.Dir(imgPath), err)
}

f, err := os.Create(imgPath)
if err != nil {
return fmt.Errorf("failed to create image %s: %w", imgPath, err)
Expand Down
85 changes: 80 additions & 5 deletions tests/robustness/powerfailure_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,11 @@ package robustness

import (
"bytes"
"crypto/rand"
"fmt"
"io"
"math"
"math/big"
"net/http"
"net/url"
"os"
Expand All @@ -23,9 +26,65 @@ import (
"golang.org/x/sys/unix"
)

var panicFailpoints = []string{
"beforeSyncDataPages",
"beforeSyncMetaPage",
"lackOfDiskSpace",
"mapError",
"resizeFileError",
"unmapError",
}

// TestRestartFromPowerFailure is to test data after unexpected power failure.
func TestRestartFromPowerFailure(t *testing.T) {
flakey := initFlakeyDevice(t, t.Name(), dmflakey.FSTypeEXT4, "")
for _, tc := range []struct {
name string
du time.Duration
fsMountOpt string
useFailpoint bool
}{
{
name: "fp_ext4_commit5s",
du: 5 * time.Second,
ahrtr marked this conversation as resolved.
Show resolved Hide resolved
fsMountOpt: "commit=5",
useFailpoint: true,
},
{
name: "fp_ext4_commit1s",
du: 10 * time.Second,
fsMountOpt: "commit=1",
useFailpoint: true,
},
{
name: "fp_ext4_commit1000s",
du: 10 * time.Second,
fsMountOpt: "commit=1000",
useFailpoint: true,
},
{
name: "kill_ext4_commit5s",
du: 5 * time.Second,
fsMountOpt: "commit=5",
},
{
name: "kill_ext4_commit1s",
du: 10 * time.Second,
fsMountOpt: "commit=1",
},
{
name: "kill_ext4_commit1000s",
du: 10 * time.Second,
fsMountOpt: "commit=1000",
},
} {
t.Run(tc.name, func(t *testing.T) {
doPowerFailure(t, tc.du, tc.fsMountOpt, tc.useFailpoint)
})
}
}

func doPowerFailure(t *testing.T, du time.Duration, fsMountOpt string, useFailpoint bool) {
flakey := initFlakeyDevice(t, strings.Replace(t.Name(), "/", "_", -1), dmflakey.FSTypeEXT4, fsMountOpt)
root := flakey.RootFS()

dbPath := filepath.Join(root, "boltdb")
Expand All @@ -38,6 +97,8 @@ func TestRestartFromPowerFailure(t *testing.T) {
}

logPath := filepath.Join(t.TempDir(), fmt.Sprintf("%s.log", t.Name()))
require.NoError(t, os.MkdirAll(path.Dir(logPath), 0600))

logFd, err := os.Create(logPath)
require.NoError(t, err)
defer logFd.Close()
Expand All @@ -64,10 +125,18 @@ func TestRestartFromPowerFailure(t *testing.T) {
}
}()

time.Sleep(time.Duration(time.Now().UnixNano()%5+1) * time.Second)
time.Sleep(du)
t.Logf("simulate power failure")

activeFailpoint(t, fpURL, "beforeSyncMetaPage", "panic")
if useFailpoint {
fpURL = "http://" + fpURL
targetFp := panicFailpoints[randomInt(t, math.MaxInt32)%len(panicFailpoints)]
t.Logf("random pick failpoint: %s", targetFp)
activeFailpoint(t, fpURL, targetFp, "panic")
} else {
t.Log("kill bbolt")
assert.NoError(t, cmd.Process.Kill())
}

select {
case <-time.After(10 * time.Second):
Expand All @@ -89,10 +158,10 @@ func TestRestartFromPowerFailure(t *testing.T) {

// activeFailpoint actives the failpoint by http.
func activeFailpoint(t *testing.T, targetUrl string, fpName, fpVal string) {
u, err := url.Parse("http://" + path.Join(targetUrl, fpName))
u, err := url.JoinPath(targetUrl, fpName)
require.NoError(t, err, "parse url %s", targetUrl)

req, err := http.NewRequest("PUT", u.String(), bytes.NewBuffer([]byte(fpVal)))
req, err := http.NewRequest("PUT", u, bytes.NewBuffer([]byte(fpVal)))
require.NoError(t, err)

resp, err := http.DefaultClient.Do(req)
Expand Down Expand Up @@ -192,3 +261,9 @@ func unmountAll(target string) error {
}
return fmt.Errorf("failed to umount %s: %w", target, unix.EBUSY)
}

func randomInt(t *testing.T, max int) int {
n, err := rand.Int(rand.Reader, big.NewInt(int64(max)))
assert.NoError(t, err)
return int(n.Int64())
}
Loading