Skip to content

Commit

Permalink
Merge branch 'master' into feat/basemaps-dem
Browse files Browse the repository at this point in the history
  • Loading branch information
Wentao-Kuang authored May 23, 2024
2 parents d6afcff + 0da70b2 commit ba7ddf7
Show file tree
Hide file tree
Showing 47 changed files with 781 additions and 814 deletions.
64 changes: 11 additions & 53 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,17 @@ jobs:
- name: Run actionlint to check workflow files
run: docker run --volume="${PWD}:/repo" --workdir=/repo actionlint -color

- name: Install Argo
run: |
curl -sLO https://github.com/argoproj/argo-workflows/releases/download/v3.5.5/argo-linux-amd64.gz
gunzip argo-linux-amd64.gz
chmod +x argo-linux-amd64
./argo-linux-amd64 version
- name: Lint workflows
run: |
./argo-linux-amd64 lint --offline templates/ workflows/
deploy-prod:
runs-on: ubuntu-latest
concurrency: deploy-prod-${{ github.ref }}
Expand Down Expand Up @@ -85,59 +96,6 @@ jobs:
run: |
kubectl apply -f dist/
- name: Install Argo
run: |
curl -sLO https://github.com/argoproj/argo-workflows/releases/download/v3.4.0-rc2/argo-linux-amd64.gz
gunzip argo-linux-amd64.gz
chmod +x argo-linux-amd64
./argo-linux-amd64 version
- name: Lint workflows
if: github.ref != 'refs/heads/master'
run: |
# Create test namespace
kubectl create namespace "$GITHUB_SHA"
# Create copy of Workflows files to change their namespaces
mkdir test
cp -r workflows/ test/workflows/
# Deploy templates in the test namespace
# Note: the templates have no default namespace so no need to modify them
kubectl apply -f templates/argo-tasks/ --namespace "$GITHUB_SHA"
# Find all workflows that have kind "WorkflowTemplate"
WORKFLOWS=$(grep -R -H '^kind: WorkflowTemplate$' test/workflows/ | cut -d ':' -f1)
# For each workflow template attempt to deploy it using kubectl
for wf in $WORKFLOWS; do
# Change namespace in files
sed -i "/^\([[:space:]]*namespace: \).*/s//\1$GITHUB_SHA/" "$wf"
kubectl apply -f "$wf" --namespace "$GITHUB_SHA"
done
# Find all cron workflows that have kind "CronWorkflow"
CRON_WORKFLOWS=$(grep -R -H '^kind: CronWorkflow$' test/workflows/ | cut -d ':' -f1)
# For each cron workflow attempt to deploy it using kubectl
for cwf in $CRON_WORKFLOWS; do
# Change namespace in files
sed -i "/^\([[:space:]]*namespace: \).*/s//\1$GITHUB_SHA/" "$cwf"
kubectl apply -f "$cwf" --namespace "$GITHUB_SHA"
done
# Finally lint the templates
./argo-linux-amd64 lint templates/ -n "$GITHUB_SHA"
./argo-linux-amd64 lint test/workflows/ -n "$GITHUB_SHA"
- name: Delete Test namespace
if: always()
run: |
# Delete the test namespace
stderr_tmp="$(mktemp --directory)/stderr"
if ! kubectl delete namespaces "$GITHUB_SHA" 2> >(tee "$stderr_tmp" >&2)
then
grep -q 'Error from server (NotFound): namespaces ".*" not found' "$stderr_tmp"
fi
- name: Deploy workflows
if: github.ref == 'refs/heads/master'
run: |
Expand Down
22 changes: 17 additions & 5 deletions docs/dns.configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,16 @@ Start a shell on the container
k exec -n :namespace -it :podName -- /bin/bash
```

Install basic dns utils `dig` `ping` `wget` and `curl`
Install basic networking utils `dig`, `ping`, `ping6`, `wget`, `nslookup`, and `curl`

```bash
apt install dnsutils iptools-ping wget curl
apt update && apt install -y dnsutils iputils-ping wget curl
```

Other useful tools may include `tracepath`, `traceroute` and `mtr`

```bash
apt update && apt install -y iputils-tracepath mtr traceroute
```

### Name resolution
Expand Down Expand Up @@ -69,18 +75,24 @@ Depending on the container you may have access to scripting languages.

#### NodeJS

file: index.mjs
create a new file `index.mjs`

```javascript
fetch('https://google.com').then((c) => console.log(c));

import * as dns from 'dns/promises';

await dns.resolve('google.com', 'A');
await dns.resolve('google.com', 'AAAA');
console.log(await dns.resolve('google.com', 'A'));
console.log(await dns.resolve('google.com', 'AAAA'));
```

Run the file

```bash
node --version
node index.mjs
```

## Node Local DNS

A local DNS cache is running on every node, [node-local-dns](./infrastructure/components/node.local.dns.md) if any DNS issues occur it is recommended to turn the DNS cache off as a first step for debugging
42 changes: 42 additions & 0 deletions docs/infrastructure/components/node.local.dns.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Node Local DNS

When large [argo](./argo.workflows.md) jobs are submitted the kubernetes cluster can sometimes scale up very quickly which can overwhelm the coredns DNS resolvers that are running on the primary nodes.

To prevent the overload a DNS cache is installed on every new node when it starts.

It is based off https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/ and it greatly reduces the load on the primary DNS servers.

## Debugging DNS problems with node local DNS

If DNS problems occur while node local dns is running, it is recommended to turn it off using the `UseNodeLocalDns = false` constant in `infra/constants.ts`

## Watching DNS requests

By default the DNS cache will log any external DNS requests it is resolving (anything that is not ending with `.cluster.local`) since there can be a large number of dns cache pods the following command will tail the logs from

```
kubectl logs -n kube-system --all-containers=true -f daemonset/node-local-dns --since=1m --timestamps=true --prefix=true
```

### Structured logs

`coredns` does not provide a simple way of constructing a structured log from the DNS request, it does provide a template system which can be used to craft a JSON log line, if the log line is in structured format like JSON it can be more easily processed into something like elasticsearch for additional debugging.

For the current log format see `CoreFileJsonLogFormat` and below is a example log request

```json
{
"remoteIp": "[2406:da1c:afb:bc0b:d0e3::6]",
"remotePort": 43621,
"protocol": "udp",
"queryId": "14962",
"queryType": "A",
"queryClass": "IN",
"queryName": "logs.ap-southeast-2.amazonaws.com.",
"querySize": 51,
"dnsSecOk": "false",
"responseCode": "NOERROR",
"responseFlags": "qr,rd,ra",
"responseSize": 443
}
```
29 changes: 29 additions & 0 deletions docs/retry.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Default retryStrategy

The default [`retryStrategy`](https://argo-workflows.readthedocs.io/en/stable/fields/#retrystrategy) is defined at the `workflowDefaults` level in the [Argo Workflow chart configuration](https://github.com/linz/topo-workflows/blob/master/infra/charts/argo.workflows.ts). This will be apply to every step/tasks by default.

## Overriding

To override the default `retryStrategy`, it can be done at the workflow or template level by defining a specific `retryStrategy`.

## Avoiding retry

For example, to avoid the default `retryStrategy` and make sure the task does not retry:

```yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: my-wf-
spec:
entrypoint: main
templates:
- name: main
retryStrategy:
expression: 'false'
container:
image: python:alpine3.6
command: ['python']
source: |
# Do something that fails ...
```
10 changes: 5 additions & 5 deletions infra/cdk.ts
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
import { App } from 'aws-cdk-lib';

import { ClusterName } from './constants.js';
import { ClusterName, DefaultRegion } from './constants.js';
import { tryGetContextArns } from './eks/arn.js';
import { LinzEksCluster } from './eks/cluster.js';
import { fetchSsmParameters } from './util/ssm.js';

const app = new App();

async function main(): Promise<void> {
const accountId = app.node.tryGetContext('aws-account-id') ?? process.env['CDK_DEFAULT_ACCOUNT'];
const accountId = (app.node.tryGetContext('aws-account-id') as unknown) ?? process.env['CDK_DEFAULT_ACCOUNT'];
const maintainerRoleArns = tryGetContextArns(app.node, 'maintainer-arns');
const slackSsmConfig = await fetchSsmParameters({
slackChannelConfigurationName: '/rds/alerts/slack/channel/name',
Expand All @@ -17,12 +17,12 @@ async function main(): Promise<void> {
});

if (maintainerRoleArns == null) throw new Error('Missing context: maintainer-arns');
if (accountId == null) {
if (typeof accountId !== 'string') {
throw new Error("Missing AWS Account information, set with either '-c aws-account-id' or $CDK_DEFAULT_ACCOUNT");
}

new LinzEksCluster(app, ClusterName, {
env: { region: 'ap-southeast-2', account: accountId },
env: { region: DefaultRegion, account: accountId },
maintainerRoleArns,
slackChannelConfigurationName: slackSsmConfig.slackChannelConfigurationName,
slackWorkspaceId: slackSsmConfig.slackWorkspaceId,
Expand All @@ -32,4 +32,4 @@ async function main(): Promise<void> {
app.synth();
}

main();
void main();
47 changes: 30 additions & 17 deletions infra/cdk8s.ts
Original file line number Diff line number Diff line change
Expand Up @@ -7,32 +7,45 @@ import { EventExporter } from './charts/event.exporter.js';
import { FluentBit } from './charts/fluentbit.js';
import { Karpenter, KarpenterProvisioner } from './charts/karpenter.js';
import { CoreDns } from './charts/kube-system.coredns.js';
import { CfnOutputKeys, ClusterName, ScratchBucketName, validateKeys } from './constants.js';
import { getCfnOutputs } from './util/cloud.formation.js';
import { NodeLocalDns } from './charts/kube-system.node.local.dns.js';
import { CfnOutputKeys, ClusterName, ScratchBucketName, UseNodeLocalDns, validateKeys } from './constants.js';
import { describeCluster, getCfnOutputs } from './util/cloud.formation.js';
import { fetchSsmParameters } from './util/ssm.js';

const app = new App();

async function main(): Promise<void> {
// Get cloudformation outputs
const cfnOutputs = await getCfnOutputs(ClusterName);
const [cfnOutputs, ssmConfig, clusterConfig] = await Promise.all([
getCfnOutputs(ClusterName),
fetchSsmParameters({
// Config for Cloudflared to access argo-server
tunnelId: '/eks/cloudflared/argo/tunnelId',
tunnelSecret: '/eks/cloudflared/argo/tunnelSecret',
tunnelName: '/eks/cloudflared/argo/tunnelName',
accountId: '/eks/cloudflared/argo/accountId',

// Personal access token to gain access to linz-li-bot github user
githubPat: '/eks/github/linz-li-bot/pat',

// Argo Database connection password
argoDbPassword: '/eks/argo/postgres/password',
}),
describeCluster(ClusterName),
]);
validateKeys(cfnOutputs);

const ssmConfig = await fetchSsmParameters({
// Config for Cloudflared to access argo-server
tunnelId: '/eks/cloudflared/argo/tunnelId',
tunnelSecret: '/eks/cloudflared/argo/tunnelSecret',
tunnelName: '/eks/cloudflared/argo/tunnelName',
accountId: '/eks/cloudflared/argo/accountId',

// Personal access token to gain access to linz-li-bot github user
githubPat: '/eks/github/linz-li-bot/pat',
const coredns = new CoreDns(app, 'dns', {});

// Argo Database connection password
argoDbPassword: '/eks/argo/postgres/password',
});
// Node localDNS is very expermential in this cluster, it can and will break DNS resolution
// If there are any issues with DNS, NodeLocalDNS should be disabled first.
if (UseNodeLocalDns) {
const ipv6Cidr = clusterConfig.kubernetesNetworkConfig?.serviceIpv6Cidr;
if (ipv6Cidr == null) throw new Error('Unable to use node-local-dns without ipv6Cidr');
const nodeLocal = new NodeLocalDns(app, 'node-local-dns', { serviceIpv6Cidr: ipv6Cidr });
nodeLocal.addDependency(coredns);
}

const coredns = new CoreDns(app, 'dns', {});
const fluentbit = new FluentBit(app, 'fluentbit', {
saName: cfnOutputs[CfnOutputKeys.FluentBitServiceAccountName],
clusterName: ClusterName,
Expand Down Expand Up @@ -85,4 +98,4 @@ async function main(): Promise<void> {
app.synth();
}

main();
void main();
10 changes: 7 additions & 3 deletions infra/charts/argo.workflows.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ import { Chart, ChartProps, Duration, Helm } from 'cdk8s';
import { Secret } from 'cdk8s-plus-27';
import { Construct } from 'constructs';

import { ArgoDbName, ArgoDbUser } from '../constants.js';
import { ArgoDbName, ArgoDbUser, DefaultRegion } from '../constants.js';
import { applyDefaultLabels } from '../util/labels.js';

export interface ArgoWorkflowsProps {
Expand Down Expand Up @@ -65,7 +65,7 @@ export class ArgoWorkflows extends Chart {
bucket: props.tempBucketName,
keyFormat:
'{{workflow.creationTimestamp.Y}}-{{workflow.creationTimestamp.m}}/{{workflow.creationTimestamp.d}}-{{workflow.name}}/{{pod.name}}',
region: 'ap-southeast-2',
region: DefaultRegion,
endpoint: 's3.amazonaws.com',
useSDKCreds: true,
insecure: false,
Expand Down Expand Up @@ -130,7 +130,7 @@ export class ArgoWorkflows extends Chart {
workflowNamespaces: ['argo'],
extraArgs: [],
// FIXME: workaround for https://github.com/argoproj/argo-workflows/issues/11657
extraEnv: [{ name: 'WATCH_CONFIGMAPS', value: 'false' }],
extraEnv: [{ name: 'WATCH_CONTROLLER_SEMAPHORE_CONFIGMAPS', value: 'false' }],
persistence,
replicas: 2,
workflowDefaults: {
Expand All @@ -147,6 +147,10 @@ export class ArgoWorkflows extends Chart {
},
],
parallelism: 3,
/** TODO: `nodeAntiAffinity` - to retry on different node - is not working yet (https://github.com/argoproj/argo-workflows/pull/12701)
* `affinity: { nodeAntiAffinity: {} }` seems to break `karpenter`, need more investigation
*/
retryStrategy: { limit: 2 },
},
},
},
Expand Down
5 changes: 3 additions & 2 deletions infra/charts/fluentbit.ts
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import { Chart, ChartProps, Helm } from 'cdk8s';
import { Construct } from 'constructs';

import { DefaultRegion } from '../constants.js';
import { applyDefaultLabels } from '../util/labels.js';

/**
Expand Down Expand Up @@ -73,9 +74,9 @@ HC_Period 5
serviceAccount: { name: props.saName, create: false },
cloudWatchLogs: {
enabled: true,
region: 'ap-southeast-2',
region: DefaultRegion,
/** Specify Cloudwatch endpoint to add a trailing `.` to force FQDN DNS request */
endpoint: 'logs.ap-southeast-2.amazonaws.com.',
endpoint: `logs.${DefaultRegion}.amazonaws.com.`,
autoCreateGroup: true,
logRetentionDays: 30,
logGroupName: `/aws/eks/${props.clusterName}/logs`,
Expand Down
7 changes: 5 additions & 2 deletions infra/charts/kube-system.coredns.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@ import { Construct } from 'constructs';

import { applyDefaultLabels } from '../util/labels.js';

/** Configure CoreDNS to output a JSON object for its log files */
export const CoreFileJsonLogFormat = `{"remoteIp":"{remote}","remotePort":{port},"protocol":"{proto}","queryId":"{>id}","queryType":"{type}","queryClass":"{class}","queryName":"{name}","querySize":{size},"dnsSecOk":"{>do}","responseCode":"{rcode}","responseFlags":"{>rflags}","responseSize":{rsize}}`;

/**
* This cluster is setup as dual ipv4/ipv6 where ipv4 is used for external traffic
* and ipv6 for internal traffic.
Expand Down Expand Up @@ -36,7 +39,7 @@ export class CoreDns extends Chart {
// FIXME: is there a better way of handling config files inside of cdk8s
Corefile: `
cluster.local:53 {
log
log . ${CoreFileJsonLogFormat}
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
Expand All @@ -53,7 +56,7 @@ cluster.local:53 {
}
.:53 {
log
log . ${CoreFileJsonLogFormat}
errors
health
template ANY AAAA {
Expand Down
Loading

0 comments on commit ba7ddf7

Please sign in to comment.