Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Swift network for RGW to HCI scenario #386

Closed
wants to merge 1 commit into from

Conversation

fultonj
Copy link
Contributor

@fultonj fultonj commented Sep 3, 2024

When Ceph RGW is used, an endpoint for Swift storage is hosted not in a pod on k8s but on an EDPM node. Thus, a service hosted on an EDPM node will need to be accessed by cloud users from a separate network.

This patch adds the Swift storage network (swift) with VLAN 25 and range 172.22.0.0/24 in the HCI values example. The Swift network is configured on the HCI EDPM nodes and an NNCP, NAD, L2Advertisement and IPAddressPool are defined so that a pod in k8s can connect to it; such as the tempest pod which will perform object storage tests.

In order to make these changes va/hci now keeps its own copy of the nncp and networking directories since they differ (by the new network) from the generic ones in the lib directory.

Jira: https://issues.redhat.com/browse/OSPRH-6675
Depends-On: openstack-k8s-operators/ci-framework#2301

Copy link

openshift-ci bot commented Sep 3, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fultonj

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/ecc7ae499dd0462b97f23c375b46bf28

✔️ noop SUCCESS in 0s
rhoso-architecture-validate-hci FAILURE in 3m 20s

@openshift-ci openshift-ci bot added the approved label Sep 3, 2024
@fultonj fultonj changed the title Add object storage network to HCI scenario Add Swift network for RGW to HCI scenario Sep 3, 2024
Copy link
Contributor

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/0f4c9701ca2e421eb71dfa238ec09a19

✔️ noop SUCCESS in 0s
rhoso-architecture-validate-hci FAILURE in 3m 16s

@fultonj
Copy link
Contributor Author

fultonj commented Sep 3, 2024

Regarding ❌ rhoso-architecture-validate-hci FAILURE in 3m 16s

I ran the following in my own environment and I didn't reproduce the problem.

ansible-playbook -i localhost, -c local ci/playbooks/architecture/validate-architecture.yml \
  -e "@/home/zuul/ci-framework-data/artifacts/parameters/custom-params.yml" \
  -e cifmw_architecture_repo=/home/zuul/github.com/openstack-k8s-operators/architecture/

Copy link
Contributor

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/805ad2339a114206ab86d7c3a3002c29

✔️ noop SUCCESS in 0s
rhoso-architecture-validate-hci FAILURE in 3m 22s

Copy link
Contributor

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/97539585015f4394a62edc009c2db3dd

✔️ noop SUCCESS in 0s
rhoso-architecture-validate-hci FAILURE in 3m 22s

@fultonj
Copy link
Contributor Author

fultonj commented Sep 4, 2024

This failure from rhoso-architecture-validate-hci

fatal: [localhost]: FAILED! => changed=true 
  cmd:
  - oc
  - kustomize
  delta: '0:00:00.144154'
  end: '2024-09-03 20:22:05.465444'
  msg: non-zero return code
  rc: 1
  start: '2024-09-03 20:22:05.321290'
  stderr: 'error: accumulating components: accumulateDirectory: "recursed accumulation of path 
''/home/zuul/src/github.com/openstack-k8s-operators/architecture/va/hci/nncp'': 
fieldPath `data.node_0.swift_ip` is 
missing for replacement source ConfigMap.[noVer].[noGrp]/network-values.[noNs]"'
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>

should just be running oc kustomize examples/va/hci/control-plane/nncp with my patch as described here:

https://github.com/openstack-k8s-operators/ci-framework/blob/main/roles/kustomize_deploy/tasks/execute_step.yml#L192-L218

When I do that with these versions in my environment it works.

[zuul@controller-0 architecture]$ oc version
Client Version: 4.16.8
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: 4.16.1
Kubernetes Version: v1.29.5+58452d8
[zuul@controller-0 architecture]$ oc kustomize examples/va/hci/control-plane/nncp | grep "name: swift" -B 10 -A 5
    - description: swift vlan interface
      ipv4:
        address:
        - ip: 172.22.0.5
          prefix-length: "24"
        dhcp: false
        enabled: true
      ipv6:
        enabled: false
      mtu: 1500
      name: swift
      state: up
      type: vlan
      vlan:
        base-iface: enp6s0
        id: "25"
--
    - description: swift vlan interface
      ipv4:
        address:
        - ip: 172.22.0.6
          prefix-length: "24"
        dhcp: false
        enabled: true
      ipv6:
        enabled: false
      mtu: 1500
      name: swift
      state: up
      type: vlan
      vlan:
        base-iface: enp6s0
        id: "25"
--
    - description: swift vlan interface
      ipv4:
        address:
        - ip: 172.22.0.7
          prefix-length: "24"
        dhcp: false
        enabled: true
      ipv6:
        enabled: false
      mtu: 1500
      name: swift
      state: up
      type: vlan
      vlan:
        base-iface: enp6s0
        id: "25"
[zuul@controller-0 architecture]$ 

So I'm having trouble reproducing this error locally. Since I've added a depends-on to openstack-k8s-operators/ci-framework#2301 I might try updating ci-framework next to add debug info.

@fultonj
Copy link
Contributor Author

fultonj commented Sep 4, 2024

recheck

Copy link
Contributor

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/4008a91b5d2640d68e606d99492fced6

✔️ noop SUCCESS in 0s
rhoso-architecture-validate-hci FAILURE in 3m 20s

@fultonj
Copy link
Contributor Author

fultonj commented Sep 4, 2024

fatal: [localhost]: FAILED! => changed=true 
  cmd:
  - oc
  - kustomize
  delta: '0:00:00.144154'
  end: '2024-09-03 20:22:05.465444'
  msg: non-zero return code
  rc: 1
  start: '2024-09-03 20:22:05.321290'
  stderr: 'error: accumulating components: accumulateDirectory: "recursed accumulation of path 
''/home/zuul/src/github.com/openstack-k8s-operators/architecture/va/hci/nncp'': 
fieldPath `data.node_0.swift_ip` is 
missing for replacement source ConfigMap.[noVer].[noGrp]/network-values.[noNs]"'
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>

should just be running oc kustomize examples/va/hci/control-plane/nncp with my patch as described here:

https://github.com/openstack-k8s-operators/ci-framework/blob/main/roles/kustomize_deploy/tasks/execute_step.yml#L192-L218

When I do that with these versions in my environment it works.

[zuul@controller-0 architecture]$ oc version
Client Version: 4.16.8
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: 4.16.1
Kubernetes Version: v1.29.5+58452d8

The CI is using the same versions:

- 'Client Version: 4.16.8'
- 'Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3'

https://softwarefactory-project.io/zuul/t/rdoproject.org/build/92be298ca3ca468c88dd08b92b03d2fb

Copy link
Contributor

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/3b6271692ffa4200ade4b8122c4f5c31

✔️ noop SUCCESS in 0s
rhoso-architecture-validate-hci FAILURE in 3m 13s

@fultonj
Copy link
Contributor Author

fultonj commented Sep 4, 2024

This failure from rhoso-architecture-validate-hci

fatal: [localhost]: FAILED! => changed=true 
  cmd:
  - oc
  - kustomize
  delta: '0:00:00.144154'
  end: '2024-09-03 20:22:05.465444'
  msg: non-zero return code
  rc: 1
  start: '2024-09-03 20:22:05.321290'
  stderr: 'error: accumulating components: accumulateDirectory: "recursed accumulation of path 
''/home/zuul/src/github.com/openstack-k8s-operators/architecture/va/hci/nncp'': 
fieldPath `data.node_0.swift_ip` is 
missing for replacement source ConfigMap.[noVer].[noGrp]/network-values.[noNs]"'
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>

...

So I'm having trouble reproducing this error locally. Since I've added a depends-on to openstack-k8s-operators/ci-framework#2301 I might try updating ci-framework next to add debug info.

https://github.com/openstack-k8s-operators/architecture/pull/386/files#r1742876830

@fmount
Copy link
Contributor

fmount commented Sep 4, 2024

Hi @fultonj,
thanks for working on this change.
I assume the idea of having a dedicated network for swift/rgw starts from the assumption we shouldn't propagate the external network to the EDPM nodes.
The external network might create security concerns, and in the end we can "abstract" this swift dedicated network at tenant level, so the VMs (I'm thinking at the SoS use case) can access the RGW service w/o an external/public endpoint exposed outside.
Am I reading it correctly?

@fultonj
Copy link
Contributor Author

fultonj commented Sep 5, 2024

Hi @fultonj, thanks for working on this change. I assume the idea of having a dedicated network for swift/rgw starts from the assumption we shouldn't propagate the external network to the EDPM nodes. The external network might create security concerns, and in the end we can "abstract" this swift dedicated network at tenant level, so the VMs (I'm thinking at the SoS use case) can access the RGW service w/o an external/public endpoint exposed outside. Am I reading it correctly?

Right now I'm trying to get a separate network (not the storage network) for RGW so we can put an end to our workaround in CI of running Ceph on the provisioning network. It's easiest to implement this new network, named swift, as another VLAN. It could be changed to an external network if we wanted to expose the service externally but that would be done in another patch if desired.

@fultonj
Copy link
Contributor Author

fultonj commented Sep 5, 2024

recheck

@fultonj
Copy link
Contributor Author

fultonj commented Sep 9, 2024

I tested this in my environment with VA1. I ran the Tempest object storage tests with the following and they passed.

Kind: Tempest
spec:
  networkAttachments:
  - swift

Waiting for a test-project run downstream before merging

cidr: 172.22.0.0/24
gateway: 172.22.0.1
name: subnet1
vlan: 25
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VLAN 24 and 172.21 is the next unused range, so why VLAN 25 and 172.22 in this patch?

Because #376 is using VLAN 24 and 172.21

@abays
Copy link
Contributor

abays commented Sep 10, 2024

@fultonj I think it might be possible to do this without needing to remove VA1's dependency on lib and duplicating all the lib configuration in va/hci. I came up with this example for the NNCP part: abays@a484224.

I created a diff of your patch's generated content and my version, and I think it's just a white-space or trailing characters difference. I don't see any actual value differences (just showing one NNCP here):

288,303d287
<     - description: swift vlan interface
<       ipv4:
<         address:
<         - ip: 172.22.0.7
<           prefix-length: "24"
<         dhcp: false
<         enabled: true
<       ipv6:
<         enabled: false
<       mtu: 1500
<       name: swift
<       state: up
<       type: vlan
<       vlan:
<         base-iface: enp6s0
<         id: "25"
328a313,328
>     - description: swift vlan interface
>       ipv4:
>         address:
>         - ip: 172.22.0.7
>           prefix-length: "24"
>         dhcp: false
>         enabled: true
>       ipv6:
>         enabled: false
>       mtu: 1500
>       name: swift
>       state: up
>       type: vlan
>       vlan:
>         base-iface: enp6s0
>         id: "25"

What do you think?

@fultonj
Copy link
Contributor Author

fultonj commented Sep 10, 2024

@fultonj I think it might be possible to do this without needing to remove VA1's dependency on lib and duplicating all the lib configuration in va/hci. I came up with this example for the NNCP part: abays@a484224.

I created a diff of your patch's generated content and my version, and I think it's just a white-space or trailing characters difference. I don't see anything actual value differences (just showing one NNCP here):

What do you think?

That's much nicer though it's missing a few things I'll still need.

  1. The NetworkAttachmentDefinition for swift
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  labels:
    osp/net: swift
    osp/net-attach-def-type: standard
  name: swift
  namespace: openstack
spec:
  config: |
    {
      "cniVersion": "0.3.1",
      "name": "swift",
      "type": "macvlan",
      "master": "swift",
      "ipam": {
        "type": "whereabouts",
        "range": "172.22.0.0/24",
        "range_start": "172.22.0.100",
        "range_end": "172.22.0.250"
      }
    }
  1. IPAddressPool
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  labels:
    osp/lb-addresses-type: standard
  name: swift
  namespace: metallb-system
spec:
  addresses:
  - 172.22.0.80-172.22.0.90
  1. L2Advertisment
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: swift
  namespace: metallb-system
spec:
  interfaces:
  - swift
  ipAddressPools:
  - swift
  1. The OpenStackDataPlaneNodeSet is missing entries for the swift network in nodeTemplate/ansible/{networks,nodes}
apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneNodeSet
metadata:
  name: openstack-edpm
  namespace: openstack
spec:
  env:
  - name: ANSIBLE_FORCE_COLOR
    value: "True"
  networkAttachments:
  - ctlplane
  nodeTemplate:
    ansible:
      ansiblePort: 22
      ansibleUser: cloud-admin
      ansibleVars:
        edpm_ceph_hci_pre_enabled_services:
        - ceph_mon
        - ceph_mgr
        - ceph_osd
        - ceph_rgw
        - ceph_nfs
        - ceph_rgw_frontend
        - ceph_nfs_frontend
        edpm_network_config_hide_sensitive_logs: false
        edpm_network_config_os_net_config_mappings:
          edpm-compute-0:
            nic2: 6a:fe:54:3f:8a:02
          edpm-compute-1:
            nic2: 6b:fe:54:3f:8a:02
          edpm-compute-2:
            nic2: 6c:fe:54:3f:8a:02
        edpm_network_config_template: |
          ---
          {% set mtu_list = [ctlplane_mtu] %}
          {% for network in nodeset_networks %}
          {{ mtu_list.append(lookup('vars', networks_lower[network] ~ '_mtu')) }}
          {%- endfor %}
          {% set min_viable_mtu = mtu_list | max %}
          network_config:
          - type: interface
            name: nic1
            use_dhcp: true
            mtu: {{ min_viable_mtu }}
          - type: ovs_bridge
            name: {{ neutron_physical_bridge_name }}
            mtu: {{ min_viable_mtu }}
            use_dhcp: false
            dns_servers: {{ ctlplane_dns_nameservers }}
            domain: {{ dns_search_domains }}
            addresses:
            - ip_netmask: {{ ctlplane_ip }}/{{ ctlplane_cidr }}
            routes: {{ ctlplane_host_routes }}
            members:
            - type: interface
              name: nic2
              mtu: {{ min_viable_mtu }}
              # force the MAC address of the bridge to this interface
              primary: true
          {% for network in nodeset_networks %}
            - type: vlan
              mtu: {{ lookup('vars', networks_lower[network] ~ '_mtu') }}
              vlan_id: {{ lookup('vars', networks_lower[network] ~ '_vlan_id') }}
              addresses:
              - ip_netmask:
                  {{ lookup('vars', networks_lower[network] ~ '_ip') }}/{{ lookup('vars', networks_lower[network] ~ '_cidr') }}
              routes: {{ lookup('vars', networks_lower[network] ~ '_host_routes') }}
          {% endfor %}
        edpm_nodes_validation_validate_controllers_icmp: false
        edpm_nodes_validation_validate_gateway_icmp: false
        edpm_sshd_allowed_ranges:
        - 192.168.122.0/24
        edpm_sshd_configure_firewall: true
        gather_facts: false
        neutron_physical_bridge_name: br-ex
        neutron_public_interface_name: eth0
        storage_mgmt_cidr: "24"
        storage_mgmt_host_routes: []
        storage_mgmt_mtu: 9000
        storage_mgmt_vlan_id: 23
        storage_mtu: 9000
        timesync_ntp_servers:
        - hostname: pool.ntp.org
    ansibleSSHPrivateKeySecret: dataplane-ansible-ssh-private-key-secret
    extraMounts:
    - extraVolType: Ceph
      mounts:
      - mountPath: /etc/ceph
        name: ceph
        readOnly: true
      volumes:
      - name: ceph
        secret:
          secretName: ceph-conf-files
    managementNetwork: ctlplane
    networks:
    - defaultRoute: true
      name: ctlplane
      subnetName: subnet1
    - name: internalapi
      subnetName: subnet1
    - name: storage
      subnetName: subnet1
    - name: tenant
      subnetName: subnet1
    - name: swift
      subnetName: subnet1
  nodes:
    edpm-compute-0:
      ansible:
        ansibleHost: 192.168.122.100
      hostName: edpm-compute-0
      networks:
      - defaultRoute: true
        fixedIP: 192.168.122.100
        name: ctlplane
        subnetName: subnet1
      - name: internalapi
        subnetName: subnet1
      - name: storage
        subnetName: subnet1
      - name: storagemgmt
        subnetName: subnet1
      - name: tenant
        subnetName: subnet1
      - name: swift
        subnetName: subnet1
    edpm-compute-1:
      ansible:
        ansibleHost: 192.168.122.101
      hostName: edpm-compute-1
      networks:
      - defaultRoute: true
        fixedIP: 192.168.122.101
        name: ctlplane
        subnetName: subnet1
      - name: internalapi
        subnetName: subnet1
      - name: storage
        subnetName: subnet1
      - name: storagemgmt
        subnetName: subnet1
      - name: tenant
        subnetName: subnet1
      - name: swift
        subnetName: subnet1
    edpm-compute-2:
      ansible:
        ansibleHost: 192.168.122.102
      hostName: edpm-compute-2
      networks:
      - defaultRoute: true
        fixedIP: 192.168.122.102
        name: ctlplane
        subnetName: subnet1
      - name: internalapi
        subnetName: subnet1
      - name: storage
        subnetName: subnet1
      - name: storagemgmt
        subnetName: subnet1
      - name: tenant
        subnetName: subnet1
      - name: swift
        subnetName: subnet1
  preProvisioned: true
  services:
  - bootstrap
  - configure-network
  - validate-network
  - install-os
  - ceph-hci-pre
  - configure-os
  - ssh-known-hosts
  - run-os
  - reboot-os
  - install-certs
  - ceph-client
  - ovn
  - neutron-metadata
  - libvirt
  - nova-custom-ceph

@abays
Copy link
Contributor

abays commented Sep 10, 2024

@fultonj Yes, it's missing the other stuff. I just wanted to use NNCPs as an example approach to what might work. It's possible though that the pattern might break for the other components. But first I just wanted to see if you all thought this might be worth investigating.

@fultonj
Copy link
Contributor Author

fultonj commented Sep 10, 2024

@fultonj Yes, it's missing the other stuff. I just wanted to use NNCPs as an example approach to what might work. It's possible though that the pattern might break for the other components. But first I just wanted to see if you all thought this might be worth investigating.

It is certainly worth investigating as it's much cleaner so thank you very much! I'll try implementing this change with that approach.

@fultonj
Copy link
Contributor Author

fultonj commented Sep 10, 2024

todo:

  • streamline the kustomize per andrew's example

@raukadah
Copy link
Contributor

/trigger github-experimental

@abays
Copy link
Contributor

abays commented Sep 17, 2024

@fultonj Here's a new version of the rework for NNCP generation that doesn't hardcode NNCP names, as we discussed yesterday: abays@462eb45. You can see it uses labels instead. I tested it and it works.

When Ceph RGW is used, an endpoint for Swift storage is
hosted not in a pod on k8s but on an EDPM node. Thus, a
service hosted on an EDPM node will need to be accessed
by cloud users from a separate network.

This patch adds the Swift storage network (swift) with
VLAN 25 and range 172.22.0.0/24 in the HCI values example.
The Swift network is configured on the HCI EDPM nodes
and an NNCP, NAD, L2Advertisement and IPAddressPool are
defined so that a pod in k8s can connect to it; such as
the tempest pod which will perform object storage tests.

In order to make these changes va/hci now keeps its own
copy of the nncp and networking directories since they
differ (by the new network) from the generic ones in the
lib directory.

Jira: https://issues.redhat.com/browse/OSPRH-6675
Depends-On: openstack-k8s-operators/ci-framework#2301

Signed-off-by: John Fulton <[email protected]>
Copy link
Contributor

This change depends on a change that failed to merge.

Change openstack-k8s-operators/ci-framework#2301 is needed.

@raukadah
Copy link
Contributor

/trigger github-experimental

@fultonj
Copy link
Contributor Author

fultonj commented Sep 18, 2024

Obsoleted by #404

@fultonj fultonj closed this Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved do-not-merge Changes are not ready to be merged github-experimental
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants