Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fcos] Enable openshift installer to decompress xz files and upload them as page blobs to Azure #3033

Closed
wants to merge 2 commits into from

Conversation

jomeier
Copy link
Contributor

@jomeier jomeier commented Jan 31, 2020

Hi,

If the coreos team converts the fcos image to a fixed size VHD file (I think thats a good idea): How does this 8GByte big image come into the Azure storage account?

If the installer must upload it, this will last very long. Seems not to be very feasible, if it's necessary each time we run the installer.

Yes it would be great I we wouldn't have to deal with the fcos image at all. I hope it gets to the Azure market place very soon.

Greetings,

Josef

@openshift-ci-robot openshift-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 31, 2020
@openshift-ci-robot
Copy link
Contributor

Hi @jomeier. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jomeier
Copy link
Contributor Author

jomeier commented Jan 31, 2020

I compiled the installer and tried it out. It decompresses the fcos image, uploads it into a page blob container.

But: As explained a few times before the upload takes - at least at my PC - far too long and terminates with a timeout finally.

@jomeier
Copy link
Contributor Author

jomeier commented Jan 31, 2020

After 27 minutes Terraform stopped with this error message:

ERROR
ERROR Warning: "resource_group_name": [DEPRECATED] This field has been deprecated and is no longer used - will be removed in 2.0 of the Azure Provider
ERROR
ERROR   on ../../../tmp/openshift-install-710899921/main.tf line 142, in resource "azurerm_storage_container" "vhd":
ERROR  142: resource "azurerm_storage_container" "vhd" {
ERROR
ERROR (and 3 more similar warnings elsewhere)
ERROR
ERROR
ERROR Error: Error creating Blob "rhcos92dmz.vhd" (Container "vhd" / Account "cluster92dmz"): Error creating storage blob on Azure: Error while uploading source file "/home/sepp/.cache/openshift-installer/image_cache/fb97e8b31264f7a8cd5eda2dc86275b7": Error writing page at offset 346169344 for file "/home/sepp/.cache/openshift-installer/image_cache/fb97e8b31264f7a8cd5eda2dc86275b7": blobs.Client#PutPageUpdate: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code="AuthenticationFailed" Message="Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.\nRequestId:4a368045-901e-0016-6c43-d89150000000\nTime:2020-01-31T14:31:43.0134237Z"
ERROR
ERROR   on ../../../tmp/openshift-install-710899921/main.tf line 148, in resource "azurerm_storage_blob" "rhcos_image":
ERROR  148: resource "azurerm_storage_blob" "rhcos_image" {
ERROR
ERROR
FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply using Terraform

Seems more like a timeout to me because it comes always after that time on my PC.

Next I will now try the upload directly on Azure in a cloud shell. This should work faster because we already are on Azure.

@vrutkovs
Copy link
Member

/retitle [fcos] Enable openshift installer to decompress xz files and upload them as page blobs to Azure

@openshift-ci-robot openshift-ci-robot changed the title Enable openshift installer to decompress xz files and upload them as page blobs to Azure [fcos] Enable openshift installer to decompress xz files and upload them as page blobs to Azure Jan 31, 2020
@vrutkovs
Copy link
Member

Extra rhcos92dmz.vhd file included

@jomeier
Copy link
Contributor Author

jomeier commented Jan 31, 2020

Next problem:

Upload was much much much faster if I run openshift-installer in the Azure cloud shell.

But:

ERROR
ERROR Warning: "resource_group_name": [DEPRECATED] This field has been deprecated and is no longer used - will be removed in 2.0 of the Azure Provider
ERROR
ERROR   on ../../../tmp/openshift-install-895652810/main.tf line 142, in resource "azurerm_storage_container" "vhd":
ERROR  142: resource "azurerm_storage_container" "vhd" {
ERROR
ERROR (and 3 more similar warnings elsewhere)
ERROR
ERROR
ERROR Error: compute.ImagesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="InvalidParameter" Message="Disk 'rhcoswsgc0.vhd' with blob https://clusterwsgc0.blob.core.windows.net:8443/vhd/rhcoswsgc0.vhd is of Dynamic VHD type. Please retry with fixed VHD type." Target="disks"
ERROR
ERROR   on ../../../tmp/openshift-install-895652810/main.tf line 158, in resource "azurerm_image" "cluster":
ERROR  158: resource "azurerm_image" "cluster" {
ERROR
ERROR
FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply using Terraform

The fcos vhd file seems to be dynamic. But Azure images require FIXED vhd images. So we have to get them a fixed size and use an uploader which takes into account to not upload zero blocks. Don't know if Terraform can do this ...

I used this command line tool:

azure-vhd-utils

It only uploads what is necessary and it outputs a progress counter on the console which would also nice to have.

I have no idea of how to convert a VHD file in golang from dynamic to fixed size.

@jomeier
Copy link
Contributor Author

jomeier commented Jan 31, 2020

Correct me if I am wrong but wouldn't it be a good idea if the fcos image would be provided in fixed size and compressed vhd file by the fedora core os team? the fixed size file contains lots of zeroes, they should be compressible very good. The step of converting the dynamic vhd file to a fixed size vhd file wouldn't be necessary.

@jomeier
Copy link
Contributor Author

jomeier commented Jan 31, 2020

Copy link

@sgreene570 sgreene570 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome progress! Some small comments to help out. Also, I think you might have accidentally committed rhcos92dmz.vhd.

return err
}
defer writer.Close()
//defer os.Remove(dest)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor question, Is there a reason for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgreene570: No. I took it from the original code from Vadim. It may be that some code snippets aren't necessary anymore.

go.mod Outdated
@@ -100,6 +100,7 @@ require (
github.com/terraform-providers/terraform-provider-openstack v1.24.0
github.com/terraform-providers/terraform-provider-random v1.3.2-0.20191204175905-53436297444a
github.com/terraform-providers/terraform-provider-vsphere v1.14.0
github.com/ulikunitz/xz v0.5.6

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you run go mod vendor as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Will do that.

return nil
}

func decompressFileXZ(src, dest string) error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is a lot like decompressFileGzip. Do you think we could combine them to save some lines? 😃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lines are not needed anymore because everything we need is already in pkg/tfvars/internal/cache/cache.go

@LorbusChris
Copy link
Member

/assign
/assign @vrutkovs

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign lorbuschris
You can assign the PR to them by writing /assign @lorbuschris in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jomeier
Copy link
Contributor Author

jomeier commented Jan 31, 2020

The code which should convert dynamic vhd to fixed vhd size is commited.

The installer can download, decompress, convert dynamic to fixed vhd size and upload the decompressed VHD file to Azure and a VM image can also be created.

But: It's neither creating a running VM on Azure nor in Windows Hyper V.

For sure it has something to do with the dynamic to fixed VHD size conversion. I blow the file up with zeroes to fulfill the requirements of Azure. Azure complains if the blob file size doesn‘t fit the size given in the VHD footer :-(

If I use the Powershell command:

convert-vhd -Path fcos.vhd -VHDType Fixed -DestinationPath fcos.fixedpowershell.vhd

(would love to see its source code)

I can start a VM with hyper-V on my windows PC with it. So this seems to be a good test setup for finding the problem.

I found a vhd analyzer tool: https://github.com/franciozzy/VHD-Sherlock

But it's not reporting any issues.

Maybe someone can help me with that? I'm no VHD file structure expert. I'm giving up for today.

@jomeier
Copy link
Contributor Author

jomeier commented Feb 1, 2020

I would like to propose a different strategy because this becomes more and more ugly :-)

I would use Terraform to create a helper VM on Azure where all this download, extract, conversion, upload stuff will be performed. We could use standard tools (qemu-img, azcopy) for that and things would be much faster because we already do everything on Azure.

Does this sound feasible ?

@jomeier
Copy link
Contributor Author

jomeier commented Feb 1, 2020

#3042

In my opinion Terraform is configured wrong in the installer. The basic features local-exec and remote-exec don't work.

The installer can't handle "internal-plugin" commands.

They are necessary to finally get this fcos decompression, conversion and upload code running. I'd love to get your feedback on this image upload strategy.

I don't know how to implement this without remote-exec. How has written the Terraform code in the installer? Maybe he or she can help here?

@jomeier
Copy link
Contributor Author

jomeier commented Feb 1, 2020

I found out why the built in provisioners (remote-exec, local-exec, ...) don't work in the openshift-installer. The Terraform library spawns a sub process with the binary it's linked to. This sub process gets a few command line args starting with "internal-plugin". The original Terraform binary knows this parameter, openshift-installer doesn't.

All I had to do is to add a command line parameter in Cobra for "internal-plugin" and reach a few arguments through to the Terraform library function, which takes care of it.

In my opinion this will enhance openshift-installer's Terraform capabilites a lot. And it will help to get my PoC for the image upload running.

Currently I'm proud like a king :-)

PR takes a while ...

@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 2, 2020
… converts it to a fixed image and uploads it to Azure. Everything with standard tools (AZcopy, qemu-img, ...).
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 2, 2020
@jomeier
Copy link
Contributor Author

jomeier commented Feb 2, 2020

What does this PR do:

  • do whatever has been done before on the infrastructure side
  • get a SAS-Token from the previously created storage account
  • create an Ubuntu VM from the Azure marketplace
  • copy a Bash script into it
  • run the Bash script
  • this Bash script does this:
    • download AZCopy (Azure tool for uploading files into Azure storage accounts)
    • download the compressed VHD image (at the moment only xz decompression is supported)
    • decompress it
    • convert it to a Raw image with qemu-img
    • calculate an MB uprounded size
    • resize the image to this uprounded size
    • convert the image to a fixed size VHD
    • upload the image to an Azure blob storage container with AZCopy and the SAS-token
  • Terraform will automatically wait for finishing of the script and creates the fcos VM image afterwards from the VHD file in the storage account. This is possible with the use of the 'remote-exec' provisioner which previously didn't work with the openshift-installer. I had to create a different commit for that. Was the most work to find out how to enable it.
  • Everything else works as usual

This 'supporting VM' thing might look a little bit confusing at first but ...

Advantage to the previous solution:

  • no local download/upload of huge amounts of data necessary. The current behaviour with downloading/uploading the vhd file locally doesn't work properly for environments with poor internet connections (like mine)
  • everything happens on the Azure site. There we benefit from nice network throughput rates
  • the VHD conversion can be performed with "standard" tools like qemu-img and AZCopy. No need to rebuild everything in golang on our own.

Try it out, please and don't hesitate to give me feedback.

Thanks a lot also to @vrutkovs and @mjudeikis for not giving up on me ;-)

@jomeier
Copy link
Contributor Author

jomeier commented Feb 3, 2020

This would be a temporary solution until the fcos image lands in the Azure Marketplace.

@LorbusChris
Copy link
Member

@jomeier thank you for working on this!
Unfortunately I don't think this is a workaround we want to merge here. I think the proper fix would be to change coreos-assembler buildextend-azure to output a compressed fixed size VHD image (coreos/fedora-coreos-tracker#361).

coreos/fedora-coreos-tracker#148 would then just be one more step to automate.

@LorbusChris
Copy link
Member

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 3, 2020
@jomeier
Copy link
Contributor Author

jomeier commented Feb 3, 2020

@LorbusChris:
I understand.

If the coreos team converts the fcos image (I think thats a good idea): How does this 8GByte big image come into the Azure storage account?

If the installer must upload it this will last very long. Seems not to be very feasible, if it's necessary each time we run the installer.

Yes it would be great if we wouldn't have to deal with the fcos image at all. I hope it gets to the Azure market place very soon.

@LorbusChris
Copy link
Member

I think this is a larger topic and needs to be decided on the architects' side.

You could create an enhancement proposal for the installer to support remote-exec and local-exec as a first step in https://github.com/openshift/enhancements

However, I don't think running an additional VM for this step is the general direction we want to take here (at least not by default).

Right now we want to keep the delta beetween fcos and master branches small, so I'm very hesitant to accept this as-is.

xref'ing: #3042
and cc'iing @abhinavdahiya for additional comment from the installer team's side.

@LorbusChris
Copy link
Member

@jomeier btw do you mind updating the first comment in this issue with the summary you wrote just above here? ^

@jomeier
Copy link
Contributor Author

jomeier commented Feb 3, 2020

@LorbusChris:
Let's hope that there will be some positive signal on the fcos site very soon regarding pushing the image to Azure for us.

Treat this PR as a way (even if it's hacky) to get the image automatically uploaded to the correct location during the installation process so people can test Azure support of OKD until the fcos image is available on Azure.

I fully understand your arguments and I don't expect you to accept the PR. Everything is fine.

@jomeier jomeier closed this Feb 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants