-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow configurable timeout for WaitForFileSystemAvailable #382
Comments
I wrote a small script to check the creation time of the filesystems and found that as they get bigger they take longer to create. I did my tests in eu-west-1 and I'm not sure if this is something specific to that region.
So I think it would be great if the timeout could either grow dynamically with storage capacity or be available for manual configuration. |
@jonded94 and I dug a little deeper to figure out where the RequestCanceled (above) was coming from and found that it's actually happening on the Faulty request:
Working request:
We also found that we only get these errors for filesystems larger than 100TB. We have successfully created several <100TB FSx volumes. It seems that the large volumes always fail after 5 minutes of creation and the smaller ones do not run into this RequestCanceled. |
Editing comment: Looking at the readme: https://github.com/kubernetes-csi/external-provisioner/tree/master?tab=readme-ov-file
which wouldn't be related It seems more likely that the describe call is catching the context cancelled vs. being the cause itself. Seems likely that the context is being canceled on the CSI side, during either the csiClient.CreateVolume or Provision calls Potentially this? The 5 minutes would line up with your observations
|
We wanted to do a deeper dive into what the problem is and test when exactly the filesystem creation fails and when it succeeds. Unsuccessful testFor our first test we created a 150TB volume:
Successful testNow we tested with another smaller volume and created a 90TB volume: From what we observed in our tests, after the csi provider timed out, the CreateVolume function is called again and instead of checking if a filesystem is already being created, it first checks if the quota would be reached with a "new" filesystem. WorkaroundIn a next test, we tried to increase all configurable timeouts, mainly for the csi-provisioner ( A temporary fix for us now would be to increase the quota for Lustre FSx, and in the long term it would be helpful to change the logic of how an ongoing volume request is detected so that it happens before a new volume is created. |
Note: It seems like the quota check is being triggered by a new create api call. Needs further investigation to confirm |
Hello, we are trying to create a StorageClass with the following conf:
And it fails, if I try to create a FS with the same parameters via the UI, it gets created in no-less than 6 minutes, therefore the 5 minute timeout being a little bit problematic, any ideas when would this be solved? Thanks 🙏 |
@JoseAlvarezSonos have you tried creating with using the dynamic provisioning mechanism? The 5 minute timeout shouldn't impact being able to provision a new filesystem unless an additional call would exceed the quota
|
@jacobwolfaws we are using the Dynamic Provisioning with Data Repository example from this repo. That's what you mean by "dynamic provisioning mechanism" right? |
yes |
@jacobwolfaws in that case yes, but we see the exact same problem:
BTW I obfuscated some values. |
I recommend configuring the csi-provisioner to have |
@jacobwolfaws I read that, but the Helm chart doesn't provide a way to configure that. The args, where I assume I would need to add this, are hardcoded and limited to this, any recommendation on that front? |
In the short term, you could use kustomize or maintain your own version of the helm chart? This is a gap we'll close before the next csi driver release |
@jacobwolfaws several things were involved in @jon-rei and my tests to make a dynamic volume creation of sizes >80TiB possible through this driver.
Have you been testing our PR or could debug the double creation call? |
@jacobwolfaws I tried your suggestion but it didn't work. To try "an extreme" to see if that was the issue, I increased the
Logs:
|
Not an expert here but the 10min timeout in |
Can you also collect the logs from the provisioner and fsx-plugin container
would also help to increase loglevel to 5 for those containers in the values.yaml file of your chart |
@jonded94 I'll try and let you know, thanks @jacobwolfaws I enabled more logging, but I have tons of leaderelection messages which are not that useful nor interesting, and then the only other outstanding messages are these:
It's worth mentioning that I tried adding an exportPath to see if it helps, but it didn't. I also tried with a smaller S3 bucket (less than 5MB total size), and it worked around 5-10min, but the "important" bucket is around 200TB and in production it will be around twice the size. Nonetheless, if I try to create the FS manually, it works, so maybe there's parameter missing to avoid copying the whole content of the bucket or populating the metadata or whatever is doing that it's taking that long. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
It seems that while the CreateVolume call is idempotent, the CreateFilesystem call is not (after the first CreateVolume reaches its timeout). We recently had an issue with there being orphaned filesystems in the k8s testing account: and I'm inclined to believe the issues are related. I believe fixing the CreateFilesystem issue / having CreateVolume check for a filesystem already being created should also fix the problems you've been seeing |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
Is your feature request related to a problem? Please describe.
We are starting to use Scratch 2 FSx FileSystems and are doing so successfully for small ones. But when we try to create FSx volumes >50TB we always run into timeouts:
It seems that the timeout is always 5 minutes, but creating a 72TB volume for example takes ~12 minutes.
Describe the solution you'd like in detail
It would be great if this timeout could be increased or made configurable since it's hardcoded here.
Describe alternatives you've considered
Right now we are trying to create the filesystems manually and then create a PV + PVC. But we could really create the volumes dynamically for the pipelines we want to run.
The text was updated successfully, but these errors were encountered: