From 41ccbff169bf292e037afae8fb181801c846da05 Mon Sep 17 00:00:00 2001 From: Drinkwater Date: Tue, 13 Aug 2024 22:54:11 -0400 Subject: [PATCH] Update 019-2024-07-24.md added transcript chore: add recording link Signed-off-by: Drinkwater --- sig-providers/meetings/019-2024-07-24.md | 230 ++++++++++++++++++++++- 1 file changed, 229 insertions(+), 1 deletion(-) diff --git a/sig-providers/meetings/019-2024-07-24.md b/sig-providers/meetings/019-2024-07-24.md index 16876ea0..63aa33e2 100644 --- a/sig-providers/meetings/019-2024-07-24.md +++ b/sig-providers/meetings/019-2024-07-24.md @@ -97,4 +97,232 @@ - The core team to monitor and validate reported issues, providing timely updates and solutions. - Encourage community members to report issues and provide feedback to improve the operator inventory and overall system stability. -## Transcript +# **Transcript** + +_This editable transcript was computer generated and might contain errors. People can also change the text after it was created._ + +Tyler Wright: All right, welcome everybody to providers monthly meeting number. + +Tyler Wright: 19 it is July 24th, 2024 during these Specialist Group for providers the group anybody that cares about anything a cost provider related comes together to discuss tooling to discuss pain points any issues related to providing just general conversation in the past have been a number of discussions that have spurred out of here that turn into working groups of terms into governance proposals a number of folks who will get an update from today include folks like jigarette and Deval. Whoa previously been working on Predator provider app. Now it is going to be a part of the open source console. And so again, there's been some updates on a monthly basis from the page 14. We'll get some more updates today out of the last + +Tyler Wright: Special interest group session there was again some updates from the paid our team about console 2.0 and we'll get some more there. Damir gave a presentation on some tooling that he's been working on for active providers. We can potentially get an update there Andrew Mello and another Community member talked about some mining projects and some of the tools he's been working on and then again, there was some open floor discussion. I think we'll also get an update from Scott who has been working on provider build improvements. And I know that a number of people from the community had tested and used the provider build scripts that he's been working on. So again, we'll talk about a little bit of a future plan there. + +Tyler Wright: If anybody has any other agenda items that they want to cover during this special interest group monthly meeting. Feel free to drop those in the agenda. I'll also remind folks that there's any active discussions. + +Jigar Patel: Hey, thanks. + +Tyler Wright: I think there's a couple of discussions related to provider. + +Jigar Patel: Tyler. So currently we are still working on adding similar experience. + +Tyler Wright: But if anybody can get involved in discussions again,… + +Jigar Patel: What console has right now? + +Tyler Wright: this is a chance to make your voice heard,… + +Jigar Patel: So what we are actually building is similarious experience on provider side. + +Tyler Wright: so I'm feel free to do so. + +Tyler Wright: Cool with that said I will hand it over to jigarette… + +Jigar Patel: So you would see the similar user interface plus with the help of Scott's new scripts that enable us to make k3s multi note provided using Scott scrapes. + +Tyler Wright: Deval to give us just an monthly update as they usually do about some of the console 2.0 updates paid or Etc. and nobody else. + +Jigar Patel: We gonna simplify a lot of things on the provider side. So yeah, we are working on it right now. So hopefully that goes smoothly and yeah, that's specifically from our side. + +Jigar Patel: If you have any questions, please let us know. We did a comprehensive testing on K3 years multinode setup found really useful and found very promising right now. + +Tyler Wright: Excellent. Thank you. Jigar. I'm gonna drop in again a link to the product and… + +Jigar Patel: So looking forward to add that into Twitter replacing the old KS and… + +Tyler Wright: engineering project board there. You can track everything from again all the work that parade tour. + +Jigar Patel: civil script so that we were running from Twitter. + +Tyler Wright: The prayer team is done in around the console 2.0 some of the things that they've been working on including,… + +Jigar Patel: So yeah, it will simplify a lot of things on provider side simplifying. + +Tyler Wright: some of the things that jigar just mentioned around testing of the provider builds process Improvement to Scott has made as well as some of the next steps that they'll be working on for console 2.0. + +Jigar Patel: Support as well from our side as well. So yeah. + +Tyler Wright: So again, you can track everything and kind of real time and leave comments and questions inside the specific projects. We'll talk a little bit more in detail about that during the steering committee meeting happening. I don't know even what date is. Yes, today's Wednesday. So that means tomorrow is Thursday. but again much appreciate all the work that Jigger Deval and everyone else is doing around console 2.0 but especially their worker on the provider side. I won't go into too much detail, but I also want to shout out jigar. I know he's a new dad. I won't Docs. The baby or when it was born, but I know it was recent and so huge congratulations to jigar. I know sleep is a luxury now, but again still a lot of work happening on the Prater provider side Council 2.0. So huge shout out to jigar. + +00:05:00 + +Tyler Wright: ette excellent All right. Any other questions on what's going on with Prater and some of the next steps integrating into console 2.0? + +Tyler Wright: cool All right, next up. I want to see if Scott wanted to give a quick update. I know jigar kind of alluded to it in terms of incorporation of the provider build scripts into console 2.0 as well as the testing that's going on around again, the improvements to the provided Bill process Scott. I just want to see if there's anything you wanted to mention to this group. I know there's some insiders here some folks that interface a lot with people in the providers build Channel Etc. So I just want to see if there's anything that you want to share at this time. + +Scott Carruthers: Yeah, really? They provide a build scripts have Completed I really haven't touched those scripts and probably going on a month and a half to two months ago. So they're just being tested and used more and more efforts. I look at people in the audience for this call. I think just about everyone is very familiar with them. So shimpa is now built a test provider using those scripts and provided some minor feedback on that. We have already implemented obviously jigarh and Deval are very familiar with the scripts of use them many times Deval actually went through and maybe this is the most significant update Deval went through + +Scott Carruthers: A list of any concerns that we had around k3s, and in the providers that are built using these scripts. So, very granular things such as if we proceed any challenges with Calico or the cni used within k3s. at CD performance and things like that. So across those lists of possible concerns, we have Solutions and feel comfortable across all those items. So obviously jigarh and Deval are very familiar with these scripts. I know that Zeke who's also with us was one of the very first community members that test these scripts and this is probably at it I'll stop my head maybe three months ago Zeke tested these scripts and had Relatively, no issues, and I know that Andy is going to start using. Them and some of our overclock managed. + +Scott Carruthers: Providers. So yeah, I think the big updates are they just starting to get implemented more and more. I think obviously the big milestone will be when the printer team implements these scripts within their a cost provider Bill process, which I think has come in very soon. Now that they've done very thorough testing. So I think that's the next Milestone and then after that obviously we'll collect Community feedback on any issues that anyone has encountered and make any tweaks necessary and then the next step after that would be to it include them as our kind of default or recommended provider Bill process. and again, I think pretty much everyone on this call was already familiar with this, but I will just briefly share if anyone is not familiar with us or wants to + +Scott Carruthers: I'm just gonna share my entire screen for a second if anyone is not familiar or wants to experiment with these more right now, kind of my working engineering documentation as always as a caution Engineers that XYZ and if you go to the Akash provider automated build through they temporary repository that these scripts live in and then step by step. A series of events that you would go through to build a provider and these in this manner and there's a lot of different options here. So I think it's pretty robust at this point where if we're doing a all-on-one provider we have that we have options if it's a GPU provider to install the necessary NVIDIA drivers and + +00:10:00 + +Scott Carruthers: runtime and plugins. We also have options to enable persistent storage. So for anyone that's not familiar if you want to play around with this, but we'll have more people experiment and give us any possible feedback. So yeah, think So just in recap the scripts have been rather stable and available for some time. I think the next Milestone is the Prater team. Implementing those as part of their build process and then we'll go from there. + +Scott Carruthers: back to Utah + +Tyler Wright: Does that thank you. Anyone have any questions about provided Bill script next steps. + +Tyler Wright: I am going to drop a link to the Akash Engineers documentation. Just if anybody from the community or anybody that's maybe having problem with the traditional way wants to use this way. It is in beta, so remind folks of that. But again, we can start to share with more folks in the community. Go ahead Zeke. + +Zeke E: This may be more of a time of an offline question. But what are the main differences between K8 and K3? And is there anything? what's the worst part about using K3 that KS has a K3 doesn't? + +Scott Carruthers: Yes, I can answer Andy has some thoughts as So. we're not aware at this point of any limitations. That obviously K3 is a more lightweight. Distribution of kubernetes. So the install process is much faster, which is one of the reasons that the provider build scripts can build the underlying cluster so quickly even in a multi-node environment. So as far as they scale down version and anything in k3's that would be included in full kubernetes or K8. I'm not aware of this. Obviously. This is part of the validation testing that we wanted to go through but as far as a kubernetes cluster, and it's + +Scott Carruthers: Necessary capabilities to serve in a cost provider as the container scheduling mechanism for a cost providers. We're not aware of anything that is not available and the lightweight 3s that is available in K8. So, yeah, it's just a lighter weight distribution and something obviously can With K3 is the appeal of it to become more and more adopted in Enterprise environments. I think in the past 3s was used significantly in lab environments and also heavily in single node all in one node provider topologies, but it's starting to be adopted in Enterprises using multi-node configurations much more significantly. So yeah, it's just a much lighter weight version of kubernetes. And obviously that speeds they Install time and management Andy. I don't know if you have any other. thoughts on those regards + +Andrey Arapov: I guess not it's what you said is it basically allows easier and faster kubernetes and deployment. Because obviously keep spray. You basically spend a little bit more time on inventory preparation and mostly on cloning the git Club get a project of keep spraying making sure you have the latest version instead. You can just run E3 has comments and just install the latest greatest and they have great process is quite similarious pretty much one common close to it. And it's basically save time and in terms of presentation, they haven't found any yet. And if you find of course will what everyone know by adopted there will be some limitations. + +Andrey Arapov: Except maybe for operational standpoint a few where this keep spraying. You have more Standalone components such as tubelette and so on you can restart particular component separately and whiskey K3 as you would need to restart KJS entirely so it could impact on the API will do it if you're starting to control playing mode. So there are smaller drawbacks, but they're just like tiny ones and + +00:15:00 + +Tyler Wright: Excellent. I say a hand Go ahead Andrew. + +Andrew Mello: Hey guys. I'm sorry. I'm a bit late. I did have a question slash feedback that I want to give at some point. Don't know where you're at in your process, but just want to let that be known. + +Tyler Wright: Excellent cool. I'll add that to agenda. We've talked about console 2.0 and some of the work there we talked about provided build scripts. The only other agenda item I had was I know damir had started working on a project last month. I just want to see if there's any updates that damir wanted to share and then enter I can kick it over to you for your feedback. Such questions. Damir. Did anything you want to share? + +Tyler Wright: Yeah, No worries. I'm sure there'll be a continued updates between these meetings and then we can check in with you again next month. cool All right, again, if anybody has any agenda items that they want to cover they haven't been covered already. Feel free to drop them in the chat, but Andrew Mello, feel free to ask your questions / comment. + +Andrew Mello: Yeah, hi. Good morning. Everyone good evening, depending on where you are. So I just wanted to bring as kind of a point of order slash little feedback sessions today on the operator inventory specifically. I'm aware that at the beginning of the year there were some major updates to it and now a few months down the road across my providers. I have noticed some really unruly Behavior with this new operator inventory and kind of the way it's built and some of the experiences I've had with it are really less than ideal. there seems to be a heavy Reliance on the kubernetes cluster kind of always being perfectly available in a perfect state. So one of the first places that I had issues with it was when the k3s cluster I use did not use a local. + +Andrew Mello: Of etcd for example and instead use the database backend. So that was the first place that I noticed. it would have trouble finding and using the actual cluster and then I would get errors inside of the only mind again operator inventory only we're talking about here where I would get errors about, + +Andrew Mello: Unable to reach the cluster exiting failing exit, it seems like whatever happened with operator inventory. There were some pretty significant changes that have impacts and outcomes that were not fully necessarily investigated. So let me give you another example of that which is when nodes go offline or are removed or are added that also can trip up the operator inventory where when it's unable to talk to one of the nodes again, it does not always populate the entire inventory. So one of the consequences of when this operator inventory gets jammed up or stalled up or the cluster hiccups or something happens to the operator inventory where it's not very happy the end result is that the status page of the provider will show no inventory at all it + +Andrew Mello: Just go available and then two brackets right as though empty. And so I wanted to just a high level. If somebody who built that or could give kind of a high level explanation of what the architecture kind of layout is because something changed significantly where now if the cluster of the node or there's one thing kind of off the operator inventory is very unstable. + +00:20:00 + +Andrew Mello: + +Andrew Mello: Across, again, I have nine different clusters that I manage. I have a bunch of different environments some with local etcd without it some with nodes that are failing some without it. Right. So I've gone through and seen over time now, so I brought up yet a third just to again let somebody talk to this now after this third point but I did post my first support issue for this over two weeks ago regarding the length of the pods being created the name length. So once again showing that it didn't necessarily go through a really deep testing in an in-depth step with the way this was built… + +Tyler Wright: excellent I would say the best course of action would be to certainly follow up with those tickets issues on to other two education. + +Andrew Mello: but when the new operator inventory Hardware Discovery pods are being generated that name is so long that it's more than the 63 character limit on the Pod name. + +Tyler Wright: You just described here in this call as well and try to be a specific as possible. But that I believe there was already discussion around the character length in our six support call that just happened an hour ago. + +Andrew Mello: So that was my support ticket I open + +Tyler Wright: I think there's going to be investigation from the core team on next steps there and then we'll follow up with the ticket. But again, I think those go ahead. + +Andrew Mello: But honestly, there's two or three more tickets coming behind that I can document already one where it's failing when the etcd leader changes now, so that was the one I got two days ago where the leader of the cluster changed and that was enough to kill the entire operator inventory, which I found very very surprising. So those are my concerns I would love if somebody could tell me what is the expected operation and behavior? And why was it updated in this way? And then how do we get to a point of stability with it?'re kind some of these edge cases that I described are gonna get handled because the end results is instability right providers are depending on the status of the cluster, right if nodes are up down that operator inventory seems very fragile right now. So that's my first foray into operator inventory. Yeah. + +Andrew Mello: right Yeah, just scaling back right pulling back from that single ticket. + +Scott Carruthers: Yeah, so it was actually our care that did the primary code on this. + +Andrew Mello: I wanted to bring this to you… + +Scott Carruthers: So again,… + +Andrew Mello: Scott and… + +Scott Carruthers: I would suggest some… + +Andrew Mello: I think it was Scott who did the updates to the operating room I don't know I understand you guys want to do more complex stuff with gpus or… + +Scott Carruthers: what Tyler is suggesting that so we have a number of providers on this call. I and… + +Andrew Mello: the way you touch them… + +Scott Carruthers: edge cases are great and… + +Andrew Mello: but it looks like it's opening a listening service on some I think it's Port 801 or… + +Scott Carruthers: we want to discover edge cases and they maybe wouldn't even classify them as cases regardless. There's a number of providers on that. + +Andrew Mello: something. + +Scott Carruthers: I think they're has been relatively obviously we deal with them as they come up. + +Andrew Mello: and then if that listening service becomes and unavailable the whole thing goes and… + +Scott Carruthers: So I'm trying to think before you opened a issue on the hostname limitation. + +Andrew Mello: kind of as they say star T the bed it dies it falls off and… + +Scott Carruthers: I'm not sure the last time that we've had an issue opened against the inventory operator. + +Andrew Mello: and it doesn't reset itself. + +Scott Carruthers: So to answer your question generally so obviously the driver behind. + +Andrew Mello: So if it gets into an error state it just kind of stays there… + +Andrew Mello: until you pop the operator inventory pod, so + +Scott Carruthers: so the inventory operator was pre-existent,… + +Andrew Mello: I have not had a good time with it and… + +Scott Carruthers: but we initially Labeled entitled feature Discovery came about as we were doing provider incentives and… + +Andrew Mello: I did want to raise that in a call here because it's a much bigger issue than just that character limit. I've now seen at least three others that will just kill it by normal,… + +00:25:00 + +Scott Carruthers: we need the ability to have a constant inspection of gpus that are hosted on a providers in provider worker nodes so that we set up a grpc endpoint that allows I know that the cloud most team now is doing indexing against that so again that gives us the ability… + +Andrew Mello: cluster things that happen. + +Scott Carruthers: if we're incentivizing a provider we can provider our often necessary via that grpc on point. So theoretically we can pull every + +Andrew Mello: It may be Scott or Andrey. Just give me a high level of what the new operator and why it's written that way. But the listening port and setting up like a server client thing like what's going on there because I get it just came out of nowhere this spring. So I'm just kind of curious What is the expected behavior of all that why is it done that way? + +Andrew Mello: Wow, okay. + +Scott Carruthers: I come up but relatively haven't heard or seen of any of criticality in the last several months. + +Andrew Mello: Okay, that was very helpful to understand. I understand who built it and a bit more the context as I shared in my feedback. I think it needs significantly more testing specifically around managing state of the nodes and what's available because I mean you can have nodes that are not even running on a cluster and it's still trying to do Hardware discovery on them. Right the kind of the concept of round how it's maintaining that open port and connection and Reporting back what's available it needs more work. So I will file those bugs. I did want to raise it the bigger issue and kind of the high level before I filed those bugs to you guys. So you're kind of aware of what's coming. But yeah, that's my specific concern is that it's not battle tested yet and it's such a big change and the consequence is your provider does not bid and it does not show. + +Andrew Mello: So any inventory it will just show, two brackets for available and nothing inside of them. And that's what I've been seeing now for a few months and anyone's bring it up. So thank you so much Scott for filling in the detail there. Appreciate it. + +Scott Carruthers: Yeah, yeah, no problem. So again, when we have issues opened there will actually be an update on the issue regarding the hostname length later today and some validations that I've gone through for that. So, yeah, you'll have an update on that issue very shortly. I think by the end of the day today and some testing that I've done against it and if you open an issue for the STD leader change and we'll definitely validate and investigate that as well. + +Andrew Mello: yeah, and like I said, because of the way it's built right with that at CD leader change I kind of said it once so I just say it a second time before I finish here, which is + +Andrew Mello: the way that this monitoring is built is that if the cluster has any state change right? It's like to itself then that breaks the ability to monitor that colon 8001 or whatever Port that's always expecting a query and then it's kind of stupid. It doesn't do anything. It just breaks after that and exits out. It doesn't have the right condition set for the cluster, basically restarting itself, which it will do sometimes especially if you restart let's say, a leader Etc. So it needs a lot more robust testing when it expects to constantly have available one of these ports right on all the notes. So that's what I've been seeing now at scale over time and that again, if you're just running maybe one cluster if you're not gonna see this stuff and again, so that is + +00:30:00 + +Andrew Mello: Back, thank you guys so much appreciate the time. + +Tyler Wright: Thank you, Andrew and as you see these issues is always great to put them in the support repo. That's all issues where it's a core product go and then again members of the core team will look at those validate those so much appreciate the time that you put in there and thank you for bringing it up here. This is what these calls are for. We talked about updates. We talked about new tooling, but we also talk about issues related to provide her and then have discussions. So, thank you. Any other agenda items that anybody wants to cover here related to their providers or anything else provide a related on the Akash Network? + +Tyler Wright: If not, then again, we'll look for some updates potentially from damir at the next providers. If not before there are a couple of teams that have reached out to me to do some demos at the August. + +Tyler Wright: Providers meeting I'll follow up with some more details and Discord from those teams that are trying to again build microservices, I guess on top to support a cost providers. I let them explain themselves, but I know they've been reaching out to supervising the ecosystem for some feedback and testing again. We' Andrew will look to follow up and create some issues related to some of the subjects. I've… + +Andrey Arapov: Thank you. + +Andrew Mello: + +Tyler Wright: I've read goodbye. He talked about around… + +Scott Carruthers: Thanks, everyone. + +Tyler Wright: + +Tyler Wright: around the inventory operator. + +Jigar Patel: Thanks. + +Tyler Wright: And then again, we'll get some continued updates over the next month from jigarette and Deval and all the work that they're doing. If there's nothing else then again, feel free to take some time back. If anybody has any questions, feel free to drop them in the sick providers Channel and we will talk again next month. Thank you all very much for the time today much appreciate the conversation. + +Meeting ended after 00:32:28 👋