diff --git a/sig-support/meetings/037-2024-09-18.md b/sig-support/meetings/037-2024-09-18.md new file mode 100644 index 00000000..c6b4c0b3 --- /dev/null +++ b/sig-support/meetings/037-2024-09-18.md @@ -0,0 +1,251 @@ +# Akash Network - Support Special Interest Group (SIG) - Meeting #37 + +## Agenda +- Opening Remarks and Overview +- Triage and Discussion on Akash Network Issues +- Console Issues Update +- Discussion on Bugs and Upcoming Issues +- Community and Discord Behavior Updates + +## Meeting Details +- Date: September 18, 2024 +- Time: 07:00 GMT-7 +- [Recording](https://evdg7ndlhwtfacwpgitrps3uw2lsgh5wz3vj4iiow5d5mewqya3a.arweave.net/JUZvtGs9plAKzzInF8t0tpcjH7bO6p4hDrdH1hLQwDY) +- [Transcript](#transcript) + +## Participants +- Tyler Wright +- Scott Carruthers +- Andrew Mello +- Damir Simpovic +- Hiroyuki Kumazawa +- Maxime Beauchamp +- Rodri R + +### Opening Remarks +- **Tyler Wright** welcomed attendees to the Support Special Interest Group (SIG) meeting focused on Akash network support issues. +- Mentioned open invitation for any additional agenda items from participants, with a special focus on support and community interactions on Discord and other channels. + +### Triage and Discussion on Akash Network Issues +- **Scott Carruthers** presented a screen share, beginning with triaging current issues in the Akash Network support repository: + - **Internal Tracking Item:** Assigned to **Andy and Devolve** from the core team. Label updated for internal logging investigation; clarified it’s focused on internal workloads and Akash console. + - **Shell Access Issue:** Resolved the shell access issue related to the provider pod restart. Updated labels and clarified it doesn’t resolve the issue of lease events dropping over time, which needs further investigation. + - **UDP Protocol Misconception:** Issue raised by **Zeke** on UDP support availability for Akash deployments. After testing, it was confirmed that UDP is supported alongside TCP. Specific issue found with using the same port for both protocols in game server deployments, which requires further work on Kubernetes to support named field differentiation. + +### Console Issues Update +- **Tyler Wright** inquired if there were any significant console-related issues that needed discussion. + - **Maxime Beauchamp** mentioned minor issues but none urgent for the current meeting. + - **Andrew Mello** raised a discrepancy in the bit engine script’s calculations on Helm charts versus the Akash console, which affects deployment CLI versus console. + +### Discussion on Bugs and Upcoming Issues +- **Andrew Mello** introduced three critical bugs found during testing: + - **Timeout Limit for Large Containers:** Five-minute timeout for deployment completion may not be sufficient for large containers (e.g., AI models), particularly on high-speed connections. Clarified it might not be a network parameter; **Scott** and others to investigate further. + - **Control Plane Bidding Issue:** Control plane nodes are bidding when they shouldn’t, affecting deployment accuracy and network performance. Scott noted mechanisms to address node exclusion in inventory operators, but this needs further investigation. + - **CPU Parameter Rollback:** Observed rollback from **512 CPUs** to **384 CPUs**, possibly unintentional. Andrew plans to create an issue to revert this limit back. + +### Community and Discord Behavior Updates +- **Tyler Wright** opened the floor for community behavior issues, particularly around Discord spam and bot activity: + - **Damir Simpovic** observed spam behavior involving bots posting links after conversational exchanges. + - **Rodri R** reported recurring issues with new users creating spam-like chats, submitting tickets and support requests, and occasionally including screenshots. + - **Tyler Wright** confirmed action items for refining the blacklist of restricted words and addressing spam through moderator tools and the Akash Insiders program. + +**Closing Remarks:** +- **Tyler Wright** encouraged all participants to stay engaged in SIG activities, emphasizing collaboration through GitHub and Discord channels. +- Expressed gratitude for everyone’s contributions, especially in community support and repository management, and wished everyone a productive week. + +## Action Items +- **Andrew Mello**: Create formal issues for the bugs discussed, including timeout limits, CPU parameter rollback, and control plane bidding issues. +- **Scott Carruthers**: Investigate the feasibility of updating the Akash network’s SDL to support simultaneous TCP and UDP on the same port. +- **Tyler Wright**: Review and update Discord’s blacklist and moderation settings, based on the observed issues and participant feedback. +- All participants were encouraged to submit any additional insights or issues in Discord channels and GitHub between meetings. + +# **Transcript** + +Tyler Wright: Hello, welcome everybody to SIG support. It is September 18th, 2024. The SIG support is a special interest group that focuses on all things on the cost. That's covers everything from Any issues related to the support repo. Some folks can talk about anything related to the console repo. I see there's a couple of representative from the console team here. And then again, we take some time towards the end to talk about support and discord and other channels. So if anybody has anything that they want to add as an agenda item, feel free to drop it in to the chat. We'll make sure we're discuss it at some point today. + +Tyler Wright: Right now, I'm gonna kick it over to Scott, who will get us started with triaching, any issues that are waiting trios and we can talk about anything and support repo. So Scott, I kick it over to you. + +Scott Carruthers: All right, thanks, Ty I will Share out my screen. + +Scott Carruthers: Okay, so I think this is actually going to be a pretty quick triage process today. So if we look at the Current issues on the Akash network support repo. We could see during this and I'll open up for discussions as well. If anybody wants to talk about any other issues but we initially just go through and make sure that nothing is in waiting tree action assignment. So I reviewed These prior to our call. And I think this is going to be a pretty quick process. And again, we'll open it up for questions. Afterwards are all open up for questions as we go through each of these as so, Start from the bottom here. this is actually just an internal tracking item. That is already assigned to Andy and Devolve from the core team. So this is just simply a matter of it's not assigned any label so I'm gonna take the + +Scott Carruthers: Awaiting triage off. And I'll put this in the Provider o, label. so again pretty simple one and this was already assigned that this is already active investigation for logging options and some things that are internal team is doing. This isn't necessarily a logging solution for community providers. This is just a logging investigation for some internal workloads like a cash console and things that were investigating better logging options. So again, pretty simple one and basically just need to have the Correct. Label associated. + +Scott Carruthers: So, I'll go back to the List and bring up a waiting triage. This is a I'm gonna go out of order here a little bit because I'll discuss a little bit further. The other one I think is pretty simple. This is one that Max and I discuss weight last week. So recently, we have cells issues of the Shell access to a deployment breaking, if a provider pod is restarted. So that Disruption to Shell Access. It should be resolved and I think there was a belief that I was also going to solve lease events dropping off after time. + +Scott Carruthers: Which I don't think this is a necessarily associated with a provider pod restart, but for some reason, over time, Kubernetes events are no longer available through a class queries. So I thought that was solved again. When we investigated the shell break issues that this is still outstanding. So Max, open an issue. Myself and Archer to this issue. + +Scott Carruthers: And I will also put this one, take it out of a waiting triage and also put this one. And they. Repo Akash label. So I think this is a pretty straightforward issue that we're probably most are Already aware of. + +Scott Carruthers: Just reading this update that Andy put in as well. But anyways, so obviously this needs further investigation to find out why lease events are not available through a cash queries. After a period of time, any questions or does anybody want to discuss this further? + + +### 00:05:00 + +Scott Carruthers: So like I said pretty well known issue and then Lastly, there's a issues that I'm pretty familiar with. So this issue. + +Scott Carruthers: Right now. So Zeke brought this to my attention last week of there was an initial belief when he opened this issue that UDP support was not available for a cost deployments. So I walked through that was Zeke and prove that UDP is definitely available in a cost appointments by changing the defaults TCP protocol in the SDL. and then there was a belief that TCP and UDP were not supported simultaneously and I proved that that wasn't the case either and showed deployments that were using Both TCP and UDP protocols, that we're working perfectly fine. + +Scott Carruthers: So, we went through a little bit of testing for what is issues. He was actually encountering and the issue was that for a video game server. So this is kind of a unusual circumstance. So you have a poor which I believe in this case was a video game server like a Minecraft server or I don't think it's actually minecraft. It was a different video game server. The utilize port 777 both for UDP and TCP. So that's kind of unusual in an application that you would expose. the same port for both TCP and UDP, but for game servers, this somewhat prominent because you want to + +Scott Carruthers: Make the mental load on video game, users a little less. So if they have to put in a custom port, they only want to have to update a single port and that would update both the underlying poor usage of TCP and UDP on the back end. So again not something that we would typically encounter TCP and UDP being used on the same port number. But for video games, that can be someone prominent. So I did some testing of this and this is indeed an issue. If you try to Expose a TCP on UDP port. So again, if we use seven seven as an example on an SDL and try to expose that, be a TCP and UDP the Kubernetes node part will actually be created for TCP. So I did some testing of this And stuff. + +Scott Carruthers: Possible in Kubernetes simultaneously on a deployment, both TCP and UDP using the same port number. But when you do that and starting to get a little bit deep into this. But in case anyone is interested, if you do this via a Straight Kubernetes, deployments. Not a cost appointment, but just a straight Kubernetes deployment. You definitely can use the same port number for TCP and UDP. But to differentiate those for underlying IP tables and other Linux and Kubernetes artifacts is necessary to use a name fields within the service and currently a cost doesn't support that Name field. So Anyways, that's a long-witted explanation of it currently is an issue and it's not possible to expose the same pork via TCP and UDP and the cost deployment. And it looks like we would have to add + +Scott Carruthers: Something into the SDL, it's particularly under the service to allow a name differentiator, that then allows Kubernetes to create the underlying artifacts to support that. So, again, I don't think it's a very prominent issue of and again, went through full validation that it is actually in an issue. So I think a somewhat where use case but something that we certainly need to solve for as there are a number of video game developers that are interested in deploying on a cache. So, with all that being said, I'll take this out of a waiting triage. And this also goes into the provider repo, and I'll, again assign myself and Archer to this. + +Scott Carruthers: Any questions about my explanation of this issue on this matter? + +Scott Carruthers: okay, if not and we go back and look at a waiting triage, we now have engineering assignment for all issues and sound like there was any questions about those specific issues or the latest issues open and they repos in the support repository. So with that I'll open it up. Are there Any desired discussions about other issues that I haven't covered. + +Scott Carruthers: Okay, maybe some will come up as we continue the conversation but with that, I think Triaging is complete tower. So I'll hand it back to you. + +Tyler Wright: Thank you very much Scott, Again, if anybody has anything that they wanted to talk about specifically, feel free to drop in and chat. One of the things that we also like to do, I think Max is here as well. I just wanted to see if there's anything console related on the issues front. I don't know if there's any good first issues right now but try to put you on the spot Max. Is there anything that you see on console issues front? That is worth talking about during the six support call. + + +### 00:10:00 + +Maxime Beauchamp: The bunch of issues, but I don't think any issues were talking about here. unless someone else has an issues that they want to talk about, + +Tyler Wright: Does anyone have any issues that they want to talk about? That's pertains to the cash console. + +Tyler Wright: All I see a hand Go ahead, Andrew. + +Andrew Mello: So I was just looking here, One of my issues does cover the console so figured I'd bring it up now. So I was noticing when doing a cost mining testing at scale that I was digging into some of the numbers a little more specifically. And what I found was a discrepancy between the bit engine script on the helm charts and the Akash console + +Andrew Mello: For the specific value of the total amount of time in a month, when it came to, the calculation of uakt per block. So I didn't file this yet. The three bugs I'm going to talk about today. I haven't filed yet. I wanted to bring them up here a little bit in advance. So what this causes right is kind of a discrepancy between what a user would see when doing a deployment with the CLI versus what they're gonna see on the console. + +Andrew Mello: Since they don't use the same number. So in the bug that I filed Maxime, I'll point you at these two lines of code and basically someone needs to decide which one is going to be the right number. Because right now, console uses a number and the bit engine script uses a number. So one of them has to stay one of the best to go. So I think it's time to square that off because again, it creates a discrepancy from CLI to console. And again, as kind of misunderstood, why we're using different numbers for that. So that was my only little akash console thing that I found in my books for today + +Maxime Beauchamp: Did you create an issue? + +Tyler Wright: + +Andrew Mello: I said that the three bugs I'm going to mention today. I wanted to discuss here first before I made issues about,… + +Tyler Wright: I'm not yet,… + +Andrew Mello: so I was bringing them up. + +Tyler Wright: I think his next step is going to create an issue. Before I move on, because I think the last portion of the six support call on the agenda is we see if any insiders anybody in the community has anything that they've seen in discord elsewhere, Andrew, I know that you mentioned having two other bugs that you want to discuss here before you formally create issues. Do you want to take time now to talk about the other two before we jump into the support stuff? + +Andrew Mello: Sure let's knock it out. that's kind of a high level one here that I want to bring up because I don't know where the best place to address this. I think it's a network parameter but What I've noticed. And I'm sure all of you have probably noticed in the last six months or so since ai workloads have taken off, that these docker containers are really big. We're talking multiple gigabytes. + +Andrew Mello: Sometimes even approaching 10 gigabytes. And what I found on the providers is that our timeout of five minutes, for a deployment to go from pending to deployed before it gets killed by the provider, that is not enough time anymore on a gigabit connection. With one of these large AI images or, model that's baked into an image again, it's just not enough time to download it and have it running and extracted within that time limit. So I don't know if that's a network parameter, that we need to vote on chain about to just change from five minutes to 15 or if it's more fancy. Does anyone know? + + +### 00:15:00 + +Damir Simpovic: I'm pretty sure that some of the providers. That are Overfox managed. Sometimes the Model, download and the Image downloads takes 45 minutes and it doesn't get killed, not sure. + +Andrew Mello: No, we're talking about different issues. I'm talking about the size of containers and extra large containers and now with LMS and all kinds of + +Andrew Mello: Packages inside of these containers. They are growing very large. once they're running and pulling models. I'm talking about the containers themselves with the Nvidia CUDA with, all the other packaging. They're now multiple gigabytes and on providers that have a Gigabit connection, That can quickly surpass the five minutes. So, who should I inquire with about this? Because I feel like This is a network parameter thing, just like the limits on CPUs memory disk that kind of the thing. It's not provider specific I believe. So, again, that's why I'm bringing it up on the call wanted a little more direction before I file a bug about this So, if anybody knows, if not, I'll keep poking around to figure out the answer. + +Scott Carruthers: Yeah, so I haven't so yeah just go ahead and create the issues so I highly doubtful that this is a network parameter, but I would have to take a quick look. But I actually haven't,… + +Andrew Mello: Okay. + +Scott Carruthers: yeah, yeah. Just go ahead and open the issue and we can investigate. I don't think that it's a network parameter. So I actually wasn't aware that there was a Restriction on. and the image download time, I + +Andrew Mello: Yeah, you got flight five minutes for the container to be running, right to go from pending to running before it turns out. it won't just sit on pending on a provider Ever. In the initial,… + +Scott Carruthers: Yeah. + +Andrew Mello: look women. + +Scott Carruthers: Just Okay, yeah, I mean the bottom line is, we need to invest here. So the reason I think that's interesting is that and… + +Andrew Mello: Okay. + +Scott Carruthers: so, Simply you might have some experience with us. I thought we've had experiences where we've had image downloads that. Take much longer than five minutes without Provoking initia. But + +Damir Simpovic: Yeah, I'm pretty sure we did. We do. + +Andrew Mello: It's initial now. + +Scott Carruthers: yeah. + +Andrew Mello: Yeah so I'll document it and that also + +Scott Carruthers: Yeah, what we have to get into some of the semantics or the best So that's the reason I want. So Andrew, I'm not at all like downplaying if you're encountering this YouTube. + +Andrew Mello: Yeah. + +Scott Carruthers: We'll definitely look into it. the interesting thing is shampa and others for some of the AI workloads that we've deployed and they've been in the weeds of this a little bit more than I. So, I wanted ship has experience, but I know that I've seen conversation around this where we've had images that take much longer than five minutes to download and it hasn't provoked an issue. So again, I'm not disputing that you've run into this, we just need to take the specifics because Again, we've had experiences where that hasn't provoked an issues. So we'll have to see what they Specific are behind it. + +Andrew Mello: Okay, I bring up the network parameter thing because another one had changed that nobody documented and I don't know why it was actually one of the ones that I had implemented, which was the 512 CPUs. And I'll actually be another bug. I'll add today because there's no reason we should have rolled that back to 384. I don't know why we support less CPUs now than we did before. And also it doesn't make sense because you can have two CPUs in a system, that are up to 512 threads easily. So I'm gonna take a look at both of those. See if it's a network parameter for the timing as well. We'll figure that out and then the final bug I had here was + +Andrew Mello: the one I mentioned to you Scott in private but just wanted to bring it public that the providers control planes are bidding on the network. So you can see this in the event logs, when you deploy to provider, that's full, they're control plane is bidding showing that they have resources available, your deployment gets, deployed on their provider. And then in the event logs, you see immediately that the node that, you're trying to deploy on is tainted and then the deployment fails at five minutes automatically. So, + + +### 00:20:00 + +Andrew Mello: that is a rather concerning one because that one actually has network performance impacts. It also has impacts of the inventory and availability of inventory being incorrect, right? We're showing when a provider is full, they're still bidding and they shouldn't be. So that I would say, out of all of my bugs, is the one with the highest priority. And then, I would say, Definitely I'm out on containers, spawning on gigabit connections. And then the last one there about + +Andrew Mello: the different values on the bit engine and the console and pointing Max to that. And again, I don't know. The reason I brought up the values today really was, because I don't know who makes the decision on that, because Akash has their value and their chart and you guys are gonna have yours. So you guys gotta decide which one to use their I would imagine it will be the bid engine one because then all the providers don't have to update. So that is my bug find for the last few weeks. + +Scott Carruthers: Yeah, and the control plane. So obviously we're talking about control planes that are Dedicated control plane nodes that aren't worker. Nodes are if they work. + +Andrew Mello: So if they followed the setup of Akash, right? you're setting control planes as tainted, As not supposed to be bidding and all of that but the provider is still seeing those as available. + +Scott Carruthers: So what I was going to say, so obviously this is a matter when the control point is not also serving as a worker node and they're at our mechanisms within the inventory operator to + +Scott Carruthers: The term is escaping out so I'll just say So I don't think that's actually what it is and they inventory, operator mechanics. But anyways, let's just say There's a mechanism with an operator to exclude nodes from inventory for this exact purpose. So if you have control plane only nodes, we should be able to when the in Operator, is deployed data specify nodes that we want to exclude from inventory. And when the inventory operator with Feature Discovery was first released, it was noticed that mechanism for exclusion was not working. So it's definitely a known issue. I don't believe that. There's actually an issue and the support repo. So for some new attention, Andrew, if you could open an issue so we can look at this on the latest version. So I think you're already intending to do that. But again, it's kind of a known issue but it hasn't got any attention recently. So, + +Andrew Mello: Yeah, and it's the kind of issue that would slip through the cracks until a provider is full. It's not something you notice until you're at edge cases, right? Because everything looks great until you're at, out of 90 CPUs available, and it's still bidding, but then you're deployments aren't working, right. The problem is it creates a full failure for the deployment. So, that's why it's kind of definitely one. We want to get some eyes on. + +Scott Carruthers: Yeah, okay, yeah, understood. So yeah, yeah, if you can open an issue for that. Then, yeah, that like I said, I don't believe this received any attention recently even though it's kind of a known issue. + +Andrew Mello: Yeah, but besides that I mean those are the majors that I found after doing some pretty extensive heavy hitting testing with a cosh mining, filling up providers and again that that's stuck out like a sore thumb. So we'll get it going. Thank you again for the attention on it. + +Tyler Wright: Just again, as an action item, the issues that Andrew's brought up today, we kind of talked about over our last few minutes, Andrew will create formal issues. So that we continue to track those beyond this meeting. We can make comments in between six support meetings and kind of look to get these results to figure out what next steps are. But Andrew, thanks for bringing those up again, here on the call and looking forward to creating those issues and following up. + +Andrew Mello: Yeah, we'll do. And glad you guys have some context now for them. + +Tyler Wright: Yes, much appreciated. All right. We've gone through the support repo we've gone through, talk about giving people opportunity to talk about anything at Cross Console related. + + +### 00:25:00 + +Tyler Wright: Andrew brought up some issues or bugs rather than he's found that he's going to turn into issues. I'm following call. The only thing I wanted to talk about again, if anybody has anything else, they want to talk about, feel free to drop it in the chats. But usually, during these six support calls, I reach out to rodri or other members of the Akash insiders slash vanguards to see if there's anything in the community in discord and telegram and other channels that might be worth discussing, that might be worth talking about with again this group. So, anything related to user experience activity in discord, etc. Feel free to talk about this now, to me, I see your hand Feel free to go ahead. + +Damir Simpovic: I noticed that the ticket guys are back, so they might have figured out. A way to go around the filter. So yeah. That's something that we need to. that's all I have. + +Tyler Wright: Yeah, I've seen some people drop some suggestions and insiders and we continue to implement them. But go ahead. Roger + +Rodri R: The same. I was just gonna add that just this morning, there was this couple of new users, that started creating a chat between them. we all with the usually do and the mirror said, they dropped in several tickets and different support things, even one of them even created a screenshot and uploaded it. I banned the users, I thought that we wouldn't get that support ticket was already in the Blacklisted. Words, Word List. I'm not sure why they're going to. + +Tyler Wright: Yeah, I'll check on that and get back to you but thank you for bringing that up. + +Damir Simpovic: the results also a notice that somebody's using books, like, Two or three bots talking to each other on this course. kind of like a normal conversation. + +Damir Simpovic: Yeah. I timed one of them out. Then the other one continues talking. As if. The other one was still there. Was pretty funny, but why I'm bringing it up is because they + +Damir Simpovic: they occasionally drop links. they're relatively harmless to watch just the conversation but after 10, or 15 messages, they drop a link where they, I don't know, found a how to get 8,000 for free, just by clicking on it, stuff like that, people scans. So Yeah. Need to be aware of those. + +Tyler Wright: All If you all can continue to drop messages in the insiders channel again, because Insiders program is an application and acceptance program that is like a caution ambassadors. There's a sign up forum on the website but again I know Roger and Jamaica both insiders. If you can continue to share some of these messages in the chat Adam I and others can continue to look and refine the modification to a mod tools. Go ahead. Roger + +Tyler Wright: You might be unmuted if you're talking. + +Rodri R: I was just saying that there were also some airdrop messages last week and I believe yesterday too. I don't know if that word has also been that listed apparently not because they typed it. + +Tyler Wright: Yeah, again I'll Look at the settings. I'm not really sure what changed because many of these words were blacklisted for a long time. So, let me see if something happened with discord or saw or elsewhere and we'll make sure that some of those basic words I've support ticket, airdrop etc, are blacklist. And then, feel free to drop in the other messages or tendencies that you see. And the cost insiders discord channel, we'll continue to refine the modules. + + +### 00:30:00 + +Rodri R: For sure and it was working because I once got banned from the server for Typing Support Ticket just when you all created the blacklist. So yeah, it used to work. + +Tyler Wright: Yeah, that's what I figured and they were gone for a long time but we'll make sure I follow up. Is there anything else tendency wise or behavior wise, and discord or elsewhere? There might be worth noting here with a folks that get people support in various chats, as well as just members of the core engineering team. + +Tyler Wright: Is there anything else support related to anybody wants to talk about before I let y'all go. + +Tyler Wright: All right, we got a number of action items that will look out for some new issues. That'll be created again. I know the console team is going to continue to create some issues for folks in the community to contribute to. Again, there's a number of opportunities to get involved. Please look at the various repos, ask questions and individual issues, leave comments etc. Much appreciate everyone's time and energy much, appreciate Andrew for bringing these bugs to attention today. Again, we look for those issues coming real shortly. And again hope everyone has a great rest of your day Thanks as always for leading us through on the treehouse process. Hope everyone has a great day, feel free to drop any comments in various channels and discord. Six support. If you don't know where to drop, it related to any of these issues. And again, we'll continue working in between meetings and discord and github, etc. But thanks again for All your hard work, hope everyone has a great rest of the day and prayers for the week. + +Scott Carruthers: Thanks everyone. + +Tyler Wright: Bye. + + +### Meeting ended after 00:32:15 👋