From 74f870fd43a3b2cef49d736efd9bc194a9c02c8b Mon Sep 17 00:00:00 2001 From: George Appiah <48031636+iamGeorgePro@users.noreply.github.com> Date: Tue, 27 Aug 2024 08:26:24 +0000 Subject: [PATCH] Create meeting notes for Sig-Support #36 (#671) * Create meeting notes for Sig-Support #36 Signed-off-by: George Appiah <48031636+iamGeorgePro@users.noreply.github.com> * Update 036-2024-08-21.md added transcript Chore: Add recording link Signed-off-by: Drinkwater * Update 036-2024-08-21.md added recording link Signed-off-by: Drinkwater --------- Signed-off-by: George Appiah <48031636+iamGeorgePro@users.noreply.github.com> Signed-off-by: Drinkwater Co-authored-by: Drinkwater --- sig-support/meetings/036-2024-08-21.md | 318 +++++++++++++++++++++++++ 1 file changed, 318 insertions(+) create mode 100644 sig-support/meetings/036-2024-08-21.md diff --git a/sig-support/meetings/036-2024-08-21.md b/sig-support/meetings/036-2024-08-21.md new file mode 100644 index 00000000..307142ba --- /dev/null +++ b/sig-support/meetings/036-2024-08-21.md @@ -0,0 +1,318 @@ +# Akash Network - Support Special Interest Group (SIG) - Meeting #36 + +## Agenda + +- Review issues awaiting triage in the support repo +- Discuss any open issues or support-related topics +- Update on console-related issues +- Community support and Discord updates +- Any other support-related items + +## Meeting Details + +- Date: August 21, 2024 +- Time: 07:00 GMT-7 +- [Recording](https://wrx26ds6t7sofpahkzy5iphq7eqtfu3jln6xf5idxmrr2url4m5q.arweave.net/tG-vDl6f5OK8B1Zx1Dzw-SEy02lbfXL1A7sjHVIr4zs) +- [Transcript](#transcript) + +## Participants + +- Tyler Wright +- Scott Carruthers +- Andrey Arapov +- Maxime Beauchamp +- Rodri R +- Zeke E +- Other attendees (not explicitly mentioned in discussion) + +## Meeting Notes + +### Review of Issues Awaiting Triage + +- Scott Carruthers led the review of issues awaiting triage in the support repo + +- Packet forwarding issue: + - Opened by Tyler Wright + - Awaiting Cosmos SDK 47 upgrade + - Classified as P2 priority + +- Lack of visibility for deployers on exit codes of failed deployments: + - Opened by Andrey Arapov + - Assigned to provider repository + - Classified as P2 priority + - Considered a good feature for users to see reasons for deployment exits + +- GPU device plugin issues with inventory operator: + - Opened by Andrey Arapov + - Assigned to provider repository + - Issue occurs when NVIDIA GPU device plugin marks devices as unhealthy + - Mainly observed on Oblivious H100 PCIe provider + - May be related to virtualized environments + - Recovery methods discussed: bouncing NVIDIA device plugin, rebooting node + +### Console-related Issues + +- Maxime Beauchamp brought up issue #227 regarding exposing manifests via API: + - Proposal to add an endpoint to fetch manifests + - Would solve issues with local storage and device changes + - Potential to add more metadata to manifests (e.g., deployment name) + - Discussion on progress of storing manifests on-chain (still in idea phase) + +### Community Support and Discord Updates + +- Increase in Discord scammer activity: + - Rodri R noted an increase in scammers engaging in channels and DMing users + - Tyler Wright acknowledged the issue and mentioned it would be discussed further in Insider office hours + +- General increase in Discord activity: + - More people joining provider and deployment channels + - Possibly due to recent events or increased awareness + +### Snapshot Service Update + +- Zeke E provided an update on the new snapshot service: + - Fully in production at snapshots.akash.network + - Pull request submitted for website documentation + - Requested feedback from users + - Will provide snapshots for upcoming Cosmos SDK 47 testnet + +### Cosmos SDK 47 Testnet Update + +- Tyler Wright provided an update on the Cosmos SDK 47 testnet: + - Currently on hold due to focus on provider-related issues and active workload concerns + - Code is mostly complete with some initial internal testing done + - Plan to create a Discord channel for testnet testing + - Looking for volunteers with extra infrastructure for testnet participation + +### Additional Notes + +- Scott Carruthers highlighted the importance of the new hourly snapshot service for providers, noting it will significantly reduce sync times +- Andrey Arapov made a recent announcement in Provider channels about updating to Provider Services version 0.6.4 + +## Action Items +- Tyler Wright to create a Discord channel for Cosmos SDK 47 testnet testing. +- Tyler Wright to follow up with the website team on merging the snapshot service URL. +- Core engineering team to continue work on provider-related issues and active workload concerns. +- Community members to test and provide feedback on the new snapshot service. +- Interested parties to look out for updates on Cosmos SDK 47 testnet testing. +- Discuss Discord scammer activity further in Insider office hours. + +# **Transcript** + +Tyler Wright: Alright, welcome everybody to Sig support. This is now a monthly meeting. August 21st 2024 During the special interest group for support the group goes over a number of things related to the core repos and oftentimes a console repo really the goal of the support is to have members of the Court engineering team go over any issues that are awaiting triage in The support repo is we're all issues related to the core code base for our costs the blockchain ETC live. So again members of the core engineering team + +Tyler Wright: go over any issues of waiting triage people on the call can talk about any open issues and again ask any questions about open issues if they want to get involved in helping to solve some of the again, this is a place to ask questions get involved talk about your skill set Etc oftentimes again, we cover issues related to console but I know some members that team may not be here again, if anybody has any issues related to the Akash console, there's a console repo as well as a Discord Channel. I know that the team behind console does a great job of staying up to date and staying very active in that repo. + +Tyler Wright: After we go over any issues awaiting triage, I usually see if there's anybody from the Akash insiders vanguards or anybody in the general community that maybe has some themes related to support that they see a Discord telegram or other facing channels. And then if anybody has any other comments questions concerns related to issues in the core code base or console again, feel free to drop those inside the chat and we can make sure that they get discussed. + +Tyler Wright: Without further Ado. I'll hand it over to Scott who will lead us through issues related to again the support repo and any issues we're waiting To Scott take it away. + +Scott Carruthers: All right. Thanks I share my screen. + +Scott Carruthers: Okay, so everyone able to see my screen in the open issues? so prior to the call,… + +Tyler Wright: Yes. + +Scott Carruthers: I was reviewing open issues specifically those awaiting triage, so I currently just have the Issues with that filter awaiting tree Ash pulled up. So as we can see we have three issues to review. So first one Tyler the packet forwarding I know that this is a issues that you opened. This is a waiting day Cosmos SDK. 47 upgrade before we can continue. Do you want me to just leave this in awaiting treehouse for the moment or do you want me to take it out of waiting triage and Do further assignment? + +Scott Carruthers: Do you want me just leave it as a yummy classified as a P2 for now? Okay. + +Scott Carruthers: Okay, so we have that one checked off and either and again, I reviewed these issues and it's already familiar with these but reviewed. These two issues prior to the call. both of these waiting triage. Once that are remaining our both opened by when a record team members Andy who may want to just speak further. So we don't have to Deep dive into them now, but if you have any comment I'll take so first of all, they lack of visibility for a deployer on the Akash Network to get the exit code of a failed appointment. So I think that's something that we've discussed recently and obviously would be + +Scott Carruthers: a great knowledge point to add to the network. So if a pod access because of out of memory or something of that nature, they deployer can get some feedback. So We're with that. Prelude anything for that you want to I'll take this out of a waiting triage and assign it into the provider repository. I'm doing that. Do you have any further comments or anything further on this that you want to discuss? + + +### 00:05:00 + +Andrey Arapov: No, it's good. Go ahead. Thank you. + +Scott Carruthers: Okay. Sure. + +Scott Carruthers: What would you prefer the classification on this a P2 P1? + +Andrey Arapov: It's definitely like Peter or something lower just like extra functionality for the users to see the reason the deployments are exiting with just kind of Good feature to her. So I'll be doing good. + +Scott Carruthers: Yeah. Yeah,… + +Andrey Arapov: Thank you. + +Scott Carruthers: absolutely. Okay, I'm gonna assign both our turn myself to this. I've classified as a P2 and then they provide a repository. + +Scott Carruthers: So now going back to open issues and essentially with the top so I don't even need to filter so again, so this is a matter that when they + +Scott Carruthers: the provider has Nvidia and the DP device plug-in issues on the inventory operator is not picking up on the fact that there are issues with that plug-in. So again, I'll assign this to the provider repository and as I'm going through the labeling assignment anything further you want to discuss on this issue. + +Andrey Arapov: My little shortly. It's basically when you have some issues with GPU and that module automatically marks the device for GPU devices are unhealthy. This event is quite rare so far. We have only served it on the oblivious h100 PSE provider. They're working on it to fix it and likely the issues related to the fact that provider is running in Puma virtualized environment and passing through the PSE h100 device through it as someone reported the same having the same. + +Andrey Arapov: Let it has seen the same issues on the Media Forum they started so in this installs we often see this kind of things are happening and that's how I actually noticed this issues at all. I just was watching no other was reporting HTTP so available and after querying it again after the GPS market on health by this model and assuming the price is still reported into Chris are available. But in fact it was like We'll say one GP was available and… + +Scott Carruthers: Okay, sure and decide I noticed in the issues. + +Andrey Arapov: wasn't picking it up. So yeah, because + +Scott Carruthers: You said that rarely you can recover from this occurrence by bouncing the mvdp device plugin. if that doesn't work? how have you been able to recover from this situation? + +Andrey Arapov: I + +Scott Carruthers: Otherwise by bouncing the inventory operator? + +Andrey Arapov: am it's like let me think. So basically the nvp device plugging is the one that is responsible for marking at gpus and healthy or detecting them at all. And when it detects the GPU you can actually keep cattle common describe notes or describe notes specific note if you're wondering and then you can see the amount of resources that com/gpu resource and we can see the amount of them and basically what an Android operator does he means I think information from there actually so kind of needs to, keep updating this information either from an old or from this. + +Andrey Arapov: Plugin directly, but in order to recover from it. First of all, we need to detecting this issue. So normally you would query the provider status on point and you would see it https are available, but the only way how to objective. Basically it's a chain of actions, which is someone is deploying the EGU workload on that particular note and then you notice You cannot access your department. We don't see your deployment is ever becoming ready. And when you are asking provider to check the provider side and Humanities point a few you want is that your board is basically stuck for our pending and what's happening in conversation that it cannot spawn because it needs more GPS but reports less GPS available as you can very good describe mode. + +Andrey Arapov: But then you see providers to reports all of them already and that's where this synchronization is at and to recover from it. It depends how bad presentation is. So They're two ways to recover it only Simply bouncing this invisibility model and hope that it actually doesn't Mark the GPU. That is unhealthy again. Sometimes it works. Sometimes it's not it depends how bad situation is and if it still marks the gpus unhealthy, you would have to reboot the note entirely simply on a simply unloading and video and reloading and video model. It's possible to do I want to talk for it if you're wondering but even that doesn't help in India semi reset command to gpus doesn't help low level so + + +### 00:10:00 + +Andrey Arapov: Yeah, and it's another factor that says in favor of the tissue is related to the visualization. So something has to do with the qmi or the way how the egpu has been passed through x-ray. + +Scott Carruthers: Yeah. + +Andrey Arapov: It's a good point here discussing it right now something to highlight. Yeah. and once you done this and the note is back up again, you can basically see old reviews are present normally which again highlights that it's issue likely is the firmware or the way how it's being passed through the key in my because physically is always back after the book. And there is another thing issues opened for this kind of issue for the inventory operator. I think with missing. + +Andrey Arapov: Was it? + +Andrey Arapov: I think it was 240. I think that a couple of issues similar to this but they're different so you need to Detective. In life mode and the other issues is that when you finish the inorder book that nvdp and with the device plugging in it takes some time to initialize about, three to five minutes and an entry operator starts earlier. And where is the information earlier and sometimes you would see the provider is supporting zero to use are available. in fact, if you roll out to start inventory operator deployment, it will correctly detect it. So I think this kind of precious are connected so it just needs to I believe frequently all the information from the note. + +Scott Carruthers: Yeah, I sense and good to go through that inside personally. Haven't hit the fdp device in issues. good to hear your series of events. So if restarting the mvdp vice plugin doesn't work reboot in the note preview booting the node entirely as prevent to cure the issues. So yeah good to walk through that and I guess there's any providers that on the call or listen to the recording that haven't encountered that yet. So yeah. + +Scott Carruthers: Yeah. + +Scott Carruthers: Yeah, okay. Yeah, thanks for that review and I'll make sense. So that brings us through the issues that were previously awaiting triage. Does anyone have Questions about any of those issues that you want to Deep dive in a little further or possibly any questions about issues that we haven't covered so far. + +Scott Carruthers: Okay, if there's nothing else I'll hand it back to Utah. + +Tyler Wright: Excellent. Thank you very much Scott again. I just want to see if there's anything else from the support repo that anybody wants to discuss. + +Tyler Wright: I know there's a representative from the consult team, one of the original developers of clouds, which is our console. So always want to shout out Max. Does anyone have any issues related to the console repo that they want to discuss are any issues and console that they want to discuss at this time? Of your hand up. Go ahead rodri. + + +### 00:15:00 + +Tyler Wright: Yeah, I've noticed this too. We've talked to some insiders and we try to tighten up on some of the mod words. But to be honest some of this behavior. I'm not really sure what we're gonna do. I think we'll talk about a little bit more about it during The Insider office hours, again, they cost insiders is a special group of ambassadors that works on volunteer basis to provide support in Discord a number of other channels both non-technical support. So again, we'll take some time during the entire office hours to discuss this a little bit more and see if there's anybody that has any specific strategies. I know that Max dropped an issue prior to that. Is there any other? + +Tyler Wright: Items on the support side that I've seen really a lot more people in providers channel is also and also unemployment's channel. So I think again whether they're coming off of events or just general awareness there seems to be a fair amount more people in Discord. I've heard from a I've heard from some Partners recently as well that it getting involved in Discord. There's + +Maxime Beauchamp: Thanks Yeah, I talked about this issues with Arthur and I think I just want to bring it up to Scott as well out of you wherever it but yeah, It's probably not that big of a change to implement, but yes to expose the manifests. to the API so the way that we like query the logs are the events with the certificate. We would add an extra point to get the Manifest. So that way instead of storing it in the local storage like we do in console. + +Maxime Beauchamp: And if you change device or browser, it doesn't follow you and connect your wallet. if we had an endpoint to this we would just fetch it and then that problem would be solved and also. Arthur was talking about + +Maxime Beauchamp: Art, there is aware. Just wanted to put a little bit more light on the issues. + + +### 00:20:00 + +Scott Carruthers: Yeah, thanks for reviewing that Max and completely makes sense. as you were described in that I now have a little bit for their Insight. So we've talked at different points about story and manifests on chain. So someone could to ease the friction on deployment to a new provider or obviously there could be automation built around that as well. But the current focus of + +Scott Carruthers: Having a manifestive API so that's the part that wasn't following initially is why that helps if life for automations schemes and past conversations where we're going to have the manifests possibly on chain or stored elsewhere so that in automation you could redeploy to another provider easily. So I wasn't sure how exposing the manifests who are provider API would solve that issue issues that we're currently trying to solve is instead of storing it The sdl and local storage. So if someone brings up console on another device console want to have access to it that use case makes complete sense. I'm glad you went over that and yeah, I'm a little clearer now and as you said, I know the Archer is aware of this issues. So I think we're Contemplating how we're going to move forward. + +Maxime Beauchamp: Cool. Thanks Scott. + +Tyler Wright: It looks like even as quick as recently as yesterday. There's been some reference of this issue on the provider side. So thank you again Max for bringing this up and talking about a little bit more. Is there any other action that you want done or I guess I've looked at the issue and you've mentioned a number of the items at the top of it. But it is there any other specific functionality that maybe is not an issues 227 that you want to make sure it's very clear for the core engineering team. I know you already talked to Archer and obviously just recently brought it back up to Scott. But is there anything that can be added to that issue just to make sure that everything you're looking to be solved is sold or is it pretty much up to date? + +Maxime Beauchamp: They just pretty much up today. There's an image to it. Unless some Scott you think. + +Tyler Wright: . + +Maxime Beauchamp: It's not clear enough, but I think it's pretty straightforward. Yeah. + +Scott Carruthers: Yeah, no. Yeah, especially with the further conversation. I'm definitely Claire so I think we're good. + +Maxime Beauchamp: Is there any progress today and storing the Manifest unchain, or is it still just an idea? + +Maxime Beauchamp: Yeah, no worries. Just wanted to ask. + +Tyler Wright: I thought two hands up. I don't know what order. + +Tyler Wright: Perfect. + + +### 00:25:00 + +Tyler Wright: Excellent. Thank you Roger. + +Rodri R: The get with my turn now. + +Tyler Wright: Go ahead. + +Rodri R: Look Andrew said and Scott and Maxim makes it great idea to store the Manifest unchain. Not sure if this has anything to do with this, but just two weeks ago. I learned I deployed for my new address and console didn't ask me to Create a new manifest. Is this something that was added recently because before in Cloud moves, I remember we had to sign the first grade. I mean, I would have to create a manifest and they would deploy Kepler or Wale whatever and I would have to pay for that and now I just deployed without creating. I'm a manifest. + +Rodri R: Asking because Melton the others know that it's not necessary anymore. Am I correct? + +Maxime Beauchamp: Wait a minute. I'm not sure what exactly you're referring to and you're saying there's no need for a manifest. + +Rodri R: Yeah, like I said, I recently created a new wallet two weeks ago. I deployed from console and it did not ask me to sign up any manifest. It was a new wallet. + +Rodri R: So I was kind of surprised because I thought it was a new feature. And I didn't have to sign manifest before on Club moves. + +Maxime Beauchamp: Like the SEO you mean? + +Rodri R: No if you remember in Cloud mode Before even deploying you had to pay I don't know 0.0001 akt to create a manifest. And now he created it automatically for me in console. + +Maxime Beauchamp: I mean the certificate. + +Rodri R: Okay, it's different. Yeah certificate. Yeah, my bad. + +Rodri R: Okay, my Yeah is the certificate so this is different. + +Tyler Wright: awesome All right, again, we've talked about any issues that are awaiting triage and we went over some of those issues from the support repo. + +Tyler Wright: We've talked a little bit about inside and saw if there's any questions for anything console related Max brought up an issue that will have an effect for client side. Many users issues issues number 227. We're just discussing that we talked to members of the Insiders about just against support in Discord and some things that we've been noticing over the last couple of weeks. I want to see if there's anything else anybody wants to discuss related to support again. This used to be a weekly meeting where we change it to monthly just because there's again many things being worked on we have the product and Engineering roadmap. And then again, if anybody has any questions between these monthly meetings we can discuss them in the Sig support Channel or people can leave comments under a specific issue and again that can get some traction. You see a hand Go ahead Zeke. + +Zeke E: Yeah, I got two things first thing. I just wanted to give an update on the snapshot service. So it's fully in production. Now, you can access it at snapshots.cost.network. I did a pull request in the GitHub repo for the website for the documentation such as waiting to get approved. So that way we can start using that more widely. If you have any questions or feedback if you use it, I'd appreciate it as we continue to scale it up. And then just is there any update regarding the cosmos sdk47 testnet? + + +### 00:30:00 + +Tyler Wright: As of right now there is no update the hope is that over the next couple of days to a week. We will get that started. I know we've said that previously but there's just been a couple of other issues. I think you could see some of the commits that have been going on from the core engineering team, but there's some other provider related issues and some other issues related to active workloads that the core engineering team has been focusing on over the last two weeks and that has put next steps in terms of the cosmos SDK 47 upgrade on hold temporarily again, we're in a point where mostly we're code complete. There's been some initial testing done internally that's like automated testing and to Zeke's point. We're starting to gather some folks together that are technical that can help us. + +Tyler Wright: Test out the cosmos SDK 47 migration Scott and I were actually talking about a plan yesterday. So again, in terms of next steps, I'll start to put together a group. I know there's some people on this call and some people in the community that showed interest and initial testing. I'll create a specific Discord channel for that testing and then again over the next couple of days. we'll share some instructions and next steps. There are some validators and some providers that again if they had some extra infrastructure, we would love to leverage for the testnet. But again, I've reached out to folks there, but please look out for updates in various channels. And again, I appreciate everyone understanding and being patient as again our core team of + +Tyler Wright: Three and a half Engineers on the core engineering side has been working relentlessly on a number of different fronts. So I've been trying to limit Scott and Archer and the rest of teams sleep schedule,… + +Zeke E: And then just one other thing I will. + +Tyler Wright: but apparently they need to sleep to live. So we're working as quickly as we can. + +Zeke E: I will be perfect. I will be providing snapshots for the testnet once that gets started up. + +Tyler Wright: So I go ahead Zeke. + +Tyler Wright: Yeah, that's gonna be instrumental and I think that would be the snapshot that gets used for those that are helping to test things and participate in that testnet. So thank you Zeke's for all the work that you've been doing over the last couple of months to provide it again additional snapshot service. That stacked out by the core engineering team. Obviously, we have the poker Chief snapshot service and number of other avenues but having a snapshot service control by kind of overclock labs in the core engineering team that has some of more + +Tyler Wright: more often Cadence around snapshots is going to be very helpful, especially when core engineers and other members of the community are setting things up testing things Etc. So appreciate all the work you've been doing there Zeke. + +Tyler Wright: I'll follow up. I've already followed up with the website team on just the URL making sure that gets merged. But again, thank you for all the work there. + +Tyler Wright: Any other support related items? + +Scott Carruthers: but I was just gonna agree with Zeke the fact that we have snapshots available every hour. + +Tyler Wright: Go ahead Scott. + +Scott Carruthers: So I'm sure there's probably a number of people on the call that can understand the frustration or just challenges with you trying to build a New provider especially with the scripting that I created. I think a majority of the time waiting for the provider to come up was waiting for the dedicated notice zinc. So that process was actually longer than building in a cash provider. So having them. Available every hour and be able to sink Within. 30 seconds is obviously going to be huge going forward. and that's probably good update is I'm not sure that everyone on this call was aware of this new. Snapchat service that we have created Zeke doing all the work on the back end, but obviously that's going to be very important going forward. + + +### 00:35:00 + +Tyler Wright: Excellent any other discussion topics that anybody wants to talk about here related to support? + +Tyler Wright: if not again, please continue to follow the support repo for all things related to the core code base follow the console repo for all things going on relate to the cash console much appreciate all the Insiders vanguards members of the community here on the call and listening in later and just involved in continuing to provide support to existing users teams and projects and companies are trying to integrate with the cash again much appreciate all that efforts. Just a reminder the Sig providers monthly meeting is going to be starting in about 22 minutes or 21 minutes at this point. We'll get an update from Jigger and the ball and all the work that they've been doing related to console 2.0 and integrating crador into the Akash console experience. If anybody has any other topics related to provider again, please feel free to join that call and we can talk about those. + +Tyler Wright: Those in more detail and about 20 minutes. But again, thank you everybody for their time today. Make sure this recording and transcript is available shortly. But again another great call. Thank you all and we'll continue to update in between meetings in specific tickets or again. Look out for some announcements in Discord elsewhere. I believe actually Andrey correct me from wrong, but you did put an announcement Provider channels as recently as today with updates for folks on how to update to provide a Services version 0.6.4. So again, if anybody has any questions there what we'll talk about in more detailed and providers, but again, it also touches support in looking at those updated instructions, so + +Tyler Wright: Look out for the provider announcements and again much appreciate everyone's time and energy today. Hope to see many of you and sit providers the steering committee that originally tomorrow is now scheduled again for August 29. So again, no steering committee tomorrow. There will be a steering committee as normal on August 29th at the same time next Thursday, but hope to see you all around the community and thank you all for all that you do. Have a great rest of your day. + +Zeke E: Thanks everybody. + +Tyler Wright: Hope to see many of you and stick providers. + +Rodri R: See you in a bit. + +Scott Carruthers: Thanks. + +oo o: live long and prosper + + +### Meeting ended after 00:51:13 👋