[Mandatory] Upgrading machines to latest firmware and driver in preparation for open source (again) #4200

tt-rkim · 2023-12-06T04:38:31Z

tt-rkim
Dec 6, 2023
Maintainer

Recently, SysEng has completely revamped their asset delivery methods to be public and like an actual software product. We highly encourage everyone to check it out on their own time.

However, this means users have things to do:

Upgrade tt-kmd (kernel driver) to version 1.26.
Upgrade tt-flash (firmware) for Grayskull only to a legal public available release. In this case, version fw_pack-80.4.0.0_acec1267.tar.gz.

Is our software stack stable on this new FW / drivers?

To my current knowledge, I have not seen anything suspect on main. We ran all pipelines on new drivers + FW before merging in changes to the current CI.

Let everyone know if you see something unexpected that you believe is coming from firmware / driver changes.

What's the current state of CI machines?

All bare metal CI machines and ~40% of VMs have been upgraded to the latest driver. Among upgraded VMs, all VMs which have Grayskull also have the newest, public firmware.

With the exception of 3 virtual machines, I have taken all machines with older drivers / FW out of service. This means that, except for 3 machines, all our CI has been upgraded.

Once things are quiet and seem stable, we will then upgrade all machines. We have the option of rolling back by taking the current upgraded pool out of service and re-enabling the old, non-upgraded machines. Recall bare metals are easily downgradeable.

What do I have to do?

If you do not use cloud VMs, all you need to do is be aware that this is going on.

By December 6th, 4pm Eastern time in North America (Toronto/New York), you must send me:

your VM IP.
your VM cloud portal name. You can check for the name through the cloud portal.
your VM silicon arch type.

I apologize for the short notice, but with how fast and urgent things are moving with this Project Grayskull thing, I can't name a single thing that has been simple or convenient.

If you use a cloud VM and you don't submit your machine's info to me, tough luck. You will have to upgrade your machine yourself. Continue reading in that case.

I will be upgrading the BMs on the spreadsheet myself.

If I wanted to do this myself, how do I install the newest versions?

The various pages are located here:

tt-kmd: https://github.com/tenstorrent/tt-kmd
tt-flash: ttkmd: https://github.com/tenstorrent/tt-flash
tt-firmware-gs: ttkmd: https://github.com/tenstorrent/tt-firmware-gs

Please install the versions noted above at the top. Again, note you only need to upgrade flash on Grayskull. If you have Wormhole, you only need to upgrade tt-kmd.

Also, for those who are very familiar with tt-flash, note that its installation method has changed materially. You now specify the path to the firmware blob you want to install. I believe this was an excellent decision by SysEng. Please refer to its instructions.

I strongly recommend that you install tt-flash into a Python virtual environment, as opposed into the system Python.

I know some people may ask about a convenience script for installing everything to the newest. However, I currently would recommend against it. This is because I don't want to put anything confusing into the repository for customers. Recall that SysEng now fully owns their software artifact distribution and installation methods. Those things could change at any time. In fact, our README now just points to their stuff. I believe deferring this to SysEng is the way to go. I'm open to further discussion on this.

Is there a difference in installation method between cloud VMs and BMs?

If you're installing a new flash - yes, you require a host reboot whenever you upgrade GS firmware.

For BMs, you just do a sudo reboot like normal.

For VMs, you must file a cloud ticket to reboot the host.

If you're installing driver, then you only need a reboot / dkms reload for either arch.

What about dirt boxes and lab machines (Toronto or Santa Clara)?

They also need these upgrades.

I will be treating lab machines upgrade as a lower-priority item, as we discussed in today's OSS meeting. Unless otherwise explicitly told to me by both @jvasilje and @davorchap , lab machines will have to take a backseat for this week. I may re-visit next week, but no guarantees.

Dirtbox machines are not my responsibility. You're free to ask me questions and for help. However, I will be doing nothing to actually upgrade them.

What about `tt-smi`?

Right now, we tell external customers to use the publicly-tested and legally-vetted version of tt-smi. The public tt-smi is available here: https://github.com/tenstorrent/tt-smi

For developers, I recommend people continue using the tt-smi that we have installed on all our machines. I'll send out another announcement once we figure out the best way to distribute the new tt-smi.

I'm having trouble figuring the best way to do the above. This is mainly because they changed tt-smi to no longer be a single binary but to be an executable Python package. I believe this is also an excellent decision by SysEng, but it throws a bit of a kink in our tt-smi delivery method to developers. If you would like to be involved in planning, let me know and I'd be happy to discuss.

cc: @tenstorrent-metal/external-developers @tenstorrent-metal/developers

tt-rkim · 2023-12-07T18:49:52Z

tt-rkim
Dec 7, 2023
Maintainer Author

With the exception of:

tt-metal-dev-moreh-wh-9 @tenstorrent-metal/moreh
e04cs04 (Nebula X2 BM)
e08cs08 @tt-dma 's bare metal

all dev bare metal machines and requested VMs have the latest driver built.

All GS bare metal machines have the latest firmware.

Cloud will be doing rolling GS firmware upgrades for all VMs throughout this week.

This weekend or at night, I will be doing one last piece of maintenance + upgrading those 3 machines that I missed. The last piece of maintenance has to do with KMD installation. There was some dkms wonkiness that neither we nor SysEng expected so we just have to account for that later.

Recall that all CI BMs and VMs should have the newest stuff. So running on pipelines means you're running on the latest firmware + drivers, which is what users will be doing.

Thanks

cc: @tenstorrent-metal/developers @tenstorrent-metal/external-developers

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Mandatory] Upgrading machines to latest firmware and driver in preparation for open source (again) #4200

{{title}}

Replies: 1 comment

{{title}}

Select a reply

[Mandatory] Upgrading machines to latest firmware and driver in preparation for open source (again) #4200

tt-rkim Dec 6, 2023 Maintainer

Is our software stack stable on this new FW / drivers?

What's the current state of CI machines?

What do I have to do?

If I wanted to do this myself, how do I install the newest versions?

Is there a difference in installation method between cloud VMs and BMs?

What about dirt boxes and lab machines (Toronto or Santa Clara)?

What about tt-smi?

Replies: 1 comment

tt-rkim Dec 7, 2023 Maintainer Author

tt-rkim
Dec 6, 2023
Maintainer

What about `tt-smi`?

tt-rkim
Dec 7, 2023
Maintainer Author