[Mandatory] Upgrading machines to latest firmware and driver in preparation for open source (again) #4200
Replies: 1 comment
-
With the exception of:
all dev bare metal machines and requested VMs have the latest driver built. All GS bare metal machines have the latest firmware. Cloud will be doing rolling GS firmware upgrades for all VMs throughout this week. This weekend or at night, I will be doing one last piece of maintenance + upgrading those 3 machines that I missed. The last piece of maintenance has to do with KMD installation. There was some Recall that all CI BMs and VMs should have the newest stuff. So running on pipelines means you're running on the latest firmware + drivers, which is what users will be doing. Thanks cc: @tenstorrent-metal/developers @tenstorrent-metal/external-developers |
Beta Was this translation helpful? Give feedback.
-
Recently, SysEng has completely revamped their asset delivery methods to be public and like an actual software product. We highly encourage everyone to check it out on their own time.
However, this means users have things to do:
1.26
.fw_pack-80.4.0.0_acec1267.tar.gz
.Is our software stack stable on this new FW / drivers?
To my current knowledge, I have not seen anything suspect on main. We ran all pipelines on new drivers + FW before merging in changes to the current CI.
Let everyone know if you see something unexpected that you believe is coming from firmware / driver changes.
What's the current state of CI machines?
All bare metal CI machines and ~40% of VMs have been upgraded to the latest driver. Among upgraded VMs, all VMs which have Grayskull also have the newest, public firmware.
With the exception of 3 virtual machines, I have taken all machines with older drivers / FW out of service. This means that, except for 3 machines, all our CI has been upgraded.
Once things are quiet and seem stable, we will then upgrade all machines. We have the option of rolling back by taking the current upgraded pool out of service and re-enabling the old, non-upgraded machines. Recall bare metals are easily downgradeable.
What do I have to do?
If you do not use cloud VMs, all you need to do is be aware that this is going on.
By December 6th, 4pm Eastern time in North America (Toronto/New York), you must send me:
I apologize for the short notice, but with how fast and urgent things are moving with this Project Grayskull thing, I can't name a single thing that has been simple or convenient.
If you use a cloud VM and you don't submit your machine's info to me, tough luck. You will have to upgrade your machine yourself. Continue reading in that case.
I will be upgrading the BMs on the spreadsheet myself.
If I wanted to do this myself, how do I install the newest versions?
The various pages are located here:
tt-kmd: https://github.com/tenstorrent/tt-kmd
tt-flash: ttkmd: https://github.com/tenstorrent/tt-flash
tt-firmware-gs: ttkmd: https://github.com/tenstorrent/tt-firmware-gs
Please install the versions noted above at the top. Again, note you only need to upgrade flash on Grayskull. If you have Wormhole, you only need to upgrade
tt-kmd
.Also, for those who are very familiar with
tt-flash
, note that its installation method has changed materially. You now specify the path to the firmware blob you want to install. I believe this was an excellent decision by SysEng. Please refer to its instructions.I strongly recommend that you install
tt-flash
into a Python virtual environment, as opposed into the system Python.I know some people may ask about a convenience script for installing everything to the newest. However, I currently would recommend against it. This is because I don't want to put anything confusing into the repository for customers. Recall that SysEng now fully owns their software artifact distribution and installation methods. Those things could change at any time. In fact, our README now just points to their stuff. I believe deferring this to SysEng is the way to go. I'm open to further discussion on this.
Is there a difference in installation method between cloud VMs and BMs?
If you're installing a new flash - yes, you require a host reboot whenever you upgrade GS firmware.
For BMs, you just do a
sudo reboot
like normal.For VMs, you must file a cloud ticket to reboot the host.
If you're installing driver, then you only need a reboot / dkms reload for either arch.
What about dirt boxes and lab machines (Toronto or Santa Clara)?
They also need these upgrades.
I will be treating lab machines upgrade as a lower-priority item, as we discussed in today's OSS meeting. Unless otherwise explicitly told to me by both @jvasilje and @davorchap , lab machines will have to take a backseat for this week. I may re-visit next week, but no guarantees.
Dirtbox machines are not my responsibility. You're free to ask me questions and for help. However, I will be doing nothing to actually upgrade them.
What about
tt-smi
?Right now, we tell external customers to use the publicly-tested and legally-vetted version of
tt-smi
. The publictt-smi
is available here: https://github.com/tenstorrent/tt-smiFor developers, I recommend people continue using the
tt-smi
that we have installed on all our machines. I'll send out another announcement once we figure out the best way to distribute the newtt-smi
.I'm having trouble figuring the best way to do the above. This is mainly because they changed
tt-smi
to no longer be a single binary but to be an executable Python package. I believe this is also an excellent decision by SysEng, but it throws a bit of a kink in ourtt-smi
delivery method to developers. If you would like to be involved in planning, let me know and I'd be happy to discuss.cc: @tenstorrent-metal/external-developers @tenstorrent-metal/developers
Beta Was this translation helpful? Give feedback.
All reactions