New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add Galaxy support. #9068

Merged

ubcheema merged 1 commit into main from galaxy/main

Jun 11, 2024

Contributor

ubcheema commented Jun 3, 2024

#0: Enable metal on galaxy.
#8305: add Galaxy cluster apis
#8305: cleanup, add print
#8450: Establish tunnels originating from an mmio device. Determine the remote chips as well as their order on the tunnel. #8452: add tests for tg pipeline
#0: patch for tg workflows.
#8450: Add tables for tunnel dispatch workers with build settings.
Populate build settings for tunnel kernels.
Launch FD2 kernels based on information in tunnel device dispatch worker map.
Enable 4 devices per hugepage/channel
#0: disable hanging/failing tests for Galaxy
#0: skip using channel 3, 7 which use huge page channel 3. This (4th) huepage is not a full 1GB in size. 256 MB is taken up by syseng tools 4th huge page.
#0: re-enable Galaxy sharded tests, reduce one test runtime for Galaxy
#0: fix cluster init for Galaxy
#8953: Fix hardcoding of queue sizes in tests.
#8450: Fix compute grid selection for N150. N150 can be standalone system or part of a TG system. On TG compute grid for N150 is different than standalone N150.
#0: Reduce prefetch q entries to account for Galaxy CQ size.

ubcheema requested review from pgkeller and tt-asaigal as code owners

June 3, 2024 18:35

ubcheema force-pushed the galaxy/main branch from 135e8ee to d79350e Compare

June 3, 2024 19:13

ubcheema temporarily deployed to dev

June 3, 2024 19:21

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 3, 2024 19:21

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 3, 2024 19:21

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 3, 2024 19:21

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 3, 2024 19:21

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 3, 2024 19:21

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 3, 2024 19:21

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 3, 2024 19:21

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 3, 2024 19:21

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 3, 2024 19:26

— with

GitHub Actions Inactive

ubcheema temporarily deployed to production

June 3, 2024 19:47

— with

GitHub Actions Inactive

ubcheema temporarily deployed to production

June 3, 2024 19:47

— with

GitHub Actions Inactive

ubcheema temporarily deployed to production

June 3, 2024 19:47

— with

GitHub Actions Inactive

ubcheema temporarily deployed to production

June 3, 2024 19:47

— with

GitHub Actions Inactive

ubcheema temporarily deployed to production

June 3, 2024 19:47

— with

GitHub Actions Inactive

ubcheema temporarily deployed to production

June 3, 2024 19:47

— with

GitHub Actions Inactive

ubcheema force-pushed the galaxy/main branch from d79350e to 07a9d42 Compare

June 3, 2024 21:17

ubcheema temporarily deployed to dev

June 3, 2024 21:18

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 3, 2024 21:18

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 3, 2024 21:18

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 3, 2024 21:18

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 3, 2024 21:18

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 3, 2024 21:18

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 3, 2024 21:18

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 3, 2024 21:18

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 3, 2024 21:18

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 3, 2024 21:22

— with

GitHub Actions Inactive

ubcheema temporarily deployed to production

June 3, 2024 21:39

— with

GitHub Actions Inactive

ubcheema temporarily deployed to production

June 3, 2024 21:39

— with

GitHub Actions Inactive

ubcheema temporarily deployed to production

June 3, 2024 21:39

— with

GitHub Actions Inactive

ubcheema temporarily deployed to production

June 3, 2024 21:39

— with

GitHub Actions Inactive

ubcheema temporarily deployed to production

June 3, 2024 21:39

— with

GitHub Actions Inactive

ubcheema temporarily deployed to production

June 3, 2024 21:39

— with

GitHub Actions Inactive

pgkeller approved these changes

View reviewed changes

Contributor

pgkeller left a comment

I don't see anything of major concern, there are a few items that we should address in a follow on commit (if not w/ this commit). Eventually I think we should pull a bunch of cq config code in device.cpp out and put it in the cq code, though we need to think about the architecture of device/cq a bit more

tt_metal/impl/device/device.cpp

    
                      }

                      log_debug(tt::LogMetal, "Setting up {} Arguments", magic_enum::enum_name((tt::tt_metal::DispatchWorkerType)dwv));

                      switch(dwv) {

                          case PREFETCH:

Contributor

pgkeller Jun 4, 2024

seems these names should be scoped?

tt_metal/impl/device/device.cpp

    
                                  settings.downstream_cores.push_back(mux_settings.worker_physical_core);

                                  settings.compile_args.resize(23);

                                  auto& compile_args = settings.compile_args;

                                  compile_args[0]  = downstream_cb_base;

Contributor

pgkeller Jun 4, 2024

as is is good enough for now, eventually we should name the arg indices

Contributor Author

ubcheema Jun 4, 2024

i agree, the indices should be named and both the host code and device kernel code should access the compile args through named indices.

tt_metal/impl/device/device.cpp

    
                              compile_args[24] = packet_switch_4B_pack(0xB1, 0xB2, 0xB3, 0xB4); // 24: packetized input dest id

                              break;

                          }

                          case US_TUNNELER_REMOTE:

Contributor

pgkeller Jun 4, 2024

US ?

Contributor Author

ubcheema Jun 4, 2024

Up Stream Tunneler.
Its a tunneler running on an inner device in the tunnel and is connected to next tunneled device going away from host.

tt_metal/impl/device/device.cpp

    
                              auto &compile_args = demux_d_settings.compile_args;

                              compile_args.resize(30);

                              compile_args[0] = 0xB1; // 0: endpoint_id_start_index

Contributor

pgkeller Jun 4, 2024

as part of the next cleanup pass, this and other constants should be named and declared in one location I think...

tt_metal/impl/device/device.cpp

    
                              auto demux_d_settings = std::get<1>(device_worker_variants[DEMUX_D][0]);

                              auto dispatch_d_settings = std::get<1>(device_worker_variants[DISPATCH_D][0]);

                              TT_ASSERT(num_prefetchers == demux_d_settings.semaphores.size(), "Demux D does not have required number of semaphores for Prefetcher D. Exptected = {}. Fount = {}", num_prefetchers, demux_d_settings.semaphores.size());

Contributor

pgkeller Jun 4, 2024

Fount typo

tt_metal/impl/dispatch/dispatch_core_manager.hpp

+                  COUNT = 12
+              };
+              struct worker_build_settings_t{

Contributor

pgkeller Jun 4, 2024

quibble: not a fan of this name. dispatch_worker_build_settings_t ? or dispatch_settings_t? or cq?

tt_metal/llrt/tt_cluster.cpp

@@ @@ -583,6 +691,71 @@ void Cluster::initialize_ethernet_sockets() { @@
                   }
               }
+              void Cluster::reserve_ethernet_cores_for_tunneling() {

Contributor

pgkeller Jun 4, 2024

a few high level comments in front of the bigger routines would help w/ understanding the flow: what do you have to search through? what criteria is being looked for


          #0: Enable metal on galaxy.

e06be02

#8305: add Galaxy cluster apis
#8305: cleanup, add print
#8450: Establish tunnels originating from an mmio device. Determine the remote chips as well as their order on the tunnel.
#8452: add tests for tg pipeline
#0:    patch for tg workflows.
#8450: Add tables for tunnel dispatch workers with build settings.
       Populate build settings for tunnel kernels.
       Launch FD2 kernels based on information in tunnel device dispatch worker map.
       Enable 4 devices per hugepage/channel
#0:    disable hanging/failing tests for Galaxy
#0:    skip using channel 3, 7 which use huge page channel 3. This (4th) huepage is not a full 1GB in size. 256 MB is taken up by syseng tools 4th huge page.
#0:    re-enable Galaxy sharded tests, reduce one test runtime for Galaxy
#0:    fix cluster init for Galaxy
#8953: Fix hardcoding of queue sizes in tests.
#8450: Fix compute grid selection for N150. N150 can be standalone system or part of a TG system. On TG compute grid for N150 is different than standalone N150.
#0:    Reduce prefetch q entries to account for Galaxy CQ size.
#0:    galaxy mesh return any available device
#0:    Fix device mesh close for Galaxy
#8450: Update Galaxy device creation.

ubcheema force-pushed the galaxy/main branch from 07a9d42 to e06be02 Compare

June 11, 2024 21:11

ubcheema temporarily deployed to dev

June 11, 2024 21:22

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 11, 2024 21:22

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 11, 2024 21:22

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 11, 2024 21:22

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 11, 2024 21:22

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 11, 2024 21:22

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 11, 2024 21:22

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 11, 2024 21:22

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 11, 2024 21:22

— with

GitHub Actions Inactive

ubcheema temporarily deployed to dev

June 11, 2024 21:31

— with

GitHub Actions Inactive

ubcheema temporarily deployed to production

June 11, 2024 21:44

— with

GitHub Actions Inactive

ubcheema temporarily deployed to production

June 11, 2024 21:44

— with

GitHub Actions Inactive

ubcheema temporarily deployed to production

June 11, 2024 21:44

— with

GitHub Actions Inactive

ubcheema temporarily deployed to production

June 11, 2024 21:44

— with

GitHub Actions Inactive

ubcheema temporarily deployed to production

June 11, 2024 21:44

— with

GitHub Actions Inactive

ubcheema temporarily deployed to production

June 11, 2024 21:44

— with

GitHub Actions Inactive

ubcheema merged commit d35ea9d into main

74 checks passed

ubcheema added a commit that referenced this pull request


          #8450: Cleanup items pending from PR #9068

59f9618

ubcheema added a commit that referenced this pull request


          #8450: Cleanup items pending from PR #9068

d03657c

ubcheema added a commit that referenced this pull request


          #8450: Cleanup items pending from PR #9068

4178ddd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet