Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Galaxy support. #9068

Merged
merged 1 commit into from
Jun 11, 2024
Merged

Add Galaxy support. #9068

merged 1 commit into from
Jun 11, 2024

Conversation

ubcheema
Copy link
Contributor

@ubcheema ubcheema commented Jun 3, 2024

#0: Enable metal on galaxy.
#8305: add Galaxy cluster apis
#8305: cleanup, add print
#8450: Establish tunnels originating from an mmio device. Determine the remote chips as well as their order on the tunnel. #8452: add tests for tg pipeline
#0: patch for tg workflows.
#8450: Add tables for tunnel dispatch workers with build settings.
Populate build settings for tunnel kernels.
Launch FD2 kernels based on information in tunnel device dispatch worker map.
Enable 4 devices per hugepage/channel
#0: disable hanging/failing tests for Galaxy
#0: skip using channel 3, 7 which use huge page channel 3. This (4th) huepage is not a full 1GB in size. 256 MB is taken up by syseng tools 4th huge page.
#0: re-enable Galaxy sharded tests, reduce one test runtime for Galaxy
#0: fix cluster init for Galaxy
#8953: Fix hardcoding of queue sizes in tests.
#8450: Fix compute grid selection for N150. N150 can be standalone system or part of a TG system. On TG compute grid for N150 is different than standalone N150.
#0: Reduce prefetch q entries to account for Galaxy CQ size.

Copy link
Contributor

@pgkeller pgkeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything of major concern, there are a few items that we should address in a follow on commit (if not w/ this commit). Eventually I think we should pull a bunch of cq config code in device.cpp out and put it in the cq code, though we need to think about the architecture of device/cq a bit more

}
log_debug(tt::LogMetal, "Setting up {} Arguments", magic_enum::enum_name((tt::tt_metal::DispatchWorkerType)dwv));
switch(dwv) {
case PREFETCH:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems these names should be scoped?

settings.downstream_cores.push_back(mux_settings.worker_physical_core);
settings.compile_args.resize(23);
auto& compile_args = settings.compile_args;
compile_args[0] = downstream_cb_base;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as is is good enough for now, eventually we should name the arg indices

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree, the indices should be named and both the host code and device kernel code should access the compile args through named indices.

compile_args[24] = packet_switch_4B_pack(0xB1, 0xB2, 0xB3, 0xB4); // 24: packetized input dest id
break;
}
case US_TUNNELER_REMOTE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

US ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up Stream Tunneler.
Its a tunneler running on an inner device in the tunnel and is connected to next tunneled device going away from host.

auto &compile_args = demux_d_settings.compile_args;
compile_args.resize(30);

compile_args[0] = 0xB1; // 0: endpoint_id_start_index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as part of the next cleanup pass, this and other constants should be named and declared in one location I think...

auto demux_d_settings = std::get<1>(device_worker_variants[DEMUX_D][0]);
auto dispatch_d_settings = std::get<1>(device_worker_variants[DISPATCH_D][0]);

TT_ASSERT(num_prefetchers == demux_d_settings.semaphores.size(), "Demux D does not have required number of semaphores for Prefetcher D. Exptected = {}. Fount = {}", num_prefetchers, demux_d_settings.semaphores.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fount typo

COUNT = 12
};

struct worker_build_settings_t{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quibble: not a fan of this name. dispatch_worker_build_settings_t ? or dispatch_settings_t? or cq?

@@ -583,6 +691,71 @@ void Cluster::initialize_ethernet_sockets() {
}
}

void Cluster::reserve_ethernet_cores_for_tunneling() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few high level comments in front of the bigger routines would help w/ understanding the flow: what do you have to search through? what criteria is being looked for

#8305: add Galaxy cluster apis
#8305: cleanup, add print
#8450: Establish tunnels originating from an mmio device. Determine the remote chips as well as their order on the tunnel.
#8452: add tests for tg pipeline
#0:    patch for tg workflows.
#8450: Add tables for tunnel dispatch workers with build settings.
       Populate build settings for tunnel kernels.
       Launch FD2 kernels based on information in tunnel device dispatch worker map.
       Enable 4 devices per hugepage/channel
#0:    disable hanging/failing tests for Galaxy
#0:    skip using channel 3, 7 which use huge page channel 3. This (4th) huepage is not a full 1GB in size. 256 MB is taken up by syseng tools 4th huge page.
#0:    re-enable Galaxy sharded tests, reduce one test runtime for Galaxy
#0:    fix cluster init for Galaxy
#8953: Fix hardcoding of queue sizes in tests.
#8450: Fix compute grid selection for N150. N150 can be standalone system or part of a TG system. On TG compute grid for N150 is different than standalone N150.
#0:    Reduce prefetch q entries to account for Galaxy CQ size.
#0:    galaxy mesh return any available device
#0:    Fix device mesh close for Galaxy
#8450: Update Galaxy device creation.
@ubcheema ubcheema merged commit d35ea9d into main Jun 11, 2024
74 checks passed
ubcheema added a commit that referenced this pull request Jul 9, 2024
ubcheema added a commit that referenced this pull request Jul 9, 2024
ubcheema added a commit that referenced this pull request Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants