#3900: Add prod support for batch and channels #4933

ruthreshx · 2024-01-25T15:54:20Z

Added the batch and channel support for Prod op.

Working on H & W.

tests/tt_eager/python_api_testing/unit_testing/test_prod.py

muthutt · 2024-01-25T16:58:52Z

tests/tt_eager/python_api_testing/unit_testing/test_prod.py

+
+    (tt_input, tt_output, torch_input) = get_tensors(input_shape, output_shape, device)
+
+    torch_output = torch.sum(torch_input, dims, True)


why is this correct ? Is'nt it suppose to be torch.prod ? https://pytorch.org/docs/stable/generated/torch.prod.html

tests/tt_eager/python_api_testing/unit_testing/test_prod.py

muthutt

some early comments

muthutt

LGTM

tt-aho · 2024-01-25T21:26:53Z

tt_eager/tt_dnn/op_library/prod/kernels/reader_prod_nc.cpp

+#include "tt_eager/tt_dnn/op_library/prod/kernels/utils.hpp"
+
+inline uint32_t get_read_tile_id(uint32_t tile_id, uint32_t dim, uint32_t input_tile_offset, uint32_t HtWt) {
+    return (dim == 0 ) ? (tile_id) : (tile_id / HtWt * input_tile_offset) + (tile_id % HtWt);


In general, we want to avoid multiplications (especially for GS) and especially avoid divisions since they're not performant on the riscs. You should try and refactor your code to avoid this, especially since you're calling this every loop. Could be refactored to use addition and if check.

tt-aho · 2024-01-25T21:28:20Z

tt_eager/tt_dnn/op_library/prod/kernels/reader_prod_nc.cpp

+            if (input_is_dram) {
+                noc_async_read_tile(read_tile_id, dram_input_addrg, l1_write_addr_in0);
+            } else {
+                noc_async_read_tile(read_tile_id, l1_input_addrg, l1_write_addr_in0);
+            }


I would use a compile time arg for the dram flag so you create a single addr gen with the correct template, instead of paying for this cost at runtime in a loop

tt-aho · 2024-01-26T15:12:06Z

tt_eager/tt_dnn/op_library/prod/kernels/utils.hpp

+void fill_cb_with_value(uint32_t cb_id, uint32_t value, int32_t num_of_elems = 1024) {
+    cb_reserve_back(cb_id, 1);
+    auto ptr = reinterpret_cast<uint16_t *>(get_write_ptr(cb_id));
+    for (int j = 0; j < num_of_elems; j++) {
+        ptr[j] = uint16_t(value >> 16);
+    }
+    cb_push_back(cb_id, 1);
+}


Any reason why we use int for num_of_elems/loop?

For pointers to l1 memory you should create them as
volatile tt_l1_ptr std::uint16_t* ptr = (volatile tt_l1_ptr uint16_t*)(get_write_ptr(cb_id));
This is to avoid a potential hang due to HW bug.

Can also store/use the u16 value like how you do it in later code
const auto u16_value = uint16_t(value >> 16);

tt-aho · 2024-01-26T15:14:57Z

tt_eager/tt_dnn/op_library/prod/kernels/utils.hpp

+
+    // mask_h
+    // first tile ptr
+    auto mask_h_ptr = reinterpret_cast<uint16_t *>(get_write_ptr(cb_mask_h_w));


Same comment on l1 ptr as above

generate_mask_h_w function is not used.Hence, I'm removing this function

tt-aho · 2024-01-26T15:16:21Z

tt_eager/tt_dnn/op_library/prod/kernels/utils.hpp

+            uint32_t h = 0;
+            for (; h < mask_h_0; h++) {
+                mask_h_ptr[h * 16 + w] = u16_one;
+            }
+            for (; h < 16; h++) {
+                mask_h_ptr[h * 16 + w] = u16_zero;
+            }


Avoid using multiplies

tt_eager/tt_dnn/op_library/prod/prod_nc_op.cpp

tt-aho · 2024-01-26T15:34:56Z

tt_eager/tt_dnn/op_library/prod/prod_nc_op.hpp

+Tensor prod(
+    const Tensor &input,
+    const Tensor &output,
+    std::vector<int64_t> &dims,
+    const MemoryConfig &mem_config = operation::DEFAULT_OUTPUT_MEMORY_CONFIG);


This function should not take a mem_config since it already takes output

Mem_config requires for create_output_tensor function.

So this is used for intermediate mem config then right? I think if you want to keep this arg you should change the arg name in the binding, since it is currently labelled as output_mem_config, but your output tensor already has a mem_config that is used for the final output.

tt_eager/tt_dnn/op_library/prod/prod_nc_op.cpp

tt-aho · 2024-01-26T15:39:20Z

tt_eager/tt_dnn/op_library/prod/prod_nc_op.cpp

+    const auto& input = inputs.at(0);
+    const auto& output = inputs.at(1);


I think you should assert for rank 4 tensors here since you are hardcoding indices like 2, 3 elsewhere.
Otherwise you should update it to support tensors other than rank 4 if possible.

Added the rank 4 assert in the code accordingly.

ruthreshx · 2024-01-31T08:52:24Z

Hi @tt-aho ,
Addressed the comments.
Reg the L1 Ptr and avoid using the multiples,
I took reference from moreh sum, to added the support for N & C to implement the prod.
I just replicate the same thing for the prod as well.

TT-BrianLiu · 2024-02-09T17:24:56Z

tt_eager/tt_dnn/op_library/prod/kernels/prod_hw.cpp

+ALWI void ACQ() { acquire_dst(tt::DstMode::Half); }
+ALWI void REL() { release_dst(tt::DstMode::Half); }


Switch to new APIs:

tile_regs_commit, tile_regs_release

tile_regs_acquire, tile_regs_wait

Docs: tt_metal/include/compute_kernel_api/reg_api.h

TT-BrianLiu · 2024-02-09T17:25:11Z

tests/tt_eager/python_api_testing/unit_testing/test_prod.py

+@pytest.mark.parametrize(
+    "dims",
+    (
+        [
+            0,
+        ],
+        [
+            1,
+        ],
+        [
+            2,
+        ],
+        [
+            3,
+        ],
+    ),
+    ids=["0", "1", "2", "3"],
+)


TT-BrianLiu · 2024-02-09T17:26:13Z

tt_eager/tt_dnn/op_library/prod/kernels/reader_prod_hw.cpp

+void mask_tile_in_reader(uint32_t l1_addr, uint32_t mask_w = 32, uint32_t mask_h = 32) {
+    union {
+        float f;
+        uint32_t u;
+    } zero;
+    zero.f = 0.0f;
+    auto ptr = reinterpret_cast<uint16_t *>(l1_addr);
+    for (uint32_t h = 0; h < 16; h++) {


do we have to do this in the kernel?

File removed as HW support has been given using NC support

TT-BrianLiu · 2024-02-09T17:27:37Z

tt_eager/tt_dnn/op_library/prod/kernels/reader_prod_hw.cpp

+    if (scaler != 0) {
+        auto ptr = reinterpret_cast<uint16_t *>(get_write_ptr(cb_id_in2));
+        for (int j = 0; j < 1024; j++) ptr[j] = uint16_t(0);
+
+        for (int k = 0; k < 4; k++)
+            for (int j = 0; j < 16; j++) ptr[k * 256 + j] = uint16_t(scaler >> 16);
+    }


Do we have to do this in kernel?

muthutt · 2024-02-09T17:42:10Z

Filling CBs is done in the kernel by others including MOREH in recent code. I don't see anything objectionable @brian Liu ***@***.***>

…

On Fri, Feb 9, 2024 at 9:31 AM TT-BrianLiu ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In tt_eager/tt_dnn/op_library/prod/kernels/prod_hw.cpp <#4933 (comment)> : > +ALWI void ACQ() { acquire_dst(tt::DstMode::Half); } +ALWI void REL() { release_dst(tt::DstMode::Half); } Switch to new APIs: - tile_regs_commit, tile_regs_release - tile_regs_acquire, tile_regs_wait Docs: tt_metal/include/compute_kernel_api/reg_api.h ------------------------------ In tests/tt_eager/python_api_testing/unit_testing/test_prod.py <#4933 (comment)> : > ***@***.***( + "dims", + ( + [ + 0, + ], + [ + 1, + ], + [ + 2, + ], + [ + 3, + ], + ), + ids=["0", "1", "2", "3"], +) fix this ------------------------------ In tt_eager/tt_dnn/op_library/prod/kernels/reader_prod_hw.cpp <#4933 (comment)> : > +void mask_tile_in_reader(uint32_t l1_addr, uint32_t mask_w = 32, uint32_t mask_h = 32) { + union { + float f; + uint32_t u; + } zero; + zero.f = 0.0f; + auto ptr = reinterpret_cast<uint16_t *>(l1_addr); + for (uint32_t h = 0; h < 16; h++) { do we have to do this in the kernel? ------------------------------ In tt_eager/tt_dnn/op_library/prod/kernels/reader_prod_hw.cpp <#4933 (comment)> : > + if (scaler != 0) { + auto ptr = reinterpret_cast<uint16_t *>(get_write_ptr(cb_id_in2)); + for (int j = 0; j < 1024; j++) ptr[j] = uint16_t(0); + + for (int k = 0; k < 4; k++) + for (int j = 0; j < 16; j++) ptr[k * 256 + j] = uint16_t(scaler >> 16); + } Do we have to do this in kernel? — Reply to this email directly, view it on GitHub <#4933 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BAGOCNFAYHJAL4UNJHOHISTYSZMOFAVCNFSM6AAAAABCKYHGGGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQNZSHA3TQMJRG4> . You are receiving this because you commented.Message ID: ***@***.***>

TT-BrianLiu · 2024-02-09T17:56:14Z

Filling CBs is done in the kernel by others including MOREH in recent code. I don't see anything objectionable @brian Liu @.>
…
On Fri, Feb 9, 2024 at 9:31 AM TT-BrianLiu @.> wrote: @.**** commented on this pull request. ------------------------------ In tt_eager/tt_dnn/op_library/prod/kernels/prod_hw.cpp <#4933 (comment)> : > +ALWI void ACQ() { acquire_dst(tt::DstMode::Half); } +ALWI void REL() { release_dst(tt::DstMode::Half); } Switch to new APIs: - tile_regs_commit, tile_regs_release - tile_regs_acquire, tile_regs_wait Docs: tt_metal/include/compute_kernel_api/reg_api.h ------------------------------ In tests/tt_eager/python_api_testing/unit_testing/test_prod.py <#4933 (comment)> : > @.**( + "dims", + ( + [ + 0, + ], + [ + 1, + ], + [ + 2, + ], + [ + 3, + ], + ), + ids=["0", "1", "2", "3"], +) fix this ------------------------------ In tt_eager/tt_dnn/op_library/prod/kernels/reader_prod_hw.cpp <#4933 (comment)> : > +void mask_tile_in_reader(uint32_t l1_addr, uint32_t mask_w = 32, uint32_t mask_h = 32) { + union { + float f; + uint32_t u; + } zero; + zero.f = 0.0f; + auto ptr = reinterpret_cast<uint16_t >(l1_addr); + for (uint32_t h = 0; h < 16; h++) { do we have to do this in the kernel? ------------------------------ In tt_eager/tt_dnn/op_library/prod/kernels/reader_prod_hw.cpp <#4933 (comment)> : > + if (scaler != 0) { + auto ptr = reinterpret_cast<uint16_t >(get_write_ptr(cb_id_in2)); + for (int j = 0; j < 1024; j++) ptr[j] = uint16_t(0); + + for (int k = 0; k < 4; k++) + for (int j = 0; j < 16; j++) ptr[k * 256 + j] = uint16_t(scaler >> 16); + } Do we have to do this in kernel? — Reply to this email directly, view it on GitHub <#4933 (review)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/BAGOCNFAYHJAL4UNJHOHISTYSZMOFAVCNFSM6AAAAABCKYHGGGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQNZSHA3TQMJRG4 . You are receiving this because you commented.Message ID: @.>

We should look into better ways of lowering constants like this. Convs do this with small sharded tensors. If you want to leave it like this, please document it better and clean it up (for example what are those hardcoded numbers and why are loop indices ints)

ruthreshx requested a review from tt-rkim January 25, 2024 16:07