forked from pytorch/torchrec
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
implementation of fbgemm op - regroup_keyed_tensor (pytorch#2128)
Summary: Pull Request resolved: pytorch#2128 # context * current production uses `fbgemm.permute_pooled_embs_auto_grad` for `KT.regroup`. * It has several downsides: a) it needs to perform a `torch.cat` operation, costing memory and time b) it only support "no duplicates" in the grouping, otherwise it fallbacks to a slower pytorch native implementation * new implementation uses `fbgemm.permute_multi_embedding` for the same function a) it doesn't need `torch.cat`, so saves memory and time b) it supports "duplicates" in grouping without sacrificing performance # benchmark results * stats sheet |item|baseline|new function|delta perf (%)|notes| |---|---|---|---|---| |**runtime**|5.2 ms|2.7 ms|48%|wi/o dups| |**memory**|1.5 K|1.0 K|33%|w/o dups| |**runtime**|12.3 ms|2.7 ms|78%|w/ dups| |**memory**|1.0 K|1.0 K|0%|w/ dups| * log output ``` _regroup_keyed_tenors | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 13.1 ms | Memory (P90): 1011.0 permute_multi_embs | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.7 ms | Memory (P90): 1011.0 KeyedTensor_regroup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 5.2 ms | Memory (P90): 1517.0 KTRegroupAsDict | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 4.9 ms | Memory (P90): 1517.0 _regroup_keyed_tenors_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 12.3 ms | Memory (P90): 1011.0 permute_multi_embs_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.7 ms | Memory (P90): 1011.0 KeyedTensor_regroup_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 12.0 ms | Memory (P90): 1011.0 KTRegroupAsDict_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 11.4 ms | Memory (P90): 1011.0 ``` * CPU results are very interesting ``` [fallback] _regroup_keyed_tenors | B: 1024 | F: 1020 | device: cpu | Runtime (P90): 0.4 ms | Memory (P90): 0.0 [prod] KeyedTensor.regroup | B: 1024 | F: 1020 | device: cpu | Runtime (P90): 0.7 ms | Memory (P90): 0.0 [prod] KTRegroupAsDict | B: 1024 | F: 1020 | device: cpu | Runtime (P90): 0.6 ms | Memory (P90): 0.0 ``` Differential Revision: D58649553
- Loading branch information
1 parent
04c8076
commit 1dc3dde
Showing
2 changed files
with
236 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters