-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code object compression via bundling #1374
base: develop
Are you sure you want to change the base?
Code object compression via bundling #1374
Conversation
275cfda
to
471de13
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basics look good, assume all rocblas tests passed.
if os.name == "nt": | ||
# On Windows, the objectFiles list command line (including spaces) | ||
# exceeds the limit of 8191 characters, so using response file | ||
|
||
responseFile = os.path.join('/tmp', 'clangArgs.txt') | ||
with open(responseFile, 'wt') as file: | ||
file.write(" ".join(objFiles)) | ||
file.flush() | ||
|
||
args = [globalParameters['AssemblerPath'], '-target', 'amdgcn-amd-amdhsa', '-o', coFileRaw, '@clangArgs.txt'] | ||
subprocess.check_call(args, cwd=asmDir) | ||
else: | ||
numObjFiles = len(objFiles) | ||
maxObjFiles = 10000 | ||
|
||
if numObjFiles > maxObjFiles: | ||
batchedObjFiles = [objFiles[i:i+maxObjFiles] for i in range(0, numObjFiles, maxObjFiles)] | ||
batchSize = int(math.ceil(numObjFiles / maxObjFiles)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you consider doing what I added for windows (response file) instead of processing in batches? Is there a limit of 10000 with the linker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In TensileLite, this batching strategy is implemented, but the source PR doesn't describe the motivation beyond your interpretation: some error was faced in the past because too many arguments were issued to the compiler.
I spent some time trying to find documentation on this 10000-input limit but couldn't find anything meaningful, so I left the implementation intact. @KKyang I believe that you implemented this feature, could you provide more context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#415 this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct.
1458f44
to
d8b5337
Compare
Might be worth zstd compressing the msgpack files too, they're pretty compressible. Here's an untested attempt at decompress support in case it's helpful: diff --git a/Tensile/Source/lib/source/msgpack/MessagePack.cpp b/Tensile/Source/lib/source/msgpack/MessagePack.cpp
index de97929c..dbc397e0 100644
--- a/Tensile/Source/lib/source/msgpack/MessagePack.cpp
+++ b/Tensile/Source/lib/source/msgpack/MessagePack.cpp
@@ -28,6 +28,8 @@
#include <Tensile/msgpack/Loading.hpp>
+#include <zstd.h>
+
#include <fstream>
namespace Tensile
@@ -86,6 +88,34 @@ namespace Tensile
return nullptr;
}
+ // Check if the file is zstd compressed
+ char magic[4];
+ in.read(magic, 4);
+ bool isCompressed = (in.gcount() == 4 && magic[0] == '\x28' && magic[1] == '\xB5' && magic[2] == '\x2F' && magic[3] == '\xFD');
+ // Reset file pointer to the beginning
+ in.seekg(0, std::ios::beg);
+
+ if (isCompressed) {
+ // Decompress zstd file
+ std::vector<char> compressedData((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>());
+
+ size_t decompressedSize = ZSTD_getFrameContentSize(compressedData.data(), compressedData.size());
+ if (decompressedSize == ZSTD_CONTENTSIZE_ERROR || decompressedSize == ZSTD_CONTENTSIZE_UNKNOWN) {
+ if(Debug::Instance().printDataInit())
+ std::cout << "Error: Unable to determine decompressed size for " << filename << std::endl;
+ return nullptr;
+ }
+
+ std::vector<char> decompressedData(decompressedSize);
+ size_t dSize = ZSTD_decompress(decompressedData.data(), decompressedSize, compressedData.data(), compressedData.size());
+ if (ZSTD_isError(dSize)) {
+ if(Debug::Instance().printDataInit())
+ std::cout << "Error: ZSTD decompression failed for " << filename << std::endl;
+ return nullptr;
+ }
+
+ msgpack::unpack(result, decompressedData.data(), dSize);
+ } else {
msgpack::unpacker unp;
bool finished_parsing;
constexpr size_t buffer_size = 1 << 19;
@@ -109,6 +139,7 @@ namespace Tensile
return nullptr;
}
+ }
}
catch(std::runtime_error const& exc)
{ |
@LunNova Thanks for the code snippet. I've had this idea as well and have plans to implement it. However, for the scope of this PR we'll keep it to code object files and add the .dat file compression in another PR. |
Summary:
This PR adds a compression layer to all final code objects, thereby generating smaller libraries at the expense of build time. Includes minor refactoring.
Outcomes:
getAssemblyCodeObjectFiles
has been renamed tobuildAssemblyCodeObjectFiles
to match the name of source kernel functions.Testing and Environment:
Docker: Ubuntu 24.04, ROCm 6.4 RC stack, AMD clang version 18.0.0, AMD clang-offload-bundler version 18.0.0
Tested with hipBLASLt bench and test clients