-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPCC-32017 New compressing splitter implementation #18738
HPCC-32017 New compressing splitter implementation #18738
Conversation
Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-32017 Jirabot Action Result: |
65d377e
to
13194ad
Compare
There are a couple of cores generated in Regression Setup:
|
13194ad
to
6a7b96f
Compare
thorlcr/thorutil/thbuf.cpp
Outdated
{ | ||
Owned<IFileIO> iFileIO = iFile->open(IFOread); | ||
Owned<ISerialInputStream> in = createSerialInputStream(iFileIO); | ||
Owned<IBufferedSerialInputStream> inputStream = createBufferedInputStream(in, readAheadSize, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably be the preferred block size from the storage plane
thorlcr/thorutil/thbuf.cpp
Outdated
const char *options = nullptr; | ||
Owned<ICompressor> compressor = compressHandler->getCompressor(options); | ||
Owned<ISerialOutputStream> compressed = createCompressingOutputStream(outputStream, compressor); | ||
constexpr size32_t compressedSize = 0x100000; // JCSMORE configurable/what is ideal size? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compressionBlockSize - should pass the same constant to the decompressor
thorlcr/thorutil/thbuf.cpp
Outdated
ActPrintLog(&activity, "Spilling to temp storage [file = %s]", iFile->queryFilename()); | ||
Owned<IFileIO> io = iFile->open(IFOcreate); | ||
Owned<ISerialOutputStream> out = createSerialOutputStream(io); | ||
outputStream.setown(createBufferedOutputStream(out, writeAheadSize)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be from the plane the preferred block size for the temp plane
thorlcr/thorutil/thbuf.cpp
Outdated
} | ||
bool checkWriteAhead(rowcount_t &outputRowsAvailable) | ||
{ | ||
CriticalBlock b(crit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to code as:
if (totalRowsWritten == outputRowsAvailable)
{
CriticalBlock b(crit);
if (totalRowsWritten == outputRowsAvailable)
{
with atomic variables for totalRowsWritten
thorlcr/thorutil/thbuf.cpp
Outdated
} | ||
bool getRowsInMem(Rows &outputRows, rowcount_t &outputRowsAvailable) | ||
{ | ||
CriticalBlock b(crit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will also block other threads. Avoiding that will be a bit tricky - but a circular buffer would make it easier.
Reallocation also makes it tricky
thorlcr/thorutil/thbuf.cpp
Outdated
size32_t sz = thorRowMemoryFootprint(serializer, row); | ||
rows.emplace_back(row, sz); | ||
rowsMemUsage += sz; | ||
if (rowsMemUsage >= maxRowMem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to use a different condition here. I.e. read in 256K chunks until you get to 1MB
Maybe 64 rows or 128K whichever is hit first
1c31c3b
to
be4d061
Compare
@ghalliday - I think ready for full review (though I've left a 'test' commit in for now). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakesmith some initial comments. I will reread again
thorlcr/thorutil/thbuf.cpp
Outdated
} | ||
virtual void endNested(size32_t sizePos) override | ||
{ | ||
size32_t nestedSize = nestedSizes.back(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic will not work for datasets nested within datasets. I need to add a tell() function to the output (that was coming in my next iteration anyway). You then need something along the lines of
size32_t beginNested(count)
{
output->suspend(sizeof(size32_t));
if (nesting++ == 0)
outerNestingOffset = output->tell();
return output->tell()-startNesting;
}
void endNested(size32_t delta)
{
size32_t patchedLength = output->tell() + delta - outerNestingOffset;
outputStream->resume(sizeof(size32_t), &patchedLength);
nesting--;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, let me know when you push your update.
thorlcr/thorutil/thbuf.cpp
Outdated
return rowBuilder.finalizeRowClear(sz); | ||
} | ||
public: | ||
rowcount_t lastKnownWritten = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing that it is not with the other member variables. Should probably be private with an accessor function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree. Have added get/set methods.
thorlcr/thorutil/thbuf.cpp
Outdated
memsize_t rowsMemUsage = 0; | ||
std::atomic<rowcount_t> totalInputRowsRead = 0; // not used until spilling begins, represents count of all rows read | ||
rowcount_t inMemTotalRows = 0; // whilst in memory, represents count of all rows seen | ||
CriticalSection crit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this protecting? Worth giving it a more descriptive name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have renamed/added comment.
thorlcr/thorutil/thbuf.cpp
Outdated
} | ||
void createOutputStream() | ||
{ | ||
// NB: this flushes existing unread rows from memory. Called once, when spilling starts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it does anything with unread rows. It affects what happens next to rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stale comment, from when it did. Changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakesmith some initial comments. I will reread again
} | ||
} | ||
} | ||
virtual void stop() override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stop() should indicate it is not going to read any more rows (e.g., set lastKnownWritten to -1), otherwise an output that only reads 100 records and then stops will cause another output to spill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does, via call to owner.outputStopped. It does it that way because needs to do so within readAheadCS.
Have added comment
thorlcr/thorutil/thbuf.cpp
Outdated
return rowBuilder.finalizeRowClear(sz); | ||
} | ||
public: | ||
rowcount_t lastKnownWritten = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lastKnownAvailable would be a better name to cover both in-memory and spilled rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, renamed.
aebec19
to
2f0827d
Compare
thorlcr/thorutil/thbuf.cpp
Outdated
eof = true; | ||
} | ||
}; | ||
class COutputStreamSerializer : public CSimpleInterfaceOf<IRowSerializerTarget> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this to same place as CThorStreamDeserializerSource - so it can be reused in the unit tests?
c8ac137
to
28cfe90
Compare
thorlcr/thorutil/thbuf.hpp
Outdated
struct SharedRowStreamReaderOptions | ||
{ | ||
offset_t storageBlockSize = 256 * 1024; // block size of read/write streams | ||
memsize_t compressionBlockSize = 256 * 1024; // compression buffer size of read streams |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this should be 1024 as it is elsewhere.
thorlcr/thorutil/thbuf.hpp
Outdated
struct SharedRowStreamReaderOptions | ||
{ | ||
offset_t storageBlockSize = 256 * 1024; // block size of read/write streams | ||
memsize_t compressionBlockSize = 256 * 1024; // compression buffer size of read streams |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this should be 1024 as it is elsewhere.
thorlcr/thorutil/thbuf.hpp
Outdated
memsize_t inMemMaxMem = 2000 * 1024; // before spilling begins. | ||
memsize_t inMemReadAheadGranularity = 128 * 1024; // granularity (K) of read ahead | ||
rowcount_t inMemReadAheadGranularityRows = 64; // granularity (rows) of read ahead. NB: whichever granularity is hit first | ||
offset_t spillWriteAheadSize = 4000 * 1024; // once spilling, maximum size to write ahead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this needs to be this high (as long as it is slightly less than a multiple of compressionBlockSize), but I'm not sure what effect varying it will have. Might be worth experimenting.
Signed-off-by: Jake Smith <[email protected]>
0db080a
to
2116ab4
Compare
fcb0440
into
hpcc-systems:candidate-9.6.x
Type of change:
Checklist:
Smoketest:
Testing: