WIP: Bulk V2 #2568

keith-ratcliffe · 2024-09-17T21:05:17Z

No description provided.

ivakegg · 2024-09-18T14:09:27Z

properties/dev.properties

@@ -41,7 +41,7 @@ LIVE_CHILD_MAP_MAX_MEMORY_MB=1024
 BULK_CHILD_REDUCE_MAX_MEMORY_MB=2048
 LIVE_CHILD_REDUCE_MAX_MEMORY_MB=1024

-BULK_INGEST_DATA_TYPES=shardStats
+BULK_INGEST_DATA_TYPES=shardStats,wikipedia,mycsv,myjson
 LIVE_INGEST_DATA_TYPES=wikipedia,mycsv,myjson


If you are moving those types to bulk, then remove from live

IIRC, my goal here was just to signal to the user that they can now use either live or bulk for the 3 test data types...although, I guess it really doesn't matter in the end, since both live and bulk types here get dumped into the ingest.data.types config list, and that list gets deduped in TypeRegistry

Makes me wonder, do we really need to maintain separate variables for these? So far, I haven't found a case where the distinction matters to our code

Anyway, the live flag maker still polls the datatypeName dirs in hdfs, same as always. And the bulk flag maker is now configured to run, polling the new datatypeName-bulk dirs (created in quickstart's install-ingest.sh above)

These draft changes build on NationalSecurityAgency#2568 with the following differences. * Compute bulkv2 load plans using new unreleased APIs in accumulo PR 4898 * The table splits are loaded at the beginning of writing to rfiles instead of at the end. Not sure about the overall implications on on memory use in reducers of this change. The load plan could be computed after the rfile is closed using a new API in 4898 if defering the loading of tablet splits is desired. * Switches to using accumulo public APIs for writing rfiles instaead of internal accumulo methods. Well public once they are actually released. * The algorithm to compute the load plan does less work per key/value. Should be rougly constant time vs log(N). * Adds a simple SortedList class. This reason this was added is that this code does binary searches on list, however it was not certain those list were actually sorted. If the list was not sorted it would not cause exceptions in binary search but could lead to incorrect load plans and lost data. This new SortedList class ensures list are sorted and allows this assurance to travel around in the code. Maybe this change should be its own PR.

keith-ratcliffe force-pushed the wip/BulkV2 branch 4 times, most recently from 393627d to 6cb73d2 Compare September 17, 2024 23:08

ivakegg reviewed Sep 18, 2024

View reviewed changes

keith-ratcliffe force-pushed the wip/BulkV2 branch 10 times, most recently from a711f5c to 7c11abd Compare September 20, 2024 15:33

WIP: Bulk V2

8c7ed6a

keith-ratcliffe force-pushed the wip/BulkV2 branch from a2430be to 8c7ed6a Compare October 1, 2024 15:51

keith-turner mentioned this pull request Oct 1, 2024

WIP adapt DW PR#2568 to use accumulo PR#4898 #2582

Draft

Test to see if this change results in NPEs being thrown

32c9b62

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Bulk V2 #2568

WIP: Bulk V2 #2568

keith-ratcliffe commented Sep 17, 2024

ivakegg Sep 18, 2024

keith-ratcliffe Sep 18, 2024

WIP: Bulk V2 #2568

Are you sure you want to change the base?

WIP: Bulk V2 #2568

Conversation

keith-ratcliffe commented Sep 17, 2024

ivakegg Sep 18, 2024

Choose a reason for hiding this comment

keith-ratcliffe Sep 18, 2024

Choose a reason for hiding this comment