Added option for choosing absolute read, write throughput (cumulative). Issue with Parallelism while writing is handled. #66

LalithSrinivas · 2020-06-11T15:42:06Z

What ?
Add support for Dynamo batch writes using absolute RCU and WCU limit set by the spark job, along with target capacity based batch writes. We also added an option to increase / decrease the parallelism while writing data in dynamoDB by setting number of partitions.
Why ?
Adding the support for absolute RCU and WCU gives the user more flexibility in terms of running the job. It also reduces the manual work of calculating the target-capacity % based on the provisioned RCU & WCU
Other changes
We also noticed that retries to write a dynamo record is infinite without a hard limit. We give an option for the user to set a limit on the retries using maxretries
How

By passing "absWrite", "absRead" options, the user can choose to define cumulative write and read throughput.
This is achieved by considering "absread", "abswrite" parameters. If parameters map contains these keys. Then their values are considered as a cumulative throughput limit. Else there are set to -1 at first, then, as before, throughput is fetched from the table properties. (lines from 69, TableConnector.scala)
If a data frame is user-defined and it is distributed in more (or less) partitions than the value of defaultParallelism parameter, then there will be an overflow (or underflow) of Write Capacity Units. To handle this issue, one should identify the parallelism factor for a user-defined data frame. That can be done by identifying number of tasks running. which is equals to (number of stages) * (number of partitions, the DF is in). As the task is of one stage, number of tasks = number of partitions of DF. Hence an argument with numInputDFPartitions parameter. (line 57, 85, TableConnector.scala)
Infinite recursion case in handleBatchWriteResponse method is handled by adding a maximum retries constraint. The maxRetries can be set by user, by using maxRetries parameter

1. Added an option to use user defined write, read limit by passing "absRead", "absWrite" parameter as keys and required limit as value 2. Resolved a case of infinite loop in "handleBatchWriteResponse" method by adding a limit of maximum retiries (of unprocessed data). It can be passed, as a parameter, by the user with the name "maxRetries" 3. Fixed issue with writeLimit in case if dataframe is user defined. (in which case parallelism should not be considered for num of parallel tasks, but the number of tasks itself). Added a parameter for this, which is "numInputDFPartitions". as number of tasks = (number of stages) * (number of partitions, the DF is in). Here number of stages = 1.

… case)

Added a logger info part to let the user know about number of unprocessed items when the max retries is reached.

LalithSrinivas added 7 commits June 11, 2020 20:39

changed numInputDFPartitions parameter to numinputdfpartitions (lower…

0e21ec9

… case)

removed a few comments

47edf95

Logger info, in case of max retries

f486bff

Added a logger info part to let the user know about number of unprocessed items when the max retries is reached.

comment removal

2ea4817

Added Comments

317037c

Added comments

cf5f07b

LalithSrinivas marked this pull request as ready for review June 15, 2020 04:59

cosmincatalin requested a review from jacobfi July 10, 2020 11:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added option for choosing absolute read, write throughput (cumulative). Issue with Parallelism while writing is handled. #66

Added option for choosing absolute read, write throughput (cumulative). Issue with Parallelism while writing is handled. #66

LalithSrinivas commented Jun 11, 2020 •

edited

Loading

Added option for choosing absolute read, write throughput (cumulative). Issue with Parallelism while writing is handled. #66

Are you sure you want to change the base?

Added option for choosing absolute read, write throughput (cumulative). Issue with Parallelism while writing is handled. #66

Conversation

LalithSrinivas commented Jun 11, 2020 • edited Loading

LalithSrinivas commented Jun 11, 2020 •

edited

Loading