-
Notifications
You must be signed in to change notification settings - Fork 91
Very bad read performance for pretty small DynamoDB table #65
Comments
Hi
on the Dynamo table's dataframe and see if that makes a difference? Also I think reading 200 KB with 5 read capacity should take around 5 seconds. |
Please note that I'm talking about a structured streaming project and thus need to update the Dynamo dataframe in every micro-batch. I need the latest data from the Dynamo table, so caching is not feasible for my use case. Anyway, this would possibly only mitigate the original problem. |
Yes, sorry for not explaining - I know it's not a fix and a workaround at best. |
Ok, I got it! I tried the following code snippet:
Result was:
As you can see, both queries took nearly same amount of time (about 15 seconds). What can we conclude from that? The problem is not at Spark planning layer? Any other ideas how to proceed? |
Hi If this is indeed the case for you, then a workaround right now would be to set the option Let me know if this works. It might also just work today if the table description is up-to-date. |
Yes, it seems you are right! I also faced more like a random behaviour in read performance. Now that I hard-coded the I will play around a little bit with different values here and use that option for now. Thank you for providing this workaround! |
Hi there, Changing We also had a second problem where we had to set |
@jacobfi When you say "DynamoDB table description", what are you referring to? Is this referring to the output of the DynamoDB Is the problem that DynamoDB's |
Yes the library uses the |
I want to read a very small DynamoDB table (about 6.500 elements, 200 KB in total) into my Spark Structured Streaming job every micro-batch. I use Pyspark with Spark version 2.4.4, spark-dynamodb version 1.0.4. The DynamoDB table has a provisioned read capacity of 5.
My code looks as follows:
I faced a very slow read performance, where it takes multiple seconds, up to few minutes, to read those few elements from Dynamo:
I also noticed that only small portion of the provisioned read capacity is used for every read:
It seems to be random how many read capacity is used. Sometimes, there is used even less. But anyway, even with a read capacity of 1 or so, it should be much faster to read ~6.500 elements from a very small DynamoDB table.
I also tried some configurations like:
.option("filterPushdown", False)
.option("readPartitions", <1, 2, 4, 8, ...>)
.option("targetCapacity", 1)
with no effect at all. I noticed that with readPartitions i. e. of 8, it's a little bit faster (about 20 seconds), but not fast enough for my understanding. And I think that such a small amount of elements should be readable with one partition in a feasible amount of time.
Any ideas, what I'm doing wrong? Any advice on that? Thank's in advance!
The text was updated successfully, but these errors were encountered: