Skip to content
This repository has been archived by the owner on Aug 31, 2021. It is now read-only.

DynamoDB load always uses full scan instead of the specified global secondary index in Python #101

Open
anujpareek opened this issue Jul 9, 2021 · 0 comments

Comments

@anujpareek
Copy link

anujpareek commented Jul 9, 2021

In Python I'm are trying to read DynamoDB with a global secondary index with a provided schema and filters. It's a very large dynamo table and it takes approximately 4 hrs to do a full table scan. We've created a global secondary index to improve the performance. However we are not sure if this library supports this functionality. Or perhaps we are using it incorrectly. Currently we are using the following code to do a full scan. I tried adding the commented out line to use the index but that didn't work and couldn't find any examples of this.

        dynamo_df = spark.read.schema(table_schema) \
        .option("tableName", "table") \
        // .option("indexName", "x-y-global-secondary-index") \
        .option("region", region) \
        .option("throughput", 2500) \
        .format("dynamodb") \
        .load()

filtered_df = dynamo_df.filter((dynamo_df.x == x ) & (dynamo_df.y > y)

Appreciate the help!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

1 participant