Error when load small amount of data on large cluster #95

kubicaj · 2021-03-01T15:42:00Z

Hi,

I have dynamodb which consists only small amount of data (25 rows)
I use the following spark-dynamodb library:

        <dependency>
            <groupId>com.audienceproject</groupId>
            <artifactId>spark-dynamodb_2.11</artifactId>
            <version>1.0.4</version>
        </dependency>

I have a very simple code which I use for loading od dynamodb:

dynamo_db_df = self.spark_session.read.option("tableName", "my-sample-table").format("dynamodb").load()
dynamo_db_df.show()

I have 2 types of running environments:

Small cluster : AWS glue job where worker_type = Standard (50 GB disk and 2 executors)
Large cluster: AWS glue job where worker_type = G.1X, num_workers = 32 (32 executors where each executor have 4 vCPU, 16 GB of memory)

When I run it on cluster which is small then result looks fine:

+---+--------+-------------+-----+---------+--------------------+--------------------+
| Id|OrderNum|        Title|Price|     ISBN|        BookMetadata|             Authors|
+---+--------+-------------+-----+---------+--------------------+--------------------+
| 22|      22|Book 22 Title|  644|900917-22|[true, [Editor-22...|[[M, [USA, Author...|
| 18|      18|Book 18 Title|   97|377399-18|[true, [Editor-18...|[[M, [USA, Author...|
| 16|      16|Book 16 Title|  383|224276-16|[true, [Editor-16...|[[F, [USA, Author...|
|  2|       2| Book 2 Title|   73| 371411-2|[true, [Editor-2,...|[[F, [USA, Author...|
| 13|      13|Book 13 Title|  431|911648-13|[true, [Editor-13...|[[F, [USA, Author...|
|  8|       8| Book 8 Title|  521| 770005-8|[true, [Editor-8,...|[[F, [USA, Author...|
|  9|       9| Book 9 Title|  838| 915353-9|[true, [Editor-9,...|[[F, [USA, Author...|
|  1|       1| Book 1 Title|  782| 637081-1|[true, [Editor-1,...|[[M, [USA, Author...|
|  6|       6| Book 6 Title|  604|  33246-6|[true, [Editor-6,...|[[F, [USA, Author...|
| 24|      24|Book 24 Title|  826|370799-24|[true, [Editor-24...|[[M, [USA, Author...|
|  5|       5| Book 5 Title|  726| 503009-5|[true, [Editor-5,...|[[M, [USA, Author...|
|  4|       4| Book 4 Title|  172| 229720-4|[true, [Editor-4,...|[[M, [USA, Author...|
| 23|      23|Book 23 Title|  574|876365-23|[true, [Editor-23...|[[M, [USA, Author...|
| 19|      19|Book 19 Title|  694|574785-19|[true, [Editor-19...|[[M, [USA, Author...|
|  7|       7| Book 7 Title|  418| 732692-7|[true, [Editor-7,...|[[F, [USA, Author...|
| 11|      11|Book 11 Title|  360|582662-11|[true, [Editor-11...|[[M, [USA, Author...|
|  3|       3| Book 3 Title|  401| 722245-3|[true, [Editor-3,...|[[M, [USA, Author...|
| 20|      20|Book 20 Title|  185|464982-20|[true, [Editor-20...|[[F, [USA, Author...|
| 21|      21|Book 21 Title|  271|685657-21|[true, [Editor-21...|[[F, [USA, Author...|
| 25|      25|Book 25 Title|  688|521779-25|[true, [Editor-25...|[[M, [USA, Author...|
+---+--------+-------------+-----+---------+--------------------+--------------------+
only showing top 20 rows

When I run it on cluster which is large then count of rows is correct but data are empties:

++
||
++
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
++
only showing top 20 rows

But when I load table which has several GBs and millions of rows then everything looks fine.

Please can you check it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when load small amount of data on large cluster #95

Error when load small amount of data on large cluster #95

kubicaj commented Mar 1, 2021

Error when load small amount of data on large cluster #95

Error when load small amount of data on large cluster #95

Comments

kubicaj commented Mar 1, 2021