Skip to content
This repository has been archived by the owner on Aug 31, 2021. It is now read-only.

Error when load small amount of data on large cluster #95

Open
kubicaj opened this issue Mar 1, 2021 · 0 comments
Open

Error when load small amount of data on large cluster #95

kubicaj opened this issue Mar 1, 2021 · 0 comments

Comments

@kubicaj
Copy link

kubicaj commented Mar 1, 2021

Hi,

I have dynamodb which consists only small amount of data (25 rows)
I use the following spark-dynamodb library:

        <dependency>
            <groupId>com.audienceproject</groupId>
            <artifactId>spark-dynamodb_2.11</artifactId>
            <version>1.0.4</version>
        </dependency>

I have a very simple code which I use for loading od dynamodb:

dynamo_db_df = self.spark_session.read.option("tableName", "my-sample-table").format("dynamodb").load()
dynamo_db_df.show()

I have 2 types of running environments:

  1. Small cluster : AWS glue job where worker_type = Standard (50 GB disk and 2 executors)
  2. Large cluster: AWS glue job where worker_type = G.1X, num_workers = 32 (32 executors where each executor have 4 vCPU, 16 GB of memory)

When I run it on cluster which is small then result looks fine:

+---+--------+-------------+-----+---------+--------------------+--------------------+
| Id|OrderNum|        Title|Price|     ISBN|        BookMetadata|             Authors|
+---+--------+-------------+-----+---------+--------------------+--------------------+
| 22|      22|Book 22 Title|  644|900917-22|[true, [Editor-22...|[[M, [USA, Author...|
| 18|      18|Book 18 Title|   97|377399-18|[true, [Editor-18...|[[M, [USA, Author...|
| 16|      16|Book 16 Title|  383|224276-16|[true, [Editor-16...|[[F, [USA, Author...|
|  2|       2| Book 2 Title|   73| 371411-2|[true, [Editor-2,...|[[F, [USA, Author...|
| 13|      13|Book 13 Title|  431|911648-13|[true, [Editor-13...|[[F, [USA, Author...|
|  8|       8| Book 8 Title|  521| 770005-8|[true, [Editor-8,...|[[F, [USA, Author...|
|  9|       9| Book 9 Title|  838| 915353-9|[true, [Editor-9,...|[[F, [USA, Author...|
|  1|       1| Book 1 Title|  782| 637081-1|[true, [Editor-1,...|[[M, [USA, Author...|
|  6|       6| Book 6 Title|  604|  33246-6|[true, [Editor-6,...|[[F, [USA, Author...|
| 24|      24|Book 24 Title|  826|370799-24|[true, [Editor-24...|[[M, [USA, Author...|
|  5|       5| Book 5 Title|  726| 503009-5|[true, [Editor-5,...|[[M, [USA, Author...|
|  4|       4| Book 4 Title|  172| 229720-4|[true, [Editor-4,...|[[M, [USA, Author...|
| 23|      23|Book 23 Title|  574|876365-23|[true, [Editor-23...|[[M, [USA, Author...|
| 19|      19|Book 19 Title|  694|574785-19|[true, [Editor-19...|[[M, [USA, Author...|
|  7|       7| Book 7 Title|  418| 732692-7|[true, [Editor-7,...|[[F, [USA, Author...|
| 11|      11|Book 11 Title|  360|582662-11|[true, [Editor-11...|[[M, [USA, Author...|
|  3|       3| Book 3 Title|  401| 722245-3|[true, [Editor-3,...|[[M, [USA, Author...|
| 20|      20|Book 20 Title|  185|464982-20|[true, [Editor-20...|[[F, [USA, Author...|
| 21|      21|Book 21 Title|  271|685657-21|[true, [Editor-21...|[[F, [USA, Author...|
| 25|      25|Book 25 Title|  688|521779-25|[true, [Editor-25...|[[M, [USA, Author...|
+---+--------+-------------+-----+---------+--------------------+--------------------+
only showing top 20 rows

When I run it on cluster which is large then count of rows is correct but data are empties:

++
||
++
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
++
only showing top 20 rows

But when I load table which has several GBs and millions of rows then everything looks fine.

Please can you check it?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

1 participant