-
-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guide/readme/example for using with AWS Glue ETL job #82
Comments
First off: I have never used AWS at all so I have no experience with any of the tools you mention. In addition my Python and Spark knowledge is negligible. I simply use different tools. Now this library was written in Java and explicitly only hooks into specific Hadoop APIs which are also used by Spark. See: https://github.com/nielsbasjes/splittablegzip/blob/main/README-Spark.md So to use this you will need:
This is all I have for you. If you have figured it out I'm willing to add your insights to the documentation. |
Thanks for the response. I think I am good on your bullet 1 and 3 within my scripts (yes, using pyspark). But, on item 2, I'm struggling with the following: AWS Glue requires passing an --extra-jars flag and an s3 path to the "jars." So, I'm developing these scripts in python using pyspark. i use windows. So, not familiar with java, "jars" or even Maven, at all. My assumption is that Maven is to java as "pip" is to python. I don't think Glue will install from the maven repo so I think I need to download the "jars" files to my s3 path and just point to them. I see the java in your repo but am not sure how to determine what I need to satifsy AWS Glue's "--extra-jars" option. Does that make sense? |
Further, from AWS docs, Again, thank you for your help. |
So, even further reducing my question, I think all I need is to get some of these (which) on my s3 and add the extra jars path. Again, don't have maven installed on windows machine and have zero java experience. do I need them all? https://repo1.maven.org/maven2/nl/basjes/hadoop/splittablegzip/1.3/ |
Maven is just a tool to manage the build process of a Java based system. A jar is nothing more than a file with some "ready to run" java code. The maven.org site is just a site where you can reliably download such jar files. For you project you only need to download this single file and put it on S3 |
Thank you. I was making it much harder than necessary. Am testing now and will report back. |
Ok, I added these two parameters to my jobs definition:
I then added this to my script:
in my logs, I see the job starts and reports passed args and it reports my show statment from my lazy load. But, when it goes to write the resulting large file, the job shuts down with a series of warnings and then a shut down failure: in reverse order
I dont' think I've changed anything else in my script which successfully runs (but very slow due to large gz files). Now seems to not run. I suspect:
|
OK, am reporting back that I commented out the changes above and script is running fine but with everything loaded on one executor ,not parallelization, and slow! So, something about the write statement that causes job to fail but using same write
have tries with and without the maxRecordsPerFile option |
Thinking about this some more: https://issues.apache.org/jira/browse/SPARK-29102 or, AWS glue has a capaqbility to install python libraries using PIP, the equivalent of Maven. But don't see a similar capability to kickoff a maven install. Only the way to pass the jar file using --extra-jars like above. However, am seeing some things where for exmaple Glue can be configured for a delta lake using spark.conf settings. I don't use "spark-submit" for an aws glue job so therefore can't pass -- package arg. Hoping you see anything in this madness! |
This looks promising. again, I see I need extra jars with pointer to the jar file on s3. No problem there. But in the config statement, I can pass what your guide says to pass to —packages. But again, I don’t see how the two resolve. |
OK, I think I'm getting very close but job still failing on my read statement with:
On startup, my spark session seems to initialize properly including recognition of my jar files directory:
The s3 pointer on the --extra-jars flag is where I have uploaded splittablegzip-1.3.jar I then attempt to set config but get an error:
And, when I run the read statement, I get the error above:
so, its got to be something wrong with my second conf.set statement above .... |
Related post for help SO |
I wonder if you could make suggestions on how to use this in an AWS glue job. My method does not involve using spark-submit but rather creating job definitions and run-job using boto3 tools.
When I try to use this in my script, i get:
pyspark.sql.utils.IllegalArgumentException: Compression codec nl.basjes.hadoop.io.compress.SplittableGzipCodec not found.
have tried passing --conf nl.basjes.hadoop.io.compress.SplittableGzipCodec, -packages nl.basjes.hadoop.io.compress.SplittableGzipCodec and other methods as args to job to no avail. I think I must need to put a copy of the codec on s3 and point to it with extra-files or other arg?
The text was updated successfully, but these errors were encountered: