How to access data in S3 from a Flyte spark task running locally? #3229

xshen8888 · 2023-01-11T19:40:52Z

xshen8888
Jan 11, 2023

When executing a flyte workflow locally on the developer's PC, a spark task trying to access data stored in AWS S3 will need extra work that is not documented in the Flyte community.

the spark task code is like
spark = flytekit.current_context().spark_session
spark_df = spark.read.parquet("s3a://bucket/key_to_parquet_data")
Solution is to
a) add the following 3 line to the flyte task's spark_conf section (only needed when you run spark task locally):
e.g.
@task(
task_config=Spark(
spark_conf={
"spark.jars.packages": "org.apache.hadoop:hadoop-aws:?.?.?",
"spark.hadoop.fs.s3a.access.key": "xxx",
"spark.hadoop.fs.s3a.secret.key": "yyy",
"spark.hadoop.fs.s3a.session.token": "zzz",
Note: hadoop-aws version must match what your pyspark version is asking for.
b) In order to avoid adding the above spark properties to every spark task, alternatively, create a folder conf in your flyte venv's pyspark installation and add a file spark-defaults.conf there with the following content.
e.g.
spark.jars.packages org.apache.hadoop:hadoop-aws:3.3.2
spark.hadoop.fs.s3a.access.key xxx
spark.hadoop.fs.s3a.secret.key yyy
spark.hadoop.fs.s3a.session.token zzz

c) Create an OS env var for your terminal session:
export SPARK_LOCAL_IP="127.0.0.1"
Alternatively, add the env var to your bash or zsh profile.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to access data in S3 from a Flyte spark task running locally? #3229

{{title}}

Replies: 0 comments

Select a reply

How to access data in S3 from a Flyte spark task running locally? #3229

xshen8888 Jan 11, 2023

Replies: 0 comments

xshen8888
Jan 11, 2023