How to access data in S3 from a Flyte spark task running locally? #3229
xshen8888
started this conversation in
Deployment Tips & Tricks
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
When executing a flyte workflow locally on the developer's PC, a spark task trying to access data stored in AWS S3 will need extra work that is not documented in the Flyte community.
the spark task code is like
spark = flytekit.current_context().spark_session
spark_df = spark.read.parquet("s3a://bucket/key_to_parquet_data")
Solution is to
a) add the following 3 line to the flyte task's spark_conf section (only needed when you run spark task locally):
e.g.
@task(
task_config=Spark(
spark_conf={
"spark.jars.packages": "org.apache.hadoop:hadoop-aws:?.?.?",
"spark.hadoop.fs.s3a.access.key": "xxx",
"spark.hadoop.fs.s3a.secret.key": "yyy",
"spark.hadoop.fs.s3a.session.token": "zzz",
Note: hadoop-aws version must match what your pyspark version is asking for.
b) In order to avoid adding the above spark properties to every spark task, alternatively, create a folder conf in your flyte venv's pyspark installation and add a file spark-defaults.conf there with the following content.
e.g.
spark.jars.packages org.apache.hadoop:hadoop-aws:3.3.2
spark.hadoop.fs.s3a.access.key xxx
spark.hadoop.fs.s3a.secret.key yyy
spark.hadoop.fs.s3a.session.token zzz
c) Create an OS env var for your terminal session:
export SPARK_LOCAL_IP="127.0.0.1"
Alternatively, add the env var to your bash or zsh profile.
Beta Was this translation helpful? Give feedback.
All reactions