You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank for your work, but I wasn't able successfully run this plugin. The query finished after the first stage and returned an empty dataframe without any error.
Code to reproduce.
I installed spark in a docker image python:3.8-bullseye with openjdk_version="17" like this
ARG scala_version="2.12"ENV APACHE_SPARK_VERSION="3.3.0" \
HADOOP_VERSION="3" \
SPARK_HOME=/usr/local/spark \
SPARK_OPTS="--driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info" \
PATH="${PATH}:${SPARK_HOME}/bin"WORKDIR /tmp
RUN wget -q "https://archive.apache.org/dist/spark/spark-${APACHE_SPARK_VERSION}/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" && \
tar xzf "spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" -C /usr/local --owner root --group root --no-same-owner && \
rm "spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" && \
ln -s "/usr/local/spark-${APACHE_SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}" $SPARK_HOME
WORKDIR /usr/local
# to read s3aRUN wget -P "${SPARK_HOME}/jars" https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.2/hadoop-aws-3.2.2.jar && \
wget -P "${SPARK_HOME}/jars" https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.375/aws-java-sdk-bundle-1.11.375.jar
RUN wget -P "${SPARK_HOME}/jars" https://github.com/IBM/spark-s3-shuffle/releases/download/v0.5/spark-s3-shuffle_${scala_version}-${APACHE_SPARK_VERSION}_0.5.jar
# Add a link in the before_notebook hook in order to source automatically PYTHONPATHRUN mkdir -p /usr/local/bin/before-notebook.d && \
ln -s "${SPARK_HOME}/sbin/spark-config.sh" /usr/local/bin/before-notebook.d/spark-config.sh
# Fix Spark installation for Java 11 and Apache Arrow library# see: https://github.com/apache/spark/pull/27356, https://spark.apache.org/docs/latest/#downloadingRUN cp -p "${SPARK_HOME}/conf/spark-defaults.conf.template""${SPARK_HOME}/conf/spark-defaults.conf" && \
echo $'\n\spark.driver.extraJavaOptions -Dio.netty.tryReflectionSetAccessible=true\n\spark.executor.extraJavaOptions -Dio.netty.tryReflectionSetAccessible=true\n\spark.driver.memory 200g\n\spark.kryoserializer.buffer.max 2047\n\spark.sql.shuffle.partitions 300\n\spark.sql.execution.arrow.pyspark.fallback.enabled true\n\spark.driver.maxResultSize 120g' >> "${SPARK_HOME}/conf/spark-defaults.conf"RUN pip install pyspark
It has been ended in a couple of minutes and result was like
despite the fact that without spark-s3-shuffle it runs through many stages in a hour and returns a massive dataframe. The spark.shuffle.s3.rootDir was filled with a couple GBs of data but I would expected much more data.
Do you have any thoughts what can I do to make it work?
Thanks in advance!
The text was updated successfully, but these errors were encountered:
i-Hun
changed the title
Spark finishing after first stage and returns empty dataframe
Spark finished after first stage and returned an empty dataframe
Jan 19, 2023
i-Hun
changed the title
Spark finished after first stage and returned an empty dataframe
Spark finished after the first stage and returned an empty dataframe
Jan 19, 2023
Thank for your work, but I wasn't able successfully run this plugin. The query finished after the first stage and returned an empty dataframe without any error.
Code to reproduce.
python:3.8-bullseye
with openjdk_version="17" like thisConfigured spark in python:
and tried to execute a heavy query.
It has been ended in a couple of minutes and result was like
despite the fact that without spark-s3-shuffle it runs through many stages in a hour and returns a massive dataframe. The spark.shuffle.s3.rootDir was filled with a couple GBs of data but I would expected much more data.
Do you have any thoughts what can I do to make it work?
Thanks in advance!
The text was updated successfully, but these errors were encountered: