Python API Issues #1581

ivanthewebber · 2024-09-05T21:09:05Z

Expected behavior

Instructions in docs with latest versions should succeed without errors. I have been unable to initialize Sedona-Spark for the Python API. I think the docs need updated or there are errors in the most recent versions.

I installed Sedona and PySpark (with Hadoop) from PyPi and have a Java 11 JDK and Scala 2.12 on my computer. I also tried installing Spark from a download directly. I have tried manually downloading the Jars as well.

I want to initialize the session like follows:

import pyspark
import pyspark.version
import pyspark.sql
import sedona
import sedona.spark

def get_sedona_spark(spark_version=pyspark.version.__version__, scala_version='2.12', sedona_version=sedona.version, geotools_version='28.2') -> sedona.spark.SedonaContext:
    """
    Get the Sedona Spark context.

    We use the newest version, so Sedona's methods will expect lon-lat order.
    """

    if spark_version.count('.') > 1:
        spark_version = '.'.join(spark_version.split('.')[:2])

    builder: pyspark.sql.SparkSession.Builder = sedona.spark.SedonaContext.builder()
    spark = builder\
        .config(
            'spark.jars.packages',
            f'org.apache.sedona:sedona-spark-{spark_version}_{scala_version}:{sedona_version},' +
            f'org.datasyslab:geotools-wrapper:{sedona_version}-{geotools_version}'
        ).config(
            'spark.jars.repositories',
            'https://artifacts.unidata.ucar.edu/repository/unidata-all'
        ).getOrCreate()

    return sedona.spark.SedonaContext.create(spark)

if __name__ == "__main__":
    get_sedona_spark()

Ideally like the quickstarts for Spark/Flink there would be simple steps to run a simple word count program.

Actual behavior

Various errors. I've tried a lot of variations and recommended fixes from StackOverflows but haven't made much progress.

I get errors like the following: PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.

Steps to reproduce the problem

# create new env
python -m env env
./env/scripts/Activate.ps1
python -m pip install --upgrade pip
pip install --upgrade apache-sedona[spark] pyspark

# I tried setting the spark home to a few different options, but
# if I'm reading the docs right when installing from PyPi I shouldn't need to
# $SPARK_HOME = "venv/.../pyspark"

# attempt to initialize session (see above)
python test.py

Settings

Sedona version = 1.6.1, 1.5

Apache Spark version = 3.5.2, 3.5.1, 3.4

API type = Python

Scala version = 2.12

JRE version = 1.11

Python version = 3.12

Environment = Local

The text was updated successfully, but these errors were encountered:

github-actions · 2024-09-05T21:09:29Z

Thank you for your interest in Apache Sedona! We appreciate you opening your first issue. Contributions like yours help make Apache Sedona better.

Kontinuation · 2024-09-07T17:14:24Z

I cannot reproduce this problem using the script and runtime settings you described. Can you verify that PySpark works well without sedona by running a simple non-spatial test job, for instance spark.range(0, 10).count()?

PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number. may indicate that Spark does not work, which is the prerequisite for sedona-spark to function properly.

ivanthewebber · 2024-09-09T17:15:28Z

I am attaching the full output; sorry for not including that initially. I confirmed that (Py)Spark is working fine.
test2_logs.txt

I noticed that when I run in a notebook, I get a different error: TypeError: 'JavaPackage' object is not callable. Reading older issues it sounds like this is usually related to missing jars or notebook configuration problems.

As a side note, have you thought about adding Ray with GeoPandas to your benchmarks? Ray is faster than Spark on several benchmarks (generally skewed towards ML use cases), so I think it's something you'd want to keep your eye on. The feature parity wouldn't be exact, but it's interesting.

jiayuasu · 2024-09-10T04:26:43Z

The attached logs show the key issue is Exception in thread "main" java.lang.RuntimeException: [download failed: com.google.j2objc#j2objc-annotations;1.1!j2objc-annotations.jar]

Does your laptop have internet access? If not, it needs to have internet access. If yes and you still have this issue, please put this jar (https://mvnrepository.com/artifact/com.google.j2objc/j2objc-annotations/1.1) at SPARK_HOME/jars

Since you are using Windows, why not just use Sedona's Docker image to get started?

ivanthewebber · 2024-09-10T16:29:16Z

Yes I have internet, but it could be some firewall.

Thanks, that's a good idea. If you already have a link to the Dockerfiles please share; I will need to build them myself for work.

ivanthewebber · 2024-09-10T18:25:06Z

With the jar downloaded I am progressing, but it seems Sedona needs winutils.exe (just to get the context unlike PySpark that only needs it for writing). I think it would be great if Sedona like PySpark didn't need winutils.exe to run simple logic to enable developers to do some unit testing or data inspection.

Also, I am a little confused because according to the PySpark installation docs when installing from Python it is supposed to include Hadoop (I see a few jars with it in the name) and I thought winutils.exe is just a subset of the Hadoop Jars. Thus I wonder if either the Hadoop that is supposed to come with the PyPi pyspark is missing or unfindable or is still missing classes.

You'll probably determine this is ready to close, but I would appreciate any answers to my questions.

jiayuasu · 2024-09-13T01:04:42Z

@ivanthewebber Sedona is an open-source project. Its source code of dockerfiles is here: https://github.com/apache/sedona/tree/master/docker

jiayuasu · 2024-09-13T01:05:24Z

Instruction about how to build the dockerfile is here: https://sedona.apache.org/latest/setup/docker/#how-to-build

jiayuasu closed this as completed Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python API Issues #1581

Python API Issues #1581

ivanthewebber commented Sep 5, 2024

github-actions bot commented Sep 5, 2024

Kontinuation commented Sep 7, 2024

ivanthewebber commented Sep 9, 2024

jiayuasu commented Sep 10, 2024

ivanthewebber commented Sep 10, 2024

ivanthewebber commented Sep 10, 2024

jiayuasu commented Sep 13, 2024

jiayuasu commented Sep 13, 2024

Python API Issues #1581

Python API Issues #1581

Comments

ivanthewebber commented Sep 5, 2024

Expected behavior

Actual behavior

Steps to reproduce the problem

Settings

github-actions bot commented Sep 5, 2024

Kontinuation commented Sep 7, 2024

ivanthewebber commented Sep 9, 2024

jiayuasu commented Sep 10, 2024

ivanthewebber commented Sep 10, 2024

ivanthewebber commented Sep 10, 2024

jiayuasu commented Sep 13, 2024

jiayuasu commented Sep 13, 2024