Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python API Issues #1581

Closed
ivanthewebber opened this issue Sep 5, 2024 · 8 comments
Closed

Python API Issues #1581

ivanthewebber opened this issue Sep 5, 2024 · 8 comments

Comments

@ivanthewebber
Copy link

Expected behavior

Instructions in docs with latest versions should succeed without errors. I have been unable to initialize Sedona-Spark for the Python API. I think the docs need updated or there are errors in the most recent versions.

I installed Sedona and PySpark (with Hadoop) from PyPi and have a Java 11 JDK and Scala 2.12 on my computer. I also tried installing Spark from a download directly. I have tried manually downloading the Jars as well.

I want to initialize the session like follows:

import pyspark
import pyspark.version
import pyspark.sql
import sedona
import sedona.spark

def get_sedona_spark(spark_version=pyspark.version.__version__, scala_version='2.12', sedona_version=sedona.version, geotools_version='28.2') -> sedona.spark.SedonaContext:
    """
    Get the Sedona Spark context.

    We use the newest version, so Sedona's methods will expect lon-lat order.
    """

    if spark_version.count('.') > 1:
        spark_version = '.'.join(spark_version.split('.')[:2])

    builder: pyspark.sql.SparkSession.Builder = sedona.spark.SedonaContext.builder()
    spark = builder\
        .config(
            'spark.jars.packages',
            f'org.apache.sedona:sedona-spark-{spark_version}_{scala_version}:{sedona_version},' +
            f'org.datasyslab:geotools-wrapper:{sedona_version}-{geotools_version}'
        ).config(
            'spark.jars.repositories',
            'https://artifacts.unidata.ucar.edu/repository/unidata-all'
        ).getOrCreate()

    return sedona.spark.SedonaContext.create(spark)

if __name__ == "__main__":
    get_sedona_spark()

Ideally like the quickstarts for Spark/Flink there would be simple steps to run a simple word count program.

Actual behavior

Various errors. I've tried a lot of variations and recommended fixes from StackOverflows but haven't made much progress.

I get errors like the following: PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.

Steps to reproduce the problem

# create new env
python -m env env
./env/scripts/Activate.ps1
python -m pip install --upgrade pip
pip install --upgrade apache-sedona[spark] pyspark

# I tried setting the spark home to a few different options, but
# if I'm reading the docs right when installing from PyPi I shouldn't need to
# $SPARK_HOME = "venv/.../pyspark"

# attempt to initialize session (see above)
python test.py

Settings

Sedona version = 1.6.1, 1.5

Apache Spark version = 3.5.2, 3.5.1, 3.4

API type = Python

Scala version = 2.12

JRE version = 1.11

Python version = 3.12

Environment = Local

Copy link

github-actions bot commented Sep 5, 2024

Thank you for your interest in Apache Sedona! We appreciate you opening your first issue. Contributions like yours help make Apache Sedona better.

@Kontinuation
Copy link
Member

I cannot reproduce this problem using the script and runtime settings you described. Can you verify that PySpark works well without sedona by running a simple non-spatial test job, for instance spark.range(0, 10).count()?

PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number. may indicate that Spark does not work, which is the prerequisite for sedona-spark to function properly.

@ivanthewebber
Copy link
Author

I am attaching the full output; sorry for not including that initially. I confirmed that (Py)Spark is working fine.
test2_logs.txt

I noticed that when I run in a notebook, I get a different error: TypeError: 'JavaPackage' object is not callable. Reading older issues it sounds like this is usually related to missing jars or notebook configuration problems.

As a side note, have you thought about adding Ray with GeoPandas to your benchmarks? Ray is faster than Spark on several benchmarks (generally skewed towards ML use cases), so I think it's something you'd want to keep your eye on. The feature parity wouldn't be exact, but it's interesting.

@jiayuasu
Copy link
Member

The attached logs show the key issue is Exception in thread "main" java.lang.RuntimeException: [download failed: com.google.j2objc#j2objc-annotations;1.1!j2objc-annotations.jar]

Does your laptop have internet access? If not, it needs to have internet access. If yes and you still have this issue, please put this jar (https://mvnrepository.com/artifact/com.google.j2objc/j2objc-annotations/1.1) at SPARK_HOME/jars

Since you are using Windows, why not just use Sedona's Docker image to get started?

@ivanthewebber
Copy link
Author

Yes I have internet, but it could be some firewall.

Thanks, that's a good idea. If you already have a link to the Dockerfiles please share; I will need to build them myself for work.

@ivanthewebber
Copy link
Author

With the jar downloaded I am progressing, but it seems Sedona needs winutils.exe (just to get the context unlike PySpark that only needs it for writing). I think it would be great if Sedona like PySpark didn't need winutils.exe to run simple logic to enable developers to do some unit testing or data inspection.

Also, I am a little confused because according to the PySpark installation docs when installing from Python it is supposed to include Hadoop (I see a few jars with it in the name) and I thought winutils.exe is just a subset of the Hadoop Jars. Thus I wonder if either the Hadoop that is supposed to come with the PyPi pyspark is missing or unfindable or is still missing classes.

You'll probably determine this is ready to close, but I would appreciate any answers to my questions.

@jiayuasu
Copy link
Member

@ivanthewebber Sedona is an open-source project. Its source code of dockerfiles is here: https://github.com/apache/sedona/tree/master/docker

@jiayuasu
Copy link
Member

Instruction about how to build the dockerfile is here: https://sedona.apache.org/latest/setup/docker/#how-to-build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants