-
Notifications
You must be signed in to change notification settings - Fork 692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python API Issues #1581
Comments
Thank you for your interest in Apache Sedona! We appreciate you opening your first issue. Contributions like yours help make Apache Sedona better. |
I cannot reproduce this problem using the script and runtime settings you described. Can you verify that PySpark works well without sedona by running a simple non-spatial test job, for instance
|
I am attaching the full output; sorry for not including that initially. I confirmed that (Py)Spark is working fine. I noticed that when I run in a notebook, I get a different error: As a side note, have you thought about adding Ray with GeoPandas to your benchmarks? Ray is faster than Spark on several benchmarks (generally skewed towards ML use cases), so I think it's something you'd want to keep your eye on. The feature parity wouldn't be exact, but it's interesting. |
The attached logs show the key issue is Does your laptop have internet access? If not, it needs to have internet access. If yes and you still have this issue, please put this jar (https://mvnrepository.com/artifact/com.google.j2objc/j2objc-annotations/1.1) at SPARK_HOME/jars Since you are using Windows, why not just use Sedona's Docker image to get started? |
Yes I have internet, but it could be some firewall. Thanks, that's a good idea. If you already have a link to the Dockerfiles please share; I will need to build them myself for work. |
With the jar downloaded I am progressing, but it seems Sedona needs Also, I am a little confused because according to the PySpark installation docs when installing from Python it is supposed to include Hadoop (I see a few jars with it in the name) and I thought winutils.exe is just a subset of the Hadoop Jars. Thus I wonder if either the Hadoop that is supposed to come with the PyPi pyspark is missing or unfindable or is still missing classes. You'll probably determine this is ready to close, but I would appreciate any answers to my questions. |
@ivanthewebber Sedona is an open-source project. Its source code of dockerfiles is here: https://github.com/apache/sedona/tree/master/docker |
Instruction about how to build the dockerfile is here: https://sedona.apache.org/latest/setup/docker/#how-to-build |
Expected behavior
Instructions in docs with latest versions should succeed without errors. I have been unable to initialize Sedona-Spark for the Python API. I think the docs need updated or there are errors in the most recent versions.
I installed Sedona and PySpark (with Hadoop) from PyPi and have a Java 11 JDK and Scala 2.12 on my computer. I also tried installing Spark from a download directly. I have tried manually downloading the Jars as well.
I want to initialize the session like follows:
Ideally like the quickstarts for Spark/Flink there would be simple steps to run a simple word count program.
Actual behavior
Various errors. I've tried a lot of variations and recommended fixes from StackOverflows but haven't made much progress.
I get errors like the following:
PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.
Steps to reproduce the problem
Settings
Sedona version = 1.6.1, 1.5
Apache Spark version = 3.5.2, 3.5.1, 3.4
API type = Python
Scala version = 2.12
JRE version = 1.11
Python version = 3.12
Environment = Local
The text was updated successfully, but these errors were encountered: