Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ShapefileReader with Unity Catalogue on Databricks #1531

Closed
JimShady opened this issue Jul 30, 2024 · 14 comments · Fixed by #1553
Closed

ShapefileReader with Unity Catalogue on Databricks #1531

JimShady opened this issue Jul 30, 2024 · 14 comments · Fixed by #1553

Comments

@JimShady
Copy link
Contributor

Running apache-sedona 1.6.0 on Databrick Runtime 14.3.

Tested the Python shapefilereader with a unity catalogue volume on Databricks and it fails.

path = "/Volumes/prod_sandbox/su_jim/volume/tests/Albania 2-digit postcode areas 2023.cpg"
shapefile = ShapefileReader.readToGeometryRDD(sc, path )

"Cannot access the UC volume from this location"

Geopandas works fine.

import geopandas as gpd
gpd.read_file("/Volumes/prod_sandbox/su_jim/volume/tests/Albania 2-digit postcode areas 2023.cpg")

Thanks.

@furqaankhan
Copy link
Contributor

furqaankhan commented Jul 30, 2024

Adding the following config when you create your sedona object should make it work, plus pre-pending dbfs:/ to the path:

sedona = SedonaContext.builder()
           ...
           .config("spark.databricks.unityCatalog.volumes.enabled","true")
           ...
           .getOrCreate()
           
sc = sedona.sparkContext
path = "dbfs:/Volumes/prod_sandbox/su_jim/volume/tests/Albania 2-digit postcode areas 2023.cpg"
shapefile = ShapefileReader.readToGeometryRDD(sc, path)

@JimShady
Copy link
Contributor Author

Ah I didn't realize that was possible. I'll give it a go, thanks.

@JimShady
Copy link
Contributor Author

JimShady commented Jul 30, 2024

I did this:

import os
from pathlib import Path
from pyspark.sql.types import StringType

from sedona.spark import *

sedona = SedonaContext.builder().config("spark.databricks.unityCatalog.volumes.enabled","true").getOrCreate()

sc = sedona.sparkContext

Then this

ShapefileReader.readToGeometryRDD(sc, "dbfs:/Volumes/prod_sandbox/su_jim/volume/tests/Albania 2-digit postcode areas 2023.shp")

But I get this error:

TypeError: 'JavaPackage' object is not callable
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File <command-3459905630890875>, line 1
----> 1 ShapefileReader.readToGeometryRDD(sc, "dbfs:/Volumes/prod_sandbox/su_jim/volume/tests/Albania 2-digit postcode areas 2023.shp")

File /databricks/python/lib/python3.10/site-packages/sedona/core/formatMapper/shapefileParser/shape_file_reader.py:40, in ShapefileReader.readToGeometryRDD(cls, sc, inputPath)
     38 jvm = sc._jvm
     39 jsc = sc._jsc
---> 40 srdd = jvm.ShapefileReader.readToGeometryRDD(
     41     jsc,
     42     inputPath
     43 )
     44 spatial_rdd = SpatialRDD(sc=sc)
     46 spatial_rdd.set_srdd(srdd)

TypeError: 'JavaPackage' object is not callable

@JimShady JimShady reopened this Jul 30, 2024
@jiayuasu
Copy link
Member

jiayuasu commented Jul 30, 2024

@JimShady The correct way to create SedonaContext on Databricks Python (since v1.4.1) is

from sedona.spark import *

sedona = SedonaContext.create(spark)
sedona.conf.set("spark.databricks.unityCatalog.volumes.enabled", "true")
sc = sedona.sparkContext

The code from @furqaankhan is for Sedona on OSS Spark.

@Kontinuation
Copy link
Member

The path to the shapefile should point to the directory containing the shapfiles, not the path to the cpg file. GeoPandas may be more tolerant of the path.
Reference: https://sedona.apache.org/1.6.0/tutorial/rdd/#from-shapefile

@JimShady
Copy link
Contributor Author

Yes pointing to the cpg file was a typo. I meant to do SHP.

It's frustrating that each shapefile needs to be in its own directory for this to work. I've 100s that I use regularly. I just want to point at the file, not a folder. Could this be considered for future improvements to the package @jiayuasu ?

@JimShady
Copy link
Contributor Author

I cannot edit this variable.

sedona.conf.set("spark.databricks.unityCatalog.volumes.enabled", "true")

But it seems ok as it's correct anyway.

image

However I still fail to read a shapefile:

filename = "Albania 2-digit postcode areas 2023.shp"
temp_folder = "/Volumes/prod_sandbox/su_jim/volume/tests/temp"
os.makedirs(temp_folder, exist_ok=True)

# This takes the files I want to read and puts them into their own temp folder so that I can then point the reader at it.

folder_path = "/Volumes/prod_sandbox/su_jim/volume/tests/"
file_list = [os.path.join(folder_path, file) for file in os.listdir(folder_path)]
filtered_files = [file for file in file_list if Path(filename).stem in os.path.basename(file)]

for file in filtered_files:
    source_path = file
    destination_path = os.path.join(temp_folder, os.path.basename(file))
    shutil.copy(source_path, destination_path)

ShapefileReader.readToGeometryRDD(sc, "dbfs:" + temp_folder)

IllegalArgumentException: Cannot access the UC Volume path from this location. Path was /Volumes/prod_sandbox/su_jim/volume/tests/temp
File , line 14
11 destination_path = os.path.join(temp_folder, os.path.basename(file))
12 shutil.copy(source_path, destination_path)
---> 14 ShapefileReader.readToGeometryRDD(sc, "dbfs:" + temp_folder)

@jiayuasu jiayuasu linked a pull request Aug 24, 2024 that will close this issue
@jiayuasu jiayuasu closed this as not planned Won't fix, can't repro, duplicate, stale Aug 24, 2024
@Kontinuation
Copy link
Member

The newly added shapefile datasource works on Unity Catalog volumes (tested on DBR 15.4 LTS). You'll be able to read shapefiles in Unity Catalog volumes using:

path = "/Volumes/catalog_name/schema_name/volume_name/shapefile_directory"
df = sedona.read.format("shapefile").load(path)

@JimShady
Copy link
Contributor Author

JimShady commented Sep 2, 2024

Thank you for the update @Kontinuation . I do still wish that we could point at a file instead of a folder, but this helps at least.

@Kontinuation
Copy link
Member

Thank you for the update @Kontinuation . I do still wish that we could point at a file instead of a folder, but this helps at least.

It supports paths pointing to .shp files. sedona.read.format("shapefile").load("/path/to/somefile.shp") also works.

@JimShady
Copy link
Contributor Author

JimShady commented Sep 2, 2024

Oh that's amazing !!!! Thanks so much.

@JimShady
Copy link
Contributor Author

JimShady commented Sep 3, 2024

Hi @Kontinuation . Just wondering - is this feature in 1.6.1 , or will it be in 1.7.0 (which I don't think is released yet). Thanks.

@Kontinuation
Copy link
Member

Kontinuation commented Sep 3, 2024

It will be in 1.7.0. You can try it out using the jars built by GitHub Actions (see Artifacts at the bottom): https://github.com/apache/sedona/actions/runs/10674801446

@JimShady
Copy link
Contributor Author

JimShady commented Sep 3, 2024

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants