Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented turf-clusters-dbscan with spatial data structure (fix #2492) #2497

Merged
merged 9 commits into from
Nov 25, 2023

Conversation

TadaTeruki
Copy link
Contributor

@TadaTeruki TadaTeruki commented Sep 22, 2023

  • Use a meaningful title for the pull request. Include the name of the package modified.
  • Have read How To Contribute.
  • Run npm test at the sub modules where changes have occurred.
  • Run npm run lint to ensure code style at the turf module level.

Faster DBSCAN implementation for turf-clusters-dbscan.

  • Use spatial index for region query
  • Remove external packages not maintained

Performance

fiji: 1.144ms
many-points: 8.363ms
noise: 0.18ms
points-with-properties: 0.158ms
points1: 0.156ms
points2: 0.208ms
fiji x 94,308 ops/sec ±0.41% (96 runs sampled)
many-points x 360 ops/sec ±0.46% (91 runs sampled)
noise x 15,181 ops/sec ±0.32% (95 runs sampled)
points-with-properties x 92,153 ops/sec ±0.25% (96 runs sampled)
points1 x 15,769 ops/sec ±0.21% (97 runs sampled)
points2 x 11,746 ops/sec ±3.14% (84 runs sampled)

Environment

  • CPU: 11th Gen Intel i7-11390H (8) @ 5.000GHz
  • Memory: 16GB (15741MiB)

Fixes #2492

@TadaTeruki
Copy link
Contributor Author

TadaTeruki commented Sep 22, 2023

This uses RBush as a spatial data structure not to make new dependencies. (enough fast)

According to the README of RBush, KDBush will be faster for indexing points like DBSCAN does, however KDBush does not support CommonJS yet and will make new dependencies.
Which is preferred?

@TadaTeruki TadaTeruki changed the title Faster turf-clusters-dbscan (fix #2492) Implemented turf-clusters-dbscan with spatial data structure (fix #2492) Sep 22, 2023
@TadaTeruki
Copy link
Contributor Author

TadaTeruki commented Sep 22, 2023

I have another question:

My implementation passed the npm test, but now I externally tested and found the result is not always the same with the previous implementation. The difference is obvious for geographical points near to the poles.

In the previous implementation, the points are 2-dimentionally treated (longitude as X and latitude as Y). This calculation looks incorrect because the length of maxDistance will not be constant for 2D space.

My implementation fixed this problem, but the result will be changed.
What should I do?

Resources

My code for testing accuracy

Code

const turf = require('@turf/turf');
// My implementation
const dbscan = require('turf-clusters-dbscan').default;
// Original implementation
const turfdb = require('@turf/clusters-dbscan').default;

function generateRandomPoints(count) {
    return turf.randomPoint(count, {bbox: [-180, -90, 180, 90]});
}

function testDBSCAN(points) {
    const count = points.features.length;
    const clustered1 = dbscan(points, 0.1*Math.sqrt(count), {units: 'kilometers', minPoints: 2});
    const clustered2 = turfdb(points, 0.1*Math.sqrt(count), {units: 'kilometers', minPoints: 2});

    // sort clustered1.features[index] by lat
    clustered1.features.sort((a,b) => {
        return a.geometry.coordinates[1] - b.geometry.coordinates[1];
    });
    // sort clustered2.features[index] by lat
    clustered2.features.sort((a,b) => {
        return a.geometry.coordinates[1] - b.geometry.coordinates[1];
    });

    var clustermap = {};
    const incorrect = (index) => {
        const cluster1 = clustered1.features[index];
        const cluster2 = clustered2.features[index];
        console.log("Incorrect!");
        console.log("Result: ", cluster1);
        console.log("Expected: ", cluster2);
        console.log("-----------");
    }
    clustered1.features.forEach((_, index) => {
        const cluster1 = clustered1.features[index].properties;
        const cluster2 = clustered2.features[index].properties;
        if((!cluster1.cluster && cluster2.cluster) || (cluster1.cluster && !cluster2.cluster)) {
            incorrect(index);
        }
        if(cluster1.cluster && cluster2.cluster) {
            if(clustermap[cluster1.cluster]) {
                if(clustermap[cluster1.cluster] != cluster2.cluster) {
                    incorrect(index);
                }
            } else {
                clustermap[cluster1.cluster] = cluster2.cluster;
            }
        }
        if(cluster1.dbscan && cluster2.dbscan) {
            if(cluster1.dbscan != cluster2.dbscan) {
                incorrect(index);
            }
        }
    });
}

const N = [100, 1000, 2000];

for (const n of N) {
    console.log(`Testing DBSCAN accuracy for N = ${n}...`);
    const randomPoints = generateRandomPoints(n);
    testDBSCAN(randomPoints);
    console.log('--------------------------');
}

Result

Testing DBSCAN accuracy for N = 100...
--------------------------
Testing DBSCAN accuracy for N = 1000...
--------------------------
Testing DBSCAN accuracy for N = 2000...
Incorrect!
Result:  {
  type: 'Feature',
  properties: { dbscan: 'core', cluster: 0 },
  geometry: {
    type: 'Point',
    coordinates: [ -172.23841889389155, -89.97795925584309 ]
  }
}
Expected:  {
  type: 'Feature',
  properties: { dbscan: 'noise' },
  geometry: {
    type: 'Point',
    coordinates: [ -172.23841889389155, -89.97795925584309 ]
  }
}
-----------
Incorrect!
Result:  {
  type: 'Feature',
  properties: { dbscan: 'core', cluster: 1 },
  geometry: {
    type: 'Point',
    coordinates: [ 63.8281943048008, -88.98541665690192 ]
  }
}
Expected:  {
  type: 'Feature',
  properties: { cluster: 0, dbscan: 'core' },
  geometry: {
    type: 'Point',
    coordinates: [ 63.8281943048008, -88.98541665690192 ]
  }
}
-----------
Incorrect!
Result:  {
  type: 'Feature',
  properties: { dbscan: 'core', cluster: 1 },
  geometry: {
    type: 'Point',
    coordinates: [ 64.41809667006919, -88.9542553110876 ]
  }
}
Expected:  {
  type: 'Feature',
  properties: { cluster: 0, dbscan: 'core' },
  geometry: {
    type: 'Point',
    coordinates: [ 64.41809667006919, -88.9542553110876 ]
  }
}
-----------
Other implementation of `regionQuery` that will not change the result from the original implementation
// Function to find neighbors of a point within a given distance
const regionQuery = (index: number): IndexedPoint[] => {
  const point = points.features[index];
  const [x, y] = point.geometry.coordinates;

  const minY = Math.max(y - latDistanceInDegrees, -90.0);
  const maxY = Math.min(y + latDistanceInDegrees, 90.0);

  const lonDistanceInDegrees = (function () {
    // Handle the case where the bounding box crosses the poles
    if (minY < 0 && maxY > 0) {
      return latDistanceInDegrees;
    }
    if (Math.abs(minY) < Math.abs(maxY)) {
      return latDistanceInDegrees / Math.cos(degreesToRadians(maxY));
    } else {
      return latDistanceInDegrees / Math.cos(degreesToRadians(minY));
    }
  })();

  const minX = Math.max(x - lonDistanceInDegrees, -360.0);
  const maxX = Math.min(x + lonDistanceInDegrees, 360.0);

  // Calculate the bounding box for the region query
  const baseBbox = { minX, minY, maxX, maxY };
  
  const neighbors = tree.search(baseBbox)
    .filter((neighbor) => {
      const neighborIndex = (neighbor as IndexedPoint).index;
      const neighborPoint = points.features[neighborIndex];
      const dist = distance(point, neighborPoint);
      return dist < latDistanceInDegrees;
  });

  return neighbors as IndexedPoint[];
};

@TadaTeruki TadaTeruki marked this pull request as ready for review September 22, 2023 03:00
@smallsaucepan
Copy link
Member

Thanks for your PR @TadaTeruki

@mfedderly and @twelch will know more about this module than I do. Perhaps they have some questions for you?

If I understand the above though, your new implementation passes all the existing tests. However you've found some places near the poles we were not testing, and your implementation gives different (more accurate?) results than the old implementation.

Do you have some example locations you can attach where the different implementations give different results?

@TadaTeruki
Copy link
Contributor Author

TadaTeruki commented Oct 5, 2023

Thank you for checking my pull request. @smallsaucepan

I tested more to find some examples that give different results.
Eventually, I found my comment at #2497 (comment) was incorrect.

That was caused by some bugs in my implementation.
In my attempt to adjust the search range of the Spatial index by latitude for solving the problem, the search range became excessively large near the poles. This led to certain locations being processed multiple times, which resulted in incorrect clustering.

The result was fundamentally the same.
I apologize for making statements based on assumptions without adequate verification. I will submit a new commit soon.

@smallsaucepan
Copy link
Member

Thanks for the detailed explanation @TadaTeruki and for your efforts with this 👍

@TadaTeruki
Copy link
Contributor Author

I'm done. please check it

@smallsaucepan
Copy link
Member

Hi @TadaTeruki. There is another change you'll need to make before we can merge this. Looking at package.json the dependencies don't include rbush which your implementation now uses. This probably worked ok in development because other packages use rbush and it was already installed elsewhere. If someone installs @turf/clusters-dbscan only though, rbush isn't installed and the module can't be found.

See turf-unkink-polygon for an example of how rbush is included.

Similarly you can remove the density-clustering and @types/density-clustering dependencies at the same time as it is no longer required. Let us know when you've pushed this and we can take another look. Also reach out if you have any questions.

@smallsaucepan
Copy link
Member

If you would like to test deploying to a local registry for yourself, the steps are outlined in the wiki - https://github.com/Turfjs/turf/wiki/Contributing#deploy-to-a-local-node-registry

@TadaTeruki
Copy link
Contributor Author

Sorry for being late. Is there anything else I need to do?

* Import 'RBush' for spatial indexing

* Removed 'dbscan-clustering'

* Reimplemented DBSCAN for performance

* Test
* Write more comments
* Smaller bounding box for region query with RBush
* Use Bulk-Insertion for adding data to RBush tree

* Slightly improved performance of region query
 - remove unused packages
 - add rbush
@smallsaucepan
Copy link
Member

No need to apologise @TadaTeruki. Appreciate your help. I'll take a look and hopefully we can get this merged soon1!

@smallsaucepan smallsaucepan merged commit 407619b into Turfjs:master Nov 25, 2023
3 checks passed
@smallsaucepan
Copy link
Member

Thanks for all your work on this @TadaTeruki 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

turf-clusters-dbscan looks very slow
2 participants