Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GDS - RandomWalk - Unable to load NODE #337

Open
Mintactus opened this issue Nov 20, 2024 · 6 comments
Open

GDS - RandomWalk - Unable to load NODE #337

Mintactus opened this issue Nov 20, 2024 · 6 comments
Labels
BUG Something isn't working

Comments

@Mintactus
Copy link

Neo4j 5.25.1
GDS 2.12
GDS Python Client 1.12

The randomWalk algo doesn't load my sourceNode, details below:

My in memory GDS graph has been build from a pandas DataFrame using the construct method of gds, so it doesn't exists and will not exists on disk, its intended for an in memory analysis only.

Here is the content of the in memory extracted from gds.graph.nodeProperty.stream

             nodeId  propertyValue nodeLabels
0 6335695024714629015 -0.00003 
1 531768015437695177 0.00009 
2 3558886278460545694 -0.00012 
3 7960371801618416072 -0.00006 
4 688712822280937494 0.00009 
5 6445645390101772454 0.00000 
6 4640442843099832304 -0.00006 
7 6026970582286088324 0.00006 
8 5356341080109221825 0.00003 
9 1843909622001289035 0.00006 
10 5984421542275516993 -0.00009 
11 1113611838033320553 -0.00003 
12 4162479979561917907 0.00003 

When trying to run randomWalk

    sourceNode = self.markov_chain_nodes['nodeId'].last() <- This output an signed int64
    random_walk_config = {
        'sourceNodes': [sourceNode],
        'walkLength': FUTURE_SIZE,
        'walksPerNode': 1,
        'relationshipWeightProperty': 'transition probability',
        'concurrency': 4
    }
    future = self.gds.randomWalk.stream(self.graph, **random_walk_config)

I got this error, {message: Failed to invoke procedure gds.randomWalk.stream: Caused by: org.neo4j.internal.kernel.api.exceptions.EntityNotFoundException: Unable to load NODE 4162479979561917907.}.

But the node id 4162479979561917907 clearly exist in the in memory graph

I read that I'm suppose to use gds.find_node_id to match the sourceNode, but this is an in memory graph only and doesn't need to become an on-disk graph. Having to create an on disk graph just to make it work doesn't make any sens to me.

This might also be considered as a feature request then...

Thanks for your support :)

@Mintactus Mintactus added the BUG Something isn't working label Nov 20, 2024
@IoannisPanagiotas
Copy link
Contributor

IoannisPanagiotas commented Nov 21, 2024

Hi @Mintactus ,

I have looked into your issue. I can verify there is a bug when working with graphs not backed by a database for randomwalk. We have applied a fix which should be out in the next gds release, but I am not sure when that is going to be.

In the meantime, as a workaround, I would suggest the following

Instead of running randomWalk on the gds python client, you can run with the neo4j python client and call a cypher query directly. There are instructions on https://neo4j.com/docs/python-manual/current/ for how to do this.

The Cypher query that you need is the following, where X is
sourceNode = self.markov_chain_nodes['nodeId'].last()

 CALL gds.randomWalk.stream(
  'myGraph',
  {
    sourceNodes: X,
    walkLength: 3,
    walksPerNode: 1,
    randomSeed: 42,
    concurrency: 1
  }
)
YIELD nodeIds

I believe that execute_query in the page I shared should work.

This should work as it avoids doing the faulty computation. Let us know if you need any help in running that query.

@FlorentinD
Copy link
Contributor

you also can still use the GDS client -

gds.run_cypher("""CALL gds.randomWalk.stream(
  'myGraph',
  {
    sourceNodes: X,
    walkLength: 3,
    walksPerNode: 1,
    randomSeed: 42,
    concurrency: 1
  }
)
YIELD nodeIds
""") 

@Mintactus
Copy link
Author

Thank you guys,

@IoannisPanagiotas
@FlorentinD

I'm glad to know I wasn't crazy, I'have used it for a while and on that one I couldn't explain what i was doing wrong.

Amazing support

@Mintactus
Copy link
Author

Mintactus commented Nov 22, 2024

I did some deeper test and investigation,

If I'm right, graph created using the construct method ( graph that do not exists on disk ) will use the nodeId provided in the dataframe as actual nodeIds usable as sourceNodes inside an algo. Which seems to be right based on the picture provided.

As suggested, I tried the above using only the cypher statement inside the browser instead of the GDS Python Client randomWalk method, but still GDS is not able to locate the nodeID. So it seems the problem is not comming from the GDS Python Client but rather GDS itself not being able to locate a nodeID on a not existant on disk graph.

To reproduce the issue, you basically build an in-memoery graph from a dataframe using the construct method , then try to run the randomWalk algo using cypher with any sourceNode in it, it fails.

Unless I missed something in the doc, this behavior obliged the dev to:

-Export it's in-memory graph into a new database ( Because it has to be a new, you can't use the one the gds initiate it's connection with )
-Create a new gds object linked to this new database
-Create a new native in-memory projection from from this new database
-Then run the algo from this new projection

Kind of a huge workaround making the usuge of in-memory graph drasticly less exiting to use.
But thanks for your support, hopefully a patched version will come out soon :)

nodeIdProblem

@IoannisPanagiotas
Copy link
Contributor

IoannisPanagiotas commented Nov 23, 2024

@Mintactus

Please remove the 'path' from the yields as in the query we shared above!
The bug is contained in that part because it relies on having a neo4j graph. It should run normally after that.

Best.

@Mintactus
Copy link
Author

Mintactus commented Nov 25, 2024

Thanks for your support

I will match the ids given by randomWalk with the gds.graph.nodeProperty.stream to kind of find the nodes and their property involded in the walk, but I will be on line to test the new patch, as essentiel path informations can still not be retreive when removing the path from the algo. So at the end for now you still need to recreate an GDS object and create a new projection and database as a complete workaround.

Unless there are something else you want to add, you can, I might close the ticket soon.

Thanks again, I will test the patched version when it's out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BUG Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants