Question about planning ... failures? #63

rpgoldman · 2024-08-05T21:57:37Z

I took the jupyter notebook from the repo (which is, I think, the same as the python script in the README) and modified it slightly (print the URL and the response from celery) to test my PaaS server. I'm getting a very odd result:

Service URL is http://lacrosse-0:5001/package/lama-first/solve
Computing...
{'result': {'call': 'timeout 20 planutils run lama-first -- domain problem',
            'output': {},
            'output_type': 'generic',
            'stderr': '',
            'stdout': ''},
 'status': 'ok'}

So no plan, but no error output, either, and celery seems to be claiming my request is no longer PENDING.

Is this what I should expect if there's a time out?

Is there a log I should be looking at to diagnose this?

The text was updated successfully, but these errors were encountered:

rpgoldman · 2024-08-06T00:04:21Z

Same result when I use the code from README.md.

rpgoldman · 2024-08-06T00:22:36Z

If it helps, when I did the following cell in the notebook:

# An example of Package API to send task, retrive plans with adaptor.

def test_planner_with_adaptor(package_name,req_body,service="solve"):
    service_url="http://"+ip+":5001/package/"+package_name+"/"+service
    solve_request = requests.post(service_url, json=req_body).json()
    celery_result = requests.post("http://"+ip+":5001" + solve_request['result'],json={"adaptor":"planning_editor_adaptor"})
    print('Computing...')
    while celery_result.json().get("status","") == 'PENDING':
        time.sleep(2)
        celery_result = requests.post("http://"+ip+":5001" + solve_request['result'],json={"adaptor":"planning_editor_adaptor"})
    pprint(celery_result.json())
    
test_planner_with_adaptor("lama-first",req_body)

I got back:

Computing...
{'plans': [], 'status': 'ok'}

haz · 2024-08-06T01:18:37Z

Hrmz...there should be a log in there, ya. Are you able to see anything on the monitor? Same base URL, with the port :5555.

rpgoldman · 2024-08-06T14:59:49Z

Hm. The server is up (lacrosse-0:8010, where lacrosse-0 is the local hostname). But lacrosse-0:5555 does not respond. Logging into the server and trying curl lacrosse-0:5555, curl http://lacrosse-0:5555 and the same with localhost gives only "Couldn't connect to server."

And looking at a list of my processes, this does not look good:

2603926 ?        00:00:00 planutils <defunct>
2603927 ?        00:00:00 python3 <defunct>
2603928 ?        00:00:00 run <defunct>
2603930 ?        00:00:01 singularity <defunct>
2604046 ?        00:00:02 apptainer <defunct>
2604667 ?        00:00:00 python3 <defunct>
2604668 ?        00:00:00 run <defunct>
2604669 ?        00:00:00 planutils <defunct>
2604670 ?        00:00:00 python3 <defunct>
2604671 ?        00:00:00 run <defunct>
2604673 ?        00:00:18 singularity <defunct>
2605301 ?        00:00:00 python3 <defunct>
2605302 ?        00:00:00 run <defunct>
2605303 ?        00:00:00 planutils <defunct>
2605304 ?        00:00:00 python3 <defunct>
2605305 ?        00:00:00 run <defunct>
2605307 ?        00:00:13 singularity <defunct>
2614612 ?        00:00:00 python3 <defunct>
2614613 ?        00:00:00 run <defunct>
2614614 ?        00:00:00 planutils <defunct>
2614615 ?        00:00:00 python3 <defunct>
2614616 ?        00:00:00 run <defunct>
2614618 ?        00:00:06 singularity <defunct>
2615156 ?        00:00:00 python3 <defunct>
2615157 ?        00:00:00 run <defunct>
2615158 ?        00:00:00 planutils <defunct>
2615159 ?        00:00:00 python3 <defunct>
2615160 ?        00:00:00 run <defunct>
2615162 ?        00:00:05 singularity <defunct>
2618062 ?        00:00:00 python3 <defunct>
2618063 ?        00:00:00 run <defunct>
2618064 ?        00:00:00 planutils <defunct>
2618065 ?        00:00:00 python3 <defunct>
2618066 ?        00:00:00 run <defunct>
2618068 ?        00:00:05 singularity <defunct>
2658068 ?        00:00:00 python3 <defunct>
2658069 ?        00:00:00 run <defunct>
2658070 ?        00:00:00 planutils <defunct>
2658071 ?        00:00:00 python3 <defunct>
2658072 ?        00:00:00 run <defunct>
2658074 ?        00:00:06 singularity <defunct>
2662833 ?        00:00:00 python3 <defunct>
2662834 ?        00:00:00 run <defunct>
2662835 ?        00:00:00 planutils <defunct>
2662836 ?        00:00:00 python3 <defunct>
2662837 ?        00:00:00 run <defunct>
2662839 ?        00:00:07 singularity <defunct>
2663568 ?        00:00:00 python3 <defunct>
2663569 ?        00:00:00 run <defunct>
2663570 ?        00:00:00 planutils <defunct>
2663571 ?        00:00:00 python3 <defunct>
2663572 ?        00:00:00 run <defunct>
2663574 ?        00:00:16 singularity <defunct>
2675256 ?        00:00:00 python3 <defunct>
2675257 ?        00:00:00 run <defunct>
2675258 ?        00:00:00 planutils <defunct>
2675259 ?        00:00:00 python3 <defunct>
2675260 ?        00:00:00 run <defunct>
2675262 ?        00:00:06 singularity <defunct>
2675468 ?        00:00:00 python3 <defunct>
2675469 ?        00:00:00 run <defunct>
2675470 ?        00:00:00 planutils <defunct>
2675471 ?        00:00:00 python3 <defunct>
2675517 ?        00:00:00 run <defunct>
2675519 ?        00:00:11 singularity <defunct>

Looks like these aren't getting reaped? But also, I thought all the action was inside the docker containers? Or is PaaS running singularity on the host machine?

rpgoldman · 2024-08-06T15:07:55Z

Listing the running containers, I don't see anything that looks like the monitor (but I don't really know how to read a docker-compose.yml file, so I could be wrong):

CONTAINER ID   IMAGE                           COMMAND                  CREATED        STATUS        PORTS                                                  NAMES
7b06fca38c82   server_web                      "gunicorn --bind 0.0…"   18 hours ago   Up 18 hours   0.0.0.0:5001->5001/tcp, :::5001->5001/tcp              server_web_1
2ba8eb63c435   server_worker                   "celery -A tasks wor…"   18 hours ago   Up 18 hours                                                          server_worker_1
f50b05015dec   server_frontend                 "gunicorn --bind 0.0…"   18 hours ago   Up 18 hours   0.0.0.0:8010->80/tcp, :::8010->80/tcp                  server_frontend_1
1ad1b3635b9f   mysql:8.0                       "docker-entrypoint.s…"   18 hours ago   Up 18 hours   0.0.0.0:3306->3306/tcp, :::3306->3306/tcp, 33060/tcp   server_mysql_1
1eaf3fabc1e8   redis                           "docker-entrypoint.s…"   18 hours ago   Up 18 hours   6379/tcp                                               server_redis_1

haz · 2024-08-06T15:16:17Z

Indeed, the monitor isn't there. It's all done through the docker containers, but they're bridged on the virtual network. You may be facing the version mismatch similar to #65 . Can you try running my version here? It's in #62 , waiting for review. Added bonus, it has your frontend port as a parameter now in the .env file.

rpgoldman · 2024-08-06T15:19:17Z

Aha! I tried docker compose log and it looks like the lama runs claim to have succeeded, but somehow the responses are not coming back. For example:

server_worker_1    | [2024-08-06 00:03:18,655: INFO/MainProcess] Task tasks.run.package[9493497a-6ab4-4b55-a6cf-85e27d6d7843] received
server_worker_1    | [2024-08-06 00:03:38,874: INFO/ForkPoolWorker-16] Task tasks.run.package[9493497a-6ab4-4b55-a6cf-85e27d6d7843] succeeded in 20.21391299692914s: ({'stdout': '', 'stderr': '', 'call': 'timeout 20 planutils run lama-first -- domain problem', 'output': {}, 'output_type': 'generic'}, {'domain': {'value': '
server_worker_1    | (define (domain BLOCKS)
server_worker_1    |    (:requirements :strips)
server_worker_1    |    (:predicates (on ?x ?y) (ontable ?x) (clear ?x) (handempty) (holding ?x) )
server_worker_1    |    (:action pick-up :parameters (?x) :precondition (and (clear ?x) (ontable ?x) (handempty)) :effect (and (not (ontable ?x)) (not (clear ?x)) (not (handempty)) (holding ?x)))
server_worker_1    |    (:action put-down :parameters (?x) :precondition (holding ?x)
server_worker_1    |             :effect (and (not (holding ?x)) (clear ?x) (handempty) (ontable ?x)))
server_worker_1    |    (:action stack :parameters (?x ?y) :precondition (and (holding ?x) (clear ?y))
server_worker_1    |             :effect (and (not (holding ?x)) (not (clear ?y)) (clear ?x) (handempty) (on ?x ?y)))
server_worker_1    |    (:action unstack :parameters (?x ?y) :precondition (and (on ?x ?y) (clear ?x) (handempty))
server_worker_1    |             :effect (and (holding ?x) (clear ?y) (not (clear ?x)) (not (handempty)) (not (on ?x ?y)))))
server_worker_1    | ', 'type': 'file'}, 'problem': {'value...', ...}})

Given that the timeout value in the call (20s) is less than the runtime reported (20.21391299692914s) is there any chance that the server is misinterpreting a timeout as successful plan generation?

haz · 2024-08-06T15:22:23Z

It could be, ya. That'd be my hunch as well... @nirlipo : any recollection on what the status code is meant to be on timeout? I'll start poking around to find it in the meantime...

rpgoldman · 2024-08-06T15:22:58Z

OK, one (last?) update. Is there any chance the monitor has failed?

server_monitor_1   |     from celery.bin.base import Command
server_monitor_1   | ImportError: cannot import name 'Command' from 'celery.bin.base' (/usr/local/lib/python3.10/dist-packages/celery/bin/base.py)

There is no server_monitor in the image list I posted above. I see that this might be #65

haz · 2024-08-06T15:26:10Z

Yep -- can you try my branch? Should hopefully fix it.

haz · 2024-08-06T15:31:37Z

So the celery process gets the time limit + 10 (just to make sure the service ends appropriately):

planning-as-a-service/server/celery-queue/tasks.py

Line 136 in 8528c9d

call = f"timeout {TIME_LIMIT} planutils run {call}"

If this is triggered, then this would have been the response:

planning-as-a-service/server/celery-queue/tasks.py

Line 148 in 8528c9d

    
           return {"stdout":"Request Time Out", "stderr":"", "call":call, "output":{},"output_type":output_file["type"]},arguments

The planner itself is given the exact limit:

planning-as-a-service/server/celery-queue/tasks.py

Line 136 in 8528c9d

call = f"timeout {TIME_LIMIT} planutils run {call}"

So the question is what is in the stdout/stderr when the planner completes:

planning-as-a-service/server/celery-queue/tasks.py

Line 145 in 8528c9d

    
           result={"stdout":res.stdout, "stderr":res.stderr, "call":call, "output":output,"output_type":output_file["type"]}

Weird that there would be nothing (the first response you posted). I don't think it's the planning-editor-adapter you want to use.

rpgoldman · 2024-08-06T15:39:07Z

Should I expect the plan to be in "stdout" or in "output"?

Also, a side issue that I will raise in a separate issue: when one is running rootless and starts a privileged container, it's easy to end up with files created in mounted directories with uid's that don't exist on the host machine, and thus cannot be removed by the host machine owner, e.g., error checking context: no permission to read from '/home/rpg/src/planning-as-a-service/server/db_data/#ib_16384_0.dblwr'. That file's UID and GID are both 328532966 on the host machine.

If one must do this, there's some way to sync the uids. I will talk to my colleagues and ask them for the trick. But, of course, if one can avoid running as root while writing files in shared directories, that's great.

haz · 2024-08-06T15:49:41Z

Should I expect the plan to be in "stdout" or in "output"?

Without an adapter specified, I think it's in result/stdout. At least that's my read of the code. If you specify an adapter, it will try to rework the json response according to what the adapter specifies.

But, of course, if one can avoid running as root while writing files in shared directories, that's great.

This issue has plagued me across many projects. Would love to know an answer to how to treat it more appropriately!

rpgoldman · 2024-08-06T16:02:53Z

This issue has plagued me across many projects. Would love to know an answer to how to treat it more appropriately!

See #66

We have an approach locally to syncing file ownership, but it involves all kinds of shell scripting to sync up UIDs between the host and dockerfile, where the dockerfile imports a shell script (realuser.sh) and runs it on startup. There seems to be a bunch of black art involved. I will see what I can learn and pass on anything I do in #66

rpgoldman · 2024-08-06T17:47:11Z

Sorry to be so difficult. I grabbed your new version @haz and now I see this in docker compose logs:

server_web_1       | [2024-08-06 17:26:19,274] ERROR in app: Exception on /package/lama-first/solve [POST]
server_web_1       | Traceback (most recent call last):
server_web_1       |   File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 2447, in wsgi_app
server_web_1       |     response = self.full_dispatch_request()
server_web_1       |   File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 1952, in full_dispatch_request
server_web_1       |     rv = self.handle_user_exception(e)
server_web_1       |   File "/usr/local/lib/python3.10/dist-packages/flask_cors/extension.py", line 165, in wrapped_function
server_web_1       |     return cors_after_request(app.make_response(f(*args, **kwargs)))
server_web_1       |   File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 1821, in handle_user_exception

Unfortunately, there's no indication of what the user exception e is.

In the client Jupyter notebook, I see this

{'Error': 'This Planutils package is not configured correctly'}

That's the value of solve_request in:

package_name = "lama-first"
service = "solve"
service_url="http://"+ip+":5001/package/"+package_name+"/"+service
solve_request = requests.post(service_url, json=req_body,headers=headers).json()
solve_request

I'm at a bit of a dead end here, since when I search the AI-Planning group in github for "This Planutils package" I find nothing.

haz · 2024-08-06T17:52:39Z

Aye, this is the kind of catch-all, and more info really should be logged. Usually it occurs when I try calling a planner that's not installed (or it doesn't have that privileged status to run).

planning-as-a-service/server/api/app.py

Line 57 in 8528c9d

# For any unexpected error, this error message will return

Note that the error parameter isn't touched at all :P. I can add it in tonight (on my branch), and at least get better logging.

If you have dom/prob pairing you're willing to share, I can make sure I get to a workable state on my side with things?

rpgoldman · 2024-08-06T17:58:45Z

I was just using the example in the Jupyter notebook in the planning-as-a-service repo. I will attach my version of that here.

Planning-as-service.ipynb.txt

rpgoldman · 2024-08-06T18:00:44Z

Aye, this is the kind of catch-all, and more info really should be logged. Usually it occurs when I try calling a planner that's not installed (or it doesn't have that privileged status to run).

planning-as-a-service/server/api/app.py

Line 57 in 8528c9d

# For any unexpected error, this error message will return

Note that the error parameter isn't touched at all :P. I can add it in tonight (on my branch), and at least get better logging.

Oh, yes. It might be reasonable to put the exception in there (presumably one can translate arbitrary exceptions into json?) and possibly even stash the backtrace...

haz · 2024-08-06T19:01:16Z

It may be server internal stuff, and perhaps not the wisest to send back...it fires on any 500 error. But at present it's not even in the server logs, and that should be fixed!

rpgoldman · 2024-08-06T19:21:43Z

Are the server logs what I see in docker compose logs? Or are there files somewhere I should be examining?

rpgoldman · 2024-08-06T19:24:00Z

It may be server internal stuff, and perhaps not the wisest to send back...it fires on any 500 error. But at present it's not even in the server logs, and that should be fixed!

Random thought: could you grab up any errors raised by the planners themselves and give them a specific type? Then you could send along exceptions of that type and backtraces, and ignore miscellaneous errors that would not be of type PlanSolverError (or whatever). I don't know enough about the architecture of the server to know whether that's feasible.

haz · 2024-08-07T19:29:15Z

Problem with that one is that every planner has their own custom way of running and failing. I think one thing we could do is have planning authors implement their own way to run it so that the result is a json output. Then we just echo that json back to the user who invoked it.

E.g., the FasDownward driver (written in Python) could have a top-level try-catch that dumps a json with failure info, when a failure occurs. Also makes it easy for arbitrary planning software to be hosted as well (e.g., domain generators, model acquisition techniques, VAL, etc). Just needs to be on the individual package maintainer side, rather than something cross-cutting for all the solvers.

haz · 2024-08-08T02:27:41Z

Minor change for better error handling: haz#2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about planning ... failures? #63

Question about planning ... failures? #63

rpgoldman commented Aug 5, 2024

rpgoldman commented Aug 6, 2024

rpgoldman commented Aug 6, 2024

haz commented Aug 6, 2024

rpgoldman commented Aug 6, 2024

rpgoldman commented Aug 6, 2024

haz commented Aug 6, 2024 •

edited

Loading

rpgoldman commented Aug 6, 2024

haz commented Aug 6, 2024

rpgoldman commented Aug 6, 2024

haz commented Aug 6, 2024

haz commented Aug 6, 2024

rpgoldman commented Aug 6, 2024

haz commented Aug 6, 2024

rpgoldman commented Aug 6, 2024

rpgoldman commented Aug 6, 2024

haz commented Aug 6, 2024

rpgoldman commented Aug 6, 2024

rpgoldman commented Aug 6, 2024

haz commented Aug 6, 2024

rpgoldman commented Aug 6, 2024

rpgoldman commented Aug 6, 2024

haz commented Aug 7, 2024

haz commented Aug 8, 2024

Question about planning ... failures? #63

Question about planning ... failures? #63

Comments

rpgoldman commented Aug 5, 2024

rpgoldman commented Aug 6, 2024

rpgoldman commented Aug 6, 2024

haz commented Aug 6, 2024

rpgoldman commented Aug 6, 2024

rpgoldman commented Aug 6, 2024

haz commented Aug 6, 2024 • edited Loading

rpgoldman commented Aug 6, 2024

haz commented Aug 6, 2024

rpgoldman commented Aug 6, 2024

haz commented Aug 6, 2024

haz commented Aug 6, 2024

rpgoldman commented Aug 6, 2024

haz commented Aug 6, 2024

rpgoldman commented Aug 6, 2024

rpgoldman commented Aug 6, 2024

haz commented Aug 6, 2024

rpgoldman commented Aug 6, 2024

rpgoldman commented Aug 6, 2024

haz commented Aug 6, 2024

rpgoldman commented Aug 6, 2024

rpgoldman commented Aug 6, 2024

haz commented Aug 7, 2024

haz commented Aug 8, 2024

haz commented Aug 6, 2024 •

edited

Loading