Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about planning ... failures? #63

Open
rpgoldman opened this issue Aug 5, 2024 · 23 comments
Open

Question about planning ... failures? #63

rpgoldman opened this issue Aug 5, 2024 · 23 comments

Comments

@rpgoldman
Copy link
Contributor

I took the jupyter notebook from the repo (which is, I think, the same as the python script in the README) and modified it slightly (print the URL and the response from celery) to test my PaaS server. I'm getting a very odd result:

Service URL is http://lacrosse-0:5001/package/lama-first/solve
Computing...
{'result': {'call': 'timeout 20 planutils run lama-first -- domain problem',
            'output': {},
            'output_type': 'generic',
            'stderr': '',
            'stdout': ''},
 'status': 'ok'}

So no plan, but no error output, either, and celery seems to be claiming my request is no longer PENDING.

Is this what I should expect if there's a time out?

Is there a log I should be looking at to diagnose this?

@rpgoldman
Copy link
Contributor Author

Same result when I use the code from README.md.

@rpgoldman
Copy link
Contributor Author

If it helps, when I did the following cell in the notebook:

# An example of Package API to send task, retrive plans with adaptor.

def test_planner_with_adaptor(package_name,req_body,service="solve"):
    service_url="http://"+ip+":5001/package/"+package_name+"/"+service
    solve_request = requests.post(service_url, json=req_body).json()
    celery_result = requests.post("http://"+ip+":5001" + solve_request['result'],json={"adaptor":"planning_editor_adaptor"})
    print('Computing...')
    while celery_result.json().get("status","") == 'PENDING':
        time.sleep(2)
        celery_result = requests.post("http://"+ip+":5001" + solve_request['result'],json={"adaptor":"planning_editor_adaptor"})
    pprint(celery_result.json())
    
test_planner_with_adaptor("lama-first",req_body)

I got back:

Computing...
{'plans': [], 'status': 'ok'}

@haz
Copy link
Contributor

haz commented Aug 6, 2024

Hrmz...there should be a log in there, ya. Are you able to see anything on the monitor? Same base URL, with the port :5555.

@rpgoldman
Copy link
Contributor Author

Hm. The server is up (lacrosse-0:8010, where lacrosse-0 is the local hostname). But lacrosse-0:5555 does not respond. Logging into the server and trying curl lacrosse-0:5555, curl http://lacrosse-0:5555 and the same with localhost gives only "Couldn't connect to server."

And looking at a list of my processes, this does not look good:

2603926 ?        00:00:00 planutils <defunct>
2603927 ?        00:00:00 python3 <defunct>
2603928 ?        00:00:00 run <defunct>
2603930 ?        00:00:01 singularity <defunct>
2604046 ?        00:00:02 apptainer <defunct>
2604667 ?        00:00:00 python3 <defunct>
2604668 ?        00:00:00 run <defunct>
2604669 ?        00:00:00 planutils <defunct>
2604670 ?        00:00:00 python3 <defunct>
2604671 ?        00:00:00 run <defunct>
2604673 ?        00:00:18 singularity <defunct>
2605301 ?        00:00:00 python3 <defunct>
2605302 ?        00:00:00 run <defunct>
2605303 ?        00:00:00 planutils <defunct>
2605304 ?        00:00:00 python3 <defunct>
2605305 ?        00:00:00 run <defunct>
2605307 ?        00:00:13 singularity <defunct>
2614612 ?        00:00:00 python3 <defunct>
2614613 ?        00:00:00 run <defunct>
2614614 ?        00:00:00 planutils <defunct>
2614615 ?        00:00:00 python3 <defunct>
2614616 ?        00:00:00 run <defunct>
2614618 ?        00:00:06 singularity <defunct>
2615156 ?        00:00:00 python3 <defunct>
2615157 ?        00:00:00 run <defunct>
2615158 ?        00:00:00 planutils <defunct>
2615159 ?        00:00:00 python3 <defunct>
2615160 ?        00:00:00 run <defunct>
2615162 ?        00:00:05 singularity <defunct>
2618062 ?        00:00:00 python3 <defunct>
2618063 ?        00:00:00 run <defunct>
2618064 ?        00:00:00 planutils <defunct>
2618065 ?        00:00:00 python3 <defunct>
2618066 ?        00:00:00 run <defunct>
2618068 ?        00:00:05 singularity <defunct>
2658068 ?        00:00:00 python3 <defunct>
2658069 ?        00:00:00 run <defunct>
2658070 ?        00:00:00 planutils <defunct>
2658071 ?        00:00:00 python3 <defunct>
2658072 ?        00:00:00 run <defunct>
2658074 ?        00:00:06 singularity <defunct>
2662833 ?        00:00:00 python3 <defunct>
2662834 ?        00:00:00 run <defunct>
2662835 ?        00:00:00 planutils <defunct>
2662836 ?        00:00:00 python3 <defunct>
2662837 ?        00:00:00 run <defunct>
2662839 ?        00:00:07 singularity <defunct>
2663568 ?        00:00:00 python3 <defunct>
2663569 ?        00:00:00 run <defunct>
2663570 ?        00:00:00 planutils <defunct>
2663571 ?        00:00:00 python3 <defunct>
2663572 ?        00:00:00 run <defunct>
2663574 ?        00:00:16 singularity <defunct>
2675256 ?        00:00:00 python3 <defunct>
2675257 ?        00:00:00 run <defunct>
2675258 ?        00:00:00 planutils <defunct>
2675259 ?        00:00:00 python3 <defunct>
2675260 ?        00:00:00 run <defunct>
2675262 ?        00:00:06 singularity <defunct>
2675468 ?        00:00:00 python3 <defunct>
2675469 ?        00:00:00 run <defunct>
2675470 ?        00:00:00 planutils <defunct>
2675471 ?        00:00:00 python3 <defunct>
2675517 ?        00:00:00 run <defunct>
2675519 ?        00:00:11 singularity <defunct>

Looks like these aren't getting reaped? But also, I thought all the action was inside the docker containers? Or is PaaS running singularity on the host machine?

@rpgoldman
Copy link
Contributor Author

Listing the running containers, I don't see anything that looks like the monitor (but I don't really know how to read a docker-compose.yml file, so I could be wrong):

CONTAINER ID   IMAGE                           COMMAND                  CREATED        STATUS        PORTS                                                  NAMES
7b06fca38c82   server_web                      "gunicorn --bind 0.0…"   18 hours ago   Up 18 hours   0.0.0.0:5001->5001/tcp, :::5001->5001/tcp              server_web_1
2ba8eb63c435   server_worker                   "celery -A tasks wor…"   18 hours ago   Up 18 hours                                                          server_worker_1
f50b05015dec   server_frontend                 "gunicorn --bind 0.0…"   18 hours ago   Up 18 hours   0.0.0.0:8010->80/tcp, :::8010->80/tcp                  server_frontend_1
1ad1b3635b9f   mysql:8.0                       "docker-entrypoint.s…"   18 hours ago   Up 18 hours   0.0.0.0:3306->3306/tcp, :::3306->3306/tcp, 33060/tcp   server_mysql_1
1eaf3fabc1e8   redis                           "docker-entrypoint.s…"   18 hours ago   Up 18 hours   6379/tcp                                               server_redis_1

@haz
Copy link
Contributor

haz commented Aug 6, 2024

Indeed, the monitor isn't there. It's all done through the docker containers, but they're bridged on the virtual network. You may be facing the version mismatch similar to #65 . Can you try running my version here? It's in #62 , waiting for review. Added bonus, it has your frontend port as a parameter now in the .env file.

@rpgoldman
Copy link
Contributor Author

Aha! I tried docker compose log and it looks like the lama runs claim to have succeeded, but somehow the responses are not coming back. For example:

server_worker_1    | [2024-08-06 00:03:18,655: INFO/MainProcess] Task tasks.run.package[9493497a-6ab4-4b55-a6cf-85e27d6d7843] received
server_worker_1    | [2024-08-06 00:03:38,874: INFO/ForkPoolWorker-16] Task tasks.run.package[9493497a-6ab4-4b55-a6cf-85e27d6d7843] succeeded in 20.21391299692914s: ({'stdout': '', 'stderr': '', 'call': 'timeout 20 planutils run lama-first -- domain problem', 'output': {}, 'output_type': 'generic'}, {'domain': {'value': '
server_worker_1    | (define (domain BLOCKS)
server_worker_1    |    (:requirements :strips)
server_worker_1    |    (:predicates (on ?x ?y) (ontable ?x) (clear ?x) (handempty) (holding ?x) )
server_worker_1    |    (:action pick-up :parameters (?x) :precondition (and (clear ?x) (ontable ?x) (handempty)) :effect (and (not (ontable ?x)) (not (clear ?x)) (not (handempty)) (holding ?x)))
server_worker_1    |    (:action put-down :parameters (?x) :precondition (holding ?x)
server_worker_1    |             :effect (and (not (holding ?x)) (clear ?x) (handempty) (ontable ?x)))
server_worker_1    |    (:action stack :parameters (?x ?y) :precondition (and (holding ?x) (clear ?y))
server_worker_1    |             :effect (and (not (holding ?x)) (not (clear ?y)) (clear ?x) (handempty) (on ?x ?y)))
server_worker_1    |    (:action unstack :parameters (?x ?y) :precondition (and (on ?x ?y) (clear ?x) (handempty))
server_worker_1    |             :effect (and (holding ?x) (clear ?y) (not (clear ?x)) (not (handempty)) (not (on ?x ?y)))))
server_worker_1    | ', 'type': 'file'}, 'problem': {'value...', ...}})

Given that the timeout value in the call (20s) is less than the runtime reported (20.21391299692914s) is there any chance that the server is misinterpreting a timeout as successful plan generation?

@haz
Copy link
Contributor

haz commented Aug 6, 2024

It could be, ya. That'd be my hunch as well... @nirlipo : any recollection on what the status code is meant to be on timeout? I'll start poking around to find it in the meantime...

@rpgoldman
Copy link
Contributor Author

OK, one (last?) update. Is there any chance the monitor has failed?

server_monitor_1   |     from celery.bin.base import Command
server_monitor_1   | ImportError: cannot import name 'Command' from 'celery.bin.base' (/usr/local/lib/python3.10/dist-packages/celery/bin/base.py)

There is no server_monitor in the image list I posted above. I see that this might be #65

@haz
Copy link
Contributor

haz commented Aug 6, 2024

Yep -- can you try my branch? Should hopefully fix it.

@haz
Copy link
Contributor

haz commented Aug 6, 2024

So the celery process gets the time limit + 10 (just to make sure the service ends appropriately):

call = f"timeout {TIME_LIMIT} planutils run {call}"

If this is triggered, then this would have been the response:

return {"stdout":"Request Time Out", "stderr":"", "call":call, "output":{},"output_type":output_file["type"]},arguments

The planner itself is given the exact limit:

call = f"timeout {TIME_LIMIT} planutils run {call}"

So the question is what is in the stdout/stderr when the planner completes:

result={"stdout":res.stdout, "stderr":res.stderr, "call":call, "output":output,"output_type":output_file["type"]}

Weird that there would be nothing (the first response you posted). I don't think it's the planning-editor-adapter you want to use.

@rpgoldman
Copy link
Contributor Author

Should I expect the plan to be in "stdout" or in "output"?

Also, a side issue that I will raise in a separate issue: when one is running rootless and starts a privileged container, it's easy to end up with files created in mounted directories with uid's that don't exist on the host machine, and thus cannot be removed by the host machine owner, e.g., error checking context: no permission to read from '/home/rpg/src/planning-as-a-service/server/db_data/#ib_16384_0.dblwr'. That file's UID and GID are both 328532966 on the host machine.

If one must do this, there's some way to sync the uids. I will talk to my colleagues and ask them for the trick. But, of course, if one can avoid running as root while writing files in shared directories, that's great.

@haz
Copy link
Contributor

haz commented Aug 6, 2024

Should I expect the plan to be in "stdout" or in "output"?

Without an adapter specified, I think it's in result/stdout. At least that's my read of the code. If you specify an adapter, it will try to rework the json response according to what the adapter specifies.

But, of course, if one can avoid running as root while writing files in shared directories, that's great.

This issue has plagued me across many projects. Would love to know an answer to how to treat it more appropriately!

@rpgoldman
Copy link
Contributor Author

This issue has plagued me across many projects. Would love to know an answer to how to treat it more appropriately!

See #66

We have an approach locally to syncing file ownership, but it involves all kinds of shell scripting to sync up UIDs between the host and dockerfile, where the dockerfile imports a shell script (realuser.sh) and runs it on startup. There seems to be a bunch of black art involved. I will see what I can learn and pass on anything I do in #66

@rpgoldman
Copy link
Contributor Author

Sorry to be so difficult. I grabbed your new version @haz and now I see this in docker compose logs:

server_web_1       | [2024-08-06 17:26:19,274] ERROR in app: Exception on /package/lama-first/solve [POST]
server_web_1       | Traceback (most recent call last):
server_web_1       |   File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 2447, in wsgi_app
server_web_1       |     response = self.full_dispatch_request()
server_web_1       |   File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 1952, in full_dispatch_request
server_web_1       |     rv = self.handle_user_exception(e)
server_web_1       |   File "/usr/local/lib/python3.10/dist-packages/flask_cors/extension.py", line 165, in wrapped_function
server_web_1       |     return cors_after_request(app.make_response(f(*args, **kwargs)))
server_web_1       |   File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 1821, in handle_user_exception

Unfortunately, there's no indication of what the user exception e is.

In the client Jupyter notebook, I see this

{'Error': 'This Planutils package is not configured correctly'}

That's the value of solve_request in:

package_name = "lama-first"
service = "solve"
service_url="http://"+ip+":5001/package/"+package_name+"/"+service
solve_request = requests.post(service_url, json=req_body,headers=headers).json()
solve_request

I'm at a bit of a dead end here, since when I search the AI-Planning group in github for "This Planutils package" I find nothing.

@haz
Copy link
Contributor

haz commented Aug 6, 2024

Aye, this is the kind of catch-all, and more info really should be logged. Usually it occurs when I try calling a planner that's not installed (or it doesn't have that privileged status to run).

# For any unexpected error, this error message will return

Note that the error parameter isn't touched at all :P. I can add it in tonight (on my branch), and at least get better logging.

If you have dom/prob pairing you're willing to share, I can make sure I get to a workable state on my side with things?

@rpgoldman
Copy link
Contributor Author

I was just using the example in the Jupyter notebook in the planning-as-a-service repo. I will attach my version of that here.

Planning-as-service.ipynb.txt

@rpgoldman
Copy link
Contributor Author

Aye, this is the kind of catch-all, and more info really should be logged. Usually it occurs when I try calling a planner that's not installed (or it doesn't have that privileged status to run).

# For any unexpected error, this error message will return

Note that the error parameter isn't touched at all :P. I can add it in tonight (on my branch), and at least get better logging.

Oh, yes. It might be reasonable to put the exception in there (presumably one can translate arbitrary exceptions into json?) and possibly even stash the backtrace...

@haz
Copy link
Contributor

haz commented Aug 6, 2024

It may be server internal stuff, and perhaps not the wisest to send back...it fires on any 500 error. But at present it's not even in the server logs, and that should be fixed!

@rpgoldman
Copy link
Contributor Author

Are the server logs what I see in docker compose logs? Or are there files somewhere I should be examining?

@rpgoldman
Copy link
Contributor Author

It may be server internal stuff, and perhaps not the wisest to send back...it fires on any 500 error. But at present it's not even in the server logs, and that should be fixed!

Random thought: could you grab up any errors raised by the planners themselves and give them a specific type? Then you could send along exceptions of that type and backtraces, and ignore miscellaneous errors that would not be of type PlanSolverError (or whatever). I don't know enough about the architecture of the server to know whether that's feasible.

@haz
Copy link
Contributor

haz commented Aug 7, 2024

Problem with that one is that every planner has their own custom way of running and failing. I think one thing we could do is have planning authors implement their own way to run it so that the result is a json output. Then we just echo that json back to the user who invoked it.

E.g., the FasDownward driver (written in Python) could have a top-level try-catch that dumps a json with failure info, when a failure occurs. Also makes it easy for arbitrary planning software to be hosted as well (e.g., domain generators, model acquisition techniques, VAL, etc). Just needs to be on the individual package maintainer side, rather than something cross-cutting for all the solvers.

@haz
Copy link
Contributor

haz commented Aug 8, 2024

Minor change for better error handling: haz#2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants