Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata files generated with RayTaskRunner #16009

Open
dqueruel-fy opened this issue Nov 13, 2024 · 6 comments · May be fixed by #16022
Open

Metadata files generated with RayTaskRunner #16009

dqueruel-fy opened this issue Nov 13, 2024 · 6 comments · May be fixed by #16022
Labels
bug Something isn't working

Comments

@dqueruel-fy
Copy link

Bug summary

Issue description

I don't know if it's a bug or a desired behavior but some metadata files are generated each time I run my flows locally. That's annoying because the files are generated in my source directory (or from where I run the flows/tasks). I'd like to have more info, please, on what these files are and if we can generate it somewhere else or, ideally, not generate them at all.

It generates files with filenames like 89e55eaee58e8ce3567e87801196d9d5 in the same folder that I call the python script (see below) with the following content:

{
    "metadata": {
        "storage_key": "/Users/<path to my local source dir>/89e55eaee58e8ce3567e87801196d9d5",
        "expiration": null,
        "serializer": {
            "type": "pickle",
            "picklelib": "cloudpickle",
            "picklelib_version": null
        },
        "prefect_version": "3.1.2",
        "storage_block_id": null
    },
    "result": "gAVLAS4=\n"
}

The minimal reproducible python script is

from prefect import flow, task
from prefect_ray import RayTaskRunner

@task(log_prints=True, persist_result=True)
def taskA():
    print("Task A")
    return 1

@flow(log_prints=True, persist_result=True, task_runner=RayTaskRunner)
def myFlow():
    print("In my flow")
    taskA.submit().wait()
    return 0

myFlow()

Version info

Version:             3.1.2
API version:         0.8.4
Python version:      3.11.9
Git commit:          02b99f0a
Built:               Tue, Nov 12, 2024 1:38 PM
OS/Arch:             darwin/arm64
Profile:             local
Server type:         server
Pydantic version:    2.8.2
Integrations:
  prefect-ray:       0.4.2

Additional context

Some notes:

  • These files are not generated when I remove the RayTaskRunner or when I set persist_result to False .
  • I saw this files being generated when I upgraded prefect from 3.0.0rc14 to 3.1.1 in my code base, and I reproduced it in this minimal example.
  • I've tried to change the server config's PREFECT_LOCAL_STORAGE_PATH to /tmp/result but it didn't help
  • Screenshot of the flow and task ran from the minimal python code
    image
@dqueruel-fy dqueruel-fy added the bug Something isn't working label Nov 13, 2024
@cicdw
Copy link
Member

cicdw commented Nov 13, 2024

Hey @dqueruel-fy - those files are a consequence of persisting task and flow results.

I've tried to change the server config's PREFECT_LOCAL_STORAGE_PATH to /tmp/result but it didn't help

This setting has an effect at workflow runtime and therefore setting it on the server will have no effect (all server configuration is prefixed with PREFECT_SERVER_). If you set this setting within the process that your workflows execute you should see the desired behavior.

For more information, check out the documentation on results and settings:

@zzstoatzz
Copy link
Collaborator

zzstoatzz commented Nov 13, 2024

hi @dqueruel-fy - yes this sounds like expected behavior, that metadata is your serialized result

» PREFECT_LOCAL_STORAGE_PATH=/tmp/result ipython

In [1]: from prefect import task

In [2]: @task(persist_result=True)
   ...: def f():
   ...:     return 42
   ...:

In [3]: f()
16:35:23.491 | INFO    | Task run 'f' - Finished in state Completed()
Out[3]: 42

In [4]: !ls /tmp/result
109c10d275731f842f4b08dd51b397aa

when you say

I've tried to change the server config's PREFECT_LOCAL_STORAGE_PATH to /tmp/result but it didn't help

... was about to type the same as @cicdw above, nevermind 🙂

@dqueruel-fy
Copy link
Author

@zzstoatzz @cicdw thanks for your quick answers ! I do understand that the files need to be generated but I don't understand why they are generated in my code base. My mention of PREFECT_LOCAL_STORAGE_PATH was probably misleading, I meant having the default value or this /tmp/result still produces the same issue (files generated in my code source).

These files are generated from where I call the python scripts.

Have you tried my minimal python script and and run it from let's say ~/Download, if you have the same behavior than me, you'll have new files generated in ~/Download.

I guess that the expected behavior is to have these files generated in the PREFECT_LOCAL_STORAGE_PATH not in the directory I call the script, right ?

@cicdw
Copy link
Member

cicdw commented Nov 14, 2024

Ah I think I have a suspicion for what's going on! A few details (some of which are repetitive just for completeness sake):

  • whenever PREFECT_LOCAL_STORAGE_PATH is not set (and when there is no default storage block either), the default storage location for results is the present working directory as you've seen
  • this setting must be set on the client that executes the workflow to take affect
  • Ray uses multiple processes (or machines, but it sounds like you are running Ray locally on one machine) for distributing work
  • setting this as an environment variable in one runtime but not in the runtime of the Ray workers will cause any tasks executed on the workers to not pick up the setting

If my suspicion is correct that you are only setting this setting on the "parent" process that executes the flow and not on the Ray workers, the easiest solution is probably to use a .env file or prefect.toml file to persist this setting across all processes started in that directory.

@dqueruel-fy
Copy link
Author

Thanks @cicdw for your insight !

So I've tested again with using prefect config set PREFECT_LOCAL_STORAGE_PATH
The resulting server settings are

% prefect config view
🚀 you are connected to:
http://127.0.0.1:4200
PREFECT_PROFILE=<profile>
PREFECT_API_URL='http://127.0.0.1:4200/api' (from profile)
PREFECT_LOCAL_STORAGE_PATH='/tmp/test' (from profile)

And when running my example scripts, I still have one file generated to /tmp/test (flow's one ? ) and one in my current working directory (task one ?).

I've also tried providing the env var to the RayTaskRunner like this but that didn't help.

@flow(log_prints=True, persist_result=True, task_runner=RayTaskRunner(init_kwargs={"runtime_env": {"env_vars":{"PREFECT_LOCAL_STORAGE_PATH": "/tmp/test"}}}))
def myFlow():
   ...

Could you provide more information on how to use the .envor the prefect.toml files please ?

@zzstoatzz
Copy link
Collaborator

zzstoatzz commented Nov 14, 2024

hi @dqueruel-fy - I am looking into this now (this seems like a bug).

it looks like when in ray, the task is unable to discover the parent context's result store and falls back to a default, relative path

will update hopefully soon!

I don't think prefect.toml helps in the context of this issue, but if you're generally curious I'd check this out.

@zzstoatzz zzstoatzz linked a pull request Nov 14, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants