Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JobController submits random jobs #238

Open
cote3804 opened this issue Jan 12, 2025 · 5 comments
Open

JobController submits random jobs #238

cote3804 opened this issue Jan 12, 2025 · 5 comments

Comments

@cote3804
Copy link

Hi again @gpetretto and other developers,

I'm experiencing this bug where after submitting a flow with one job using the following code,

from jobflow_remote import submit_flow
from jobflow import Flow
from atomate2.jdftx.jobs.core import LatticeMinMaker
from atomate2.jdftx.sets.core import LatticeMinSetGenerator
from pymatgen.core.structure import Structure
import pymatgen.io.jdftx.jdftxinfile_master_format
from qtoolkit.core.data_objects import QResources

bulk_iro2 = Structure.from_file("POSCAR_test")

# settings compliant with RPA
maker = LatticeMinMaker(
    input_set_generator=LatticeMinSetGenerator(
        user_settings={
            "lattice-minimize": {"nIterations": 100},
            "latt-move-scale": {"s0": 1, "s1": 1, "s2": 1},
            "elec-smearing": {
                "smearingType": "MP1",
                "smearingWidth": 0.1,
                },
            "van-der-waals": None,
        },
        calc_type="bulk",
        pseudopotentials="SG15",
    ),
)
bulk_relax_job = maker.make(bulk_iro2)
flow = Flow([bulk_relax_job])
resources = QResources(
    job_name="input_settings_test",
    nodes=1,
    account="XXXXX",
    threads_per_process=32,
    processes_per_node=4,
    time_limit=30*60, # seconds
    queue_name="debug",
    scheduler_kwargs={
        "partition": "debug",
        "constraint": "gpu",
        "ntasks_per_node": 4,
        "qverbatim": "#SBATCH --gpus-per-task=1 \n#SBATCH --gpu-bind=none"
    },
)

submit_flow(flow, project="IrO2", worker="perlmutter", resources=resources)

jobflow-remote will submit old flows that previously finished. A screenshot is below that shows after running the script above, 6 other jobs were added.

image

The main log file reports that 3 flows were added each with 3 jobs:

2025-01-12 09:49:06,831 [INFO] ID 44391 jobflow_remote.jobs.jobcontroller: Appended flow (300e9145-c3c4-43f0-a4dd-077b2f996291) with jobs: ('ffef9795-18c7-44c6-9710-cf442cdf0410', 'f9f23b05-da3c-44d6-9219-4dcaeb70aada', 'a6bd526c-09c7-4f15-a08d-8b9355b3198a')
2025-01-12 09:49:07,719 [INFO] ID 44391 jobflow_remote.jobs.jobcontroller: Appended flow (fe7c1015-e273-420f-95dc-6ad7e10f488c) with jobs: ('e8b1902b-a91f-4432-a6ab-b73b6bb3f8e9', 'c149f53d-0df5-45d5-9be3-d5518bd143e5', '3326469f-be3d-400a-bcbe-961ad954ffa0')
2025-01-12 09:49:08,589 [INFO] ID 44391 jobflow_remote.jobs.jobcontroller: Appended flow (4de6c118-688f-4c67-b921-f1a9cad4b47e) with jobs: ('db8afaf7-7cd8-4200-9ece-ad1c9a02a811', 'f1d29046-23b9-4f7e-9998-e2765e4491d5', '2d935fbd-47d2-4d91-89f2-ffbfc363046d')

The jobs being added are jobs that I ran yesterday.
Any idea what's happening here? I've attempted restarting the runner to no avail. I've also gone through the code up through

try:
self.flows.insert_one(flow_doc)
self.jobs.insert_many(job_dicts)

and the issue is not here as only one flow is submitted.

I'll note that the test_molecule_jobs above are using JobFlow's replace functionality and all of the incorrectly submitted flows are using a batch worker.

Thanks!

@gpetretto
Copy link
Contributor

Hi @cote3804,
thanks for reporting this. Let me summarize the points to check if I got everything correctly:

  • you first submitted a Flow to the batch worker (let's call it Flow1). This flow made use of the replace functionality. Flow1 completed correctly.
  • at a later time you submitted the flow with the LatticeMinMaker job. During the submission procedure only that flow is added to the DB
  • The runner resubmits the Jobs belonging to Flow1.

Is this a good summary of what is happening? In particular, is it correct that the Jobs that are submitted to the batch runner (e.g. those with db_id 257-262 in your screenshot) are the same jobs that were already present in the DB, and just went back to the READY state? (or some other state?) From the description it is not 100% clear if these jobs were already COMPLETED and were just resubmitted, or if entirely new jobs have been added to the DB.

I tried to quickly replicate this with a simple example, but the problem did not show up. So I will need more details from your side:

  • Can you confirm that this was not happening in the beginning and only started happening at some point? Any other difference with what you were doing before?
  • does this keep happening? i.e., if you submit a new Flow, these jobs keep being resubmitted
  • when the batch jobs are completed, are there any leftover files in the batch folders of the worker?
  • The jobs that are resubmitted are only those that are dynamically generated by the "replace" procedure? Or all the jobs of Flow1 are resubmitted?

Thanks

@cote3804
Copy link
Author

Hi @gpetretto

Yes, your summary is correct. jobs 257-262 are completely new from what I can tell. There are jobs that preceded those that reached a FAILED or COMPLETED state:
image
In the screenshot you can see that 249 and 248 are also labeled test_molecule_job, but that they completed. Those jobs are seemingly being resubmitted as new jobs with db-id 257-259.

Responding to your bulleted questions:

  • Correct, this was not happening at the beginning. Things were working properly when I had a batch worker running jobs. This problem does seem to coincide with introducing the replace jobs, even though some jobs that are being resubmitted erroneously (db-id 260 in the original screenshot) do not use replace in the flow.
  • Yes, these jobs keep getting resubmitted and it seems like the list of resubmitted jobs keeps growing with each flow I submit. I'll note as well that when this started occurring, it seemed to be duplicating jobs. So I would submit one test_molecule_job in a flow, and two would be added to the database. It isn't clear to me if the same job was being duplicated or if it was resubmitting an old test_molecule_job. Unfortunately, the JobDocs are identical so I can't verify if a new job was being duplicated or an old one was being resubmitted.
  • There are currently 4 files in the jobs_handle_dir/running dir. When I ask for the job info associated with those files, I get
image Furthermore, all of the files in this directory correspond to `surface_ionic_min` jobs whereas some of the erroneously submitted jobs are `test_molecule_job`s
  • All of Flow1 is resubmitted. Flow1 contains one job that gets replaced by two, both of which are added to the db and executed successfully.

Let me know if you need me to run any tests or provide more detail.

@gpetretto
Copy link
Contributor

Hi @cote3804,
thanks a lot for all the additional details. I should admit that I don't see any obvious way in which this could happen.
Let me ask you a few more questions:

  • just to be sure, If the runner is shut down and you submit a flow, do you still get duplicated jobs? Or is it just at the moment you start the runner that the duplicate appear?
  • if you just stop and restart the runner, is there any additional job that get submitted? Or only when you explicitly submit some new job?
  • From what I see in the screenshot, it seems that the duplicated jobs have different db_id and different uuid+index among them. Can you check if you have any overlap of uuid (maybe with the original job that is being duplicated?)? This may help understand if an entirely new Flow is being created or if it is "cloning" a previously existing one in the DB, although this should be prevented by the indexes in the DB.
  • Do you have the same behaviour even if you submit a very simple Flow? for example
     from jobflow import Flow
     from jobflow_remote import submit_flow
     from jobflow_remote.testing import add
    
     j = add(1, 5)
     flow = Flow([j])
     submit_flow(flow, worker="local_shell_worker")
  • are the file in the jobs_handle_dir/running folder related to the jobs that are being duplicated? Or even files that are not there?

Ultimately, if nothing comes up, would you be available to send me a dump of your DB (e.g. with the jf backup create command)?

@cote3804
Copy link
Author

Hi @gpetretto

I was able to test a few things related to this issue.

  • It is indeed new jobs that are being duplicated rather than old jobs being resubmitted. I created a new flow this morning and submitted it, which ended up being duplicated and the old test_molecule_job was erroneously added as well.
  • If the runner is shut down, the erroneous extra jobs are still added in the READY state
  • Stoping and restarting the runner has no effect on the creation of jobs from what I can tell.
  • The example flow you sent with the add job was indeed duplicated.

I'm happy to send you my database state. Is there an email you'd prefer me to send it to?
I'm debating resetting my database since it's still entirely test data. Do you want me to keep it around for more testing and create a new project with a fresh database instead?

@gpetretto
Copy link
Contributor

Hi @cote3804. Sorry for the delay and thanks for all the tests.

I had also received a notification with a message reporting that the issue seemed to be present only when submitting the job through VS Code, but I don't see that message here on github. Does this mean that was not the problem?
Since the duplication happens even if the runner is not active, I was in fact wondering if the problem could come from previous scripts used to submit some flow that are mistakenly called again when running new submission scripts. Could this be the case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants