Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script interface instead of cli tool. BREAKING #23

Merged
merged 21 commits into from
Oct 3, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/workflows/CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@ jobs:
fail-fast: false
matrix:
version:
- '1.8'
- '1.9'
- '~1.10.0-0'
- 'nightly'
os:
- ubuntu-latest
Expand Down
1 change: 0 additions & 1 deletion Comonicon.toml

This file was deleted.

11 changes: 5 additions & 6 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,26 +1,25 @@
name = "MEDYANSimRunner"
uuid = "b58a3b99-22e3-44d1-b5ea-258f082a6fe8"
authors = ["nhz2 <[email protected]>"]
version = "0.3.0"
version = "0.4.0"

[deps]
Comonicon = "863f3e99-da2a-4334-8734-de3dacbe5542"
ArgCheck = "dce04be8-c92d-5529-be00-80e4d2c0e197"
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
DeepDiffs = "ab62b9b5-e342-54a8-a765-a90f495de1a6"
Distributed = "8ba89e20-285c-5b6f-9357-94700520ee1b"
FileWatching = "7b1f6079-737a-58dc-b8bc-7a2ca5c1b5ee"
InteractiveUtils = "b77e0a4c-d291-57a0-90e8-8db25a27a240"
JSON3 = "0f8b85d8-7281-11e9-16c2-39a750bddbf1"
Logging = "56ddb016-857b-54e1-b83d-db4d58db5568"
LoggingExtras = "e6f89c97-d47a-5376-807f-9c37f3926c36"
OrderedCollections = "bac558e1-5e72-5ebc-8fee-abe8a469f55d"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
SHA = "ea8e919c-243c-51af-8825-aaa63cd721ce"
SmallZarrGroups = "d423b6e5-1c84-4ae2-8d2d-b903aee15ac7"
TOML = "fa267f1f-6049-4f14-aa54-33bafae1ed76"

[compat]
Comonicon = "1"
DeepDiffs = "1"
JSON3 = "1"
LoggingExtras = "1"
SmallZarrGroups = "0.5"
SmallZarrGroups = "0.5, 0.6"
julia = "1.8"
199 changes: 49 additions & 150 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,97 +4,66 @@

Manage long running restartable MEDYAN.jl simulations.

Simulations run using code stored in an `input` directory and write outputs to an `output` directory.
Simulations run using julia code in a `main.jl` script and write outputs to an `output` directory.

Inspired by how build scripts work in https://github.com/JuliaPackaging/BinaryBuilder.jl

## Installation
First install and run Julia https://julialang.org/downloads/

Then in Julia install this repo as a regular Julia package.
```julia
import Pkg

Pkg.add("https://github.com/medyan-dev/MEDYANSimRunner.jl")
```

This will add a `medyansimrunner` to `~/.julia/bin`, so add `~/.julia/bin` to your PATH.

Run:
```sh
medyansimrunner -h
Pkg.add("MEDYANSimRunner")
```
To see the help.

## Example
Run the following in the root of this project.
Run the following in the root of this repo.
```sh
cd test/examples/good
medyansimrunner run input output 1
julia --project=test -e 'using Pkg; pkg"dev ."; pkg"instantiate";'
JULIA_LOAD_PATH="@" julia --project=test --startup-file=no test/example/main.jl --out=test/output --batch=1 --continue
```
This will run the example simulation in `test/examples/good/input` with job index `"1"` and store the output in `test/examples/good/output/1`.

The `job_idx` string gets passed to the `setup` function in `main.jl`.
This will run the 1st batch of the example simulation in `test/example/main.jl`
with the `test/` environment and store the output in `test/output/`.

The `job_idx` is hashed and set as the default RNG seed right before `setup` is called.
The output directory will be created if it doesn't already exist.

Any backslash in the job index will be replaced with a "/".
If the `"--batch=<job index>"` option is not included, all jobs specified in `main.jl` will be run.

The job index must be valid utf-8.

Job index must not be empty.
### `main.jl` script

Each part of job index when split by "/" must not contain any of the following characters:
This file contains the julia functions used when running the simulation.
These functions can modify the input state variable, but in general should return the state.
These functions can also use the default random number generator, this will automatically saved and loaded.

At the end of `main.jl` there should be the lines:
```julia
[ ',', '\r', '\n', '\0', '*', '|', ':', '<', '>', '?', '"',]
```

Each part must not end or start in a period or dot.

The output directory will be created if it doesn't already exist.

The job index string can be loaded from a line of a file.

For example, to run a job with a index in the third line of file `jobnames.txt` use:

```sh
medyansimrunner run input output jobnames.txt 3
if abspath(PROGRAM_FILE) == @__FILE__
MEDYANSimRunner.run_sim(ARGS; jobs, setup, loop, load_snapshot, save_snapshot, done)
end
```


## input kwargs

- `step-timeout`: the maximum amount of time in seconds each step is allowed to take before the job is killed, defaults to infinity.

- `max-steps`: the maximum number of steps a job is allowed to take before the job is killed.

- `startup-timeout`: the maximum amount of time in seconds to load everything and run the first loop, defaults to infinity.

- `max-snapshot-MB`: the maximum amount of hard drive space each snapshot is allowed to use in megabytes.

## `input` directory

The input directory must contain a `main.jl` file, a `Manifest.toml`, and a `Project.toml`.

The input directory will be the working directory of the simulation and can include other data needed for the simulation, including an `Artifacts.toml`

The input directory should not be mutated during or after a simulation.

### `main.jl` file

This file contains the julia functions used when running the simulation.
These functions can modify any input state variables, but in general should return the state.
These functions can also use the default random number generator, this will automatically saved and loaded.
To run the simulation if `main.jl` is called as a julia script.

#### Standard input parameters.
- `step::Int`: starts out at 0 after setup and is auto incremented right after every `loop`.

#### `setup(job_idx::String; kwargs...) -> header_dict, state`
Return the header dictionary to be written as the `header.json` file in output.
#### `jobs::Vector{String}`
A vector of jobs to run. Each job represents one variant of the simulation that can be run.
This is useful if many simulations need to be run in parallel. The `"--batch=<job index>"` argument
can be used to pick just one job to run.

The selected `job` string gets passed to the `setup` function in `main.jl`.
The `job` string is also used to seed the default RNG right before `setup` is called.

#### `setup(job::String; kwargs...) -> header_dict, state`
Return the header dictionary to be written as the `header.json` file in output trajectory.
Also return the state that gets passed on to `loop` and the state that gets passed to `save_snapshot` and `load_snapshot`.

`job_idx::String`: The job index. This is used for multi job simulations.
`job::String`: The job. This is used for multi job simulations.

#### `save_snapshot(step::Int, state; kwargs...)::SmallZarrGroups.ZGroup`
#### `save_snapshot(step::Int, state; kwargs...)-> group::SmallZarrGroups.ZGroup`
Return the state of the system as a `SmallZarrGroups.ZGroup`
This function should not mutate `state`

Expand All @@ -116,20 +85,14 @@ This function should not mutate `state`
Return the state that gets passed to `save_snapshot`



### `Manifest.toml` and `Project.toml`

These contain the julia environment used when running the simulation.
These must contain SmallZarrGroups, JSON3, and LoggingExtras, because these are required for saving data.

### Main loop pseudo code

```
activate and instantiate the environment
include("main.jl")
create output directory based on job_idx if it doesn't exist
Random.seed!(collect(reinterpret(UInt64, sha256(job_idx))))
job_header, state = setup(job_idx)
create output directory based on job if it doesn't exist
Random.seed!(collect(reinterpret(UInt64, sha256(job))))
job_header, state = setup(job)
save job_header
step = 0
SmallZarrGroups.save_dir(snapshot_zip_file, save_snapshot(step, state))
Expand All @@ -149,87 +112,23 @@ end

## `output` directory

The output directory has an `out$job_idx` subdirectory for job `job_idx`'s output.

Each out subdirectory has the following files.

### `info.log`
Any logs, warnings, and errors generated by the simulation are saved in this file.
The output directory has a subdirectory for each job's output.
The job string is the name of the subdirectory.

### `warn.log`
Any warnings, and errors generated by the simulation are saved in this file.
Each job's output subdirectory has the following files.

### `error.log`
Any errors generated by the simulation are saved in this file.
### `logs/<timestamp_randomstring>/{info|warn|error}.log`
Any logs, warnings, and errors generated by the simulation are saved in these files.

### `header.json`
### `traj/header.json`
A description of the system.

### `list.txt`
Data describing the saved snapshots, and if the simulation is done or errored, or needs to be continued.

The last element in each line is the sha256 of the line, not including the last comma space, and hash value.


The first line is.
```
version = 1, job_idx = 1, input_tree_hash = 5a936e..., 54bf8d69288...
```
- `version`: version of the info.txt format.
- `job_idx`: index of the job.
- `input_tree_hash`: hash of input directory calculated with [`my_tree_hash`](src/treehash.jl)

The second line is:
```
header_sha256 = 2cf934..., 312f788...
```
- `header_sha256`: hash of header.json.
Or:
```
Error starting job, 8d69288...
```

After these lines each of the next lines correspond to a saved snapshot.

These have the format:
```
yyyy-mm-dd HH:MM:SS, step number, nthreads, julia versioninfo, rng state, snapshot sha256, line sha256
```

`snapshot sha256` is the sha256 of the snapshot zip file.

The final line explains how the simulation ended it can be one of the following:
```
Error starting job, line sha256
```

```
Error running job, line sha256
```

```
Error startup_timeout of $startup_timeout seconds reached, line sha256
```

```
Error step_timeout of $step_timeout seconds reached, line sha256
```

```
Error max_steps of $max_steps steps reached, line sha256
```

```
Error max_snapshot_MB of $max_snapshot_MB MB reached, line sha256
```

```
Done, line sha256
```

See the log files for more details and error messages.

### `traj/snap<step>.zarr.zip`
Contains `snap$i.zarr.zip` files where `i` is the step of the simulation.
The state returned by `setup` is stored in `snap0.zarr.zip`
The user data is stored in the `"snap"` sub group. The root group contains
some metadata used by `MEDYANSimRunner`.

### `snapshots` subdirectory
Contains `snapshot$i.zarr.zip` files where `i` is the step of the simulation.
The state returned by `setup` is stored in `snapshot0.zarr.zip`
### `traj/footer.json`
This is created to show a trajectory is complete.
It contains some metadata about the trajectory.
2 changes: 0 additions & 2 deletions deps/build.jl

This file was deleted.

Loading
Loading