You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
are there rules programmers need to know about tasks deleting and writing into the output directories?
reevaluate the worker shutdown rules or at least lengthen the worker timeout. [Workaround: The write-json task now waits for parca output just so its worker node won’t time out waiting for more work to do.] Maybe workers should stick around while runs (or their run) are running, or maybe we should just auto-start more of them on demand and make them start up faster.
auto-launch and shutdown workers, probably mediated by Gaia. Either accept a worker count from the workflow builder for speed and adjustability or react to the queue length to be more automatic. [Low priority since the WCM builder now does it.]
pulling a docker image is slow so periodically save a new GCE Image in the disk image family containing current docker images: `sudo su -l sisyphus && docker pull python:2.7.16 && docker pull gcr.io/allen-discovery-center-mcovert/wcm-runtime`, reap some old images, do `sudo apt-get upgrade`, and trim the journalctl history. Then save the image as “sisyphus-v3” (etc.) in the “sisyphus-worker” family.
pulling a docker image is slow so consider having idle workers pre-pull the image for an upcoming task, at least when all remaining commands in the workflow use the same image
delete the gateway machine?
use a docker image version tag? how to feed it to the worker launcher?
sisyphus
(Ryan) handle task cancel when container hasn’t started yet (kill the container once it loads, if it does?)
(Ryan: investigating) have docker output files as the user who starts sisyphus
should task/perform-task! also call docker/stop! ?
fill out the error handling and recovery, e.g. don’t use deprecated GCS methods that can’t retry
make sisyphus-handle-rabbit use try/finally to ensure it ACKs to rabbit and resets its task state on exception?
create intermediate directories in GCS for nested keys (needed for gcsfuse; not needed for the web console browser)
support relative local paths within the container (or reject them up front with a clear error message rather than failing obscurely) (the workflow builder rejects them early)
gaia
don’t start each workflow with scary looking log entry “CANCELING (:every :step :in :the :new :workflow)”
remember when each run begins and tell the elapsed time in the “WORKFLOW COMPLETE” log entry
get status info: what are the status enum values? distinguish tasks waiting for workers vs tasks handed off to workers. indicate which worker is running which task.
wcm.py: provide a means to upload directory trees to storage from client [is gcsfuse sufficient? it needs the –implicit-dirs flag until we make sisyphus create the dirs]
wcm.py: enable the ability to do parts of the workflow (parca, sim, analysis), as well as all at once: you can set –run_analysis=0 or –generations=0. is that sufficient? to queue up sims and analysis but block them requires some trick like making them depend on a file that nothing writes, then manually creating the files, then calling Gaia.run()? or a web UI.
standardize gaia API
unify the gaia client and wcEcoli worker-launch code
enable multiple classes of workers with GCE instance templates, separate rabbit queues, and workflow (steps or commands?) to specify needed classes
clean up each workflow run when done
make workers launch quicker. is it quicker to launch a VM from a snapshot or an instance template than an image? easier to resize?
add an API method to list the current workflow names
logs (for now, the logs are the UI)
when looking at gaia workflow status from the client, provide means to filter tasks/data etc
figure out how to filter viewing by workflow name or user or task. Which LogEntry fields are most useful for that? log name? log “tag” label? .setOperation() [ID, producer]? setSourceLocation() esp. for stack traces? gaia and sisyphus now set several of these
log a sequence of docker output lines as a batch (mainly an optimization but it could reduce log interleaving between tasks)
support JsonPayload in the stackdriver API for logging structured records?
also try stackdriver debugger, load trace & profiler, dashboards
when the app in-container prints a stack trace, get that into one log entry for readability (even if it’s a heurstic aggregation) and set the log level to severe
design Gaia and Sisyphus logs to be more informative, less cluttered, easier to filter, and easier to read
clearly label the action or purpose for each log entry
(jerry) streamline or strip out JSON data, UUIDs, and such except where it’s definitely useful for debugging
errors
return the error info (e.g. there’s no storage bucket named “robin1”) in parseable JSON rather than causing a json-decoder-error decoding the server’s response
need more error detection & reporting
test what happens when things go wrong. does it emit helpful error messages? can it do self-repair?
exceptions
bad input: expunge a non-existent name, expunge an expunged name, commands and steps missing needed fields, steps referring to missing commands, …
cancellation at each possible juncture
a server goes down at each possible juncture
optimization
how come it takes (at least sometimes) many minutes for workers to start picking up tasks?
tasks run very slowly. do we need VMs with faster CPUs? more RAM? more cores? GPUs? larger disk?
optimization: reuse a running docker container when the previous task requested the same image
for apps with their own worker node requirements [also an optimization?]: a separate set of nodes for each workflow
documentation
document all the GCE VM setup factors: machine type? boot disk size? OS? Identity and API access? additional access scopes? software installation and configuration? startup script? metadata?
setting the “sisyphus” service account when configuring the GCE instance works, which obviates all the activate-service-account steps
document how to create the gaia and sisyphus VM images
document how to restart and monitor the gaia and sisyphus servers
features
unit tests
implement nightly builds and PR builds
web UI: show a graph of your current workflow’s steps, click on a step to see its inputs, outputs, log, and which inputs are available; show the workers and what tasks each one is running
tools to simplify and speed up the dev cycle
use the server DNS names within the cloud rather than hardwired IP addresses
remove webserver state viewing (what webserver?)
Sisyphus created empty directories rather than storing archive files for WCM task outputs e.g. sisyphus/data/jerry/20190628.204402/kb/
Sisyphus created directories for failed tasks e.g. sisyphus/data/jerry/20190628.204402/plotOut/
pass an array of CLI tokens to Docker so the client doesn’t have to do complex shell quoting (jerry put quoting into the WCM workflow as a temporary workaround) (maybe drop the unused && and > features)
flow.trigger(‘sisyphus’) gave a json error
Sisyphus wrote outputs to GCS after some failed tasks, so retrying the same task names won’t start
WCM output .tgz archives aren’t getting stored in GCS; only directory entries are stored
clear output directories between task runs
ensure that running a Command always begins without previous output files even if it reuses an open docker container
make a Gaia client pip and add it to the wcEcoli requirements, or something
the sisyphus VM needs more disk space –> now 200GB, 2 CPUs, 7.5 GB RAM
why do the worker VMs print “*** System restart required ***” when you ssh in? –> the VM image needed rebooting to install updates
give processes and data keys their own namespace
the Simulation task failed trying to delete the output directory:
Device or resource busy: ’wcEcoli/out/wf/wildtype_000000/000000/generation_000000/000000/simOut’
arrange secure access to the Gaia API over the internet
probably need worker nodes with more RAM and disk space; maybe configurable
replace any yaml.load() calls with yaml.safe_load()
remote uploading to Gaia; ability to post a workflow directly from your desktop
remote log monitoring via flow.listen()
give the sisyphus service account permissions to write to logs
ideally, make a single log entry for a stack traceback
support stackdriver logging and filtering: sisyphus
pick an easier way to tunnel to kafka than adding to /etc/hosts (Cloud IAP? ifconfig alias? HOSTALIASES? dynamic port forwarding? VPN?) OR obviate it with stackdriver logging
[^C out of flow.listen() should not print a bunch of clutter in ipython]
store archive with .tgz suffix OR store the directory of files instead of an archive
the namespace should be independent of the bucket name
put commands in namespace
“gaia-base bash[8924]: WARNING: Illegal reflective access by io.netty.util.internal.ReflectionUtil (file:/home/gaia/.m2/repository/io/netty/netty-all/4.1.11.Final/netty-all-4.1.11.Final.jar) to constructor java.nio.DirectByteBuffer(long,int); Please consider reporting this to the maintainers of io.netty.util.internal.ReflectionUtil; All illegal access operations will be denied in a future release”
the log output comes out in batches of lines with many minutes between them
update Gaia.launch(): There’s no ../../script/launch-sisyphus.sh in the pip, and it should launch all the servers in one gcloud call like the wcEcoli version does now
a parca task never got picked up by a worker
adding workers made everything stop running: with 3 WCM workers, one of them waits and one runs the write-metadata task then times out while the third runs parca. later, I stopped listen(), started 3 more workers, then started listen() again, then it got very stuck. listen() printed nothing. the gaia log only printed Kafka messages about partitions. listen ^C printed the usual stacktrace stuff but wouldn’t quit. no ^C response. ^D printed “Do you really want to exit ([y]/n)?” but wouldn’t exit. then ^C finally exited.
log a message when a workflow run stops running and indicate whether all tasks completed successfully
clearly label the error messages via log/severe! or log/exception!
perhaps flow.listen() should tune in at the start of the run or from where listen left off
call the stackdriver API instead of java.util.logging (multiple benefits)
logging the app in-container: avoid extra quoting and escaping: textPayload: “INFO sisyphus: (“log” {:line ” File "/usr/local/lib/python2.7/runpy.py", line 72, in _run_code”}) ”
set log message levels
(Ryan: validating) provide some means to find out what keys the workflow is waiting on (for debugging)
(Ryan: validating) worker nodes need to be robust to task failures
(Ryan) adjust Docker calls if possible to deliver log entries in smaller batches
support task cancellation
put each workflow run in its own namespace such as WCM_jerry_20190716.021305, pass the namespace name in each sisyphus task, and log it in each gaia & sisyphus log entry for filtering
(jerry) support stackdriver logging and filtering: gaia
(jerry) add the “instance_id” and “zone” labels to gce_instance monitored resources
(jerry) have gaia pass the task name to sisyphus and use it with sisyphus log/tag
(jerry) when the docker app run returns an error code, don’t re-log the same output lines (since that’s confusing to read) and set the severity level to error
store sisyphus id in logs
(Ryan: validating) sisyphus gets in bad state with rabbit when a task fails
(Ryan: validating) sometimes the WCM WF runs Parca then doesn’t continue on to run the following tasks
(Ryan: validating) the queue needs to be robust to task failures; don’t rerun them unless that has a reasonable chance of working and there’s a max number of retries; the rabbit interaction is failing on error in sisyphus
(Ryan: investigating) Gaia.trigger() doesn’t start the workflow unless workers are good and ready
(Ryan: investigating) is it necessary to have running workers before flow.trigger() will work?
wcm.py: show what is going to be run, then accept confirmation (with option to force) -> run it with –dump to write workflow-commands.json and workflow-steps.json instead of sending them to the Gaia workflow server. you can then look them over and either manually upload those files via the gaia client or do wcm.py again without the arg
(Ryan) RENAME EVERYTHING
key –> name
root –> workflow
process –> step
command –> command
merge –> merge
halt –> halt
trigger –> run
expire –> expire
??? –> ???
(jerry) remove kafka-based logging from gaia client, ssh-tunnel.sh, sisyphus, and gaia
(jerry) make the “task complete” log entry responsive to whether the task success (“task succeeded” or “task failed”) so it doesn’t mislead people with “task complete” on failure
(jerry) log a clear message when a workflow completes or stalls
clarify logs for worker termination vs. step termination vs. step completion
add argument checking assertions, #type: type specs, and docstrings
preserve indentation whitespace in Logs Viewer? –> leading whitespace doesn’t show up in the collapsed view, e.g. parca’s output “wrote\n\t /wcEcoli/out/wf/kb/rawData.cPickle …”
sisyphus-log: {u’status’: u’log’, u’line’: u’Fitting RNA synthesis probabilities.’, u’id’: u’cbb31409-3bc9-4811-94d0-97a0f6bfa3b5’} should be more like worker sisyphus-b: Fitting RNA synthesis probabilities.
(Ryan) sometimes rabbit messages are not received by sisyphus
gaia client should check arg types before sending a request to the server
retry docker pull on com.spotify.docker.client.exceptions.DockerException: java.util.concurrent.ExecutionException: javax.ws.rs.ProcessingException: java.io.IOException: Connection refused
auto-flush at :notice level and above, or manually flush after logging STEP COMPLETED?
(jerry) perform-task! should catch exceptions inside the task-specific tag to better tag the “task-error” log entry
don’t start each workflow with scary looking log entries “CANCELING ” followed by “CANCELING (:every :step :in :the :workflow)”
store a persistent log of high level info plus error messages, esp. for CI runs; or save stdout+stderr from all steps; or dump logs for a single run in bucket alongside results
write a step-by-step how-to document for lab members
document how to make a Compute Engine monitor chart for worker node CPU usage: on GCP dashboard, add chart, Metric instance/cpu/utilization, Filter metric.labels.instance_name = starts_with(“sisyphus”) and maybe more metrics like instance/disk/read_bytes_count group by project_id aggregate by sum