Replies: 4 comments 5 replies
-
When trying to run both 'plotman plot' and 'plotman archive' at the same time, after some time (140s - 1780s) it errors out:
/proc/6620 is the actively running plotting process:
As this happens randomly, it suggest this same segment of code has run successfully several times before having an issue. A quick search online suggests the python scripts might not be conforming to standard syntax. I doubt this, as these scripts have been being used for over a year. I know very little python. It feels more like some lib or some python support programs are missing or need updating. |
Beta Was this translation helpful? Give feedback.
-
We make an effort to catch psutil exceptions and respond sensibly. Looks like there may be some left to deal with. How did you install plotman? The goal being to know what version/branch/commit you have. I'm not sure about the initial issue description. It sounds like it is starting a plot, waiting, then starting another plot. Is that wrong? Or maybe that part is right but the archive isn't actually moving plots off tmp? |
Beta Was this translation helpful? Give feedback.
-
This is an odd setup. The Windows host has a Machinaris Docker container and a Ubuntu VM via WSL v1. The plotman is not your offical plotman, but Machinaris.
The error in the last comment, something new, I have only recently seen as I try to run both services at the same time. Before I was just doing one, either plot or archive. When plotting in the VM with just plotman plot, it would wait the global_stagger_m time and then start a plot regardless if the last had finished. In the Machinaris docker container, the same command using the same plotman.yaml would wait the global_stagger_m time, but then would continue to wait as the previous plot had not made it to stage 5. The Machinaris behavior is the expected one. Machinaris is really just another Linux VM in a container. As I am trying to replicate what plotman needs run in the Ubuntu VM, it feels like I am missing something, a lib or package. #962 has the full VM setup. |
Beta Was this translation helpful? Give feedback.
-
I appreciate any and all help provided. I found part of the problem. In my Ubuntu (WSL) there is a sym link in the path to the log files. As the proc.open_files() uses real paths, thus it fails to match rootlog to any open files. Below is a quick fix in job.py -> __init__. It should really be implemented where the log dirs are read in. logroot = os.path.realpath(logroot) #363 has some of the solution. For whatever reason, proc.open_files() does not capture the log file. list_fds does capture the log file, usually number 0 or 1. job.py import errno
...
def list_fds(procId):
# List process currently open FDs and their target
# Source: https://stackoverflow.com/a/24803353
ret = []
base = '/proc/' + str(procId) + '/fd'
for num in os.listdir(base):
path = None
if os.path.exists(os.path.join(base, num)):
try:
path = os.readlink(os.path.join(base, num))
except OSError as err:
# Last FD is always the "listdir" one (which may be closed)
if err.errno != errno.ENOENT and err.errno != errno.errorcode[9]:
raise
ret.append(path)
return ret job.py -> __init__ # Find logfile (whatever file is open under the log root). The
# file may be open more than once, e.g. for STDOUT and STDERR.
if sys.platform.startswith('linux'):
for f in list_fds(self.proc.pid):
if logroot in f:
if self.logfile:
assert self.logfile == f
else:
self.logfile = f
break
else:
# Below is the original code
for f in self.proc.open_files():
if logroot in f.path:
if self.logfile:
assert self.logfile == f.path
else:
self.logfile = f.path
break The random bug occurs when in proc.open_files() or list_fds() attempts os.readlink(path) and the tmp file is either closed or non-existent. I have tried to compensate with: if os.path.exists(os.path.join(base, num)): and if err.errno != errno.ENOENT and err.errno != errno.errorcode[9]:
raise Neither has stopped the "OSError: [Errno 9] Bad file descriptor:" error. plotman status is fixed:
The main issue of this discussion remains. When global_stagger_m times out, it starts plotting a new plot regardless if the last one has reached in stage 5 or is done. In Machinaris, plotman continues to wait for the first plot to reach stage 5 or is finish. global_stagger_m set to 3
|
Beta Was this translation helpful? Give feedback.
-
This is a duplicate discussion from Machinaris repo.
#962 details how I have gotten 'plotman archive' working in a Linux VM on Windows.
'plotman plot' will start fine, but as soon as the global_stagger_m times out, it starts another plot plotting. Left unattended, the tmp dir fills up and all plotting fails. 'plotman archive' also detects archive jobs, but seams to have no issues.
Machinaris using the same plotman.yaml knows there are active plotting jobs and waits after the global_stagger_m times out until a plotting job gets to stage 5.
While I have made the Linux VM see everything the same as the docker container using sym links, there are still differences.
Any ideas what is missing? I am not sure what to try next.
Beta Was this translation helpful? Give feedback.
All reactions