Plotman plot not seeing running jobs in Ubuntu in WSL v1 on Windows #964

zorner · 2022-11-17T18:37:28Z

zorner
Nov 17, 2022

This is a duplicate discussion from Machinaris repo.

#962 details how I have gotten 'plotman archive' working in a Linux VM on Windows.

'plotman plot' will start fine, but as soon as the global_stagger_m times out, it starts another plot plotting. Left unattended, the tmp dir fills up and all plotting fails. 'plotman archive' also detects archive jobs, but seams to have no issues.

Machinaris using the same plotman.yaml knows there are active plotting jobs and waits after the global_stagger_m times out until a plotting job gets to stage 5.

While I have made the Linux VM see everything the same as the docker container using sym links, there are still differences.

Machinaris dose not have a /root/.config dir, so plotman.yaml can not be in /root/.config. It is in /root/.chia/plotman.
VM has /root/.config/plotman.yaml as a sym link to /root/.chia/plotman/plotman.yaml.
VM comes with python 3.8. Machinaris uses 3.10. I have tried 3.10 and 3.9 in the VM, but did not make a difference.
VM may not have python3.??-distutils and python3.??-dev installed. I have tried with them, but no change.
VM only has the Chia CLI. Machinaris may have the Chia GUI version.

Any ideas what is missing? I am not sure what to try next.

zorner · 2022-11-17T18:38:14Z

zorner
Nov 17, 2022
Author

When trying to run both 'plotman plot' and 'plotman archive' at the same time, after some time (140s - 1780s) it errors out:

Traceback (most recent call last):
  File "/opt/chia/venv/bin/plotman", line 8, in <module>
    sys.exit(main())
  File "/opt/chia/venv/lib/python3.8/site-packages/plotman/plotman.py", line 281, in main
    (started, msg) = manager.maybe_start_new_plot(
  File "/opt/chia/venv/lib/python3.8/site-packages/plotman/manager.py", line 118, in maybe_start_new_plot
    jobs = job.Job.get_running_jobs(log_cfg.plots)
  File "/opt/chia/venv/lib/python3.8/site-packages/plotman/job.py", line 187, in get_running_jobs
    job = cls(
  File "/opt/chia/venv/lib/python3.8/site-packages/plotman/job.py", line 215, in __init__
    for f in self.proc.open_files():
  File "/opt/chia/venv/lib/python3.8/site-packages/psutil/__init__.py", line 1142, in open_files
    return self._proc.open_files()
  File "/opt/chia/venv/lib/python3.8/site-packages/psutil/_pslinux.py", line 1645, in wrapper
    return fun(self, *args, **kwargs)
  File "/opt/chia/venv/lib/python3.8/site-packages/psutil/_pslinux.py", line 2193, in open_files
    path = readlink(file)
  File "/opt/chia/venv/lib/python3.8/site-packages/psutil/_pslinux.py", line 210, in readlink
    path = os.readlink(path)
OSError: [Errno 9] Bad file descriptor: '/proc/6620/fd/174'

[1]-  Exit 1                  plotman plot
[2]+  Exit 1                  plotman archive

/proc/6620 is the actively running plotting process:

root@Plotter2-Chia:/mnt/c/Users/Cryptnoid Pennies# ps -x | grep "plotman\|rsync\|chia_plot"
 6620 ?        SNsl 535:52 chia_plot -n 1 -r 24 -u 256 -t /plotting1/ -d /plotting1/ -2 /ramdrive/ -v 256 -K 1 -f a3f7e719a508879c16144b615e76aeab1c2bea66b47b690aea33c9e5fed591ad4476b573f2e3dddfe84e923adeae6449 -c xch1ekslymyyfrrv50nptterxuxgmwvw0vc8xz848sf3gzfhse7e8thqmjlqj4
 8220 tty2     S      0:00 grep --color=auto plotman\|rsync\|chia_plot

As this happens randomly, it suggest this same segment of code has run successfully several times before having an issue.

A quick search online suggests the python scripts might not be conforming to standard syntax. I doubt this, as these scripts have been being used for over a year. I know very little python.

It feels more like some lib or some python support programs are missing or need updating.

0 replies

altendky · 2022-11-17T19:45:44Z

altendky
Nov 17, 2022
Collaborator

We make an effort to catch psutil exceptions and respond sensibly. Looks like there may be some left to deal with. How did you install plotman? The goal being to know what version/branch/commit you have.

I'm not sure about the initial issue description. It sounds like it is starting a plot, waiting, then starting another plot. Is that wrong? Or maybe that part is right but the archive isn't actually moving plots off tmp?

0 replies

zorner · 2022-11-17T21:06:01Z

zorner
Nov 17, 2022
Author

This is an odd setup. The Windows host has a Machinaris Docker container and a Ubuntu VM via WSL v1. The plotman is not your offical plotman, but Machinaris.

(venv)# pip install --force-reinstall git+https://github.com/guydavis/plotman@development

root@Plotter1-Chia:/mnt/c/Users/Cryptnoid Pennies# plotman version
plotman 0.5.3+dev

The error in the last comment, something new, I have only recently seen as I try to run both services at the same time. Before I was just doing one, either plot or archive.

When plotting in the VM with just plotman plot, it would wait the global_stagger_m time and then start a plot regardless if the last had finished. In the Machinaris docker container, the same command using the same plotman.yaml would wait the global_stagger_m time, but then would continue to wait as the previous plot had not made it to stage 5. The Machinaris behavior is the expected one.

Machinaris is really just another Linux VM in a container. As I am trying to replicate what plotman needs run in the Ubuntu VM, it feels like I am missing something, a lib or package. #962 has the full VM setup.

4 replies

zorner Nov 17, 2022
Author

This is an odd setup. The Windows host has a Machinaris Docker container and a Ubuntu VM via WSL v1. The plotman is not your offical plotman, but Machinaris.

(venv)# pip install --force-reinstall git+https://github.com/guydavis/plotman@development

root@Plotter1-Chia:/mnt/c/Users/Cryptnoid Pennies# plotman version
plotman 0.5.3+dev

The error in the last comment, something new, I have only recently seen as I try to run both services at the same time. Before I was just doing one, either plot or archive.

When plotting in the VM with just plotman plot, it would wait the global_stagger_m time and then start a plot regardless if the last plot had finished. In the Machinaris docker container, the same command using the same plotman.yaml would wait the global_stagger_m time, but then would continue to wait as the previous plot had not made it to stage 5. The Machinaris behavior is the expected one.

Machinaris is really just another Linux VM in a container. As I am trying to replicate what plotman needs run in the Ubuntu VM, it feels like I am missing something, a lib or package. #962 has the full VM setup.

altendky Nov 17, 2022
Collaborator

I suspect there's some details here about the setup that aren't quite right. An Ubuntu VM inside WSL? Like you installed VirtualBox inside of WSL to run a VM inside of WSL inside of Windows? Or do you just mean directly in WSL?

Anyways, if plotman is waiting for the stagger then it seems to be able to tell that the plotting process is still running. Otherwise it would start another plotting process 20 seconds later. But, it would start another plot after the global stagger without waiting for the plotting process to satisfy phase staggers if it weren't able to find the plotting log file, I think... If you run plotman status you should be able to see what it sees in terms of phase progress. You can also check and make sure the plot logs are getting created and written to.

zorner Nov 18, 2022
Author

Ubuntu is running directly on WSL.

The plotting logs are being created as expected.

plotman status on Ubuntu WSL v1

root@Plotter1-Chia:/mnt/c/Users/Cryptnoid Pennies# plotman status
plot id   plotter   k   tmp   dst   wall   phase   tmp     pid   stat     mem   user    sys   io
           madmax   0               0:38     ?:?     0   27393    SLP   9.0Gi   7:52   1:06   0s

Total jobs: 1
Jobs in : 1

Updated at: Thu Nov 17 19:41:48 2022

plotman status on Machinaris

root@Plotter1-Chia:/chia-blockchain# plotman status
 plot id   plotter    k           tmp           dst   wall   phase   tmp     pid   stat     mem   user    sys   io
4c873de4    madmax   32   /plotting3/   /plotting3/   0:03     1:2    3G   11739    SLP   7.9Gi   0:06   0:02   0s

Total jobs: 1
Jobs in /plotting3/: 1

Updated at: Thu Nov 17 19:49:35 2022

This starting to look like the same issue as #363. There it was suggested this was WSl v1 was the issue.

plotman status on Ubuntu WSL v2

root@Plotter1-Chia:/mnt/c/Users/Cryptnoid Pennies# plotman status
plot id   plotter   k   tmp   dst   wall   phase   tmp   pid   stat     mem   user    sys   io
           madmax   0               0:16     ?:?     0   155    SLP   9.9Gi   0:49   0:05   0s

Total jobs: 1
Jobs in : 1

Updated at: Fri Nov 18 01:20:46 2022

#363 mentions an error message of 'Found plotting process PID XXX, but could not find logfile in its open files:' I am not seeing that message in plotman.log.

There also was a question about sym links.

plotman.yaml - logging

logging:
        # DO NOT CHANGE THESE IN-CONTAINER PATHS USED BY MACHINARIS!
        plots: /root/.chia/plotman/logs
        transfers: /root/.chia/plotman/logs/archiving
        application: /root/.chia/plotman/logs/plotman.log

/root/.chia in Ubuntu is a sym link where as Machinaris has mounted the dir.

ls -la on Unbuntu

root@Plotter1-Chia:/mnt/c/Users/Cryptnoid Pennies# ls -la /root/
total 12
drwx------ 1 root root  512 Nov 17 11:40 .
drwxr-xr-x 1 root root  512 Nov 15 04:36 ..
-rw------- 1 root root 5412 Nov 16 18:27 .bash_history
-rw-r--r-- 1 root root 3106 Dec  5  2019 .bashrc
drwxr-xr-x 1 root root  512 Nov 14 17:20 .cache
lrwxrwxrwx 1 root root   43 Nov 14 17:09 .chia -> '/mnt/c/Users/Cryptnoid Pennies/.machinaris/'
drwx------ 1 root root  512 Nov 15 03:35 .chia_keys
drwxr-xr-x 1 root root  512 Nov 14 17:05 .config
drwxr-xr-x 1 root root  512 Nov 14 16:53 .docker
drwxr-xr-x 1 root root  512 Nov 14 17:13 .local
-rw-r--r-- 1 root root    0 Nov 16 18:29 .motd_shown
-rw-r--r-- 1 root root  161 Dec  5  2019 .profile
drwx------ 1 root root  512 Nov 17 18:08 .ssh

ls -la on Machinaris

root@Plotter1-Chia:/chia-blockchain# ls -la /root/
total 44
drwx------ 1 root root 4096 Nov 15 21:19 .
drwxr-xr-x 1 root root 4096 Nov 11 07:14 ..
-rw------- 1 root root  995 Nov 16 18:28 .bash_history
-rw-r--r-- 1 root root 3235 Nov 11 07:14 .bashrc
-rw-r--r-- 1 root root 3106 Oct 15  2021 .bashrc.bak
drwxr-xr-x 4 root root 4096 Nov 11 07:14 .cache
drwxrwxrwx 1 root root  512 Oct 26 09:41 .chia
lrwxrwxrwx 1 root root   22 Nov 11 07:01 .chia_keys -> /root/.chia/.chia_keys
-rw------- 1 root root   20 Nov 15 21:19 .lesshst
-rw-r--r-- 1 root root  161 Jul  9  2019 .profile
drwx------ 2 root root 4096 Nov 15 03:18 .ssh
-rw-r--r-- 1 root root  290 Nov  4 11:52 .wget-hsts

mount on Machinaris

root@Plotter1-Chia:/chia-blockchain# mount
...
C:\ on /root/.chia type 9p (rw,noatime,dirsync,aname=drvfs;path=C:\;uid=0;gid=0;metadata;symlinkroot=/mnt/host,mmap,access=client,msize=65536,trans=fd,rfd=8,wfd=8)
...

Tried removing sym link. could not mount a dir on another dir. (WSL limitation) Did replace it with mkdir /root/.chia/plotman and copied plotman.yaml.

root@Plotter1-Chia:/mnt/c/Users/Cryptnoid Pennies# plotman status
plot id   plotter   k   tmp   dst   wall   phase   tmp   pid   stat     mem   user    sys   io
           madmax   0               0:02     ?:?     0   116    SLP   9.9Gi   0:51   0:06   0s

Total jobs: 1
Jobs in : 1

Updated at: Fri Nov 18 03:40:21 2022

My guess at this point is as WSL Linux kernels (the ones from the MS store) are not full kernels, they are missing something PSutils needs. May be a module that can be loaded.

altendky Nov 18, 2022
Collaborator

Might be worth running psutil tests in wsl to see if they work. Probably not given giampaolo/psutil#1251. But I think the linked issue was reporting better success in WSL2. Also, as I recall the disk performance in WSL2 if you mount the disk into WSL2, instead of whatever mapping of Windows-mounted disks, is better than WSL1. Why is this not an option?

If I make time for "real work" on plotman, odds are it will be to add Windows support, not WSL1 support.

zorner · 2022-11-22T04:25:39Z

zorner
Nov 22, 2022
Author

I appreciate any and all help provided.

I found part of the problem.

In my Ubuntu (WSL) there is a sym link in the path to the log files. As the proc.open_files() uses real paths, thus it fails to match rootlog to any open files. Below is a quick fix in job.py -> __init__. It should really be implemented where the log dirs are read in.

logroot = os.path.realpath(logroot)

#363 has some of the solution. For whatever reason, proc.open_files() does not capture the log file. list_fds does capture the log file, usually number 0 or 1.

job.py

import errno
...
def list_fds(procId):
    # List process currently open FDs and their target
    # Source: https://stackoverflow.com/a/24803353

    ret = []
    base = '/proc/' + str(procId) + '/fd'
    for num in os.listdir(base):
        path = None
        if os.path.exists(os.path.join(base, num)):
            try:
                path = os.readlink(os.path.join(base, num))
            except OSError as err:
                # Last FD is always the "listdir" one (which may be closed)
                if err.errno != errno.ENOENT and err.errno != errno.errorcode[9]:
                    raise
        ret.append(path)
    return ret

job.py -> __init__

        # Find logfile (whatever file is open under the log root).  The
        # file may be open more than once, e.g. for STDOUT and STDERR.
        if sys.platform.startswith('linux'):
            for f in list_fds(self.proc.pid):
                if logroot in f:
                    if self.logfile:
                        assert self.logfile == f
                    else:
                        self.logfile = f
                    break
        else:
            # Below is the original code
            for f in self.proc.open_files():
                if logroot in f.path:
                    if self.logfile:
                        assert self.logfile == f.path
                    else:
                        self.logfile = f.path
                    break

The random bug occurs when in proc.open_files() or list_fds() attempts os.readlink(path) and the tmp file is either closed or non-existent.

I have tried to compensate with:

        if os.path.exists(os.path.join(base, num)):

and

                if err.errno != errno.ENOENT and err.errno != errno.errorcode[9]:
                    raise

Neither has stopped the "OSError: [Errno 9] Bad file descriptor:" error.

plotman status is fixed:

root@Plotter1-Chia:/opt/chia/venv/lib/python3.8/site-packages/plotman# plotman status
 plot id   plotter    k           tmp           dst   wall   phase   tmp     pid   stat      mem   user    sys   io
f9f067ac    madmax   32   /plotting3/   /plotting3/   0:31     3:5   75G   28575    SLP   12.5Gi   7:16   0:55   0s

Total jobs: 1
Jobs in /plotting3/: 1

Updated at: Mon Nov 21 21:21:21 2022

The main issue of this discussion remains. When global_stagger_m times out, it starts plotting a new plot regardless if the last one has reached in stage 5 or is done. In Machinaris, plotman continues to wait for the first plot to reach stage 5 or is finish.

global_stagger_m set to 3

...sleeping 20 s: stagger (145s/180s)
...sleeping 20 s: stagger (165s/180s)
Starting plot job: chia_plot -n 1 -k 32 -r 24 -u 256 -x 8444 -t /plotting3/ -d /plotting3/ -2 /ramdrive/ -v 256 -K 1 -f a3f7e719a508879c16144b615e76aeab1c2bea66b47b690aea33c9e5fed591ad4476b573f2e3dddfe84e923adeae6449 -c xch1ekslymyyfrrv50nptterxuxgmwvw0vc8xz848sf3gzfhse7e8thqmjlqj4 ; logging to /root/.chia/plotman/logs/2022-11-21T23_09_43.477299-05_00.plot.log

root@Plotter1-Chia:/opt/chia/venv/lib/python3.8/site-packages/plotman# plotman status
 plot id   plotter    k           tmp           dst   wall   phase   tmp     pid   stat     mem   user    sys   io
388865af    madmax   32   /plotting3/   /plotting3/    55s     1:2    6G   32233    SLP   8.0Gi   0:12   0:02   0s

Total jobs: 1
Jobs in /plotting3/: 1

Updated at: Mon Nov 21 23:07:33 2022
root@Plotter1-Chia:/opt/chia/venv/lib/python3.8/site-packages/plotman# ps -x | grep "plotman \| chia_plot \| rsync"
32233 ?        SNsl  23:23 chia_plot -n 1 -k 32 -r 24 -u 256 -x 8444 -t /plotting3/ -d /plotting3/ -2 /ramdrive/ -v 256 -K 1 -f a3f7e719a508879c16144b615e76aeab1c2bea66b47b690aea33c9e5fed591ad4476b573f2e3dddfe84e923adeae6449 -c xch1ekslymyyfrrv50nptterxuxgmwvw0vc8xz848sf3gzfhse7e8thqmjlqj4
32353 tty1     S      0:00 /opt/chia/venv/bin/python3.8 /opt/chia/venv/bin/plotman plot
32360 tty2     S      0:00 grep --color=auto plotman \| chia_plot \| rsync
root@Plotter1-Chia:/opt/chia/venv/lib/python3.8/site-packages/plotman# plotman status
 plot id   plotter    k           tmp           dst   wall   phase   tmp     pid   stat     mem   user    sys   io
388865af    madmax   32   /plotting3/   /plotting3/   0:02     1:3   25G   32233    SLP   9.9Gi   0:43   0:05   0s

Total jobs: 1
Jobs in /plotting3/: 1

Updated at: Mon Nov 21 23:09:26 2022
root@Plotter1-Chia:/opt/chia/venv/lib/python3.8/site-packages/plotman# plotman status
 plot id   plotter    k           tmp           dst   wall   phase   tmp     pid   stat      mem   user    sys   io
ff26f3ec    madmax   32   /plotting3/   /plotting3/    15s     1:1     0   32449    RUN    2.6Gi    39s    21s   0s
388865af    madmax   32   /plotting3/   /plotting3/   0:03     1:3   31G   32233    SLP   10.0Gi   0:52   0:06   0s

Total jobs: 2
Jobs in /plotting3/: 2

Updated at: Mon Nov 21 23:09:58 2022

1 reply

altendky Nov 22, 2022
Collaborator

As far as compensating for psutil not working, have you looked upstream in psutil at how to fix that well? The reason to use libraries like psutil is exactly to encapsulate these platform details in one place instead of spreading them across every application.

The tracebacks sound like just another place we need to handle the race condition of requesting some info from psutil that may get out of date before we use it. There should be some existing examples of exception suppression around this. Any interest in making a PR? If not, I might manage to find some time for this, we'll see.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plotman plot not seeing running jobs in Ubuntu in WSL v1 on Windows #964

{{title}}

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Plotman plot not seeing running jobs in Ubuntu in WSL v1 on Windows #964

zorner Nov 17, 2022

Replies: 4 comments · 5 replies

zorner Nov 17, 2022 Author

altendky Nov 17, 2022 Collaborator

zorner Nov 17, 2022 Author

zorner Nov 17, 2022 Author

altendky Nov 17, 2022 Collaborator

zorner Nov 18, 2022 Author

altendky Nov 18, 2022 Collaborator

zorner Nov 22, 2022 Author

altendky Nov 22, 2022 Collaborator

zorner
Nov 17, 2022

Replies: 4 comments 5 replies

zorner
Nov 17, 2022
Author

altendky
Nov 17, 2022
Collaborator

zorner
Nov 17, 2022
Author

zorner Nov 17, 2022
Author

altendky Nov 17, 2022
Collaborator

zorner Nov 18, 2022
Author

altendky Nov 18, 2022
Collaborator

zorner
Nov 22, 2022
Author

altendky Nov 22, 2022
Collaborator