Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I setup Visdom on a remote server using slurm? #828

Open
neuronphysics opened this issue Dec 12, 2021 · 1 comment
Open

How can I setup Visdom on a remote server using slurm? #828

neuronphysics opened this issue Dec 12, 2021 · 1 comment

Comments

@neuronphysics
Copy link

neuronphysics commented Dec 12, 2021

I want to use visdom to visualize the results of my trained deep learning algorithm which has been running on a remote cluster server. First I am wondering whether I should use special command line to connect via ssh to the cluster or not to be able to see the visdom plots?

In my slurm script I used the following command line:
python -u script.py --visdom_server "http://ncc1.clients.dur.ac.uk" --visdom_port 8098
and in my python script

#Plotting on remote server
import visdom
cfg = {"server": "ncc1.clients.dur.ac.uk",
       "port": 8098}
vis = visdom.Visdom('http://' + cfg["server"], port = cfg["port"])

win = None

def update_viz(epoch, loss, title):
    global win

    if win is None:
        title = title

        win = viz.line(
            X=np.array([epoch]),
            Y=np.array([loss]),
            win=title,
            opts=dict(
                title=title,
                fillarea=True
            )
        )
    else:
        viz.line(
            X=np.array([epoch]),
            Y=np.array([loss]),
            win=win,
            update='append'
        )

I got this error:

requests.exceptions.InvalidURL: Failed to parse: http://http::8098/env/main
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
Visdom python client failed to establish socket to get messages from the server. This feature is optional and can be disabl
ed by initializing Visdom with `use_incoming_socket=False`, which will prevent waiting for this request to timeout.
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
script.py:41: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().d
etach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  params['w'].append(nn.Parameter(torch.tensor(Normal(torch.zeros(n_in, n_out), std * torch.ones(n_in, n_out)).rsample(), r
equires_grad=True, device=device)))
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
script.py:42: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().d
etach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  params['b'].append(nn.Parameter(torch.tensor(torch.mul(bias_init, torch.ones([n_out,])), requires_grad=True, device=devic
e)))
script.py:292: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().
detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  return torch.exp(torch.lgamma(torch.tensor(a, dtype=torch.float, requires_grad=True).to(device=local_device)) + torch.lga
mma(torch.tensor(b, dtype=torch.float, requires_grad=True).to(device=local_device)) - torch.lgamma(torch.tensor(a+b, dtype=
torch.float, requires_grad=True).to(device=local_device)))
script.py:679: UserWarning: This overload of add_ is deprecated:
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at  /opt/conda/conda-bld/pytorch_1631630815121/work/torch
/csrc/utils/python_arg_parser.cpp:1025.)
  exp_avg.mul_(beta1).add_(1 - beta1, grad)
Port could not be cast to integer value as ':8098'
on_close() takes 1 positional argument but 3 were given
Traceback (most recent call last):
  File "script.py", line 871, in <module>
    update_viz(epoch, elbo2.item(),' Loss by Epoch')
  File "script.py", line 736, in update_viz
    win = viz.line(
NameError: name 'viz' is not defined

How can I run my plotting script on a remote server? Is there anyway to do this? Thanks.

@JackUrb
Copy link
Contributor

JackUrb commented Dec 13, 2021

Hi @neuronphysics, one way to manage this kind of setup is with an ssh tunnel, such that you can still log to localhost at the port you tunnel. This isn't required to get a remote server working, however it does make the semantics equivalent to if you run the server and the plotting script on the same machine.

That being said, it seems something isn't quite right with your underlying setup:

Failed to parse: http://http::8098/env/main

You can see here how we parse the incoming domain and configuration details:

parsed_url = urlparse(server)
if not parsed_url.scheme:
parsed_url = urlparse('http://{}'.format(server))
self.server_base_name = parsed_url.netloc
self.server = urlunparse((parsed_url.scheme,
parsed_url.netloc,'','','',''))
self.endpoint = endpoint
self.port = port
# preprocess base_url
self.base_url = base_url if base_url != "/" else ""
assert self.base_url == '' or self.base_url.startswith('/'), \
'base_url should start with /'
assert self.base_url == '' or not self.base_url.endswith('/'), \
'base_url should not end with / as it is appended automatically'

It might be worthwhile to add some print statements to understand why it is we're parsing out http://http::8098/env/main as the final address, rather than the http://ncc1.clients.dur.ac.uk:8098/env/main you may expect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants