Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Help Request] traci #2139

Closed
knightcalvert opened this issue Jan 19, 2024 · 11 comments · Fixed by #2140
Closed

[Help Request] traci #2139

knightcalvert opened this issue Jan 19, 2024 · 11 comments · Fixed by #2140
Labels
help wanted Extra attention is needed

Comments

@knightcalvert
Copy link

High Level Description

I have noticed i have the same problem with #2127, so i update the latest smarts version. but the problems still exist.
problem 1 :

Could not connect to TraCI server at localhost:59573 [Errno 111] Connection refused
 Retrying in 0.05 seconds
Could not connect to TraCI server at localhost:59573 [Errno 111] Connection refused
 Retrying in 0.05 seconds
Could not connect to TraCI server at localhost:59573 [Errno 111] Connection refused
 Retrying in 0.05 seconds

in the beginning, the traci will tried to connect to different ports. however, after running 10 hours, the traci only tried to connect to the same port with constant failure, so my code stucked, i have to rerun my code.

problem 2 :
this problem is like the #2127, because of

  File "./smart_master/SMARTS/smarts/core/sumo_traffic_simulation.py", line 239, in _initialize_traci_conn
    self._traci_conn.setOrder(0)
TypeError: 'NoneType' object is not callable

so my smarts can't reset successfully, i update the code, but this problem still exist occasionally.

and can i ture off this traci warning? 90% of my console output is traci warning, I can't find the info i really need. thank you very much

Version

the latest v

Operating System

ubuntu

Problems

No response

@knightcalvert knightcalvert added the help wanted Extra attention is needed label Jan 19, 2024
@Gamenot
Copy link
Collaborator

Gamenot commented Jan 19, 2024

Hello @knightcalvert, we are currently using this method to acquire port numbers for SUMO.

I did a bit of digging into SUMO port creation and shutdown. I think perhaps this line killing SUMO is preventing the destructor being called and cleanup of SUMO's used port.

self._sumo_proc.kill()

I'll try tomorrow to see if I can apply a cleaner shutdown without blocking on sumo closing.

I will also squash the messages. They are related one of sumo's methods that uses print for warnings...:

https://github.com/eclipse-sumo/sumo/blob/56aceb87d847397941936c28934f9097e0c03f98/tools/traci/main.py#L87-L106

@Gamenot
Copy link
Collaborator

Gamenot commented Jan 19, 2024

@knightcalvert SUMO logging should be now squashed: 16665d5

@Gamenot
Copy link
Collaborator

Gamenot commented Jan 19, 2024

An update. kill() was intended to prevent zombie SUMO processes, without it some sumo zombie processes can be left behind. I am trying to find a solution that closes sumo processes gracefully but does not leave behind zombie processes.

@Gamenot
Copy link
Collaborator

Gamenot commented Jan 20, 2024

#2140 is intended to fix the issue. From current stress testing, kill() does not leave over zombie processes or lost ports. Processes exiting without closing the sumo process will cause zombie processes, mainly in the case of an exception or closing a process without calling SMARTS.destroy() inside. I have going to run a second set of stress tests to see if reusing a single process causes a problem.

@knightcalvert One thing to note, even after this, if you are using SMARTS directly make sure that each SMARTS instance calls destroy() at the end of its use to guarantee cleanup of resources. If using the the gym style environment close() is necessary.

@knightcalvert
Copy link
Author

after updating the code, it helpful, the useless output is gone.
but maybe because of my parallel run? i still got:

[] TraCI could not connect in time to 'localhost:52151' [Could not connect in 101 tries]
[] TraCI could not connect in time to 'localhost:52151' [Could not connect in 101 tries]
[] TraCI could not connect in time to 'localhost:52151' [Could not connect in 101 tries]

and my code have using close()

    def close(self):
        self.base_env.close()

i‘m not familiar with the traci or ports, and just curious that if one port could not connect in many tries, it seems useless to continue connecting this port, can i tried to connect another one?

@Gamenot
Copy link
Collaborator

Gamenot commented Jan 22, 2024

after updating the code, it helpful, the useless output is gone.
but maybe because of my parallel run? i still got:

[] TraCI could not connect in time to 'localhost:52151' [Could not connect in 101 tries]
[] TraCI could not connect in time to 'localhost:52151' [Could not connect in 101 tries]
[] TraCI could not connect in time to 'localhost:52151' [Could not connect in 101 tries]

I think I understand the issue better now. It does not seem to have to do with number of ports but with the SUMO server and SMARTS somehow not paring.

My only thought is that somehow this is happening:

sumo_problem

Once a connection is established, SMARTS does not check to see if its instance of the TraCI server (sumo) is still alive. So, I think this might be the result of an extremely low probability race condition where a smarts instance manages to connect to a different TraCI server by bad luck, then locking out the actual owner of the TraCI server.

We get ports by getting a random free port recommended by the OS (out of 64512 standard ports), so the chance of ports colliding is very low but possible.

and my code have using close()

    def close(self):
        self.base_env.close()

i‘m not familiar with the traci or ports, and just curious that if one port could not connect in many tries, it seems useless to continue connecting this port, can i tried to connect another one?

It was assumed that it would connect or retry with a different port. I will put in a patch that will reattempt with a different port while I think of a way to gracefully handle the root cause.

problem 2 :

Does this still happen?

Gamenot added a commit that referenced this issue Jan 22, 2024
@Gamenot
Copy link
Collaborator

Gamenot commented Jan 22, 2024

i‘m not familiar with the traci or ports, and just curious that if one port could not connect in many tries, it seems useless to continue connecting this port, can i tried to connect another one?

I have attempted a patch 0930736 for that now in #2140. It retries with a different port and saves the TraCI server on a stolen connection to avoid interrupting a different instance.

I will need to do a follow-up fix.

@knightcalvert
Copy link
Author

my last run is stucked in 401027 episodes, as far as i know, the problem 2 is not happened again. I roughly understand what you say, it seems like if i reduce the number of parallel runs, the possibility of port occupation will reduce too?
really thank you for your constant attention.

@Gamenot
Copy link
Collaborator

Gamenot commented Jan 23, 2024

my last run is stucked in 401027 episodes, as far as i know, the problem 2 is not happened again. I roughly understand what you say, it seems like if i reduce the number of parallel runs, the possibility of port occupation will reduce too? really thank you for your constant attention.

Honestly, it would reduce the chances but not completely prevent it.

I am pursuing a different solution that uses a centralised server to prevent port collisions (at least between sumo instances). As of 94da02f it looks like this:

## console 1 (or in background OR on remote machine)
# Run the centralized sumo port management server.
# Use `export SMARTS_SUMO_CENTRAL_PORT=62232` or `--port 62232`
$ python -m smarts.core.utils.centralized_traci_server
## console 2
## Set environment variable to switch to the server.
$ export SMARTS_SUMO_TRACI_SERVE_MODE=central
## Unnecessary but optional
# export SMARTS_SUMO_CENTRAL_HOST=localhost
# export SMARTS_SUMO_CENTRAL_PORT=62232
## do run
$ python experiment.py

It works as is right now but when I get it working better I will likely integrate the server generation into the main process and set it as the default behaviour. I think I will also eventually use the server as a pool of sumo processes which may also speed up training somewhat.

@Gamenot
Copy link
Collaborator

Gamenot commented Jan 25, 2024

The newest change 25789e9 resulted in no disconnects and no port collisions against 60k instances and 32 parallel experiments.

@Apoorvgarg-creator
Copy link

Apoorvgarg-creator commented Sep 7, 2024

Hi @Gamenot I am still facing the issue for TraCI server.

I ran the cmd python -m smarts.core.utils.centralized_traci_server on 1 console.

Then I ran in 2nd console
export SMARTS_SUMO_TRACI_SERVE_MODE=central
and also ran the example code of SMARTs.
I am able to see red cars only.

I get same error like this -

[] TraCI could not connect in time to 'localhost:52151' [Could not connect in 101 tries]
[] TraCI could not connect in time to 'localhost:52151' [Could not connect in 101 tries]
[] TraCI could not connect in time to 'localhost:52151' [Could not connect in 101 tries]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants