-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stuck in group construction. #231
Comments
Hi Steamgjk, thank you for trying out derecho. After I checked your configuration file, I found there are two issues with it. The major issue is related to the json_layout configuration. Your current configuration needs 6 nodes to start the service that's why the system keeps waiting after you started three. If you don't mind overlapping Bar subgroup and Foo subgroup, you can do the following:
The above setting will enforce the overlapping of the Foo and Bar subgroups. The minor issue is the provider setting. As libfabric is deprecating the |
Hi, @songweijia Seems it still does not work in my 3-VM cluster. I have updated the 3 cfg files related to the 2 issues as the attached zip file, but it is still stuck there. |
I just realized that you were using So, you can either try with 6 nodes, or you can use my suggestion to the json layout with |
@songweijia I then check by commenting some codes step by step, then I notice the foo part is okay: After I comment the bar part derecho/src/applications/demos/simple_replicated_objects_json.cpp Lines 95 to 129 in 724a1db
then the cluster can run and I can see the printed logs "Node says...". Then, I comment the foo part derecho/src/applications/demos/simple_replicated_objects_json.cpp Lines 59 to 93 in 724a1db
and only maintain the bar part. This time, it goes to the problem again: After launch node-0 and node-1, then I launch node-2, then node-1 crashes. Then, I continue to comment derecho/src/applications/demos/simple_replicated_objects_json.cpp Lines 109 to 128 in 724a1db
so this time, only node-0 does void_future, node-1 and node-2 does not read, then the 3 nodes are fine. But if I only comment derecho/src/applications/demos/simple_replicated_objects_json.cpp Lines 119 to 128 in 724a1db
then node-0 does void_future and node-1 read, then node-1 still crashes after three nodes finish constructing the group. |
I am trying to run simple_replicated_objects, so I create 3 VMs in Google Cloud.
I am using derecho.cfg in the demos/json_cfgs path, Here is my modifications:
(1) Change local ip to the corresponding VMs' ips, local_ids are 0, 1, 2 respectively
(2) Change leader ip to my VM-0 ip (local_id=0 is the leader)
(3) provider = sockets
(4) domain = ens4 (This is the NIC name of all VMs)
However, After I launch the 3 VMs, I am trapped in constructing the groups. Below is the leader VM's console log.
We can see the other 2 VMs have successfully connected the leader, so the IP-related staff should be correct. Then I am not sure what goes wrong with the config (I am suspicious it is because of the json_layout, but I am not sure).
I attach the three cfg files for reference, and really appreciate if you staff can provide some help. Thanks!
derecho-0(leader).cfg.txt
derecho-1.cfg.txt
derecho-2.cfg.txt
The text was updated successfully, but these errors were encountered: