Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thread.join() stuck in swarm.py #522

Open
jvilinsky opened this issue Jun 12, 2024 · 9 comments
Open

thread.join() stuck in swarm.py #522

jvilinsky opened this issue Jun 12, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@jvilinsky
Copy link

Hi, I am having an error where when I run more than one crazyflie it will sometimes work and sometimes get stuck on line 259 in the parallel_safe function in swarm.py. It sometimes works and sometimes inexplicably doesn't work. It seems as if the threads get stuck for some reason and .join() will never execute.

Function for reference:

def parallel_safe(self, func, args_dict=None):
        """
        Execute a function for all Crazyflies in the swarm, in parallel.
        One thread per Crazyflie is started to execute the function. The
        threads are joined at the end and if one or more of the threads raised
        an exception this function will also raise an exception.

        For a more detailed description of the arguments, see `sequential()`

        :param func: The function to execute
        :param args_dict: Parameters to pass to the function
        """
        threads = []
        reporter = self.Reporter()

        for uri, scf in self._cfs.items():
            args = [func, reporter] + \
                self._process_args_dict(scf, uri, args_dict)

            thread = Thread(target=self._thread_function_wrapper, args=args)
            threads.append(thread)
            thread.start()

        for thread in threads:
            thread.join()

        if reporter.is_error_reported():
            first_error = reporter.errors[0]
            raise Exception('One or more threads raised an exception when '
                            'executing parallel task') from first_error
@knmcguire
Copy link
Collaborator

So, this seems to be an issue in the cflib backend of Crazyswarm2? This code is not part of the crazyswarm2 codebase itself

@jvilinsky
Copy link
Author

jvilinsky commented Jun 14, 2024

Yes, sorry it is used in the crazyflie_server.py on line 213. It gets stuck right before the creation of servers and subscriptions.

    # Now all crazyflies are initialized, open links!
    try:
        self.swarm.open_links()
    except Exception as e:
        # Close node if one of the Crazyflies can not be found
        self.get_logger().info("Error!: One or more Crazyflies can not be found. ")
        self.get_logger().info("Check if you got the right URIs, if they are turned on" +
                               " or if your script have proper access to a Crazyradio PA")
        exit()

Thanks for the quick reply!

@knmcguire
Copy link
Collaborator

So there aren't any error messages? like that it is not able to connect to one of the uris?

@jvilinsky
Copy link
Author

No error messages which is why im so stuck, I narrowed it down to a problem with threading as mentioned before, the .join() function from thread class in threading waits for parallel threads finish before joining so if they never finish it will never join. It seems like there might be an infinite loop somewhere. It could also possibly be getting stuck with self._connect_event.wait() in the open_link function which is being used in the threads (line 90 of SyncCrazyflie.py in cflib.crazyflie).

@knmcguire
Copy link
Collaborator

I'm unfortunately not able to recreate your issue I'm afraid... In general it usually takes time for the Crazyradio to download all the parameters/log tocs from the crazyflies before it says it is fully connected, and that time is multiplied with the crazyflies you connect too.

But the getting stuck I've never seen before. What is the OS that you are running it from? Python threading is messy and if you would run this from a VM with limited resources then I would expect some issues.

@jvilinsky
Copy link
Author

That makes sense, the setup im using is:
OS: Ubuntu 22.04.4 LTS x86_64
CPU: 12th Gen Intel i7-12700K
GPU: NVIDIA GeForce RTX 3070 Ti
Memory: 5159MiB / 31878MiB

@knmcguire
Copy link
Collaborator

Thanks for sharing the information!

This seems like a very capable computer... so I don't think that that is the issue.
Which version of python do you have installed and which version of the CFlib do you have?

@jvilinsky
Copy link
Author

I have python version 3.10.12 and I think CFlib version 0.1.25.1

@knmcguire
Copy link
Collaborator

Alright.. that's also exactly what I have.

Unfortunatly we can't recreate it at this moment so the best for now is just to restart the server, how ugly that solution is. I haven't seen this happen in the CI either so perhaps there are some timing issues that might cause this as well.

I'll keep it open here so that others can pitch in and let it know if they also experience the same issue.

@knmcguire knmcguire added the bug Something isn't working label Jul 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants