Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Unitrace] Tool always aborts by assertion error in 'UniTracer::Create' when tried to profile on python scripts #64

Open
xunsongh opened this issue May 20, 2024 · 4 comments

Comments

@xunsongh
Copy link

I built unitrace tool on PVC machine with driver agama-ci-devel-hotfix-821.36 by default without MPI support, and then try to run this tool on a simple python script, but it always be aborted by the assertion error in UniTracer::Create.

Here is my command to run the successfully built unitrace tool:

./unitrace -h python ./simple.py
./unitrace --chrome-kernel-logging --chrome-dnn-logging --chrome-ccl-logging python ./simple.py

Also I tried other options in running but all of them failed on such an assertion error:

python: /home/gta/pti-gpu/tools/unitrace/src/tracer.h:50: static UniTracer* UniTracer::Create(const TraceOptions&): Assertion `status == ZE_RESULT_SUCCESS' failed.
Aborted (core dumped)

My test case is simplest as could:

if __name__ == '__main__':
    a = 1

Would you please help check why the unitrace tool crashed on such a simple case who is even not related to SYCL or L0?

@Sarbojit2019
Copy link
Contributor

Are you able to run L0 applications successfully? Most probably Unitrace::create is failing due to L0 call failure. It is during the initialization of the tool where it interacts with L0 hence it is not really matter what is your app doing :). Few things I would suggest to try

  1. See if there is any SYCL or L0 app which exercise L0 apis are running fine on the same environment.
  2. Try to build Unitrace in the environment where you want to run it. In past I have seen people build the tool in an environment and then run it under different environment which caused tool failure.
  3. Try to findout which L0 API is failing from the assert and collect the error no.

BTW, any chance to try on different machine to verify the behavior?

@xunsongh
Copy link
Author

xunsongh commented Jun 7, 2024

Are you able to run L0 applications successfully? Most probably Unitrace::create is failing due to L0 call failure. It is during the initialization of the tool where it interacts with L0 hence it is not really matter what is your app doing :). Few things I would suggest to try

  1. See if there is any SYCL or L0 app which exercise L0 apis are running fine on the same environment.
  2. Try to build Unitrace in the environment where you want to run it. In past I have seen people build the tool in an environment and then run it under different environment which caused tool failure.
  3. Try to findout which L0 API is failing from the assert and collect the error no.

BTW, any chance to try on different machine to verify the behavior?

Thank you for your guidance. And here are my replies on your suggestions:

  1. I can use unitrace tool to trace all those c++ executable programs but only failed on such a simple Python case;
  2. Of course I built, run, test many cases within a clean environment setup by conda;
  3. Sorry I don't have such knowledges to track the failed L0 API. In gdb's backtrace the top lines shew as '??' without any useful information.

And I just had one available PVC machine which let me find this issue and unfortunately the machine was broken several days past.

@Sarbojit2019
Copy link
Contributor

Are you able to run L0 applications successfully? Most probably Unitrace::create is failing due to L0 call failure. It is during the initialization of the tool where it interacts with L0 hence it is not really matter what is your app doing :). Few things I would suggest to try

  1. See if there is any SYCL or L0 app which exercise L0 apis are running fine on the same environment.
  2. Try to build Unitrace in the environment where you want to run it. In past I have seen people build the tool in an environment and then run it under different environment which caused tool failure.
  3. Try to findout which L0 API is failing from the assert and collect the error no.

BTW, any chance to try on different machine to verify the behavior?

Thank you for your guidance. And here are my replies on your suggestions:

  1. I can use unitrace tool to trace all those c++ executable programs but only failed on such a simple Python case;
  2. Of course I built, run, test many cases within a clean environment setup by conda;
  3. Sorry I don't have such knowledges to track the failed L0 API. In gdb's backtrace the top lines shew as '??' without any useful information.

And I just had one available PVC machine which let me find this issue and unfortunately the machine was broken several days past.

Regarding your response to "Item 1" I doubt if this is related to python app. As per the failure point it looks to be at the very beginning. Lets connect internally to see the setup and failure.

@zma2
Copy link
Contributor

zma2 commented Jun 24, 2024

@xunsongh Please check the version of libstdc++.so in you conda env. If it is lower than 6.0.30, you need to upgrade it at least 6.0.30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants