Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

target/riscv: Ensure to handle all triggered a halt events #1171

Open
wants to merge 1 commit into
base: riscv
Choose a base branch
from

Conversation

lz-bro
Copy link
Contributor

@lz-bro lz-bro commented Nov 20, 2024

If all current halted states are due to a halt group, then a new "triggered a halt" event has occurred.

@lz-bro lz-bro force-pushed the handle-all-trigger-halt branch from 7652128 to 21d836a Compare November 20, 2024 12:34
else
break;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop seems weird. What is it's purpose? DCSR is writable, so we can occasionally trick debugger into wrong conclusion.

halt groups are an optional feature and I'm quite confused that we don't check for it.

Could you please provide a test scenario to reproduce your issue? Is it possible to use spike to model it? Or do you need a specific HW ?

Copy link
Contributor Author

@lz-bro lz-bro Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bug was discovered when I was testing Semihosting on our hardware and smp is enable. It fails when the following happens:
If all current halted states are due to a halt group, and other harts state was running. In fact, there was a hart halted, which caused the other harts to halt because of the hart group.

} else if (halted && running) {
LOG_TARGET_DEBUG(target, "halt all; halted=%d",
halted);
riscv_halt(target);
} else {

If there is such a halted hart,but the record status is running,it would not process riscv_semihosting.
if (halt_reason == RISCV_HALT_EBREAK) {
int retval;
/* Detect if this EBREAK is a semihosting request. If so, handle it. */
switch (riscv_semihosting(target, &retval)) {
case SEMIHOSTING_NONE:
break;
case SEMIHOSTING_WAITING:
/* This hart should remain halted. */
*next_action = RPH_REMAIN_HALTED;
break;
case SEMIHOSTING_HANDLED:
/* This hart should be resumed, along with any other
* harts that halted due to haltgroups. */
*next_action = RPH_RESUME;
return ERROR_OK;
case SEMIHOSTING_ERROR:
return retval;
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aap-sc I think this is a bug, would you provide some suggestions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lz-bro I'm still trying to understand your reasoning and what the issue is exactly (the situation is still not quite obvious to me). It will take a couple of days - I'll ask additional question if necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I have figured out the code thanks to the detailed description that @zqb-all has provided:

If the harts are just in the process of halting due to being members of the halt group, we should wait until they finish halting, so that the true halt reason can be discovered - for instance semihosting request - and then handled correctly.

@JanMatCodasip
Copy link
Collaborator

@lz-bro I am afraid I have not understood what case this merge request addresses; not even after reading the commit description and the discussion so far.

Please, could you provide a very clear description in the commit message. Doing so will help:

  • Code reviewers to understand your changes (and have them merged)
  • Other users of OpenOCD source code - both current and future.

Thank you.

@zqb-all
Copy link
Contributor

zqb-all commented Nov 29, 2024

Let me try to explain this issue.
Suppose we have two cores, core0 and core1, belong to one smp group.
Openocd constantly calls riscv_openocd_poll to check the status.

When openocd calls riscv_openocd_poll on core0:

A. if no hardware state change occurs,sequence is:
A.1. Check that core0 belongs to smp and start checking the status of all smp cores.
A.2. riscv_poll_hart checks core0. core0's status is running without change. next_action=RPH_NONE, running++
A.3. riscv_poll_hart checks core1. core1's status is running without change. . next_action=RPH_NONE, running++
A.4. Finally, should_remain_halt=0, should_remain_resume=0, halted=0, and running=2, nothing happen

B. If core0 hit soft breakpoints on hardware, one possible sequence is
B.1. Check that core0 belongs to smp and start to check the status of all smp cores.
B.2. riscv_poll_hart checks core0. core0's status is running without change. next_action=RPH_NONE, running++
[Between B.2 and B.3, core0 hits the breakpoint and becomes halted, and core1 also halted immediately due to hw haltgroup]
B.3. riscv_poll_hart checks core1 and finds that core1's status has changed from running to halted. The reason is RISCV_HALTED_GROUP. This results in next_action=RPH_NONE, halted++
4. Finally, should_remain_halt=0, should_remain_resume=0, halted=1, running=1. Enter else if (halted && running), and call riscv_halt to halt all cores in the smp group. In the process core0's target->status will also be corrected to halted, and next time riscv_openocd_poll will consider core0's status unchanged.
There is no problem in this case.

Let's re-assume that core0/core1 are both running and consider case C

C. If core0 hit semihosting ebeak on hardware, one possible sequence is:
C.1. Check that core0 belongs to smp and start to check the status of all smp cores.
C.2. riscv_poll_hart checks core0. core0's status is running without change. next_action=RPH_NONE, running++
[Between C.2 and C.3, core0 hit semihosting ebreak and halted, and core1 also halted immediately due to hw haltgroup]
C.3. riscv_poll_hart checks core1 and finds that core1 status has changed from running to halted. The reason is RISCV_HALT_GROUP. This results in next_action=RPH_NONE, halted++
C.4. Finally, should_remain_halt=0, should_remain_resume=0, halted=1, running=1. Enter else if (halted && running), and call riscv_halt to halt all cores in the smp group. In the process core0's target->status will also be corrected to halted, and next time riscv_openocd_poll will consider core0's status unchanged.
Now here's the problem: as status not change, poll will not enter this if , cannot realize that core0 is actually semihosting ebreak and will not handle it. core0 and core1 will remain halted.

Let's re-assume that core0/core1 are both running and consider case D

D. If core0 hit semihosting ebeak on the hardware, but the timing was earlier than in case C, one possible sequence is:
D.1. Check that core0 belongs to smp and start checking the status of all smp cores.
[Between D.1 and D.2, core0 hit semihosting ebreak and halted, and core1 also halted immediately due to hw haltgroup]
D.2. riscv_poll_hart checks core0 and finds that core0 has changed from running to halted. Because halt_reason is RISCV_HALT_EBREAK, semihosting will be further checked and processed, after that one possible result is next_action=RPH_RESUME, should_resume++
D.3. riscv_poll_hart checks core1 and finds that core1 status has changed from running to halted. The reason is RISCV_HALTED_GROUP. This results in next_action=RPH_NONE, halted++
D.4. Finally, should_remain_halt=0, should_remain_resume=1, halted=1, running=0. Enter the else if (should_resume), then call riscv_resume, and eventually core0 and core1 will return to the running state.
There is no problem in this case.

Thank you.

@zqb-all
Copy link
Contributor

zqb-all commented Nov 30, 2024

Things are a bit complicated.
The poll function of the software takes time to run, and hart may hit ebreak (breakpoint or semihosting) at any time, then harts in the same smp group, may be halted immediately (by hardware haltgroup) or may continue running temporarily (for example, hart is controlled by another DM and no hardware supports synchronous halt).
We expect the poll function to eventually adjust the smp group to halted or running. But before the software completes processing, these temporarily running harts may also hit ebreak, make the situation more complicated.

@zqb-all
Copy link
Contributor

zqb-all commented Dec 12, 2024

@JanMatCodasip @aap-sc Does my description of the issue help you understand what the issue is ?

@JanMatCodasip
Copy link
Collaborator

@zqb-all Thank you for describing the situation in more detail. It will take me some time to get back to it.

Copy link
Collaborator

@JanMatCodasip JanMatCodasip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @zqb-all, thank you for the clear and accurate description of the situation that you're trying to address. It helped me to understand the code change. I am also sorry for replying late.

Overall this fix looks good and I have posted some comments.

@@ -3790,6 +3791,22 @@ int riscv_openocd_poll(struct target *target)
LOG_TARGET_DEBUG(target, "resume all");
riscv_resume(target, true, 0, 0, 0, false);
} else if (halted && running) {
foreach_smp_target(entry, targets)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
foreach_smp_target(entry, targets)
LOG_TARGET_DEBUG(target, "SMP group is in inconsistent state: %u halted, %u running", halted, running);
/* The SMP group is in an inconsistent state - some harts in the group have halted
* whereas others are running. The reasons for that (and corresponding
* OpenOCD actions) could be:
* 1) The targets are in the process of halting due to halt groups
* but not all of them halted --> wait a moment and then poll again so that
* the halt reason of every hart can be accurately determined (e.g. semihosting).
* 2) The targets do not support halt groups --> OpenOCD must halt
* the remaining harts by a standard halt request.
* 3) The hart states got out of sync for some other unknown reason (problem?). -->
* Same as previous - try to halt the harts by a standard halt request
* to get them back in sync.
/* Detect if the harts are just in the process of halting due to a halt group */
foreach_smp_target(entry, targets)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

if (get_field(dcsr, CSR_DCSR_CAUSE) == CSR_DCSR_CAUSE_GROUP)
cause_groups++;
else
break;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
break;
/* This hart has halted due to something else than a halt group.
* Don't continue checking the rest - exit early. */
break;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

Comment on lines 3807 to 3808
if (halted == cause_groups)
return ERROR_OK;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (halted == cause_groups)
return ERROR_OK;
if (halted == cause_groups) {
LOG_TARGET_DEBUG(target, "The harts appear to just in the process of halting due to halt group. Giving them more time - will poll their state later.");
return ERROR_OK;
}

@@ -3790,6 +3791,22 @@ int riscv_openocd_poll(struct target *target)
LOG_TARGET_DEBUG(target, "resume all");
riscv_resume(target, true, 0, 0, 0, false);
} else if (halted && running) {
foreach_smp_target(entry, targets)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this loop can be skipped if haltgroup_supported is equal to false.

Comment on lines 3810 to 3833
LOG_TARGET_DEBUG(target, "halt all; halted=%d",
halted);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LOG_TARGET_DEBUG(target, "halt all; halted=%d",
halted);
/* Halting the whole SMP group to bring it in sync. */
LOG_TARGET_DEBUG(target, "halt all; halted=%d",
halted);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

else
break;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I have figured out the code thanks to the detailed description that @zqb-all has provided:

If the harts are just in the process of halting due to being members of the halt group, we should wait until they finish halting, so that the true halt reason can be discovered - for instance semihosting request - and then handled correctly.

Comment on lines 3807 to 3808
if (halted == cause_groups)
return ERROR_OK;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) I would recommend to keep a counter of how many times this situation has occurred. If it reaches a certain limit, OpenOCD should:

  • stop waiting for the harts to halt,
  • print an error to the user, and
  • try to halt all the harts directly (riscv_halt()).

(2) Instead of returning from riscv_openocd_poll() and waiting for the next poll interval, it would perhaps be better to create a loop inside this function. This would allow to immediately re-poll the hart state and react quickly.

  • @aap-sc @zqb-all What is your opinion on that?
  • If you are in favor of it but don't feel like implementing it in this PR, consider adding a TODO comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create a loop inside this function. This would allow to immediately re-poll the hart state and react quickly.

Sounds good, this helps smp harts finish state adjustment faster.

@lz-bro lz-bro force-pushed the handle-all-trigger-halt branch 2 times, most recently from 252e9cf to 8193431 Compare January 22, 2025 09:55
If the harts are just in the process of halting due to
a halt group, poll again so that the halt reason of every
hart can be accurately determined (e.g. semihosting).
@lz-bro lz-bro force-pushed the handle-all-trigger-halt branch from 8193431 to 0affedb Compare January 23, 2025 01:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants