target/riscv: Ensure to handle all triggered a halt events #1171

lz-bro · 2024-11-20T12:28:54Z

If all current halted states are due to a halt group, then a new "triggered a halt" event has occurred.

aap-sc · 2024-11-20T12:46:26Z

src/target/riscv/riscv.c

+				else
+					break;
+			}
+		}


This loop seems weird. What is it's purpose? DCSR is writable, so we can occasionally trick debugger into wrong conclusion.

halt groups are an optional feature and I'm quite confused that we don't check for it.

Could you please provide a test scenario to reproduce your issue? Is it possible to use spike to model it? Or do you need a specific HW ?

This bug was discovered when I was testing Semihosting on our hardware and smp is enable. It fails when the following happens:
If all current halted states are due to a halt group, and other harts state was running. In fact, there was a hart halted, which caused the other harts to halt because of the hart group.

riscv-openocd/src/target/riscv/riscv.c

Lines 3792 to 3796 in f51900b

} else if (halted && running) {

LOG_TARGET_DEBUG(target, "halt all; halted=%d",

halted);

riscv_halt(target);

} else {

If there is such a halted hart，but the record status is running，it would not process riscv_semihosting.

riscv-openocd/src/target/riscv/riscv.c

Lines 3605 to 3623 in f51900b

if (halt_reason == RISCV_HALT_EBREAK) {

int retval;

/* Detect if this EBREAK is a semihosting request. If so, handle it. */

switch (riscv_semihosting(target, &retval)) {

case SEMIHOSTING_NONE:

break;

case SEMIHOSTING_WAITING:

/* This hart should remain halted. */

*next_action = RPH_REMAIN_HALTED;

break;

case SEMIHOSTING_HANDLED:

/* This hart should be resumed, along with any other

* harts that halted due to haltgroups. */

*next_action = RPH_RESUME;

return ERROR_OK;

case SEMIHOSTING_ERROR:

return retval;

}

}

@aap-sc I think this is a bug, would you provide some suggestions?

@lz-bro I'm still trying to understand your reasoning and what the issue is exactly (the situation is still not quite obvious to me). It will take a couple of days - I'll ask additional question if necessary.

I believe I have figured out the code thanks to the detailed description that @zqb-all has provided:

If the harts are just in the process of halting due to being members of the halt group, we should wait until they finish halting, so that the true halt reason can be discovered - for instance semihosting request - and then handled correctly.

JanMatCodasip · 2024-11-28T08:55:27Z

@lz-bro I am afraid I have not understood what case this merge request addresses; not even after reading the commit description and the discussion so far.

Please, could you provide a very clear description in the commit message. Doing so will help:

Code reviewers to understand your changes (and have them merged)
Other users of OpenOCD source code - both current and future.

Thank you.

zqb-all · 2024-11-29T17:03:41Z

Let me try to explain this issue.
Suppose we have two cores, core0 and core1, belong to one smp group.
Openocd constantly calls riscv_openocd_poll to check the status.

When openocd calls riscv_openocd_poll on core0:

A. if no hardware state change occurs，sequence is:
A.1. Check that core0 belongs to smp and start checking the status of all smp cores.
A.2. riscv_poll_hart checks core0. core0's status is running without change. next_action=RPH_NONE, running++
A.3. riscv_poll_hart checks core1. core1's status is running without change. . next_action=RPH_NONE, running++
A.4. Finally, should_remain_halt=0, should_remain_resume=0, halted=0, and running=2, nothing happen

B. If core0 hit soft breakpoints on hardware, one possible sequence is
B.1. Check that core0 belongs to smp and start to check the status of all smp cores.
B.2. riscv_poll_hart checks core0. core0's status is running without change. next_action=RPH_NONE, running++
[Between B.2 and B.3, core0 hits the breakpoint and becomes halted, and core1 also halted immediately due to hw haltgroup]
B.3. riscv_poll_hart checks core1 and finds that core1's status has changed from running to halted. The reason is RISCV_HALTED_GROUP. This results in next_action=RPH_NONE, halted++
4. Finally, should_remain_halt=0, should_remain_resume=0, halted=1, running=1. Enter else if (halted && running), and call riscv_halt to halt all cores in the smp group. In the process core0's target->status will also be corrected to halted, and next time riscv_openocd_poll will consider core0's status unchanged.
There is no problem in this case.

Let's re-assume that core0/core1 are both running and consider case C

C. If core0 hit semihosting ebeak on hardware, one possible sequence is:
C.1. Check that core0 belongs to smp and start to check the status of all smp cores.
C.2. riscv_poll_hart checks core0. core0's status is running without change. next_action=RPH_NONE, running++
[Between C.2 and C.3, core0 hit semihosting ebreak and halted, and core1 also halted immediately due to hw haltgroup]
C.3. riscv_poll_hart checks core1 and finds that core1 status has changed from running to halted. The reason is RISCV_HALT_GROUP. This results in next_action=RPH_NONE, halted++
C.4. Finally, should_remain_halt=0, should_remain_resume=0, halted=1, running=1. Enter else if (halted && running), and call riscv_halt to halt all cores in the smp group. In the process core0's target->status will also be corrected to halted, and next time riscv_openocd_poll will consider core0's status unchanged.
Now here's the problem: as status not change, poll will not enter this if , cannot realize that core0 is actually semihosting ebreak and will not handle it. core0 and core1 will remain halted.

Let's re-assume that core0/core1 are both running and consider case D

D. If core0 hit semihosting ebeak on the hardware, but the timing was earlier than in case C, one possible sequence is:
D.1. Check that core0 belongs to smp and start checking the status of all smp cores.
[Between D.1 and D.2, core0 hit semihosting ebreak and halted, and core1 also halted immediately due to hw haltgroup]
D.2. riscv_poll_hart checks core0 and finds that core0 has changed from running to halted. Because halt_reason is RISCV_HALT_EBREAK, semihosting will be further checked and processed, after that one possible result is next_action=RPH_RESUME, should_resume++
D.3. riscv_poll_hart checks core1 and finds that core1 status has changed from running to halted. The reason is RISCV_HALTED_GROUP. This results in next_action=RPH_NONE, halted++
D.4. Finally, should_remain_halt=0, should_remain_resume=1, halted=1, running=0. Enter the else if (should_resume), then call riscv_resume, and eventually core0 and core1 will return to the running state.
There is no problem in this case.

Thank you.

zqb-all · 2024-11-30T04:38:41Z

Things are a bit complicated.
The poll function of the software takes time to run, and hart may hit ebreak (breakpoint or semihosting) at any time, then harts in the same smp group, may be halted immediately (by hardware haltgroup) or may continue running temporarily (for example, hart is controlled by another DM and no hardware supports synchronous halt).
We expect the poll function to eventually adjust the smp group to halted or running. But before the software completes processing, these temporarily running harts may also hit ebreak, make the situation more complicated.

zqb-all · 2024-12-12T15:32:14Z

@JanMatCodasip @aap-sc Does my description of the issue help you understand what the issue is ?

JanMatCodasip · 2024-12-17T10:46:58Z

@zqb-all Thank you for describing the situation in more detail. It will take me some time to get back to it.

JanMatCodasip

Hi @zqb-all, thank you for the clear and accurate description of the situation that you're trying to address. It helped me to understand the code change. I am also sorry for replying late.

Overall this fix looks good and I have posted some comments.

JanMatCodasip · 2025-01-16T14:21:48Z

src/target/riscv/riscv.c

@@ -3790,6 +3791,22 @@ int riscv_openocd_poll(struct target *target)
 		LOG_TARGET_DEBUG(target, "resume all");
 		riscv_resume(target, true, 0, 0, 0, false);
 	} else if (halted && running) {
+		foreach_smp_target(entry, targets)


Suggested change

foreach_smp_target(entry, targets)

LOG_TARGET_DEBUG(target, "SMP group is in inconsistent state: %u halted, %u running", halted, running);

/* The SMP group is in an inconsistent state - some harts in the group have halted

* whereas others are running. The reasons for that (and corresponding

* OpenOCD actions) could be:

* 1) The targets are in the process of halting due to halt groups

* but not all of them halted --> wait a moment and then poll again so that

* the halt reason of every hart can be accurately determined (e.g. semihosting).

* 2) The targets do not support halt groups --> OpenOCD must halt

* the remaining harts by a standard halt request.

* 3) The hart states got out of sync for some other unknown reason (problem?). -->

* Same as previous - try to halt the harts by a standard halt request

* to get them back in sync.

/* Detect if the harts are just in the process of halting due to a halt group */

foreach_smp_target(entry, targets)

JanMatCodasip · 2025-01-16T14:23:43Z

src/target/riscv/riscv.c

+				if (get_field(dcsr, CSR_DCSR_CAUSE) == CSR_DCSR_CAUSE_GROUP)
+					cause_groups++;
+				else
+					break;


Suggested change

break;

/* This hart has halted due to something else than a halt group.

* Don't continue checking the rest - exit early. */

break;

JanMatCodasip · 2025-01-16T14:25:45Z

src/target/riscv/riscv.c

+		if (halted == cause_groups)
+			return ERROR_OK;


Suggested change

if (halted == cause_groups)

return ERROR_OK;

if (halted == cause_groups) {

LOG_TARGET_DEBUG(target, "The harts appear to just in the process of halting due to halt group. Giving them more time - will poll their state later.");

return ERROR_OK;

}

JanMatCodasip · 2025-01-16T14:35:10Z

src/target/riscv/riscv.c

@@ -3790,6 +3791,22 @@ int riscv_openocd_poll(struct target *target)
 		LOG_TARGET_DEBUG(target, "resume all");
 		riscv_resume(target, true, 0, 0, 0, false);
 	} else if (halted && running) {
+		foreach_smp_target(entry, targets)


I wonder if this loop can be skipped if haltgroup_supported is equal to false.

JanMatCodasip · 2025-01-16T14:36:41Z

src/target/riscv/riscv.c

 		LOG_TARGET_DEBUG(target, "halt all; halted=%d",
 			halted);


Suggested change

LOG_TARGET_DEBUG(target, "halt all; halted=%d",

halted);

/* Halting the whole SMP group to bring it in sync. */

LOG_TARGET_DEBUG(target, "halt all; halted=%d",

halted);

JanMatCodasip · 2025-01-16T14:39:28Z

src/target/riscv/riscv.c

+				else
+					break;
+			}
+		}


I believe I have figured out the code thanks to the detailed description that @zqb-all has provided:

If the harts are just in the process of halting due to being members of the halt group, we should wait until they finish halting, so that the true halt reason can be discovered - for instance semihosting request - and then handled correctly.

JanMatCodasip · 2025-01-16T14:42:52Z

src/target/riscv/riscv.c

+		if (halted == cause_groups)
+			return ERROR_OK;


(1) I would recommend to keep a counter of how many times this situation has occurred. If it reaches a certain limit, OpenOCD should:

stop waiting for the harts to halt,

print an error to the user, and

try to halt all the harts directly (riscv_halt()).

(2) Instead of returning from riscv_openocd_poll() and waiting for the next poll interval, it would perhaps be better to create a loop inside this function. This would allow to immediately re-poll the hart state and react quickly.

@aap-sc @zqb-all What is your opinion on that?

If you are in favor of it but don't feel like implementing it in this PR, consider adding a TODO comment.

create a loop inside this function. This would allow to immediately re-poll the hart state and react quickly.

Sounds good, this helps smp harts finish state adjustment faster.

If the harts are just in the process of halting due to a halt group, poll again so that the halt reason of every hart can be accurately determined (e.g. semihosting).

lz-bro force-pushed the handle-all-trigger-halt branch from 7652128 to 21d836a Compare November 20, 2024 12:34

aap-sc reviewed Nov 20, 2024

View reviewed changes

JanMatCodasip reviewed Jan 16, 2025

View reviewed changes

lz-bro force-pushed the handle-all-trigger-halt branch 2 times, most recently from 252e9cf to 8193431 Compare January 22, 2025 09:55

target/riscv: Fix SMP group is in inconsistent state

0affedb

If the harts are just in the process of halting due to a halt group, poll again so that the halt reason of every hart can be accurately determined (e.g. semihosting).

lz-bro force-pushed the handle-all-trigger-halt branch from 8193431 to 0affedb Compare January 23, 2025 01:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

target/riscv: Ensure to handle all triggered a halt events #1171

target/riscv: Ensure to handle all triggered a halt events #1171

lz-bro commented Nov 20, 2024

aap-sc Nov 20, 2024

lz-bro Nov 20, 2024 •

edited

Loading

lz-bro Nov 22, 2024

aap-sc Nov 22, 2024

JanMatCodasip Jan 16, 2025

JanMatCodasip commented Nov 28, 2024

zqb-all commented Nov 29, 2024 •

edited

Loading

zqb-all commented Nov 30, 2024

zqb-all commented Dec 12, 2024

JanMatCodasip commented Dec 17, 2024

JanMatCodasip left a comment

JanMatCodasip Jan 16, 2025

lz-bro Jan 23, 2025

JanMatCodasip Jan 16, 2025

lz-bro Jan 23, 2025

JanMatCodasip Jan 16, 2025

JanMatCodasip Jan 16, 2025

JanMatCodasip Jan 16, 2025

lz-bro Jan 23, 2025

JanMatCodasip Jan 16, 2025

JanMatCodasip Jan 16, 2025

zqb-all Jan 17, 2025

	} else if (halted && running) {
	LOG_TARGET_DEBUG(target, "halt all; halted=%d",
	halted);
	riscv_halt(target);
	} else {

	if (halt_reason == RISCV_HALT_EBREAK) {
	int retval;
	/* Detect if this EBREAK is a semihosting request. If so, handle it. */
	switch (riscv_semihosting(target, &retval)) {
	case SEMIHOSTING_NONE:
	break;
	case SEMIHOSTING_WAITING:
	/* This hart should remain halted. */
	*next_action = RPH_REMAIN_HALTED;
	break;
	case SEMIHOSTING_HANDLED:
	/* This hart should be resumed, along with any other
	* harts that halted due to haltgroups. */
	*next_action = RPH_RESUME;
	return ERROR_OK;
	case SEMIHOSTING_ERROR:
	return retval;
	}
	}

-		foreach_smp_target(entry, targets)
+		LOG_TARGET_DEBUG(target, "SMP group is in inconsistent state: %u halted, %u running", halted, running);
+		/* The SMP group is in an inconsistent state - some harts in the group have halted
+		 * whereas others are running. The reasons for that (and corresponding
+		 * OpenOCD actions) could be:
+		 * 1) The targets are in the process of halting due to halt groups
+		 *    but not all of them halted --> wait a moment and then poll again so that
+		 *    the halt reason of every hart can be accurately determined (e.g. semihosting).
+		 * 2) The targets do not support halt groups --> OpenOCD must halt
+		 *    the remaining harts by a standard halt request.
+		 * 3) The hart states got out of sync for some other unknown reason (problem?). -->
+		 *    Same as previous - try to halt the harts by a standard halt request
+		 *    to get them back in sync.
+		/* Detect if the harts are just in the process of halting due to a halt group */
+		foreach_smp_target(entry, targets)

-					break;
+					/* This hart has halted due to something else than a halt group.
+					 * Don't continue checking the rest - exit early. */
+					break;

target/riscv: Ensure to handle all triggered a halt events #1171

Are you sure you want to change the base?

target/riscv: Ensure to handle all triggered a halt events #1171

Conversation

lz-bro commented Nov 20, 2024

Choose a reason for hiding this comment

lz-bro Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JanMatCodasip commented Nov 28, 2024

zqb-all commented Nov 29, 2024 • edited Loading

zqb-all commented Nov 30, 2024

zqb-all commented Dec 12, 2024

JanMatCodasip commented Dec 17, 2024

JanMatCodasip left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lz-bro Nov 20, 2024 •

edited

Loading

zqb-all commented Nov 29, 2024 •

edited

Loading