Skip to content

Commit

Permalink
opal/hmi: Workaround Power9 hw logic bug for couple of TFMR TB errors.
Browse files Browse the repository at this point in the history
Add a workaround for a HW logic bug in Power9 where TB residue and HDEC
parity errors cleared by one thread aren't visible to other threads of same
core. The TB reside and HDEC parity error are reported through TFMR bit 45
and 26 respectively. If any of the thread from the core clears the TFMR bit
26 and 45, only thread 0 is able to see that errors are cleared but rest of
the threads 1, 2 and 3 do not see those as cleared.  This causes TB error
recovery to fail for TB residue and HDEC parity errors. TFMR is per core
register and any changes made by a one thread should be visible by other
threads of the same core.

On TB residue error (TFMR bit 45), TB goes into invalid state. Hence avoid
handling/clearing TB residue error if TB is valid and running. Use TFMR bit 41
to check validity of TB state.

For HDEC parity  error (TFMR bit 26), check for other errors on TFMR register
and ignore the pre-recovery for HDEC parity error. If TFMR has any other
TB error bits set alongwith HDEC parity error we can safely ignore handling
of HDEC parity error. Also, while clearing HDEC parity error bit from TFMR,
allow only thread 0 to clear it.

Signed-off-by: Mahesh Salgaonkar <[email protected]>
Signed-off-by: Stewart Smith <[email protected]>
  • Loading branch information
maheshsal authored and stewartsmith committed Oct 23, 2017
1 parent d1bb483 commit 00f2540
Show file tree
Hide file tree
Showing 2 changed files with 54 additions and 1 deletion.
28 changes: 27 additions & 1 deletion core/hmi.c
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,14 @@
/* Number of iterations for the various timeouts */
#define TIMEOUT_LOOPS 20000000

/* TFMR other errors. (other than bit 26 and 45) */
#define SPR_TFMR_OTHER_ERRORS \
(SPR_TFMR_TBST_CORRUPT | SPR_TFMR_TB_MISSING_SYNC | \
SPR_TFMR_TB_MISSING_STEP | SPR_TFMR_FW_CONTROL_ERR | \
SPR_TFMR_PURR_PARITY_ERR | SPR_TFMR_SPURR_PARITY_ERR | \
SPR_TFMR_DEC_PARITY_ERR | SPR_TFMR_TFMR_CORRUPT | \
SPR_TFMR_CHIP_TOD_INTERRUPT)

static const struct core_xstop_bit_info {
uint8_t bit; /* CORE FIR bit number */
enum OpalHMI_CoreXstopReason reason;
Expand Down Expand Up @@ -654,7 +662,12 @@ static void wait_for_cleanup_complete(void)
*/
static void timer_facility_do_cleanup(uint64_t tfmr)
{
if (tfmr & SPR_TFMR_TB_RESIDUE_ERR) {
/*
* Workaround for HW logic bug in Power9. Do not reset the
* TB register if TB is valid and running.
*/
if ((tfmr & SPR_TFMR_TB_RESIDUE_ERR) && !(tfmr & SPR_TFMR_TB_VALID)) {

/* Reset the TB register to clear the dirty data. */
mtspr(SPR_TBWU, 0);
mtspr(SPR_TBWL, 0);
Expand Down Expand Up @@ -840,6 +853,19 @@ static void pre_recovery_cleanup_p9(void)
return;
}

/*
* Due to a HW logic bug in p9, TFMR bit 26 and 45 always set
* once TB residue or HDEC errors occurs at first time. Hence for HMI
* on subsequent TB errors add additional check as workaround to
* identify validity of the errors and decide whether pre-recovery
* is required or not. Exit pre-recovery if there are other TB
* errors also present on TFMR.
*/
if (tfmr & SPR_TFMR_OTHER_ERRORS) {
unlock(&hmi_lock);
return;
}

/*
* First thread on the core ?
* if yes, setup the hmi cleanup state to !DONE
Expand Down
27 changes: 27 additions & 0 deletions hw/chiptod.c
Original file line number Diff line number Diff line change
Expand Up @@ -1478,6 +1478,7 @@ int chiptod_recover_tb_errors(void)
{
uint64_t tfmr;
int rc = -1;
int thread_id;

if (chiptod_primary < 0)
return 0;
Expand All @@ -1502,6 +1503,17 @@ int chiptod_recover_tb_errors(void)
/* Get fresh copy of TFMR */
tfmr = mfspr(SPR_TFMR);

/*
* Workaround for HW logic bug in Power9
* Even after clearing TB residue error by one thread it does not
* get reflected to other threads on same core.
* Check if TB is already valid and skip the checking of TB errors.
*/

if ((proc_gen == proc_gen_p9) && (tfmr & SPR_TFMR_TB_RESIDUE_ERR)
&& (tfmr & SPR_TFMR_TB_VALID))
goto skip_tb_error_clear;

/*
* Check for TB errors.
* On Sync check error, bit 44 of TFMR is set. Check for it and
Expand All @@ -1525,6 +1537,7 @@ int chiptod_recover_tb_errors(void)
}
}

skip_tb_error_clear:
/*
* Check for TOD sync check error.
* On TOD errors, bit 51 of TFMR is set. If this bit is on then we
Expand Down Expand Up @@ -1558,6 +1571,20 @@ int chiptod_recover_tb_errors(void)
rc = 1;
}

/*
* Workaround for HW logic bug in power9.
* In idea case (without the HW bug) only one thread from the core
* would have fallen through tfmr_recover_non_tb_errors() to clear
* HDEC parity error on TFMR.
*
* Hence to achieve same behavior, allow only thread 0 to clear the
* HDEC parity error. And for rest of the threads just reset the bit
* to avoid other threads to fall through tfmr_recover_non_tb_errors().
*/
thread_id = cpu_get_thread_index(this_cpu());
if ((proc_gen == proc_gen_p9) && thread_id)
tfmr &= ~SPR_TFMR_HDEC_PARITY_ERROR;

/*
* Now that TB is running, check for TFMR non-TB errors.
*/
Expand Down

0 comments on commit 00f2540

Please sign in to comment.