RAS Get State Extension#
API#
Functions
Enumerations
Structures
RAS State#
Reliability, Availability, Serviceability (RAS) is a standard mechanism to report HW errors. In L0 Sysman, we report these errors via RAS API and organized into Correctable or Uncorrectable errors. Furthermore, the errors can be grouped by category so that source of the error is easily identifiable.
This extension is defined with intent to provide an extensible interface to the user for discovering these errors. A separate function is provided to allow users to clear the error counters. This functionality extends and is intended to eventually replace existing mechanism via zesRasGetState.
To that end, this extension also adds new enumerator for RAS error categories. The list of error categories include existing ones (refer to the Sysman Programming guide) and also additional ones which are defined here. The additional error categories are listed below:
Error category |
||
---|---|---|
Number of ECC correctable errors |
Number of ECC uncorrectable errors |
|
that have occurred in memory - |
that have occurred in memory - |
|
GDDR/HBM. |
GDDR/HBM. |
|
Number of correctable errors |
Number of ECC uncorrectable errors |
|
that have occurred in scale IP |
that have occurred in scale IP |
|
Number of ECC correctable errors |
Number of ECC uncorrectable errors |
|
that have occurred in L3 fabric |
that have occurred in L3 fabric |
The following pseudo-code demonstrates a sequence for querying the number of error categories supported by a platform and for obtaining the error counters for these categories.
// Query for number of error categories supported by platform uint32_t rasCategoryCount = 0; {s}RasGetStateExp(rasHandle, &rasCategoryCount, nullptr); zes_ras_state_exp_t* rasStates = (zes_ras_state_exp_t*) allocate(rasCategoryCount * sizeof(zes_ras_state_exp_t)); //Gather error states {s}RasGetStateExp(rasHandle, &rasCategoryCount, rasStates); // Print error details for(uint32_t i = 0; i < rasCategoryCount; i++) { output(" Error category: %d, Error count: %llun n", rasStates[i]->category, rasStates[i]->errorCounter); } // Clear error counter for specific category, for example PROGRAMMING_ERRORS {s}RasClearStateExp(rasHandle, ZES_RAS_ERROR_CAT_PROGRAMMING_ERRORS);