5 Fault management service components

12.113GPPFault management of the Base Station System (BSS)TS

According to the TMN scheme (see M.3200 [3] and M.3400 [4]), the general requirements on fault management described in the previous clauses are achieved by means of management service components. This clause contains a more detailed description of the requirements and defines the related management service components.

5.1 Alarm surveillance

This service component requires that the OS and the operator have a consistent and up-to-date view of the current operating condition and quality of service of the managed network elements. For efficient and accurate fault management of a PLMN, it is also essential to achieve early detection of faults so that they can be corrected before significant effects have been noticed by the end-user.

In support of this, the Alarm Surveillance functions are used to monitor and interrogate NEs about faults, defects and anomalies. This results in the following requirements:

a) All detected faults, defects and anomalies in the NE shall be reported to the OS (for each case which matches the reporting conditions set by the OS). The NE/OS shall therefore support the sending/reception of unsolicited event reports notifying such events.

Whenever possible, the NE should generate a single notification for a single fault. When a single fault results in the failure of other functionalities, the NE should filter these "dependency faults". Such filtering is internal to the NE, and is not standardised in the present document.

There are a number of possible mechanisms by which the NE can detect faults, defects and anomalies. Depending on the implementation within the NE, several sources of information may be available. The most obvious examples are:

– hardware/firmware detectors: co-operating or supervising units which continuously check the correctness of analogue or digital signals (sense points, transmission error detectors);

– software detectors: to detect run time software errors as well as equipment errors;

– performance detectors: to monitor the performance of the NE and generate a quality of service alarm notification if it is out of normal range.

Some of these detectors are manageable and some are not, depending on the nature of their implementation. Furthermore, some of the manageable aspects of these detection functions (such as managing of thresholds for quality of service alarms) are the subject of standardisation in the present document (these are described below), whilst others are outside of the scope.

b) The forwarding of alarm reports through the Q3 interface shall be manageable both in terms of filtering (which reports shall be forwarded and which shall be discarded) as well as in terms of report destination(s). The alarm reporting is based on the model described in X.734 [10], i.e. the result of the fault detection process shall be an alarm notification sent to the Event pre-processing which may generate a potential event report that is sent to all the existing EFDs. The EFD is a managed object which receives the potential alarm reports and determines which event reports are to be forwarded and which are to be discarded. For the forwarded reports it also determines the destination, the time frame and the forwarding mode (i.e. confirmed or non-confirmed).

c) The NE shall be able to log alarm information as alarm records and support later retrieval of the logged alarm records. The logging functionality is based on the model described in X.735 [11] and can be used for any type of event information, including the alarm information. According to this model, when a fault is detected, an alarm notification is also sent to the Log pre-processing which may generate a potential log report that is sent to all the existing logs. The Log is a managed object which receives the potential log reports and determines which of them are stored as log records and which are discarded.

d) The NE shall be able to provide information to the OS about all the current outstanding alarm conditions in the NE. The outstanding alarm information may be reported in a summary on demand or periodically to the OS. This functionality is based on the Q.821 [15] and can be used to obtain a view of the NE’s current alarm condition on demand. This functionality can also be used to align alarm information between the OS and NE, for instance after an interruption of communication between the OS-NE (e.g. link failure, OS restart, NE restart) without waiting for the forwarding of all the events which occurred during the failure.

To support these alarm surveillance requirements over the Q3 interface, a number of management functions are defined in more detail in clause 6. These functions are grouped as follows:

– Threshold Management functions (X.721 [6], X.738 [24], X.739 [25], GSM 12.11);

The set of threshold management functions is optional.

– Alarm Reporting functions (M.3400 [4]/Q.821[15]/X.733 [9]/X.734[16]/GSM 12.00 [18]);

The set of alarm reporting functions is mandatory.

– Log control functions (M.3400 [4]/Q.821[15] X.735 [11]);

The set of log control functions is optional.

– Alarm Summary functions (M.3400 [4]/Q.821[15]).

The set of alarm summary functions is optional.

Alarm surveillance also has implications for the operating practices for co-ordination whenever multiple OSs are involved in management operations. Such practices are the responsibility of the operator and are not standardised in the present document.

5.1.1 The model

The model adopted for Alarm Surveillance is depicted in figure 1. In this figure, the circles represent managed objects, stacks of circles represent sets of managed objects, and the names of the MOC(s) are reported in the middle of the circles.

The leftmost stack of managed objects in figure 1 represents the managed objects of the NE that can generate alarm notifications (for example, in the case of a BSS, instances of the GSM 12.20 MOCs and their subclasses).

The solid arrows represent the information flow and the dotted ones the control flow. Only the flows related to the Alarm Surveillance are represented and are described subsequently.

The "Event/Log pre-processing" is an NE internal function (implementation dependent) and is not subject to specification in the present document.

Figure 1: Alarm surveillance management model

The Alarm Surveillance model is a combination of other models (or parts of models) specified by ITU and ETSI for the various functionalities that compose this service. These consist of

– Fault detection functions which are achieved by

– detectors that are not seen in figure 1 because, if the detectors are not manageable, they can be considered embedded in the managed object representing them. If they are manageable, they need to be represented with an instance of a specific MOC, which is outside the scope of the present document;

– the Threshold Manager instances, which scan observed managed objects and generate notifications when counters or gauges cross threshold levels.

– Alarm reporting functions which are achieved by

– the above fault detectors, including the generation of the alarm notification;

– the Event pre-processing, which may produce a potential alarm report;

– the EFD(s) for the final step of the reporting (discrimination, report forwarding, and EFD management). For more details, see subclause 5.1.3, ITU-T X.733 [9] and ITU-T X.734 [10].

– Log control functions which are achieved by

– the Log pre-processing, which may produce a potential log report;

– the Log(s) for the final step of the logging (discrimination, log record formatting and storing, log record retrieving and log management). For more details, see subclause 5.1.4, ITU-T X.733 [9], ITU-T X.735 [11] and GSM 12.00 [18].

– Alarm summary functions which are achieved by

– the Management Operations Schedule managed objects, which are used to program the reporting time;

– the Current Alarm Summary Control managed objects, which are used to collect the information on the outstanding alarm conditions from the managed objects, and, to produce the Potential Alarm Summary report toward the EFD or as a response to an operator request. For more details, see subclause 5.1.5 and ITU-T Q.821 [15].

5.1.2 Threshold Management

In the present document, the only form of fault detection which is standardised is the performance and equipment function monitoring by means of threshold management.

5.1.2.1 General

Threshold management may be used within the NE to generate defined alarm notifications as a result of a value change crossing the threshold level of an observed counter or gauge. The principles outlined are based on the recommendations of ITU-T X.721 [6], X.738 [24] and X.739 [25]. Note that both a counter and a gauge in the context of threshold management in the present document, refer to the counters and gauges as defined for Performance Measurements in GSM 12.04 [20] or elsewhere.

NOTE: The counter or gauge could also be a result of a more complex post-processing operation in a future GSM phase 2+ version of the present document, i.e. an arithmetic expression (or formula) combining a number of counters or gauges. These latter type of counters or gauges could be referred to as hybrid counters and gauges.

The use of such threshold mechanisms allows the NE, for example, to

– detect faults, defects and anomalies (which may also be detected by other fault detection mechanisms);

– provide an early notification of a probable problem;

– provide information from the NE which for other reasons is of importance to the NE manager.

The notifications generated by the threshold manager shall contain information pertinent to the context in which the alarm was triggered. The information which shall be contained in the notifications are a reference to the counter or gauge, threshold level, severity value and probable cause associated with the counter or gauge.

Figure 2: Inter-relationship of counters and gauges with thresholds

Figure 2 depicts the relationship between the threshold manager and the counters/gauges in the observed MOs/attributes, which is a one-to-many relationship except when offset levels are used; see below. The threshold manager contains only one threshold attribute, but this attribute may consist of several sets of characteristics which define the behaviour of the threshold. There may also be more than one instance of the threshold manager created in the NE, and the present document does not prevent that some of the observed MOs/attributes are the same for different threshold manager instances.

It shall be possible to have changeable threshold levels related to a counter or gauge threshold, for which an alarm notification is generated when the respective value crosses each threshold level, depending on whether notification generation has been enabled or disabled. For every gauge threshold it shall also be possible to define hysteresis levels.

It shall be possible to set and read the characteristics (e.g. perceived severity, notification generation) of counter and gauge thresholds. It shall not be possible to modify the characteristics without first deactivating that particular threshold manager instance.

In case any of the scanned attributes is missing or out of range, a threshold alarm notification with probable cause "Configuration or customisation error"/"Variable out of range" shall be generated.

The characteristics defined for one threshold attribute shall be equally applied to all observed MOs/attributes by the threshold manager instance containing that threshold.

There are two basic types of thresholds considered: Counter thresholds and gauge thresholds, which are subsequently described.

5.1.2.2 Counter Thresholds

A counter threshold shall have the characteristic that a notification is triggered when the value of a counter crosses one of the threshold levels of the counter threshold, if notifications are enabled. Each threshold level is related to a defined notification.

Figure 3: Example counter threshold behaviour

An offset level may also be associated and de-associated with a counter threshold level. This association may be cancelled at any time. Thus, whenever the counter threshold is triggered by a counter crossing a threshold level, the threshold level itself is incremented by the offset level. This results in the threshold triggering a further notification when the newly incremented threshold is crossed. The effect of the offset level mechanism is that notifications can be engineered to be generated at a large number of threshold levels at regular intervals, without having to directly specify each and every individual comparison level.

If an offset level is defined for a counter threshold (offset value > 0), the threshold manager instance may only be associated with one observed counter attribute in one MO instance.

If it is necessary to modify a counter threshold, then it shall first be deactivated. If on reactivation of the counter threshold it is detected that a modification had been made to the counter threshold, then this will automatically result in an alarm clear notification being generated for any outstanding alarm condition for the counter threshold.

On activation or reactivation of the counter threshold, no notifications are generated for any threshold levels related to the history of the counter. Next, in the case where an offset is defined and the comparison level is going to be changed, the comparison level must be recalculated such that the new comparison level is equal to the value of the initial comparison level plus a multiple of the offset level value (where the new comparison level is greater than the counter value by less than one offset level value). This calculation is algebraically defined as follows:-

CLINITIAL + OL * (n-1) < Counter Value <= CLNEW = CLINITIAL + OL * n

where CLINITIAL = Initial comparison level, OL = Offset Level, and CLNEW = New comparison level.

The behaviour of the threshold level when using offset levels is that it wraps round when it exceeds the maximum value of the counter. The result is that the value of the threshold level is less than or equal to the value of the offset value after it has been incremented by the offset level.

Further, whenever the counter to which the counter threshold is associated is reset, then the threshold level is also reset to the initial threshold value.

5.1.2.3 Gauge thresholds

A gauge is directly related to a single gauge-threshold where applied. The gauge threshold has the characteristic that a notification shall be triggered when the value of a gauge crosses the threshold value, if notifications are enabled.

Figure 4: Example gauge threshold behaviour

If it is necessary to modify a gauge threshold, then it shall first be deactivated. Further, if on reactivation of the gauge threshold it is detected that a modification had been made to the gauge threshold, then this will automatically result in an alarm clear notification being generated for any outstanding alarm condition for the gauge threshold. On activation and reactivation, the gauge threshold will generate alarm notifications for each threshold level that is met. Therefore, an "alarm on" notification is generated for each NotifyHigh comparison level associated with an "alarm on" notification, and which is below the gauge value. Similarly, an "alarm on" notification is generated for each NotifyLow comparison level associated with an "alarm on" notification, and which is above the gauge value..

In order to avoid repeatedly triggering defined event notifications around a particular threshold level, a hysteresis mechanism may also be provided by defining threshold levels in pairs (high levels and low levels), and within the range of these two threshold levels (i.e. the hysteresis interval) no notifications are triggered.

A gauge threshold may have a set of zero or more entries defining the threshold levels associated with notifications. Each member in this set consists of two submembers (a notifyHigh value and a notifyLow value), each together with an on/off switch for the generation of the notification at that level. A notifyHigh’s threshold value shall be greater than or equal to the notifyLow’s threshold value.

A notification that a notifyHigh level has been reached is generated whenever the gauge value reaches or crosses above the notifyHigh level in a positive direction. A subsequent similar crossing of the notifyHigh level does not generate a further notification until after the gauge value is equal to or less than the corresponding notifyLow value.

Similarly, a notification that a notifyLow level has been reached is generated whenever the gauge value reaches or crosses below the notifyLow level in a negative direction. A subsequent similar crossing of the notifyLow level does not generate a further notification until after the gauge value is equal to or greater than the corresponding notifyHigh value.

For each pair of notifyHigh and notifyLow threshold levels, one of them shall generate an alarm notification, and the other shall generate an alarm clear notification. This means that the alarm clear notification may be generated either at the notifyHigh value or at the notifyLow value. The alarm notification shall always be generated before the alarm clear notification.

5.1.2.4 Threshold Manager

The threshold management mechanism is in the context of GSM 12.11, as depicted in figure 1, modelled by means of the Threshold Manager MOC.

The threshold manager is derived from the Scanner MOC in ITU-T X.739 [25] and it is able to scan (sample) a number of counters or gauges in a number of managed objects over time. It retrieves the data directly from the monitored MOs and thereby reduces the number of support objects involved in the process. For each scanned counter and gauge attribute, the threshold manager scans the attribute a number of times within the scheduled scanning period, performs comparison with the defined threshold values and generates a threshold alarm or alarm clear notification whenever the conditions defined above are met. In case any of the scanned attributes is missing or out of range, a threshold alarm notification with probable cause "Configuration or customisation error"/"Variable out of range" will be generated.

The threshold manager can scope the set of managed objects that are eligible to be included in the scanning, and it may select managed objects using filtering criteria (similar to the concept of scoping and filtering as described in ITU-T X.710 | ISO/IEC 9595). Alternatively, it can use an explicit list of managed object instances for scanning. The threshold manager observes the values of the specified attributes of each selected managed object, and applies the same characteristics (comparison levels etc.) to all of the attributes.

However, if an offset level is associated to a counter threshold, only one managed object and counter may be connected to the threshold manager instance.

The scanning and retention of observations is done according to the granularity period and scheduling attributes inherited from the scanner managed object class.

The attributes of a threshold manager instance (except the administrative state) may only be modified if thresholding is deactivated i.e. the administrative state is “locked”.

The attributes and conditional packages of the threshold manager, both the inherited ones from the scanner MOC and the ones defined herein, are described in the subclauses below.

5.1.2.4.1 Threshold manager attributes and conditional packages
5.1.2.4.1.1 Inherited from the ITU-T X.739 Scanner MOC

The scanner MOC has the following attributes:

a) Scanner Id:

This attribute contains a value used to identify an instance of the scanner managed object class (i.e. the scanner Id is used for naming).

b) Operational state:

This attribute is defined in CCITT Rec. X.731 | ISO/IEC 10164‑2.

c) Administrative state

This attribute is defined in CCITT Rec. X.731 | ISO/IEC 10164‑2.

d) Granularity period:

This attribute contains the granularity period indicating the time between scans.

If the granularity period is zero (which should be avoided), then the threshold manager samples the counter or gauge as frequently as the system allows without being overloaded.

The scanner MOC has the following conditional packages:

– Availability status package (present if the scanner can be scheduled);

– Duration package (present if the managed object function is scheduled to start at a specified time and stop at either a specified time or function continuously);

– Daily scheduling package (present if both the weekly scheduling package and external scheduler packages are not present in an instance and daily scheduling is required);

– Weekly scheduling package (present if both the daily scheduling package and external scheduler packages are not present in an instance and weekly scheduling is required);

– External scheduling package (present if both the daily scheduling package and weekly scheduling packages are not present in an instance and a reference to an external scheduler is required);

– Period synchronization package (present if clock synchronization for the granularity period is required. If this package is not present then clock synchronization is a local matter.);

– Create delete notifications package (present if notification of managed object creation and deletion events is required);

– Attribute value change notification package (present if notification of attribute value change events is required); and

– State change notification package (present if notification of state change events is required).

5.1.2.4.1.2 Additional attributes and conditional packages defined for Threshold Manager in 12.11

Attribute defined in addition to the inherited ones:

– Scan attribute identifier list: A set of attribute identifiers. The attribute identifier specifies an attribute of any ASN.1 type, which is observed by the threshold manager. The scan attribute identifier list may be empty;

Conditional packages defined in addition to the inherited ones:

– Scoped selection package (present if required and the managed object instance selection package is not present);

– Managed object instance selection package (present if required and the scoped selection package is not present);

– CounterThresholdPackage: Contains the counter threshold characteristics and is present if counter threshold is required and gaugeThresholdPackage is not present. For a more detailed description of the counter threshold characteristics, please refer to subclause 6.1.1.

– GaugeThresholdPackage: Contains the gauge threshold characteristics and is present if gauge threshold is required and counterThresholdPackage is not present. For a more detailed description of the gauge threshold characteristics, please refer to subclause 6.1.1.

Either the scoped selection package or the managed object instance selection package shall be present in a threshold manager.

Further description of the above listed packages is provided below.

5.1.2.4.1.3 Behaviour of the threshold manager

The following default values and semantics are associated with the attributes of the threshold manager:

The semantics of the scheduling packages (inherited from the scanner MOC) are defined in ITU-T Rec. X.734 [10].

The scanning filter attribute (part of the scoped selection package) semantics are the same as those for filter defined in ITU-T X.720 [8]. The default value is true.

If the managed object instance selection package is specified, the threshold manager observes the managed object instances specified in the object list.

If the scoped selection package is specified, the threshold manager uses the managed object identified in the base managed object attribute and checks all the managed objects within the levels indicated by the scope attribute by applying the criteria in the scanning filter attribute. The scoping and filtering are applied for each scan to select the managed objects to be observed. Managed objects that pass the selection criteria are observed in the scan.

5.1.2.4.1.4 Operations for the Threshold Manager

It shall be possible to perform the following operations for the Threshold Manager:

– Create Thresholding

The thresholding mechanism and its characteristics are defined for the specified counter(s)/gauge(s), by creating a new instance of the thresholdManager MOC.

– Get Thresholding

The characteristics of a thresholdManager instance are retrieved.

– Set Thresholding

The characteristics of the specified thresholdManager instance are modified. It is only possible to modify any attributes when thresholding has been deactivated for that particular counter threshold or gauge threshold.

– Remove Thresholding

The thresholding mechanism is no longer defined for the specified counter/gauge (i.e. thresholding is not performed and the thresholdManager instance is deleted).

– Activate/deactivate Thresholding

The thresholding mechanism for the specified counter threshold or gauge threshold is switched on or off. (Note that in the event of deactivating the notifications for a thresholding mechanism or the thresholding itself, the defined characteristics remain in place).

For completeness, the present document also defines the thresholdAlarmRecord MOC, which specifies the format for logging of threshold alarm notifications (see Annex A.1.1.2).

5.1.3 Alarm Reporting

Event Reporting identifies the standard mechanism to be used in the NE for the generation of event notifications, pre-processing of the event notifications, discrimination of potential event reports, and the formatting and forwarding of the event reports. Within the context of the present document, event reporting is used for reporting of alarms.

In addition to the requirements identified in clauses 5 and 6, the generation of alarm notifications for the GSM NE shall be performed according to ITU-T X.733 [9], X.721 [6] and GSM 12.20 [22]. For the semantics and complete definition of the attributes and parameters, please refer to these specifications.

According to the adopted specifications, the information contained in the alarm report (apart from the standard CMIS parameters) shall be:

Probable cause

M

Specific problems

U

Perceived severity

M

Backed-up status

U

Back-up-object

C

Trend indication

U

Threshold information

C

Notification identifier

U

Correlated notifications

U

State change definition

U

Monitored attributes

U

Proposed repair actions

U

Additional text

U

Additional information

U

Where M stands for "Mandatory", U stands for "User optional" and C stands for "Conditional".

The pre-processing of the alarm notifications is not subject to standardization in the present document.

The discrimination of potential alarm reports and forwarding of alarm reports shall be performed by means of the managed object class "Event Forwarding Discriminator" (EFD) defined in ITU-T X.734 [10] and ITU-T X.721 [6]. For the semantics and complete definition of the attributes and parameters, please refer to these specifications.

According to ITU-T X.734 [10] the functionalities of the event reporting management (which here are used for for alarm reporting) are:

– Initiation of event forwarding

– Termination of event forwarding

– Suspension of event forwarding

– Resumption of event forwarding

– Modification of event forwarding conditions

– Retrieval of event forwarding conditions.

Alarm notifications are extremely important for the management of the NE. If notifications are lost or delayed it may affect the operator’s ability to effectively manage the system. Breakdowns in communication between the NE and a remote OS shall be accounted for. Therefore the NE shall provide a storage capability locally, where notifications can be accessed and retrieved by the managing system.

The output queue from the NE to the OS is currently being studied by ITU-T SG4, and it is expected that the analysis of the output queue will be addressed in a future GSM Phase 2+ version of the present document.

5.1.4 Log control

Logging identifies the standard mechanism used in the NE for the log pre-processing of event notifications, the discrimination of potential log records, the formatting and the storing of the log records in one or more logs and the retrieving of the log records by the OS from the NE. Within the context of the present document, this functionality is used for logging alarm notifications.

The logging of alarm notifications in the NE shall be performed by means of the two managed object classes "Log" and "Alarm Record" which are defined in ITU-T X.735 [11], ITU-T X.733 [9] and ITU-T X.721 [6]. For the semantics and complete definition of the attributes and parameters, please refer to these specifications.

According to the adopted specifications the functionalities for the log control are:

– Retrieval of alarm records from the Log;

– Deletion of alarm records in the Log;

– Initiation of alarm logging;

– Termination of alarm logging;

– Suspension of alarm logging;

– Resumption of alarm logging;

– Scheduling of alarm logging;

– Modification of logging conditions;

– Retrieval of logging conditions.

In addition to the alarm record retrieval mechanism mentioned above, GSM 12.00 [18] annex B provides a definition for transfer of selected log records from a log instance to the OS as a file. This functionality shall be controlled through the managed object class "simpleFileTransferControl". The functionality required is:

– Bulk transfer of alarm records from the Log (transfer of data from the NE to OS requested by OS).

5.1.5 Alarm summary

The alarm summary functions allow an NE to report a summary of the outstanding alarm conditions of all or selected managed objects to the OS. They provide the following facilities:

– Definition of the alarm summary criteria;

– Requesting of the alarm summary criteria;

– Reporting of the alarm summary (on demand or scheduled);

– Scheduling of a current alarm summary (optional).

By setting the criteria for the generation of current alarm summary reports, the operator can determine whether outstanding alarm condition information on objects is included in a current alarm summary report. The following criteria shall be supported:

– managed objects with outstanding alarm conditions;

– perceived severity;

– alarm status;

– probable cause (optional).

The alarm summary report is sent from the NE to the OS and contains information about those outstanding alarm conditions that match the alarm summary criteria. The information for scheduled reports shall include:

– identification of the object instances with outstanding alarm conditions;

– alarm status;

– perceived severity;

– probable cause.

The presence of alarm status, perceived severity and probable cause in unscheduled reports, is dependent on whether they have been requested.

The model adopted for the alarm summary is based on the one defined in ITU-T Q.821 [15]. To provide the alarm summary functions, the OS and the NE shall be able to manage the current alarm summary control object class and, optionally, the management operations schedule object class. The first one provides the criteria for the generation of current alarm summary reports while the latter allows the scheduling of an alarm summary to occur at a specified time or periodically.

Information regarding a managed object shall be included in a current alarm summary report if :

– The managed object is included in the object list (which describes a list of object instances).

– The managed object has an alarm status that is present in the alarm status list (describes criteria for inclusion in the current alarm summary report and consists of a set of possible alarm status values (see ITU-T M.3100 [2]).

– The managed object has an alarm with a perceived severity and probable cause matching members of the perceived severity list and probable cause list.

If the object list is empty, then the criteria in the current alarm summary control shall be applied to all the objects of the NE. If any of the other criteria are empty then they are not used in selecting objects that will appear in the current alarm summary report.

The alarm summary provides information about all the objects selected and matching the selection criteria.

5.2 Fault localisation

This service component requires that the NE shall have the capability to provide all the necessary information to the OS in order to localise the faults that may occur in the NE itself.

In the process of localising the faults, the first piece of information is provided by the alarm surveillance service component, since after the fault detection an alarm notification is generated and, if the corresponding potential report is not discriminated, an alarm report which should contain sufficient information to localise the fault is forwarded to the OS.

In order to support localization wherever possible (as mentioned in 5.1 item 1), the NE should generate a single notification for a single fault.

In case of ambiguity in the localisation, the operator can request from the NE (in case this optional feature is supported by the NE):

– the execution of diagnostic tests

– retrieval of some log records to have a clear view of the events that occurred before the failure

– other information like the current configuration, the value of some measurements, the value of some attributes, etc.

Testing may also be needed to verify the fault if the localisation process is initiated due to, for example, customer complaints instead of an alarm report.

The detailed fault localisation process is closely related to the NE’s internal architecture as well as the operator’s maintenance and operating procedures and thus not subject to standardisation in the present document.

The resolution of the localisation should be down to one LRU (Least Replaceable Unit) for the majority of the faults. When the NE cannot localise the fault to one LRU, it should, as far as possible, indicate a restricted number of LRUs, ordered according to the probability of being faulty.

5.3 Fault correction

This service component requires capabilities that are used in various phases of the fault management:

1) After a fault detection, the NE shall be able to evaluate the effect of the fault on the telecommunication services and autonomously take recovery actions in order to minimise service degradation.

2) Once the faulty unit(s) has been replaced or repaired, it shall be possible from the OS to put the previously faulty unit(s) back in to service so that it is restored to its normal operation. This transition should be done in such a way that the currently provided telecommunication services are not, or minimally, disturbed.

3) At any time the NE shall be able to perform recovery actions if requested by the operator. The operator may have several reasons to require such actions; e.g. he has deduced a faulty condition by analysing and correlating alarm reports, or he wants to verify that the NE is capable of performing the recovery actions (proactive maintenance).

The recovery actions that the NE performs (autonomously or on demand) in case of fault depend on the nature and severity of the faults, on the hardware and software capabilities of the NE and on the current configuration of the NE.

The faults are distinguished in two categories: software faults and hardware faults.

In the case of software faults, depending on the severity of the fault, the recovery actions may be: System initialisations (at different levels), activation of a backup software load, activation of a fallback software load, download of a software unit etc.

In the case of hardware faults, the recovery actions depend on the existence and type of redundant (back-up) resources.

If the faulty resource has no redundancy, the recovery actions shall be:

a) Isolate and remove from service the faulty resource so that it cannot disturb other working resources.

b) Remove from service the physical and functional resources (if any) which are dependent on the faulty one. This prevents the propagation of the fault effects to other fault-free resources.

c) Adjust the Operational State and Status attributes of the faulty managed object and the affected managed objects, in a consistent way, reflecting the new situation.

d) Generate and forward (if possible) the reports to inform the OS about all the changes performed.

If the faulty resource has redundancy, the NE shall perform actions a), c) and d) above and, in addition, the recovery sequence which is specific to that type of redundancy.

In the NE, the redundancy of some resources may be provided to achieve fault tolerance and to improve the system availability. There exist several types of redundancy (e.g. hot standby, cold standby, duplex, symmetric/asymmetric, N plus one or N plus K redundancy, etc.) and for each one, in case of failure, there is a specific sequence of actions to be performed. The present document specifies the management (the monitoring and control) of the redundancies, but does not define the specific recovery sequences of the redundancy types.

The NE shall provide the OS with the capability to monitor and control any redundancy of the NE. The control of a redundancy (which means the capability to trigger a change-over or a change-back) from the OS can be performed by means of the state management services or by means of specific actions.

When the state management services are used, the transitions are triggered by locking/unlocking one of the objects participating in the redundancy.

In the case of a failure of a resource represented by a managed object providing service, the recovery sequence shall start immediately. Before or during the change-over, a temporary and limited loss of service shall be acceptable.
In the case of a management command, the NE should perform the change-over without degradation of the telecommunication services.

5.3.1 The model

The model adopted supports the redundancy management part of the fault correction service component. The model is imported as a subset of the ITU-T G.774.03 recommendation [23] – SDH Management of Multiplex-Section Protection for the Network Element View. The parts relevant for redundancy management in GSM 12.11 are introduced in subclause 5.2.1 of G.774.03, and the necessary managed object classes, protectionGroup and protectionUnit, are subsequently defined in subclause 6.3, 6.4 and related definitions. They are also depicted and exemplified in the annexes A and B of G.774.03.

5.4 Testing

This service component provides capabilities that can be used in different phases of the fault management; therefore it can also be seen as a ‘support’ to other service components. For example:

– when a fault has been detected and if the information provided through the alarm report is not sufficient to localise the faulty resource, tests can be executed to localise the fault (Fault Localisation service component);

– during normal operation of the NE, tests can be executed for the purpose of detecting faults (Alarm Surveillance service component);

– once a faulty resource has been repaired or replaced, before it is restored to service, tests can be executed on that resource to be sure that it is fault free (Fault correction service component).

However, regardless of the context where the testing is used, its target is always the same: verify if a system’s physical or functional resource performs properly and, in case it happens to be faulty, provide all the information to help the operator to localise and correct the faults.

Testing is an activity that involves the operator, the managing system (the OS) and the managed system (the NE). Generally the operator requests the execution of tests from the OS and the managed NE autonomously executes the tests without any further support from the operator.

In some cases, the operator may request that only a test bed is set up (e.g. establish special internal connections, provide access test points, etc.). The operator can then perform the real tests which may require some manual support to handle external test equipment. Since the "local maintenance" and the "inter NE testing" are out of the scope of the present document, this aspect of the testing will not be treated any further.

The requirements for the testing service component are based on ITU-T X.745, where the testing description and definitions are specified.

5.4.1 The model

The model adopted for the test management is specified in ITU-T X.745 [13] and depicted in figure 5. According to this model, the execution of a test involves two entities: a manager that initiates the test (test conductor), and an agent that executes the test (test performer). In this case, the test conductor resides on the OS while the test performer resides on the managed NE. They communicate with each other via the Q3 interface.

Figure 5: Testing Management Model

The model requires that, in the managed NE, there is at least one object which has the functionality to receive and to respond to the test requests coming from the manager; this functionality is named TARR functionality (where TARR stands for Test Action Request Receiver).

The model also requires that, in the managed NE, for each test execution there is at least one MORT, which is the managed object whose functionality is requested to be tested; and in some cases, there could be one or more associated objects (AO) which participate in the test execution although they are not the first target of the test. These two conditions imply that the resources which are subject to be individually tested should each be represented by a managed object.

The model distinguishes between two types of tests: uncontrolled and controlled tests. It is recommended to use the uncontrolled tests (that cannot be monitored or controlled) to model those tests that run very fast and can provide the test results very quickly. For the tests that may take a long time (minutes), it is preferable to use the controlled tests so that, during the execution time, the operator can perform some management activity on the tests, like the monitoring of the test state, test suspend, test resume, etc.

For controlled tests, Test Objects (TOs) are created in the NE as a result of the test request.

5.4.2 Testing requirements

In order to have a flexible, efficient and powerful testing service, the managed NE shall provide the following capabilities to the managing system.

The NE should provide a set of tests which homogeneously cover all the physical and functional parts of the system. Every possible fault occurring on every part of the NE should be covered by at least one test.

The tests should localise the faults as precisely as possible. For the majority of the faults, the localisation should be down to one LRU.

The NE shall have the ability to provide, to the OS, the list of the supported tests (both controlled and uncontrolled tests) and all the related information.

In the NE, the received test requests shall be checked to ensure that the test execution does not produce any uncontrolled and undesired effect on the telecommunication services currently provided by the NE itself. These acceptance checks depend on the type of test (intrusive or non-intrusive), on the current state of the resources to be tested (MORTs), on the current state of other involved resources necessary to set up the test environment (AO), on the current state of the test objects (TO), on the availability of the test infrastructures etc.

The non-intrusive tests can be run independently from the state of the MORTs and therefore, they do not require any preceding state change of the MORTs.

The intrusive tests can be run only if the MORTs are in the administrative state "locked" and/or in the operational state "disabled". The operator may use the state management services to change the administrative state of the MORTs; the change from unlocked to locked can be graceful, using the transient "shutting down" state.

Depending on the result of the tests, the operational state of the MORTs can change from enabled to disabled if some tests do not pass, or from disabled to enabled if all the tests pass. In the first case, when the tests detect a fault, they shall generate an alarm notification using the alarm surveillance services. In the second case, when the MORTs are returned enabled they shall generate an alarm notification with severity "cleared" and forward it to the EFD.

It is also possible that some tests fail because of minor faults, so it may be convenient to leave the MORT in service (enabled) instead of removing it from service. In this case, the availability status "degraded" shall be used to remind the operator that the MORT, although enabled, is not in perfect condition and has some minor trouble that requires correction.

When a MORT changes its operational state as a consequence of a test execution, the NE shall automatically and consistently change the state of all the other related managed objects, whose operational state is dependent on the MORT’s operational state.

If the NE provides the capability to execute controlled tests, then it shall be possible to:

– suspend and resume the tests

– monitor the processing of the tests through the state attribute of the TOs

– terminate the tests

– get the test results from the TO when they are provided as attributes.

When a controlled test is suspended, the TO is put in "suspended" state while the involved resources (the AOs) may be released, depending on the specific characteristics of the TO. When the test is resumed, the TO itself determines at what point in the test life-cycle the test will be resumed.

If the NE provides the capability to schedule the test execution within a time window, then it shall be possible <redundant>to set up the boundaries of the time window with a start time and a stop time. The start time is the earliest time at which the test performer can start the test execution (the actual starting time depends on the current conditions of the NE during the time window). The stop time is the latest time at which the test execution should be completed. The actual stop time depends on when the test was actually started and usually it should be reached before the stop time. The NE may also provide the capability to schedule the time to perform the initialisation of the tests and the time window within which to perform the real test execution. The test performer shall provide complete information to the OS about the actual initialisation time and execution time, together with the test results. If the NE cannot perform the test within the time window, the OS shall be informed.

The results of the tests are made available by the NE as attribute values of the TO(s) and/or returned via notifications issued by the TO(s).. The latter method is possible only for controlled tests. The test results shall contain all the information necessary to localise the faults, and may propose repair actions for the faults, if any.. If for any reason a test is prematurely terminated, the partial results collected so far may be reported to the OS.

In the case of uncontrolled tests, the results are reported in the reply(ies) to the test request.

It is expected that the analysis of "cyclic tests" will be addressed in a future GSM Phase 2+ version of the present document.

5.4.3 Test Categories

The test categories are defined in ITU-T X.737 [12] to group and to classify all the tests commonly applied to telecommunication systems, the tests of each group having the same characteristics from a functional and management point of view.

Some of the aspects of the testing that are common to all the test categories are specified by ITU-T X.745 [13]. In ITU-T X.737 [12] specific managed object classes (one for each test category) are derived from the TO MOC and characterised with additional packages, to model the characteristics specific to each test category. The test categories adopted for the present document are shortly described below.

5.4.3.1 Resource Self Test

This test category is used to characterise those tests that verify the ability of single resources (physical or functional) to correctly perform their allotted functions. For this test category there is only one MORT which represents the resource to be tested. This test category shall not have any AO.

5.4.3.2 Resource Boundary Test

This test category is used to characterise those tests that check the physical resources by observing them from their boundary. To test these physical resources there are one or more Points of Control and Observation (PCOs) from which they are stimulated by means of electrical signals, and observed in their reactions, measuring other electrical signals.

For this test category, there shall be one MORT that represents the physical resource under test, and one or more AOs which represent the PCOs. In this case, the AOs are usually objects specifically designed for testing purposes.

5.4.3.3 Connection Test

This test category is used to characterise those tests that verify the capability of a communication path (real or virtual) to support the desired service or level of functionality. In case it is possible to establish different types of connections on the same communication path, the test can be arranged as a sequence of different ‘exercises’, each one dedicated to a type of connection.

For this test category, there is one MORT that represents the communication path under test, and two AOs (or more in case of multitype connection) which represent the resources used to verify the connection. According to the scope restrictions of the present document, both the MORT and the AOs always belong to the same NE, therefore there is no special requirement for the AOs to agree on the details of the exercises to be performed.

The tests of this category shall be organised taking into account that the MORT is the communication path. Therefore when an exercise fails and the fault cannot be localised to the MORT (because the fault could also be on one of the AOs), then the exercise should be repeated with different AOs.

5.4.3.4 Data Integrity Test

The purpose of this test category is to verify the capability of a resource to correctly exchange data with other resources.

For this test category there is one MORT which represents the resource under test, and one or more AOs which co-operate to carry out the test.

Usually, during the test execution, the MORT transmits data to an AO which, upon reception, reflects the data to the MORT which will verify that the data is correctly received. In case of one way transmission, the receiving AO does not reflect the data back to the MORT but it has to verify the correctness of the received data.

Also, in this case when a single exercise cannot localise the fault to the MORT, the exercise should be repeated with another AO.

5.4.3.5 Loopback Test

The purpose of this test category is to verify that data can be sent and received through a communication path which is composed of one or more resources, each one represented by a managed object. A loopback may be implemented in a variety of ways, for example by physical loop connection or by echoing data received.

For this test category the MORTs are the MOs of the communication path which is under test. Not all the MOs of the communication path are necessarily under test; some of them could be involved as AOs, especially for the objects that represent the loopback points.

In some cases it is desirable to be able to specify that a loopback be set somewhere within a MORT. This may occur when a resource is modelled as a single MO but is actually complex enough to allow loopbacks to occur in several places within the MORT. For these cases the AOs may be present at locations that will test only a part of the MORT.

To allow the invocation of loopback tests, the configuration in which the MORTs and the AOs are to be placed, needs to be defined.

The manager may request a single loopback test by involving only one AO to operate the loopback point, or a multiple loopback test, involving a set of AOs. In this latter case, the order in which the AOs are operated is not defined.

5.4.3.6 Protocol Integrity Test

The purpose of this test category is to verify if the MORT can conduct proper protocol interactions with a specified AO.

It is planned that the analysis of the protocol integrity test will be addressed in a future GSM Phase 2+ version of the present document.

5.4.3.7 Test-Infrastructure Test

The purpose of this test category is to verify the ability of the NE to initiate tests, return result reports and, (for the controlled tests) respond to monitoring and control actions.

For this test category the MORT is the MO which has the TARR functionality while the TO is a "null test" whose purpose is solely to verify the correct behaviour of the test performer.