Analyzing maintenance log data to predict system failures

Cyber-Physical Systems (CPS) are often very complex and require a tight interaction between hardware and software. As it happens in almost any software systems, also CPS  generate different kinds of logs of the activities performed, including correct operations, warnings, errors, etc. Frequently, the logs generated are specific to the different subsystems and are generated independently. Such logs contains a wealth of information that needs to be extracted and that can be analyzed in different ways to understand how the single subsystem behaves and even retrieve information about the behavior of the overall system. In particular, considering the generated logs, it is possible to:

  1. Analyze the behavior of a single subsystem looking at the data generated by each one in an independent way;
  2. Analyze the overall behavior of the system looking at the correlations among the data generated by the different subsystems

Such data are very useful to understand the behavior of a system and are often used to perform post-mortem analysis when some failures happen. However, such data could also be used to understand in a more comprehensive way how the system behaves through a real-time analysis able to monitor continuously the different subsystems and their interactions. In particular, it is possible to focus on preventing failures through predictive maintenance triggered by specific analysis.

Making predictions about system failures analyzing log files is possible but such predictions are strictly related to some characteristics of such files. In particular, some very important characteristics are: data generation frequency, information details, history.

The data generation frequency needs to be related to the prediction time and the time required to take proper actions. For example, if we need to detect a failure and take proper action in a few minutes, we need to use data generated with a higher frequency (e.g., in the scale of the seconds) and we cannot use data generated with a lower one (e.g., in the scale of the hours). This requirement affects the ability to make predictions and their usefulness to implement proper maintenance actions.

The information details provided need to include proper granularity and meaningful massages. In particular, it is important to get detailed information about errors, warnings, operations performed, status of the system, etc. The specific details required are tightly connected to the specific predictions that are needed. Moreover, the finer the granularity of the information, the higher are the chances of being able to create a proper prediction model.

High quality data history is required to build proper prediction models. However, just having a large dataset is not enough. Historical data need to be representative of the operating environment and include all the possible cases that may happen during  operations. In particular, it is required to have information about the log entries and the actual behavior of the system to create a reliable model of the reality.

The requirements described are just a first step towards the definition of a proper predictive maintenance model but they are essential. Moreover, the proper approaches and algorithms need to be selected based on the specific system and the related operating conditions.