When analysing sensor data, you are typically confronted with different challenges relating to data quality. Here, we show you how these challenges can be dealt with and how we derive some initial insights from cleaned data via exploration techniques such as clustering.
Nowadays, especially with the advent of the Internet of Things (IoT), large quantities of sensor data are collected. Small sensors can be easily installed, on multipurpose industrial vehicles for instance, in order to measure a vast range of parameters. The collected data can serve many purposes, e.g. to predict system maintenance. However, when analysing it, you are typically confronted with different challenges relating to data quality, e.g. unrealistic or missing values, outliers, correlations and other typical and a-typical obstacles. The aim of this article is to show how these challenges can be dealt with and how we derive some initial insights from cleaned data via exploration techniques such as clustering.
Within the MANTIS project, Sirris is developing a general methodology that can be used to explore sensor data from a fleet of industrial assets. The main goal of the methodology is to profile asset usages, i.e. define separate groups of usages that share common characteristics. This can help experts to identify potential problems, which are not visually observable, when the resulting profiles are compared with the expected behaviour of the assets and when anomalies are detected.
In this article, we will describe the methodology of asset usage profiling for proactive maintenance prediction. The data used in this article is confidential and anonymised; we therefore cannot describe it in detail. It mainly consists of duration and resource consumption as well as a range of parameters measured via different sensors. For our analysis, we used Jupyter Notebook with appropriate libraries such as pandas, scipy and scikit-learn.
Sometimes data can be polluted, as it is collected from different sources and can contain duplicates, wrong values, empties and outliers, which should all be considered carefully. Therefore, the first natural step is to conduct an initial exploration of the data and to prepare a single reference dataset for advanced analysis, by cleaning the data, by means of visual and statistical methods, then by selecting the right attributes you wish to work with further.
In our example dataset, we find negative or zero-resource consumption, a situation that is obviously impossible, as shown in Figure 1. In our case, since there are few outliers of this type, we simply remove them from the dataset.
Figure 1 Zero or negative consumption
Another possible example is that of an erroneous date in the data. For example, dates may be too old compared to the rest of your dataset; future dates can even exist. Your decision to maintain, fix or remove wrong instances can depend on many factors, such as how big your dataset is, whether an erroneous date is very important at the current stage, etc. In our case, we maintain these instances since, at this moment, the date is not important for analysis and the percentage of this subset is very low.
Outliers are extreme values that deviate sufficiently from other observations and also need to be dealt with carefully. They can be detected visually and using statistical means. Sometimes we can simply remove them, sometimes we want to analyse them thoroughly. Visualising the data directly reveals some potential outliers; refer to the point in the upper right-hand corner in Figure 2. In our case, such high values for duration and consumption are impossible, as shown in Figure 3. Since it is the first record for this type of asset, it may have been entered manually for test purposes; we consequently choose to remove it.
Figure 2 Visual check for outliers
Figure 3 Impossible data
In Figure 4, we can see a positive linear correlation between consumption and duration, which is to be expected, although we still may find some outliers using the 3-sigma rule. This rule states that, for the normal distribution, approximately 99.7 percent of observations lie within 3 standard deviations of the mean. Then, based on Chebyshev’s Inequality, even in the case of non-normally distributed data, at least 88.8 percent of cases fall within 3-sigma intervals. Thus, we consider observations beyond 3-sigmas as outliers.
Figure 4 Data after cleaning
In Figure 5, we see that our data is quite normal, centred around 0, most values lying between -2 and 2. This means that the 3-sigma rule will show us more accurate results. You must normalise your data before applying this rule.
Figure 5 Distribution of normalised consumption/s
Results are shown in Figure 6. The reason for such a significant deviation from the average in consumption and duration of certain usages is to be discussed with a domain expert. One instance with very low consumption for a long duration raises particular questions (Figure 7).
Figure 6 3-sigma rule applied to normalised data
Figure 7 Very low consumption for its duration
Advanced data exploration
As previously stated, we are looking to profile asset usages in order to identify abnormal behaviour and therefore, along with duration and resource consumption, we also need to investigate the operational sensor data for each asset. This requires us to define groups of usages that share common characteristics; however, before doing so, we need to select a representative subset of data with the right sensors.
From the preliminary analysis, we observed that the number of sensors can differ between the assets and even between usages for the same asset. Therefore, for later modelling we need to exclusively select usages which always contain the same sensors, i.e. training a model requires vectors of the same length. To achieve this, we can use the following approach, as illustrated in Figure 8.
Figure 8 Selecting sensors
Each asset has a number of sensors that can differ from usage to usage, i.e. some modules can be removed or installed on the asset. Thus, we need to check the presence of these sensors across the whole dataset. Then, we select all usages with sensors that are present above a certain percentage, e.g. 95 percent, in the whole dataset. Let’s assume our dataset contains 17 sensors that are present in 95 percent of all usages. We select these sensors and discard those with lower presence percentages. This way, we create a vector of sensors of length 17. Since we decided to include sensors if they are 95 percent present, a limited number of usages may still be selected although they do not contain some of the selected sensors, i.e. you introduce gaps which are marked in yellow in the figure. To fix these gaps, you can either discard these usages or attribute values for missing sensors. Attributing can be complex, as you need to know what these sensors mean and how they are configured. In our case, these details are anonymised and these usages are consequently discarded. You may need to lower your presence percentage criteria in order to keep a sufficiently representative dataset for further analysis.
After the optimal subset is selected, we check the correlation of the remaining sensors. We do this because we want to remove redundant information and to simplify and speed up our calculations. Plotting a heatmap is a good way of visualising correlation. We do this for the remaining sensors as shown in Figure 9.
Figure 9 Sensor correlation heatmap
In our case, we have 17 sensors from which we select only 7 uncorrelated sensors and plot a scatter matrix, a second visualisation technique which allows us to view more details on the data. Refer to Figure 10.
Figure 10 Scatterplot matrix of uncorrelated sensors
Based on the selected sensors, we now try to characterise different usages for each asset, i.e. we can group usages across the assets based on their sensor values and, in this way, derive a profile for each group. To do this, we first apply hierarchical clustering to group the usages and plot the resulting dendrogram. Hierarchical clustering helps to identify the inner structure of the data and the dendrogram is a binary tree representation of the clustering result. Refer to Figure 11.
Figure 11 Dendrogram
On this graph, below distance 2 we see smaller clusters that are grouping ever closer to each other. Hence, we decide to split the data into 5 different clusters. You can also use silhouette analysis for selecting the best number of clusters.
In order to interpret the clustering, we also want to visualize them, but 7 sensors mean 7 dimensions and because we can’t plot in multidimensional space or it is too complex, we apply Principal Component Analysis or simply PCA in order to reduce the number of dimensions to 2. This allows us to visualize the results of clustering, which is shown on Figure 12. Good clustering means that clusters should be more or less well separated, i.e. similar colours are close to one another or not mixed too much with other colours, and this is what we also see in the figure.
Figure 12 PCA plot
After the clustering is complete, we can characterise usages. This can be done using different strategies. The simple method consists in taking the mean of the sensor values for each cluster (i.e. we calculate a centroid) to define a representative usage.
The last step involves validating the clusters. We can cross-check clustering with the consumption/duration of usages. For instance, we may expect all outliers to fall within one specific cluster, or expect some other more or less obvious patterns, hence rendering our clusters meaningful. In Figure 13 below, we can observe that the 5 clusters, i.e. 5 types of usages, correspond, to an extent but not entirely, to consumption/duration behaviour. We can see purple spots at the bottom and green spots at the top.
Figure 13 Relationship between clusters and consumption/duration
At this stage, some interesting outliers were detected in consumption/duration relationships, which can be stressed with the objectives the assets were used for. We have found clusters that represent typical usages according to data. Result validation can be improved by integrating additional data, such as maintenance data, into analysis. Furthermore, results can be validated and confidently concluded by the domain experts from Ilias Solutions, the industrial partner we are supporting for their data exploitation.