Background Pixel Classification for Motion Detection in Video Image Sequences P. Gil-Jiménez, S. Maldonado-Bascón, R. Gil-Pita, and H. Gómez-Moreno Dpto. de Teoría de la señal y Comunicaciones. Universidad de Alcalá, 28871 Alcalá de Henares (Madrid) Spain {pedro.gil,saturnino.maldonado,roberto.gil,hilario.gomez}@uah.es Abstract. The main objective in motion detection algorithms for video surveillance applications is to minimize the false alarm probability while maintaining the probability of detection as high as possible. Many motion detection systems fail when the noise in a specific zone is high, increasing the false detection probability, and so the system can not detect motion in these zones. In this paper we present an alternative scheme that tries to solve the mentioned problem using the classification capacity of a neural network. 1 Introduction A video surveillance system arises when, in some place, a special need for security or safety is required. The first generation of this kind of systems consists on many video cameras and a video monitor, operated by people. A system working that way brings on two important drawbacks. First, the amount of information a person can process is extremely low and their efficiencies decrease with time. Second, when a video image is processed by humans in a public place, people involved in that situation can feel their privacy violated [1]. In this context, the second generation of video surveillance systems is based on digital signal processing, especially on artificial intelligence and image processing [1]. In these systems, only the important events are stored, saving a big amount of storage capacity that would be wasted otherwise. These systems can improve the work conditions of the video surveillance operators, releasing them from the more monotonous tasks, having more acceptance between general people. 2 Motion detection When motion detection is needed in a video surveillance system several methods can be implemented. The easiest way to develop such a system is to calculate the absolute value of the difference between the current frame and a reference frame. Then, the difference matrix is thresholded. Those pixels whose value are higher than an arbitrarily chosen threshold are supposed to be part of an object moving into the scene, and those that are smaller belong to the background.
The reference frame can be either, the previous frame [2], or an estimation of the background calculated throughout the last processed frames [3] [4]. M(i, j) = I(i, j) B(i, j) > th (1) The critical point of this kind of systems is the value of the threshold to be chosen, and this value must be applied to the whole image. If the value of the mentioned threshold were small, the false alarm probability would increase, due mainly to the noise picked up by the camera, and the noise generated by a nonstatic background, i. e. a branch tree moving in a windy day. On the other hand, when the value of the threshold is high, the miss probability increases because the absolute difference between the background gray level and the object gray level is likely to be lower than the current threshold. Moreover, in real video images, the noise level and illumination conditions could be different in each part of the frame. To solve this problem, many schemes implement an statistical module that tries to determine the characteristics of every pixel on the image along the time. To achieve this, the algorithm establishes a training period in which, supposedly, there is no motion at all in the sequence, and all the variance of every pixel is due to the noise. Calculating the pixels statistics is possible to infer the behavior for each part of the image, and so, decide whether the next sample of a particular pixel corresponds again to the background, if its value is close to the expected according to the calculated statistics, or belong to a new object moving across the image. The more common algorithms used in such schemes suppose that the probability density function (p.d.f.) is gaussian [5], and, after calculate the mean (µ) and the typical deviation (σ), the next pixel will be considered as background if its value is within the interval ([µ kσ, µ + kσ]), and motion otherwise, where k is an arbitrary constant that must be chosen by the user. Although this method works correctly if the pixel has a low level noise, when the background generates a high level noise, due, for example, to a branch tree in a windy day, or to a high density road in which the car motion are not of interest in the system, the variance of this zone would be very high and so, the gray level interval in which the new sample can be considered motion is very small, increasing the miss probability in those zones. Furthermore, if in a particular moment a sudden illumination change occurs without any other kind of motion in the scene, the described algorithms normally fail, because when the absolute difference or the typical deviation interval check is done, the algorithms usually mark this pixel as a motion pixel when it should have been marked as background. If the illumination change affected the whole scene, almost all the image would be marked as motion. 3 Background pixel classification The proposed method tries to classify each zone of the image into one of the following groups, according to its behavior along the last processed frames. And furthermore, once the algorithm has succeed in the classification of every zone,
the decision about whether the next sample belong again to the background, or is part of a new moving object has its own rules, depending on the group the zone was classified on. 3.1 Groups description of background behavior According to the more typical sequences used in video surveillance systems, we have decided to create the following background behavior groups, even though the system is not restricted to this set of groups, and new groups could be created in future work. The groups have been chosen mainly observing the more problematic pixel behavior in typical motion detection scenes. Static background: This group corresponds to those pixels whose values are almost constant along the sequence. In real world images, this group could be all those zones that are not affected by background motion or fast illumination changes. The rule of decision of the existence of motion in this group is achieved by computing the mean and typical deviation of the pixel, being the next sample considered background again if its value is within the interval [µ kσ, µ+kσ], or motion otherwise. Figure 1 shows an example of a static background pixel behavior. To solve the sudden illumination changes, the algorithm classifies each of the pixels belonging to this group into one of these two subgroups: 2 0 10 20 30 40 60 70 80 90 Fig. 1. Static pixel example Static background with high texture information: This static background subgroup corresponds to those pixels that, belonging to the static background group, the texture information of its neighborhood is high enough to compute the angle difference between two consecutive vector, each made up from the referred pixel and its neighborhood pixels in the same frame. Computing the angle difference (NCCF), instead of the euclidean difference [6], is more robust in the presence of sudden illumination changes.
(I(i, j) B(i, j)) i j NCCF = I 2 (i, j) B 2 (i, j) i j i j (2) Static background without texture information: This static background subgroup corresponds to those pixels that, belonging to the static background group, the texture information of its neighborhood is scarce, and so, the illumination changes errors must be detected via high level algorithms. Noisy background: This group corresponds to those pixels whose values, and its neighborhood pixel values are absolutely random along the time, and their variance are high enough to compute the motion estimation via thresholding the absolute difference between the current frame and the reference image. This behavior can be obtained for example from the tree leaves or from a water surface. To decide the existence of motion on the next frame, we can suppose that the spatial variance of a new object moving through a zone of this kind is lower than the spatial variance of the background. So, for those pixels belonging to this group, the motion estimation can be achieved computing the spatial variance of the new frame. Figure 2 shows an example of a noisy background pixel behavior. 2 0 10 20 30 40 60 70 80 90 Fig. 2. Noisy pixel example Impulsive background: In many other situations the background pixel gray level is practically stationary along the time, but randomly, its value changes suddenly and comes back again to its original value, in an impulsive-like manner. This behavior can correspond, for instance, to a road in which, randomly, a car appears in the scene on a frame, and disappears on the next. Since we are not interested in this kind of events, but in very odd ones, the system can interpret
these zones as impulsive-background zones, so the system will not generate motion alarms in these zones with this kind of events, unless a special event happens, like a car stopping in that road for a very considerable time. The alarm will be triggered if the pixel value remains out of its stationary value during a number of samples. Figure 3 shows an example of an impulsive background pixel behavior. 2 0 10 20 30 40 60 70 80 90 Fig. 3. Impulsive pixel example 4 Experimental results 2 300 3 Fig. 4. First image of the experimental sequence
A sequence of images were chosen for testing the proposed scheme. The sequence was taken with a standard video camera from an exterior scene, with an interval of 0.5 seconds between samples. Figure 4 is the first image of this sequence, and figure 5 is the image number 64. On this image are marked the position of the pixels shown in figures 1 (static pixel [x=60, y=130]), 2 (noisy pixel [x=326, y=87]) and 3 (impulsive pixel [x=304, y=196]). As a training set we took a set of 20 pixels belonging to each group (static group, noisy group and impulsive group) of length (the length of the sequence), and a validation set comprised by another 20 pixels for each group. The pixels for the noisy group were taken mainly from the pixels corresponding to the tree line, the pixels for the impulsive group were taken from the road, and the pixels for the static group were taken from the rest of the image. The test set is composed by the whole image. To achieve the experiment we have designed a multilayer perceptron [7] [8] with inputs. The inputs to the neural network are the samples of a pixel, and the values of its corresponding DCT to give robustness against temporal shifts. The multilayer perceptron has 5 neurons on its hidden layer and 3 outputs, corresponding with the 3 background pixel groups (static, noisy and impulsive groups). The training has been early stopped using the validation set for generalization purposes. Fig. 5. Image 64 of the experimental sequence The background pixel classification image has been divided into 2 different images (figures 6 and 7) for clarity. Figure 6 shows the static vs non-static classification results after having repeated 10 times the experiment, taking the best result over the validation set for regularization purposes. Black pixels correspond with the static background group, while white ones correspond with the non-
static background group. As we can see, static background pixels correspond to those that belong mainly to the grass surface, the mountain and the sky, while the pixels belonging to the tree line and the roads have been marked as non-static background pixels. 2 300 3 Fig. 6. Static vs non-static classification In figure 7 is shown the noisy vs impulsive classification results obtained in the last experiment. White pixels correspond with the noisy background group and black pixels correspond with the impulsive background group. Gray pixels are those pixels belonging to the static background group shown in figure 6. Pixels that correspond with the tree line have been marked by the neural network as noisy pixels, while the pixels from the roads have been marked as impulsive pixels. 5 Conclusion and future work Motion detection is one of the main tasks of a video surveillance system. Normally, the detection is done by comparing the current frame with a previously stored reference. In this paper we have proposed an alternative method to get the reference, that can improve the results of the system when working in adverse conditions, specially reducing the false alarm probability and the miss probability in such cases. Instead of thresholding the difference between the current frame and the reference, as it is the typical scheme, we first classify each zone of the image depending on its observed behavior, and so, perform the motion detection according with this classification. The main drawback of the proposed method is the high computational cost of the classification for every pixel, so we have had to work with relatively small
2 300 3 Fig. 7. Noisy vs impulsive classification images. However, since the background behavior is suppose not to change permanently the classification can be done from time to time, reducing the computational requirements of the system. The directions of our future work aims to reduce the amount of signals that have to be classified using functions of spectral analysis and image and video compression, and so, reduce the computational cost, allowing the system to work in real time, as it is the basic requirement for this kind of systems. References 1. Foresti, G.L., Mähönen, P., Regazzoni, C.S.: Multimedia Video-Based Surveillance Systems. Kluwer Academic Publishers (0) 2. di Stefano, L., Neri, G., Viarani, E.: Analysis of pixel-level algorithms for video surveillance applications. In: CIAP01. (1) 541 546 3. Gao, D., Zhou, J.: Adaptive background estimation for real-time traffic monitoring. In: IEEE. (0) 330 333 4. Ridder, C., Munkelt, O., Kirchner, H.: Adaptive background estimation and foreground detection using kalman filtering. In: ICAM. (1995) 193 199 5. Ren, Y., Chua, C., Ho, Y.: Motion detection with non-stationary background. In: CIAP01. (1) 78 83 6. Dawson-Howe, K.: Active surveillance using dynamic background substraction. In: TR. (1996) 7. Haykin, S.: Neural Networks. A comprehensive foundation. Second edn. Prentice Hall Inc. (1999) 8. Bishop, C.: Neural Networks for Patern Recognition. Oxford University Press Inc. (1995)