Colour Based People Search in Surveillance

Colour Based People Search in Surveillance Ian Dashorst 5730007 Bachelor thesis Credits: 9 EC Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor M. Worring Intelligent Systems Lab Amsterdam University of Amsterdam Science Park 904 1098 XH Amsterdam June 24 th, 2011

Keywords Colour descriptor, light intensity invariance, object recognition, image search, surveillance, colour indexing Abstract This paper investigates the way to find a person in surveillance footage by using a colour-based search. The research consists of two parts. First the words for the different colours are studied and mapped to a representation for each separate colour and second a search method is developed to find the best matches per colour. For mapping, a group of people has been asked to define the colour of the upper body in images of a real-world dataset. The search is done on two different datasets that are a good representation of real-world situations. First the VIPeR dataset is investigated and then the outcome is compared to a real-world dataset. The results indicate that using colour description for finding initial images of persons in surveillance data is promising solution for this problem.

1. Introduction Closed Circuit Television Video (CCTV) is an important tool for crime prevention and investigation. The amount of CCTV footage being recorded is increasing rapidly and it takes a lot of time for the police to use the footage to track witnesses or suspects down. As the footage is increasingly being recorded digitally, it is possible to automate the process of analyzing. One major application of camera surveillance is post incident crime investigation. There are three major steps for post incident investigation, detection of the persons in each of the cameras, tracking the persons within a single camera and finding instances of these persons in other cameras (Metternich et al., 2010). A lot of research already has been done for these steps for when there is an initial image of the person to use, but what to do when there is no such initial image? If there is a description of the person, typically stating characteristics of clothing, we could make that the starting point. The most prominent feature on the cameras and in the description would be the colour of the clothing. This paper therefore focuses on the colour description. Developing a method for matching such a description to the data in realworld surveillance comes with major difficulties. In a laboratory setting it is possible to control the different variations in your image, but in a realworld situation this is often not possible. The large variations in lighting conditions, shading and viewpoint in the images greatly affect the appearance of a person. Changes in light source directly impact the colour of any object and shading has an effect on the intensity or saturation of a colour. A different viewpoint can also change the colours of the clothing, for example someone with a blue shirt might be wearing a red backpack. Viewing a person frontally the main colour would then be the blue from the shirt, but seeing a person on the back, red is the dominant colour. Therefore, a system that is going to be used in a real-world application for automatic processing of surveillance footage needs to be robust to these variations. It is to be expected that distinctive colours in clothing yield the best results and will be the most useful to use for detection. These colours will be best described by others and are scarcer on the street than a colour like black. The method for finding the initial image of a person introduced in this paper deals with the described difficulties of the real-world. With these things in mind this paper proposes and evaluates a method for searching the initial image by using a colour-based approach. For this approach a good colour descriptor needs to be developed, which then can be used to search the surveillance data. To search an image using the colour description, it is first needed to learn the colour names. When these colour names are known, a ranking for the best matches for a colour is wanted. This will shorten the actual time that it takes to find an initial image compared to a manual search for the image. This method would be an addition for analyzing surveillance data. In section 2 the related work will be discussed and the approach of the research will be described in section 3. Section 4 will elaborate on the experimental setup and the results will be presented in sections 5 and 6. 2. Related Work Many current systems use computer vision and image processing techniques for automated person search in surveillance systems (Borsboom, 2011; Metternich et al., 2010; Gray & Tao, 2008; Gray et al., 2007). These systems get good results, but they are not suitable for finding the initial image. This section will first discuss what has been done on automated processing and then elaborate on how this can be expanded. Gray et al. (2007) tackle the problem for viewpoint invariance and have composed a challenging dataset for testing. However, they evaluate their method by recognition on this dataset. Metternich et al. (2010) test their method on a more real-world situation and focus on post incident investigation.

Metternich et al. and Grey et al. both assume in their searches that there is a candidate person available to start the search with. Identifying such persons could be done by using textual description of these persons and searching on the given criteria. As said in the introduction the most prominent feature on cameras would be the colour description of clothes. Park et al. (2006) describe such a search engine for surveillance systems. They argue that this search engine should be able to narrow down the candidates for finding a person and a human operator should then choose the final target. The narrowing down by this engine can be done by using low-level queries with primitive features ( Show all subjects wearing a blue shirt ). They, however, don t include a sophisticated way of defining the colours for the clothing and only use a hue histogram with ten bins with each bin representing one colour. In real-world situations however the colour might not be so easily defined. Van der Weijer et al. (2007) learn the colour names from real-world images instead. This approach seems more viable in a real-world setting and they show that their method improves learning colour names significantly compared to a controlled setup, which is the general way to map colours. In this method multiple test subjects are asked to label hundreds of colour chips within a well-defined experimental setup (Van der Weijer et al.). The colours are to be chosen from a preselected set of colour names. From this labelled set of colour chips the mapping from RGB values to colour names is derived. The real-world images they use are obtained by using Google Image and labelled by way of text analysis. Van der Weijer et al. argued against learning colour names in a different setup than which they are applied. So the approach for colour naming from real-world images should be applied to images of persons instead of the Google Images in this case. This will be elaborated in the next section. 3. Approach 3.1. Method outline In this section a method for getting from a description of a person to reporting back the candidates that match the description based on the colour will be proposed. This project won t be dealing with the actual person detection on camera footage. The images that are used here have already undergone the process of person detection and some decent techniques to do this are already available. There are three main steps in reporting back candidates that match a colour description (figure 1). First it is necessary to represent the colour in the images and colours that are learned in a universal way. The colour representation should be invariant to location, viewpoint, illumination and quality of the images, due to the issues in real-world situations, described in the introduction. Van de Sande et al. (2008) give an overview of colour features and their invariance properties. The hue histogram seems to be the most viable candidate to use, as it is light intensity change and shift invariant. A drawback however is that the hue becomes unstable around the grey axis. Van de Weijer et al. (2006) apply an error analysis to the hue to resolve this. The hue histogram is made more robust by weighing each sample of the hue by its saturation. But a different method for defining the colours black, grey and white is needed, as this method still does not discern these. The hue histogram is not invariant to light colour change and shift. This might not be a problem for finding a specific image with a specific colour, as the focus is on the description of colours by humans and they might perceive colours differently with different lighting and then invariance to light colour change or shift is not wanted. To correct the limitation that occurs with achromatic and low saturated colours Grana et al. (2007) have proposed the Enhanced HSV Histogram for automatic annotation of video clips using HSV colour space analysis. It has shown improvement on the quality of image retrieval (Borghesani et al., 2009). This way of constructing the histogram will be used for describing the colour space.

Figure 1: Steps needed to get from a colour description to match candidates Histogram Definition Learning Colour names Matching Secondly, the colour names need to be learned. Van de Weijer et al. (2007) used 11 basic colour terms of the English language: black, blue, brown, grey, green, yellow, orange, pink, purple, red, white and yellow. These colour names will be used. As learning colour names from real-world images for detection in a real-world setting is performing better (Van der Weijer et al.), the colour names will be learned from the viewpoint invariant pedestrian recognition (VIPeR) dataset (figure 2), which is composed by Gray et al. (2007). The dataset contains 632 pedestrian image pairs from arbitrary viewpoints. These images simulate a real-world situation, because of the variations in viewpoint, illumination, location and quality of the images. Figure 2: Example from VIPeR dataset The colours of the clothes can be roughly divided into three parts, the head, the upper body and the lower body (Park et al., 2006). Here the focus lies solely on the upper body, but the same method that is proposed in our paper can be applied to the head or the lower body. For only selecting the upper body of the image, a mask should be used to filter out the background and legs of the person. Borsboom (2011) showed that a simple mask is already very useful for filtering the background in person matching. The mask that will be used is a simple oval at the rough location where the upper body is located. As last step for getting from a description of a person to possible candidates, which suit the description is matching. The description of the person ( A person with a red jacket ), can be represented in the query as a single colour (namely Red in this case), as we focus on the upper body solely. This query should return a list of possible candidates that match the description and the list should be ranked according to their match value, which represents the amount of similarity between the candidate and the representation of the colour in the query. The match value is generated by matching the histograms of every image in the dataset to the histogram that represents the colour in the query.

3.2. Histogram The histogram that is used to represent the colours in the individual images, is the enhanced HSV histogram as presented by Grana et al. (2007). The function yields the index of the bin and is defined as: where, are the quantization levels devoted to every color channel. They propose to add bins to the histogram that contain all the achromatic and dark colours. These bins correspond to the grey levels, from black to white. For convenience is chosen equal to. The achromatic colours are selected by imposing a threshold on the S and V coordinates respectively. The value of has been empirically set. Grana et al. found that provides a good trade-off between colour loss and matching of similar dark or achromatic colours and this value for is initially used. However, as opposed to matching, for the chromatic colours it would not be necessary to subdivide the colours according to saturation or value as the only description that is used is the colour itself. It would be useful to more accurately describe a colour description as light green or dark blue. The achromatic and dark selection can then be used for the grey values from black up to white. And the chromatic regions can be defined by. This forms the following rules for region selection (chromatic, achromatic or dark) and creating the histogram. Table 1. Rules for the HSV regions selection. Condition Region Bin S < achromatic V < dark otherwise chromatic For the histograms is chosen as and. A higher number of bins for will give a more accurate representation of the colour, but as the colour of the images vary in the details within a colour group, the matching value would be lower. The key is thus to find a good balance between the accuracy of the representation and the matching performance. 3.3. Learning colour names Learning the colour names will be done on the VIPeR dataset. This dataset provides a good real-world simulation as explained before and have a reasonable variation in colours of the upper body. The images in the VIPeR dataset however are not labelled and thus it is necessary to research how people perceive the colours on the images and label them accordingly. The setup for the questionnaire that is used to do this, is described in section 4. The images are then labelled according to the results of the questionnaire and the histograms made are then defining those colours. 3.4. Matching When the 11 colour names are learned and defined by histograms, these histograms should be used to identify the colour in other images. For that a histogram of the image will be made and matched to the colour histograms using Histogram Intersection (Swaine and Ballard, 1991). The ranking is done by calculating a match value :

Given a pair of histograms, and, each containing bins, the intersection of the histograms is normalized to obtain a fractional match value between 0 and 1. But this might not be an accurate matching, as this method does not take in to account the similarity between colours close to each other in the HSV colour space and the colour naming in the real-world. To match the colour perception of humans it might also be necessary to rank according to the neighbour colours. This results in the following manner of calculating the match value :, where Here for a pair of histograms, and containing bins, the rank is a weighted combination of the intersection between and, and the intersection of the neighbours of and. In general you would want the first weight to be higher then as the exact match is more important. This way of obtaining the match value ensures that colours closer in the spectrum to the colour you give in the query receive a higher match value then colours that are further apart. 4. Experimental setup 4.1. Questionnaire A number of volunteers have been asked to name the main colour of the upper body of the persons on the images of the VIPeR dataset. Only the first set of the image pairs are used in the questionnaire. The questionnaire used for this is shown in figure 3. It shows the image on the left, has a dropbox which contains the 11 basic colour terms and a text field for remarks. The subjects were asked to select the colour which fits the upper body of the person on the image best out of the 11 colour terms and if they had any remarks about the colour to write them in the text field. Every test subject classified 50 images this way, which were selected out of the 632 images (the first images of the image pairs) in the VIPeR dataset. Images that have been judged by less than 5 different persons are not used for the labelling of colour names. Figure 3: Questionnaire 4.2. Evaluation Evaluation is needed to prove that the learning of the 11 colour terms with this questionnaire works. The preferred evaluation is seeing if every image that is returned, contains the colour that is asked for and checking if all images that contain that colour are shown. This however reveals an inherent problem. As only images in the first set of the VIPeR database are being labelled, the second set can t

be used for the evaluation as they are not labelled. It is, however, possible to do an evaluation on the performance of predicting the colour name of a new image. This will give an indication for how well the colour names are learned. To test this, one image of the labelled images is taken out of the dataset to be the image for predicting the colour name. The colour names are trained from the remaining labelled images and the colour that gets assigned to the image, is the one that has the best match value. This will be done for all labelled images. High accuracy of the colour naming will indicate that the colours are correctly learned. Testing a query on the VIPeR dataset will be done on the second images of the image pairs that have been kept separately from the training set. Queries for different colours will be done on the dataset. Firstly it will be important to see if every image that is returned, contains the colour that is asked for. And secondly if all images that contain that colour are shown. 5. Results of the questionnaire In total 63 sets of answers were returned, containing colour labels on 300 different images. After removing the images that received less than 5 answers 250 remained. Of these 77 got the same colour assigned to them by all test subjects. The images were categorized according to the majority of the answers. This means that whenever one colour got 50% or more of the votes, the category of that image is that colour, 31 images remained unlabelled after this categorization. These images are mainly images where the persons are wearing striped clothes or the colour is either very dark or light and can be interpreted in different ways. In yellow, orange and purple less than ten images got sorted and thus may be underrepresented. The result of the categorization can be found in Appendix A A few things stand out from the answers and the histograms of the colours made accordingly. In the answers a few different combinations of colours occur very often and it seems that these are difficult to distinguish from each other when describing the colour in one of the 11 colour terms. For example green and grey are seen in the same image by different people. So are combinations of blue and either black, purple, grey or green, grey and black or white and red and orange, pink or brown. The first thing that is noticeable in the trained histograms is that for the black and white colours it is clearly visible that with chosen according to the results of previous research (Grana et al., 2007), some red and blue colours are still in these selections, as shown in figure 4. This suggests choosing a higher value for to represent the real-world more accurately. However, this might cause the histograms of the other colours lose their defining values as well. The effect on the precision and recall of varying the value of will be shown in section 6. The second thing that stands out is that the histogram of the images labelled green don t have a clear peak anywhere, but is spread out over the range between red and blue and has a lot of dark values. This could be a sign that the green colour won t be found accurately in a query.

Figure 4: left histogram of colours labelled black, right white, under green 6. Results of testing 6.1. VIPeR dataset The histograms as result from the questionnaire show that the used value for might not be optimal to represent the way people perceive the colours in the real world. Figure 5 shows this effect. When using a high value for the search result returns images with more vivid colours and decreases the error in the images found. However, this results in some images not being found as well. Figure 5a: queries for purple and red

Figure 5b: queries for purple and red Because of the differences in returns on the query, the evaluation on predicting the colour names is done for various values of between 0 and 1. The graphs in Appendix B show the complete results of predicting the colour names. The accuracy of naming all colours is not rising above 40%. This is not surprising, as this tries to name every colour, including the colours that are harder to name and which also have a great variation in human naming, such as black, white and brown. More interesting are the results for the more distinctive colours. Of these colours, naming pink shows the worst performance on precision. This can be explained by the fact that pink is very close to red in terms of Hue values. However, the other four colours show a high precision for certain levels of. It shows that for detecting the different colours, different values for should be chosen to optimize the search results. Table 2 shows the optimal values for for each colour. Table 2: Optimal values for evaluation. per colour. Based on precision and recall of the colour naming Colour Black Blue Brown Green Grey Orange Pink Purple Red White Yellow 0.15 0.55 0.10 0.10 0.30 0.50 0.40 0.55 0.30 0.05 0.25 Although some of these results are based on a very small number of images (pink, purple, orange and yellow), they still indicate that for the distinctive colours, this method for learning colour names from real-world images will yield good results, when these colour names are then used for querying. In appendix C the top 20 result for querying each colour name, with the optimal value for used, is given. It is left to the reader to judge whether these results give good results as, apart from having several people annotate these results, there is no good method for evaluating these results. It is possible to say however that the results do indicate, that when looking for an image with the upper body containing a certain colour (blue, orange, pink, purple or red), the search for such an image will shorten compared to a manual search through the footage. Querying for black does also return satisfying results, but as the colour isn t as descriptive as the others, the use for it might be less. More research still has to be done to find which value for in particular cases is best. When using a system like this one user feedback may prove to improve the results for the different tasks.

6.2. Real-world dataset Now it would be interesting to see how the system performs on a real world surveillance setting. For this the recordings used by Metternich et al. (2010) are used as test material. This footage is recorded with the assistance of the Dutch police. They obtained a ground-truth by manually labelling the positions of four persons who were asked to walk around in the area under surveillance. These four persons are shown in figure 6. The images that are used here, are the results of their application of the person detection system. The drawback of those images however is that there are a lot of false positives for persons. For these four images the queries to shorten the search time for initial images are red, grey, black and red. The previous observations show that only for the fourth image a good result can be expected. The results of the queries for red on footage from four different cameras are shown in Appendix D. Figure 6: Ground-truth in real-world dataset In the results it is clear to see that the number of false positives is quite large. The fourth image in the results for the first track however already shows the person we were looking for with this query. The results on the second track aren t as good as those of the first and our ground-truth image is not found in the first 50 images, but still a lot of persons with red upper body clothing are shown. To improve these results a better way of person detection should be used on the surveillance data. And as in the search only a simple static mask is used and not every person is in the middle of the images, not every person is evaluated for the colour on the upper body. The results for track 3 and 4 contain a lot less false positives for persons. In the results for track 3 our ground truth image that we were looking for with our query appears seven times in the top 50 images (images 17, 18, 20, 21, 24, 25 and 46). In track 4 the ground truth is found five times in the top 50 images (images 12, 27, 34, 38, 43). These results are a good indication for the potential of the method used and when it is used the search for an initial image should take a lot less time on average, even though the person detection method is not very accurate yet. 7. Discussion There still is room for improvement in the proposed method. For example the questionnaire was made on different computers with different monitors to obtain a large group of volunteers. This causes some unwanted variations in the colours shown to the test group. To obtain the most accurate mapping labelling should be done before the persons are being recorded on camera. Such labelling ensures that the description that is used to define the colour histograms for the different colours matches the description of the colours of the person that is searched for. A second problem was the amount of data for some of the distinctive colours. Ideally every colour would be defined by the same amount of images. The last point would be a better evaluation method. Because of a lack of descriptions in the evaluation set, it wasn t possible to do a comprehensive analysis on the results of the queries. Next steps in using descriptions of persons for finding initial images could be to research if it is possible to define more colours then the 11 colours used here. These could be descriptions about saturation or refinement of the spectrum like colours as cyan, beige or others. Besides that pattern

recognition for a stripe or a dot pattern on a shirt could be useful to add to the detection. Other features of a person could also be investigated. 8. Conclusion In this paper a method for using a colour description of the upper body to find a person on surveillance footage has been proposed. Training was done on the first set of images in the VIPeR dataset, which was labelled by a group of volunteers and testing has been done on the second set of images in the VIPeR dataset and a real-world surveillance dataset. The results show that the proposed method has a high precision for distinctive colours, but isn t very useful for the more achromatic and dark colours as was expected. The results further show that the optimal value for is different for detecting different colours. For real-world application using a description of a person proves a promising solution for finding initial images on surveillance data.

Literature Borghesani, D.; Grana, C. and Cucchiara, R. (2009) Color features performance comparison for image retrieval, 15th International Conference on Image Analysis and Processing, pp. 902-910. Borsboom, S.(2011) Person Matching under Large Changes in Viewpoint and Lighting. M.Sc. Thesis Artificial Intelligence. University of Amsterdam. Grana, C., Vezzani, R., Cucchiara, R.(2007) Enhancing HSV Histograms with AchromaticPoints Detection for Video Retrieval, Proceedings of ACM International Conference on Image and Video Retrieval, CIVR, pp.302 308. D. Gray, S. Brennan & H. Tao, (2007) Evaluating Appearance Models for Recognition, Reacquisition, and Tracking", Performance Evaluation of Tracking and Surveillance (PETS). IEEE International Workshop on. M.J. Metternich, M. Worring, A.W.M. Smeulders. (2010) Color based tracing in real-life surveillance data, Transactions on Data Hiding and Multimedia Security V. LNCS 6010, pp. 18-33. Park, U.; Jain, A.; Kitahara, I.; Kogure, K.; Hagita, N. (2006) ViSE: Visual Search Engine Using Multiple Networked Cameras. Pattern Recognition. IEEE International, pp. 1204 1207. Sande, K.E.A. van de; Gevers, T.; and Snoek, C. G.M., (2008) A comparison of color features for visual concept classification, CIVR '08: Proceedings of the 2008 international conference on Content-based image and video retrieval, pp. 141-150. Swain, M.J.; Ballard, D.H. (1991) Color Indexing, International Journal of Computer Vision, Vol. 7-1, pp. 11-32. Weijer, J. van de; Schmid, C.; Verbeek, J. (2007) Learning Color Names from Real-World Images, IEEE Conference on Computer Vision & Pattern Recognition.

Appendix A Categorized images Black Blue Grey Green Orange Pink Purple

Red White Yellow Unlabelled

Appendix B Precision and Recall on predicting colour names 0 0 0 0 0 1,200 Total 0 0 0 0 0 1,200 Black 0 0 0 0 0 1,200 Blue 0 0 0 0 0 1,200 Brown 0 0 0 0 0 1,200 Green 0 0 0 0 0 1,200 Grey

0 0 0 0 0 1,200 Orange 0 0 0 0 Pink 0 0 0 0 0 1,200 Purple 0 0 0 0 0 1,200 Red 0 0 0 0 0 1,200 Yellow 0 0,050 0 0,150 0,250 0 White

Appendix C Top 20 returns for colour queries on the VIPeR dataset Black Blue Brown Green Grey Orange

Pink Purple (no more than 7 reliable results were returned) Red White

Yellow

Appendix D Top returns for colour queries on real world surveillance data Camera 1 Camera 2

Camera 3

Camera 4