Evaluating the stability of SIFT keypoints across cameras

Size: px

Start display at page:

Download "Evaluating the stability of SIFT keypoints across cameras"

Ann Quinn
5 years ago
Views:

1 Evaluating the stability of SIFT keypoints across cameras Max Van Kleek Agent-based Intelligent Reactive Environments MIT CSAIL ABSTRACT Object identification using Scale-Invariant Feature Transform (SIFT) keypoints requires stable keypoint signatures that can be reliably reproduced across variations in lighting, viewing angle, as well as camera imaging characteristics. While reproducibility has been demonstrated under various lighting and viewing angles elsewhere, this work performs an empirical evaluation of the reproducibility of SIFT keypoints and the stability of their corresponding feature signatures across a variety of camera configurations within a controlled lighting and scene arrangement. INTRODUCTION The general problem of object recognition has been an important goal of the machine vision research field since its inception. Over the past decade, with the progressive development of lower-cost, high-performing computers that require less power and occupy less physical space, along with advancements that have yielded lower-cost and smaller cameras, it has become feasible to implement vision-based systems and embed them nearly everywhere, to provide people practical assistance on a wider variety of tasks than ever previously feasible. At the same time, these applications have placed new demands on vision algorithms to perform robustly and to recognize a larger variety of objects in a wider variety of lighting and scene conditions with the same or greater reliability than previously demonstrated, using a greater variety of imaging technologies. Similar exponential advances in communications technology, in particular lowcost, high-bandwidth, and wireless networking has made it possible for applications requiring the use of multiple computers distributed physically or with distributed resources to more easily coordinate and solve particular tasks. As these applications become increasingly pervsaive, certain robustness to variations in imaging devices will also become a necessity. For example, vision-enabled applications intended for home use or for their mobile handset will need to use whatever commodity image capture hardware is already available on users PCs or cell phones, respectively. Likewise, distributed vision applications may need to pull image data from multiple heterogeneous imaging devices. This paper presents an evaluation of a popular object recognition technique across a number of popular commodity image capture devices that are readily available today. BACKGROUND One particular technique that has recently gained attention due to successful demonstrations of being applied to recognizing both object instances [] and identifying objects classes [] is the use of robust local feature descriptors. Techniques that involve local features first identify a small set of salient keypoints which are likely to capture interesting information in an image about an object, analyzing the statistics around these keypoints and associating the set with a particular object. Then, the object may be identified given any new image, by locating and matching up corresponding keypoints from those previously associated with objects. In particular, the Scale Invariant Feature Transform (SIFT), proposed by David Lowe, selects candidate keypoints by searching images at multiple scales for points that are likely to be highly localizable, and then labels each keypoint using a signature derived from gradients around the keypoint. By then associating sets of these keypoints and accompanying signatures with the object, the object can later be identified by merely identifying the corresponding same set of keypoints and features in a new image. The single most important characteristic of these keypoints and signatures to enable reliable object recognition is the reproducibility of the keypoints across a variety of image variations that are likely to affect how an object is perceived. That is, the criterion about whether to choose a particular keypoint should related to how likely to be able to be detected and consistently identified in future images under variations in lighting, viewing angle/object orientation, lens distortion, and image noise. Lowe empirically analyzed the sensitivity of the SIFT keypoints to object rotation, particularly off-axis to the image plane in a recent paper []. Mikolajczyk et al. examined SIFT keypoint performance along changes in lighting direction and intensity []. However, little work has previously surrounded keypoint reproducibility across variations in cameras, across either physical imaging characteristics, such as lens, aperture and imager configuration, or performance characteristics of the imager, such as sensitivity, resolution and noise level. This work attempts to provide initial insight into SIFT performance across a range

Model Logitech QuickCam Express Logitech QuickCam Pro Sony EVI-D Nikon Coolpix 99 Detector Type CMOS, 5x88 CCD, 6x8 CCD, NTSC CCD, 8x56 Interface and Price USB, -bit Color ($5 USD) USB, -bit Color

EXPERIMENT Setup To determine the reproducibility of SIFT keypoints and associated signatures across variations of camera type, four cameras were selected, that varied widely in type, approximate

As can be seen in Table, cameras included two widely available low-end USB webcams, to an analog S-Video NTSC video camera digitized into DV, to a high-end consumer-grade digital still camera.

2 Model Logitech QuickCam Express Logitech QuickCam Pro Sony EVI-D Nikon Coolpix 99 Detector Type CMOS, 5x88 CCD, 6x8 CCD, NTSC CCD, 8x56 Interface and Price USB, -bit Color ($5 USD) USB, -bit Color ($5 USD) S-Video, digitized using Pinnacle Micro DV capture device ($959 USD) Digital still camera ( $5 USD) Table. Cameras used and their specifications of cameras through a series of experiments. EXPERIMENT Setup To determine the reproducibility of SIFT keypoints and associated signatures across variations of camera type, four cameras were selected, that varied widely in type, approximate market price, and specifications. As can be seen in Table, cameras included two widely available low-end USB webcams, to an analog S-Video NTSC video camera digitized into DV, to a high-end consumer-grade digital still camera. Images were either directly captured into x resolution (for the webcams), or downsampled after being captured at the devices native resolution. Each camera was taken and mounted on a tripod in an identical fashion at a fixed location in a room. Five incandescent lights were placed behind the camera, which was pointed at the subject at a distance of feet (for the human subject) and feet (for the toy robot). The pictures of the subject of each of the images were taken against a white wall in the laboratory. For each of the four cameras and the two conditions (human and toy robot), 6 images were taken: of background without a subject (for background subtraction), with the subject/object facing front in the center of the image, and each with the subject/object facing approximately 5 and 5 degrees away from frontal-parallel to the left of the camera, each of the subject/object facing approximately 5 and 5 degrees away from frontal-parallel to the right of the camera, respectively. Samples of the frontal-parallel pose for each camera are shown in figure. Procedure Prior to SIFT keypoint detection, images from each camera were batch-loaded, and converted to greyscale by averaging the red, green and blue pixel intensities for each pixel. Background images were separated from the rest. Contrast stretching was then performed across the whole set uniformly, by taking the minimum and maximum pixel intensities across all images and scaling the intermediate values to occupy the whole range. Background elimination Figure. Front pose for human subject using (clockwise from the top-left): qcexpress, qc pro, nikon coolpix, and Sony NTSC. Figure. Background elimination (left), mask generated by dilation with a disk of r = 5 pixels (right) To compute a background model, the mean and variance for each pixel was computed across the background images. To segment foreground from background then, each pixel in the remaining images of the set were compared against a normal distribution centered at the corresponding mean and with the corresponding covariance of the pixel in model. Pixels that were less likely than a threshold value (ɛ =.) according to this model were labeled as foreground; otherwise the pixel was labeled as background. An example of running this algorithm on an image can be seen in Figure. As can be seen by the image on the left, our background elimination procedure occasionally left holes where the foreground approached the color of the white wall. To fill in these occasional mistakes, the mask was dilated with a disk structuring element of radius 5 pixels. This effectively completed the regions for all of our test images, as can be seen by the resulting mask on the right. Keypoint identification Candidate keypoints were selected by searching a differenceof-gaussian (DoG) pyramid an octave at a time, selecting local maxima that have an intensity above a threshold (t =.8). To filter out edges, the Hessian was computed from image gradients at the candidate keypoint, and was rejected if ratio of eigenvalues of the Hessian is too large (R = ). Candidates that fell outside the foreground mask (computed

3 from the background model, above) were also eliminated, to preserve only keypoints that pertain to the foreground. SIFT feature extraction SIFT features were computed by first finding the keypoint orientation, which consisted of the dominant gradient over a window (size=6x6) centered at that keypoint. All possible gradient orientations during this process were discretized into a discrete set of N orientations (N = 6). Magnitudes of gradients of pixels within the window were weighted by a D gaussian of covariance σ, (where σ was assigned as half the width of the descriptor window), centered over the keypoint and summed. The orientation assigned to the keypoint was then this overall weighted sum direction, discretized to the nearest bin. After the keypoint orientation was assigned for a particular keypoint, the window for that keypoint was divided into q equally sized regions. A weighted histogram count of orientations was performed for each of these regions separately, taking the keypoint orientation as orientation. The results of this histogram count were then concatenated on the rest of the histogram counts, to form the SIFT feature vector of this keypoint, a single large vector of counts of size N q. Keypoint matching After the keypoints and feature vectors were computed for two images, we determined how many keypoints were reproduced from one image to the next by matching SIFT feature vectors between images. For each of the keypoints in the first image, A, we found the SIFT keypoint in the second image, B, whose feature vector was most similar (in a Euclidean distance sense) to that of the keypoint in A, and assigned the keypoint in A, B s label. Then, these keypoints were labeled and plotted with their associated histograms, and compared visually as can be seen in Figure. To see how keypoints in A corresponded to those back in B, then, the process was repeated back by taking every keypoint in B and finding its nearest neighbor in A. RESULTS In order to evaluate the performance of our version of the algorithm, the algorithm was run on pairs of images (which we will call A and B) from our set. Each pair produced four images: keypoints in A labeled with their original labels, keypoints in A labeled with the labels of their closest counterparts from B, keypoints in B labeled with their original labels, and keypoints in B labeled with their closest counterparts from A. They keypoints in A with their original labelings were then compared with keypoints in B labeled with their closest counterparts in A, and vice-versa. This output was manually analyzed for five different counts: () number of keypoints detected in A, () number of keypoints detected in B, () intersection of keypoints detected in A with those detected in B () number of keypoints from A that were detected in B and correctly labeled, (5) number of keypoints in B that were detected in A and corrrectly labeled. These correspond to columns -7 of Table. To first determine a baseline of how well our algorithm functioned, we first studied the reproducibility of keypoints among pairs of images taken with the same camera. The results can be seen for each camera on the top half of Table. Then, we studied cross-camera reproducibility for certain combinations of cameras, as can be seen in the second section of the table. The experiments were then repeated with a larger value of gradient bins, this time N = 6, yielding a total SIFT vector size of 8. Results from this second set of experiments are available in Table. The off-axis rotation images (of both 5 and 5 degrees) yielded no keypoint reproducibility, and therefore labeling performance could not be compared. Due to time constraints, further analysis of the off-axis images were left for future work. DISCUSSION We must be careful about drawing any broad conclusions about the performance of SIFT, given the results of our experiment, for a number of reasons. First, due to time and resource constraints, the number of images we were able to experiment with per camera was extremely small. Second, we used only two subjects in our experiment, a human form wearing black clothing, whose posture was slightly different each time, and a bright, plastic toy robot which remained perfectly rigid and motionless between conditions. This is not, by any means, anywhere near representative of the broad potential uses of local feature descriptors. Third, there are a large number of parameters that need to be set in the SIFT algorithm. Although we attempted to set all the parameters to sensible settings, either by using the same values described by Lowe, or by guessing a value, we were not able to, due to time constraints, tune each of the parameters to explore how performance would be affected. These parameters included the keypoint window size, gaussian window weight covariance, number of gradient bins, DoG pyramid step-size, DoG pyramid levels, edge detection threshold, and initial image blurring amount. Finally as will be described in future work, we did not have time to evaluate the performance of several of the enhancements to the algorithm, suggested by Lowe. However, if we compare the performance of our naive implementation of SIFT across our experimental conditions, we may come up with several hypotheses that may make useful predictive heuristics for practically implementing object recognition systems using SIFT. Our first observation was that keypoint detection and label matching performed very differently. Label matching was consistently higher (if we consider number of matches / number of keypoints reproduced) in same-camera experiments than between different cameras. However, this trend was not observed with just keypoint reproduction, which worked approximately equally well in within-camera and between-camera condition. In fact, particularly for the series of experiments with N = 6, between-camera keypoint reproduction was generally higher than between-cameras! However, since the number of images is so small, this is not a significant result. On the withinsame-camera human task with N =, an average of 9% of keypoints were reproduced within camera, and out of those

6 8 Nikon Coolpix () Sony NTSC () x A: (,87) x A: (,8) x A: (,7) x A: (5,97) x A: (,67).5 x A: (6,6) x A: (6,66).5 x A: (,5).5.5.5.5 6 8 6 8 6 8.5.5 6 x A:5 (56,7) x A:6 (57,9) 6 x A:7 (6,98).

4 6 8 Nikon Coolpix () Sony NTSC () x A: (,87) x A: (,8) x A: (,7) x A: (5,97) x A: (,67).5 x A: (6,6) x A: (6,66).5 x A: (,5) x A:5 (56,7) x A:6 (57,9) 6 x A:7 (6,98).5 x A:8 (6,67) x A:5 (5,5) 5 x A:6 (5,7) x A:7 (95,96) x A:8 (,68) x A:9 (6,68) 6 8 x A: (65,86) x A: (7,) 6 8 x A: (9,8) x A:9 (,8) 6 8 x A: (9,75) 6 8 x A: (9,9) x A: (5,8) 6 8 x A: (,) x A: (6,9) x A:5 (,) x A:6 (,8) x A:7 (9,98) x A:8 (7,) x A: (7,79) x A: (8,6) x A:5 (9,76) Keypoints from A with original labelings B keypoints labeled with nearest matches from A Figure. Output of algorithm comparing Nikon () to Sony (), showing keypoints of A, matched keypoints from A into B, and associated SIFT histograms (N = 6). Image A: Device (Image Image B: Device (Image in A in B A B A correctly labeled in B B correctly labeled in A Sony NTSC () Sony NTSC () Sony NTSC () Sony NTSC () Nikon () Nikon () QuickCam Pro () QuickCam Pro () QuickCam Express () QuickCam Express () 8 Nikon () Sony NTSC () Nikon () QuickCam Pro () Nikon () QuickCam Express () 8 6 QuickCam Pro () QuickCam Express () 7 8 Table. Correspondence results for Human task, N = bins/quadrant (6 element SIFT) Image A: Device (Image Image B: Device (Image in A in B A B A correctly labeled in B B correctly labeled in A Sony NTSC () Sony NTSC () Nikon () Nikon () QuickCam Pro () QuickCam Pro () 7 9 QuickCam Express () QuickCam Express () 8 Nikon () Sony NTSC () 8 5 Nikon () QuickCam Pro () 8 7 Nikon () QuickCam Express () 8 QuickCam Pro () QuickCam Express () 7 8 Table. Correspondence results for Human task, N = 6 bins/quadrant (8 element SIFT) Image A: Device (Image Image B: Device (Image in A in B A B A correctly labeled in B B correctly labeled in A Sony NTSC () Sony NTSC () Nikon () Nikon () QuickCam Pro () QuickCam Pro () 6 55 QuickCam Express () QuickCam Express () Nikon () Sony NTSC () Sony NTSC () QuickCam Pro () Sony NTSC () QuickCam Express () 8 78 Table. Correspondence results for Robot task, N = 6 bins/quadrant (8 element SIFT)

Figure. Toy robot detection results comparing Sony NTSC Camera and matched keypoints from QuickCam Pro reproduced keypoints, on average 8% were recovered.

5 Figure. Toy robot detection results comparing Sony NTSC Camera and matched keypoints from QuickCam Pro reproduced keypoints, on average 8% were recovered. Between cameras, 9% of the original keypoints were recovered, and out of those an average of % were properly labeled. For N = 6 within-cameras yielded a 9.% correct assignment, whereas between cameras was significantly worse,.%. In the robot condition, same-camera task, with N = 6, an average of 6.% were recovered, and out of those 7% were recovered. However, in the betweencamera task, reproducibility was much lower at %, out of which an average of only 6 % were properly labeled. The differences between the human and robot experiments are likely due to a number of factors. First, since the robot was completely rigid, unlike the human, its appearance changed little between shots with the same camera. This is likely to explain the extremely high reproducibility rate and labeling accuracy in the within-camera experiment. It was somewhat perplexing why the labelings nor the reproducibility rate between the identical-looking shots using the same camera was not %, as we might expect. Whether this is due to an imperceptible shifting of the camera, or some similarly imperceptible difference with how the camera imaged the two instances, there was no way to tell within our experiment. However, between cameras, we saw a decrease in both reproducbility and in labeling, compared to the human. There are several explanations for why the labelling is more challenging in the robot condition. First, there are many more keypoints, and therefore it is inherently more challenging (less likely) to randomly choose the right keypoint. But more significant is the effect of an abundance of keypoints with similar statistics. There are many keypoints that correspond to locations on the robot that look like other locations: such as the arms and the knees. Another difference between the robot task and the human task was that the robot task was significantly positively impacted by changing the number of gradient bins from N = to N = 6. (There was no correct labeling on the robot task with N =.) By contrast, the performance on reproducing keypoints for the human subject fared slightly worse with larger N. FUTURE WORK 5 The simple implementation of SIFT we used for these experiments did not contain any of the keypoint management that was recommended by Lowe []. Namely, for each keypoint detected, this implementation naively created a new SIFT keypoint with accompanying signature, without comparing the similarity of the new keypoint to existing keypoints in our set. For objects that are likely to have recurring, similar looking regions (for example, the robot s knees), these keypoints should all be assigned the same label. Since the robot had many regions of similar appearance, this is the most likely cause of the low reproducibility observed in the between-camera robot condition. This sort of merging could be done during the detection, or by clustering the SIFT vectors post-hoc, using an algorithm such as k-means. The other recommendation made by Lowe dealt with choosing the primary orientation for the keypoint. If the initial weighted voting for the primary orientation of the keypoint was close to a tie, Lowe s improvement created multiple keypoints with the various orientations that yielded the tie. This indeed seems like a wise choice, to reduce the susceptibility of choice of keypoint direction to noise. CONCLUSION This study demonstrated that although SIFT keypoint reproducibility and signature correspondence were sensitive to camera variations, correspondence was still possible. Due to the generally low reproducibility of keypoints, it may be necessary for applications to build robust models and rely on many redundant keypoints. Lowe has demonstrated that object recognition for certain applications may be possible by identifying as few as keypoints on an image Another interesting result was there was no clear winner, with respect to which single camera performed the best in our experiment. The CMOS sensor based camera, the QuickCam express, consistently underperformed the rest, which were CCD based cameras. Since most digital cameras embedded in mobile devices are CMOS-based, this may have important implications regarding the use of SIFT-based vision algorithms on these devices.

6 ACKNOWLEDGEMENTS I would like to thank Bill Freeman and Xiaoxu Ma and my colleagues in for an extremely fun and memorable class. RESOURCES The code and data set for this paper may be downloaded at emax/6.869/ scift. Please contact the author with any questions or comments. REFERENCES. S. Helmer and D. G. Lowe. Object recognition with many local features. Workshop on Generative Model Based Vision,.. D. G. Lowe. Distinctive features from scale-invariant keypoints. International Journal of Computer Vision,.. K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. Computer Vision and Pattern Recognition,. 6

Improved SIFT Matching for Image Pairs with a Scale Difference

Improved SIFT Matching for Image Pairs with a Scale Difference Y. Bastanlar, A. Temizel and Y. Yardımcı Informatics Institute, Middle East Technical University, Ankara, 06531, Turkey Published in IET Electronics,