Original. Image. Distorted. Image

An Automatic Image Quality Assessment Technique Incorporating Higher Level Perceptual Factors Wilfried Osberger and Neil Bergmann Space Centre for Satellite Navigation, Queensland University of Technology, GPO Box 2434, Brisbane, 4001, Australia Anthony Maeder School of Engineering, University of Ballarat, Ballarat, 3353, Australia Abstract We present an objective image quality assessment technique which is based on the properties of the human visual system (HVS). It consists of two major components: an early vision model (multi-channel and designed specically for complex natural images), and a visual attention model which indicates regions of interest in a scene through the use of Importance Maps. Visible errors are then weighted, depending on the perceptual importance of the region in which they occur. We show that this technique produces a high correlation with subjective test data (0.93), compared to only 0.65 for PSNR. This technique is particularly useful for images coded with spatially varying quality. 1 Introduction The limited accuracy obtained by simple objective quality metrics such as peak signal-to-noise ratio (PSNR) and mean squared error (MSE) has led to the development of more advanced quality assessment techniques. Models based on the early stages of human vision [4, 12] (i.e. up to the primary visual cortex) have shown promise and are useful for determining whether compression errors are visible or not, at each location in an image. However, many compression applications introduce suprathreshold errors, and allow variable quality across dierent parts of the image. Vision models which treat visible errors equally, regardless of their location in the image, may not be powerful enough to accurately predict picture quality in such cases. This is because we are known to be more sensitive to errors in areas of the scene to which we are paying attention than to errors in peripheral areas [10, 11]. In this paper we revise a HVS-based quality metric for monochrome images that we have recently developed [16] and demonstrate its high correlation with subjective opinion by comparing its predictions with subjective Mean Opinion Score (MOS) data. The HVS model contains both an early vision stage (which detects whether an error is visible, assuming foveal viewing) and a model of human visual attention. The early vision model has been designed to specically take into account the structure of complex natural images. The attention model takes into account several factors which are known to inuence visual attention, to produce an Importance Map (IM) [13, 15]. This map is used to weight the inuence of the visible errors produced by the early vision model, in order to obtain a more accurate indication of picture quality. 2 Important HVS Characteristics Extensive physiological and psychophysical experimentation into the operation of the primate visual system has resulted in a good understanding of the HVS, in particular up to the primary visual cortex (area V1). Important features of our early vision which need to be considered by a vision model include: sensitivity to contrast changes rather than luminance changes (approximated by Weber's law at high luminance, and de Vries{Rose law at lower luminance levels). frequency dependent sensitivity to stimuli. This can be modeled by a Contrast Sensitivity Function (CSF), which estimates the visibility threshold for stimuli at dierent spatial frequencies. However, the shape and threshold of the CSF is dependent upon the type of stimulus used. For naturalistic images, the CSF reduces signicantly at low to mid frequencies [22]. masking, which refers to our reduced ability to detect a stimulus on a spatially or temporally complex background. Thus errors are less visible along strong edges, in textured areas, or immediately following a scene change. The amount of masking caused by a background depends not

only on the background's contrast, but also on the level of uncertainty created by the background and stimulus [21]. Areas of high uncertainty (e.g. complex areas of a scene, textures) induce higher masking than areas of the same contrast with lower uncertainty (e.g. edges, gratings). 2.1 Attention and Eye Movements In order to deal eciently with the tremendous amount of information with which it is presented, the HVS operates with variable resolution. High acuity is obtained in a very small area called the fovea, which is only around 2 degrees of visual angle in diameter. Our visual eld however spans 200 135 degrees. Our fovea is repositioned by rapid, directed eye movements called saccades, which occur every 100{500 milliseconds. Visual attention processes control the location of future saccades by scanning peripheral regions in parallel, looking for important or uncertain areas which require foveal viewing. Thus a strong relation exists between eye movements and attention. Studies of viewer eye movement patterns while viewing natural image or video scenes have shown that subjects tend to xate on similar locations in a scene [20, 23], provided they are not given dierent instructions prior to viewing. Fixations are generally not distributed evenly over the scene; instead there tend to be a few regions in the scene which receive a disproportionately large number of xations, while other areas are not foveally viewed at all. Our perception of overall picture quality is heavily inuenced by the quality in these areas of interest, while peripheral regions can undergo signicant degradation without strongly aecting overall quality [10, 11]. However, care must be taken when reducing peripheral resolution not to remove future visual attractors or create new attractors [6]. A number of factors have been found to inuence our eye movements and visual attention. In general, objects which stand out from their surrounds with respect to a particular factor are likely to attract our attention. Some of the most important factors include: Contrast. [8, 19] Regions which have high contrast with their surrounds attract our attention and are likely to be of greater visual importance. Size. Larger regions are more likely to attract attention than smaller ones [2, 8]. However a saturation point exists, after which importance due to size levels o. Shape. [9, 19] Long, thin regions have been found to be visual attractors. Original Image Distorted Image Importance Calculation (5 factors) Multi-channel Early Vision Model IM PDM PDM IPDM weighted by Summation IM IPQR Figure 1: Block Diagram of Image Quality Assessment Technique. Colour. [3, 14] A strong inuence occurs when the colour of a region is distinct from the background colour. Motion. [14] Our peripheral vision is highly tuned to detecting changes in motion, and our attention is involuntarily drawn to peripheral areas undergoing motion distinct from its surrounds. Location. Eye-tracking experiments have shown that viewers eyes are directed at the centre 25% of a screen for a majority of viewing material [7]. Foreground / Background. [2, 3] Viewers are more likely to be attracted to objects in the foreground than those in the background. People. Many studies [9, 19, 23] have shown that we are drawn to focus on people in a scene, in particular their faces, eyes, mouth, and hands. Context. [9, 23] Viewers eye movements can be dramatically changed, depending on the instructions they are given prior to or during the observation of an image. 3 Model Description A block diagram of our quality assessment technique is shown in Figure 1. The model is based on the approach used in [16]. The multi-channel early vision model is described in detail in [17]. In brief, it consists of the following processes: a conversion from luminance to contrast using Peli's LBC algorithm [18]; application of a CSF using data obtained from naturalistic stimuli; a contrast masking function which raises the CSF threshold more quickly in textured (uncertain) areas than in at or edge regions; visibility thresholding; and Minkowski summation. This process produces what we term a Perceptual Distortion Map (PDM). This map indicates the parts of the coded picture that contain visible distortions, assuming that each area is viewed foveally.

As discussed in Section 2.1, our perception of overall picture quality is strongly inuenced by the quality of the most perceptually important parts of a scene. To give an indication of the location of regions of interest in a scene, we utilise Importance Maps (IM) [13, 15]. These automatically generated maps identify visually salient regions based on factors known to inuence visual attention. A complete description of how IMs are generated can be found in [15]. In brief, the original image is rst segmented into homogeneous regions. The salience of each region is then determined with respect to each of 5 factors: contrast, size, shape, location, and foreground/background. The exible structure of the algorithm allows additional factors to easily be incorporated, such as colour or a priori knowledge of scene content. The factors are weighted equally and are summed to produce an overall IM for the image. Each region is assigned an importance value in the range from 0.0{1.0, with 1.0 representing highest importance. The IM is used to weight the output of the PDM, such that errors in areas classied as being of lower importance have a lower inuence than errors in areas of high perceptual importance. This is given by: IP DM(x; y) = P DM(x; y) IM(x; y) (1) where IP DM(x; y) represents the IM-weighted PDM, and is an importance scaling power. We have found a value of of 1.0 to give good results. To produce a single number representing picture quality from the IPDM, Minkowski summation with an exponent of 2 is performed [5]. This summed error value can be converted to a scale from 1{5, so that correlation with subjective Mean Opinion Score (MOS) data can be determined. This is done using: (I)P QR = 5 1+pE (2) where (I)P QR represents an (IM-weighted) Perceptual Quality Rating and p is a scaling constant. As a result of subjective testing we have found that a value of p = 0.8 gives good correlation with subjective data. 4 Results An example of the outputs produced by this technique can be seen in Fig. 2 for the image lighthouse. The coded image (Fig. 2(c)) is actually a composite of two wavelet coded images: one at high bitrate (1.02 bits/pixel) and one low (0.20 bits/pixel). Rectangular areas in the regions of interest in the scene (i.e. the lighthouse and surrounding buildings) have been cut from the high bitrate image and pasted onto the lower bitrate image. The result is an image with Table 1: Correlation with MOS achieved by PSNR, PQR and IPQR. Assessment Technique Images Used PSNR PQR IPQR All images 0.65 0.87 0.93 JPEG & wavelet only 0.74 0.94 0.97 Variable quality only 0.55 0.84 0.90 high quality in the regions of interest, and lower quality in the periphery. This type of image typically has a higher subjective quality than an image of the same bitrate which has been coded at a single quality level. If the MSE (Fig. 2(d)) or PDM (Fig. 2(e)) are decomposed into a single number representing image quality (PSNR and PQR respectively), they are not capable of predicting the increased subjective quality of this composite coded picture, since they fail to take into account the perceptual importance of the location of the distortion. However, when the IPDM is summed to produce an IPQR, the increase in subjective quality is predicted. We have performed subjective testing in order to determine the correlation of our technique with subjective opinion. The subjective tests were performed in accordance with the CCIR Rec. 500-6 DSCQS [1] testing procedure. Eighteen subjects were asked to assess the quality of four dierent images (announcer, beach, lighthouse, and splash), which were coded using wavelet and JPEG coders at a variety of bitrates (0.15 to 1.6 bits/pixel). This produced a set of 32 compressed images. To provide a more challenging test, we included a further 12 images which were a composite of high bit rate (in an area we chose as the region of interest) and low bit rate (in all other areas) images. Plots of the predictions of our technique and PSNR versus MOS are shown in Fig. 3. Although PSNR generally provides a reasonable correlation with subjective opinion for a particular image and coder, it is not robust across dierent images and coders. This results in a poor correlation across all tested scenes. However, the predictions of the IPQR metric in Fig. 3(b) show that this technique performs independent of scene, coder, and bitrate. Unlike PSNR, the IPQR gives good results on variable quality images, since it takes into account the location as well as the magnitude of the errors. The correlations with subjective MOS achieved by PSNR, PQR, and IPQR are shown in Table 1. The IPQR technique produced a signicantly higher correlation (0.93) than PSNR (0.65), and also improved upon the PQR (0.87). As expected, IPQR was the

(a) (b) (c) (d) (e) (f ) Figure 2: Quality assessment for the image lighthouse, wavelet coded at two levels for an average bitrate of 0.35 bit/pixel. (a) original image, (b) Importance Map, with lighter regions representing areas of higher importance, (c) coded image, (d) MSE, (e) PDM, and (f) IPDM. (a) (b) Figure 3: Quality predictions compared to subjective MOS. (a) PSNR, (b) IPQR. Diamonds = announcer, squares = beach, crosses = lighthouse, circles = splash.

most successful technique when predicting the quality of the variable coded scenes. However, it also provided improved correlation over PQR for standard JPEG and wavelet coded scenes. 5 Discussion This paper has presented a quality assessment technique which incorporates higher level perceptual factors into a visual model, and demonstrated the improved quality prediction which can be achieved using this approach. A signicant improvement over the commonly used PSNR was achieved. The IPQR metric is particularly useful at predicting quality in images which have spatial variations in picture quality. An extension of the metric to video quality assessment may therefore be useful for assessing the quality of object-based coding schemes such as MPEG-4. There are several areas in which this technique can be improved or extended. We currently consider only monochrome images, so inclusion of colour in both the early vision model and the IM calculation is an obvious progression. Other visual attractors may also be included in the IM algorithm. For instance, if prior knowledge of the type of scene being viewed was available, it may be used to provide a better prediction of the location of important areas. References [1] Methodology for the subjective assessment of the quality of television pictures. ITU-R Recommendation 500-6, 1994. [2] T. Boersema and H. J. G. Zwaga. Searching for routing signs in public buildings: the distracting eects of advertisements. In D. Brogan, editor, Visual Search, pages 151{157. Taylor and Francis, 1990. [3] B. L. Cole and P. K. Hughes. Drivers don't search: they just notice. In D. Brogan, editor, Visual Search, pages 407{417. Taylor and Francis, 1990. [4] S. Daly. The visible dierence predictor: an algorithm for the assessment of image delity. In A. B. Watson, editor, Digital Images and Human Vision, pages 179{ 206. MIT Press, Cambridge, Massachusetts, 1993. [5] H. de Ridder. Minkowski-metrics as a combination rule for digital image coding impairments. In Proceedings of the SPIE - Human Vision, Visual Processing and Digital Display III, volume 1666, pages 16{26, San Jose, February 1992. [6] A. T. Duchowski and B. H. McCormick. Pre-attentive considerations for gaze-contingent image processing. In Proceedings of the SPIE - Human Vision, Visual Processing and Digital Display VI, volume 2411, pages 128{139, San Jose, February 1995. [7] G. S. Elias, G. W. Sherwin, and J. A. Wise. Eye movements while viewing NTSC format television. SMPTE Psychophysics Subcommittee white paper, March 1984. [8] J. M. Findlay. The visual stimulus for saccadic eye movements in human observers. Perception, 9:7{21, September 1980. [9] A. G. Gale. Human response to visual stimuli. In W. R. Hendee and P. N. T. Wells, editors, The Perception of Visual Information, pages 127{147. Springer-Verlag, 1997. [10] G. A. Geri and Y. Y. Zeevi. Visual assessment of variable-resolution imagery. Journal of the Optical Society of America, 12(10):2367{2375, October 1995. [11] P. Kortum and W. Geisler. Implementation of a foveated image coding system for image bandwidth reduction. In SPIE - Human Vision and Electronic Imaging, volume 2657, pages 350{360, February 1996. [12] J. Lubin. A visual discrimination model for imaging system design and evaluation. In E. Peli, editor, Vision models for target detection and recognition, pages 245{283. World Scientic, New Jersey, 1995. [13] A. Maeder, J. Diederich, and E. Niebur. Limiting human perception for image sequences. In SPIE - Human Vision and Electronic Imaging, volume 2657, pages 330{337, San Jose, February 1996. [14] E. Niebur and C. Koch. Computational architectures for attention. In R. Parasuraman, editor, The Attentive Brain. MIT Press, Cambridge, MA, 1997. [15] W. Osberger and A. J. Maeder. Automatic identication of perceptually important regions in an image using a model of the human visual system. To appear in 14th International Conference on Pattern Recognition, August 1998. [16] W. Osberger, A. J. Maeder, and N. W. Bergmann. A technique for image quality assessment based on a human visual system model. To appear in 9th European Signal Processing Conference (EUSIPCO-98), September 1998. [17] W. Osberger, A. J. Maeder, and D. McLean. A computational model of the human visual system for image quality assessment. In Proceedings DICTA-97, pages 337{342, New Zealand, December 1997. [18] E. Peli. Contrast in complex images. JOSA A, 7(10):2032{2040, October 1990. [19] J. W. Senders. Distribution of visual attention in static and dynamic displays. In Proceedings of the SPIE - Human Vision and Electronic Imaging II, volume 3016, pages 186{194, February 1997. [20] L. B. Stelmach, W. J. Tam, and P. J. Hearty. Static and dynamic spatial resolution in image coding: An investigation of eye movements. In Proceedings of the SPIE, volume 1453, pages 147{152, San Jose, 1991. [21] A. B. Watson, R. Borthwick, and M. Taylor. Image quality and entropy masking. In Proceedings of the SPIE - Human Vision and Electronic Imaging II, volume 3016, pages 2{12, February 1997. [22] M. A. Webster and E. Miyahara. Contrast adaption and the spatial structure of natural images. JOSA A, 14(9):2355{2366, September 1997. [23] A. L. Yarbus. Eye Movements and Vision. Plenum Press, New York, 1967.