Impact of the subjective dataset on the performance of image quality metrics

Similar documents
A New Scheme for No Reference Image Quality Assessment

SUBJECTIVE QUALITY OF SVC-CODED VIDEOS WITH DIFFERENT ERROR-PATTERNS CONCEALED USING SPATIAL SCALABILITY

COLOR IMAGE QUALITY EVALUATION USING GRAYSCALE METRICS IN CIELAB COLOR SPACE

A New Scheme for No Reference Image Quality Assessment

No-Reference Image Quality Assessment using Blur and Noise

Why Visual Quality Assessment?

A generalized white-patch model for fast color cast detection in natural images

Power- Supply Network Modeling

QUALITY ASSESSMENT OF IMAGES UNDERGOING MULTIPLE DISTORTION STAGES. Shahrukh Athar, Abdul Rehman and Zhou Wang

Gis-Based Monitoring Systems.

A 100MHz voltage to frequency converter

A New Approach to Modeling the Impact of EMI on MOSFET DC Behavior

L-band compact printed quadrifilar helix antenna with Iso-Flux radiating pattern for stratospheric balloons telemetry

Augmented reality as an aid for the use of machine tools

Compound quantitative ultrasonic tomography of long bones using wavelets analysis

Exploring Geometric Shapes with Touch

Computational models of an inductive power transfer system for electric vehicle battery charge

A STUDY ON THE RELATION BETWEEN LEAKAGE CURRENT AND SPECIFIC CREEPAGE DISTANCE

VR4D: An Immersive and Collaborative Experience to Improve the Interior Design Process

Optical component modelling and circuit simulation

A simple LCD response time measurement based on a CCD line camera

Linear MMSE detection technique for MC-CDMA

A perception-inspired building index for automatic built-up area detection in high-resolution satellite images

3D MIMO Scheme for Broadcasting Future Digital TV in Single Frequency Networks

Enhanced spectral compression in nonlinear optical

On the role of the N-N+ junction doping profile of a PIN diode on its turn-off transient behavior

Dynamic Platform for Virtual Reality Applications

RFID-BASED Prepaid Power Meter

ORIGINAL ARTICLE A COMPARATIVE STUDY OF QUALITY ANALYSIS ON VARIOUS IMAGE FORMATS

SUBJECTIVE QUALITY ASSESSMENT OF SCREEN CONTENT IMAGES

Benefits of fusion of high spatial and spectral resolutions images for urban mapping

BANDWIDTH WIDENING TECHNIQUES FOR DIRECTIVE ANTENNAS BASED ON PARTIALLY REFLECTING SURFACES

The Galaxian Project : A 3D Interaction-Based Animation Engine

QPSK-OFDM Carrier Aggregation using a single transmission chain

A design methodology for electrically small superdirective antenna arrays

Wireless Energy Transfer Using Zero Bias Schottky Diodes Rectenna Structures

Probabilistic VOR error due to several scatterers - Application to wind farms

On the robust guidance of users in road traffic networks

Towards Decentralized Computer Programming Shops and its place in Entrepreneurship Development

Opening editorial. The Use of Social Sciences in Risk Assessment and Risk Management Organisations

Analysis of the Frequency Locking Region of Coupled Oscillators Applied to 1-D Antenna Arrays

A technology shift for a fireworks controller

Gate and Substrate Currents in Deep Submicron MOSFETs

Last Signification Bits Method for Watermarking of Medical Image

Two Dimensional Linear Phase Multiband Chebyshev FIR Filter

A sub-pixel resolution enhancement model for multiple-resolution multispectral images

Dictionary Learning with Large Step Gradient Descent for Sparse Representations

An image segmentation for the measurement of microstructures in ductile cast iron

Small Array Design Using Parasitic Superdirective Antennas

Bridging the Gap between the User s Digital and Physical Worlds with Compelling Real Life Social Applications

OBJECTIVE IMAGE QUALITY ASSESSMENT OF MULTIPLY DISTORTED IMAGES. Dinesh Jayaraman, Anish Mittal, Anush K. Moorthy and Alan C.

Concepts for teaching optoelectronic circuits and systems

Modelling and Hazard Analysis for Contaminated Sediments Using STAMP Model

Toward the Introduction of Auditory Information in Dynamic Visual Attention Models

Objective and subjective evaluations of some recent image compression algorithms

Influence of ground reflections and loudspeaker directivity on measurements of in-situ sound absorption

Resonance Cones in Magnetized Plasma

Characterization of Few Mode Fibers by OLCI Technique

On Improving the Pooling in HDR-VDP-2 towards Better HDR Perceptual Quality Assessment

Process Window OPC Verification: Dry versus Immersion Lithography for the 65 nm node

Empirical Study on Quantitative Measurement Methods for Big Image Data

NO-REFERENCE IMAGE BLUR ASSESSMENT USING MULTISCALE GRADIENT. Ming-Jun Chen and Alan C. Bovik

AN IMPROVED NO-REFERENCE SHARPNESS METRIC BASED ON THE PROBABILITY OF BLUR DETECTION. Niranjan D. Narvekar and Lina J. Karam

Long reach Quantum Dash based Transceivers using Dispersion induced by Passive Optical Filters

Radio Network Planning with Combinatorial Optimization Algorithms

UML based risk analysis - Application to a medical robot

Adaptive noise level estimation

SSB-4 System of Steganography Using Bit 4

Sound level meter directional response measurement in a simulated free-field

PMF the front end electronic for the ALFA detector

INVESTIGATION ON EMI EFFECTS IN BANDGAP VOLTAGE REFERENCES

Ironless Loudspeakers with Ferrofluid Seals

Study on a welfare robotic-type exoskeleton system for aged people s transportation.

HCITools: Strategies and Best Practices for Designing, Evaluating and Sharing Technical HCI Toolkits

Application of CPLD in Pulse Power for EDM

analysis of noise origin in ultra stable resonators: Preliminary Results on Measurement bench

Performance of Frequency Estimators for real time display of high PRF pulsed fibered Lidar wind map

Design of an Efficient Rectifier Circuit for RF Energy Harvesting System

Interactive Ergonomic Analysis of a Physically Disabled Person s Workplace

MODELING OF BUNDLE WITH RADIATED LOSSES FOR BCI TESTING

Indoor Channel Measurements and Communications System Design at 60 GHz

100 Years of Shannon: Chess, Computing and Botvinik

Attack restoration in low bit-rate audio coding, using an algebraic detector for attack localization

Electronic sensor for ph measurements in nanoliters

VISUAL DISCOMFORT IS NOT ALWAYS PROPORTIONAL TO EYE BLINKING RATE: EXPLORING SOME EFFECTS OF PLANAR AND IN-DEPTH MOTION ON 3DTV QOE

Enhancement of Directivity of an OAM Antenna by Using Fabry-Perot Cavity

Design of Cascode-Based Transconductance Amplifiers with Low-Gain PVT Variability and Gain Enhancement Using a Body-Biasing Technique

JPEG2000: IMAGE QUALITY METRICS INTRODUCTION

S-Parameter Measurements of High-Temperature Superconducting and Normal Conducting Microwave Circuits at Cryogenic Temperatures

Sparsity in array processing: methods and performances

STUDY OF RECONFIGURABLE MOSTLY DIGITAL RADIO FOR MANET

Quality Measure of Multicamera Image for Geometric Distortion

A multi-sine sweep method for the characterization of weak non-linearities ; plant noise and variability estimation.

Low temperature CMOS-compatible JFET s

New paradigm in design-manufacturing 3Ds chain for training

Perceptual Blur and Ringing Metrics: Application to JPEG2000

Comparison of engineering models of outdoor sound propagation: NMPB2008 and Harmonoise-Imagine

MACHINE evaluation of image and video quality is important

Hue class equalization to improve a hierarchical image retrieval system

Stewardship of Cultural Heritage Data. In the shoes of a researcher.

Transcription:

Impact of the subjective dataset on the performance of image quality metrics Sylvain Tourancheau, Florent Autrusseau, Parvez Sazzad, Yuukou Horita To cite this version: Sylvain Tourancheau, Florent Autrusseau, Parvez Sazzad, Yuukou Horita. Impact of the subjective dataset on the performance of image quality metrics. IEEE International Conference on Image Processing 2008. ICIP 2008., Oct 2008, San Diego, CA, United States. 2008. <hal-00321663> HAL Id: hal-00321663 https://hal.archives-ouvertes.fr/hal-00321663 Submitted on 15 Sep 2008 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

IMPACT OF SUBJECTIVE DATASET ON THE PERFORMANCE OF IMAGE QUALITY METRICS Sylvain Tourancheau 1, Florent Autrusseau 1, Z.M. Parvez Sazzad 2 and Yuukou Horita 2 1 Laboratoire IRCCyN, Université de Nantes 2 Graduate School of Engineering, Univ. of Toyama Rue Christian Pauc, 44306 Nantes, France 3190 Gofuku, Toyama, 930-8555 Japan sylvain.tourancheau@univ-nantes.fr parvezdu@ctg.u-toyama.ac.jp florent.autrusseau@univ-nantes.fr horita@eng.u-toyama.ac.jp ABSTRACT The interest in objective quality assessment have significantly increased over the past decades. Several objective quality metrics have been proposed and made publicly available, moreover, several subjective quality assessment databases are distributed in order to evaluate and compare the metrics. However, several question arises: are the objective metrics behaviours constant across databases, contents and distortions? how significantly the subjective scores might fluctuate on different displays (i.e. CRT or LCD)? which objective quality metric might best evaluate a given distortion? In this article, we analyse the behaviour of four objective quality metrics (including PSNR) tested on three image databases. We demonstrate that the performances of the quality metrics can strongly fluctuate depending on the database used for testing. We also show the consistency of all metrics for two distinct displays. Index Terms Image quality, Quality assessment, Subjective database 1. INTRODUCTION As subjective experiments are extremely tedious and time consuming, objective quality metrics have lastly been extensively studied, and their performances have significantly improved over the decades. Whereas the aim of an objective metric is to substitute to tedious subjective experiments, while designing a metric, it is mandatory to compare the objective scores with the subjective ones, and to use statistical tools to accurately evaluate the objective quality assessment. Objective quality metrics may be used for various purposes, but they are usually designed and utilized within image and video compression context. However, the interest in objective quality metrics for other image and video processing applications (such as digital watermarking) has also recently increased. Evidently, the performances of objective metrics stongly rely on the tested distortion. For instance, the quality range of distortions induced The authors would like to thank Romuald Pépion for his assistance to obtain the results described in the paper. by watermarking techniques will be very narrow compared to the quality range of compressed data. This study is motivated by the need to faithfully analyse the metrics performances, and to demonstrate that, to be relevant, such analysis has to be conducted on a wide set of distortions and contents. It is thus strongly recommended to compare the metrics to several of the publicly available databases. One of the question we hereby address here, is the monotonicity of the metrics performances for various subjective databases. Four metrics were tested, a statistical metric (PSNR), one advanced metric exploiting Human Visual System features (C4), one based on an information-theoretic framework (VIF) and one based on structural similarities (SSIM). Three subjective databases are used, linear correlation and root mean squared error are used to assess the metrics. Strong variations among the four tested metrics within, and across the databases are observed. This may be explain by a somewhat important disparity among the databases. Besides, we also propose here a subjective study pointing out the important similarities between subjective quality assessment on both CRT and LCD monitors, since two subjective experiments, on the same dataset, were conducted independently on both displays. The analysis of both experiments will show the very weak variations of subjective data, which moreover can easily be induced by cultural factors and lab effects, rather than by the displays themselves. Similarly, a comparison is performed between the release 2 of the LIVE database and an update after a raw scores realignment processing. and shows minor difference behaviour for certain metrics. This paper is organized as follows: Section 2 describes the experiments conducted to construct our datasets, the performances of the quality metrics on each dataset are computed in Section 3 and results are discussed in Section 4. 2. DESCRIPTION OF THE SUBJECTIVE DATASETS 2.1. IRCCyN/IVC database The IRCCyN/IVC subjective database [1] consists in 10 original colour images with a resolution of 512 512 pixels from which 235 distorted images have been generated, using 4 dif-

ferent process (JPEG, JPEG2000, LAR coding, Blurring). These algorithms have the advantage to generate very different type of distortions. Each distortion type have been optimized in order to uniformly cover the whole range of quality. Subjective evaluations have been performed in a normalized room with lighting conditions and display settings adjusted according to ITU recommendation BT.500-11. The viewing distance was set to six times the picture s height. Fifteen observers participated to the experiments, they have been checked for visual acuity and color blindness. A double stimulus impairment scale (DSIS) method have been used. Both distorted and original pictures were displayed sequentially. At the end of the presentation, the observer was asked to assess the annoyance he/she felt on the distorted image with respect to the original one. The impairment scale contained five categories marked with adjectives and numbers as follows: 5 Imperceptible, 4 Perceptible but not annoying, 3 Slightly annoying, 2 Annoying and 1 Very annoying. Mean opinion score was then computed over the observers after the potentiel rejection of observers according to recommendations. 2.2. Toyama database The Toyama subjective database [2] contains 182 images of 768 512 pixels. Out of all, 14 were original images (24 bit/pixel RGB) in each group. The rest of the images were JPEG and JPEG2000 coded images (i.e. 84 compressed images for each type of distortion). Six quality scales and six compression ratios were respectively selected for the JPEG and JPEG2000 encoders. The following codec softwares were used to generate the compressed images: JPEG using cjpeg software, and JPEG2000 with JasPer software. Subjective experiments were conducted in a normalized room with low lighting conditions and display settings adjusted according to ITU-R BT.500.11. The viewing distance was set to four times the picture s height. Prior to participating the session all subjects were screened for normal visual acuity with or without glasses, normal color vision and familiarity with language. Sixteen non-expert subjects were shown the database; most of them were college students. Single stimulus absolute category rating (SSACR) method was used in these subjective experiments. The subjects were asked to provide their perception of quality on a discrete quality score that was divided into five and marked with the numerical value of adjectives: Bad (1), Poor (2), Fair (3), Good (4) and Excellent (5). Note that the numerical values attached to each category were only used for data analysis and were not shown to the subjects. At the end of each test presentation, observers provide a quality rating using the adjective scale. The test presentation order was randomized according to standard procedure and the raw scores were collected in a data file by the computer. As the original images has been assessed as well, scores was converted in difference scores (DMOS) for each observer by computing the difference between the score obtained by the original image and the one obtained by the distorted image. Difference mean opinion scores (DMOS) were then computed for each image, after the screening of post-experiment results (most subjects had no outliers) according to ITU-R Rec. 500-10. The Toyama database has been assessed on a CRT display in the University of Toyama in Japan. In order to check if the display is of central importance in such an experiment, we decided to conduct the same subjective quality assessment in IRCCyN laboratory in France, using a liquid crystal display (LCD). However, design the same experiment in two different labs it s a real challenge and the so-called lab effect can occur. Actually, even if set as similar as possible, the viewing conditions can differ from on testing room to another. Furthermore, using two different pools of observers can also lead to slight differences. Also, some cultural differences can appear between France and Japan, in the way to assess quality. For example, the way that observers consider the adjectives on the quality scale can be different. By the way, these two distinct experiments permits to obtain two different subjective datasets from the same image database. 2.3. LIVE database The LIVE database release 2 [3] contains 779 distorted images with five distortions type: JPEG, JPEG2000, white noise, gaussian blur and bit errors in JPEG2000 bit stream. Subjective quality scores have been published in form of difference scores in a quality range from 0 to 100. For more precisions concerning the experiments, please see Ref. [3]. Recent work [4] from LIVE laboratory have presented a new method to realigned subjective quality scores obtained from each session with each other. This realignment process used Z-scores transform in order to attenuate the inter-observers differences, following by a inverse transform to go back to the 0-100 quality range. This inverse transform has been adapted for each session. Following this work, an update of the subjective quality scores has been published online. In this paper, we will work on both datasets (release 2 and update). 3. QUALITY METRICS First of all, objective image and video quality metrics can be classified according to the availability of the distortion free image/video signal, which may be used as a reference to compare an original image or video signal against its distorted counterpart. Specifically, such metrics are usually of three kinds. Full Reference (FR) quality assessment metrics for which the exact original image is needed in order to assess the visual quality of any distorted version. Reduced Reference (RR) quality metrics, for which a reduced form of the original images is used. No reference (NR) metrics, where only

Fig. 1: Top line: Linear correlation coefficient between QM and subjective scores for the three databases. Bottom line: Root Mean Square Error between QM and subjective scores for the three databases.results are computed on the whole databases (white bars) and independently on the JPEG and JPEG2000 subsets, gray and black bars respectively. the distorted image is needed. Several quality metrics (QM) have been used on the five datasets described previously: Structural SIMilarity (SSIM) [5] is an objective metric for assessing perceptual image quality, working under the assumption that human visual perception is highly adapted for extracting structural information from a scene. Quality evaluation is thus based on the degradation of this structural information assuming that error visibility should not be equated with loss of quality as some distortions may be clearly visible but not so annoying. C4 [6] is a metric based on the comparison between the structural information extracted from the distorted and the original images. What makes this metric interesting is that it uses reduced references containing perceptual structural information and exploiting an implementation of a rather elaborated model of the HVS. Visual Image Fidelity (VIF) [7] is a full reference quality metric using the concept of information fidelity measurement. The VIF metric groups Natural Scene Statistics as well as HVS features and a model of the considered distortion to quantify the loss of information. Two classic criteria have been chosen to evaluate the performance of QM on each database: the linear correlation coefficient (Pearson s correlation) and the root mean squared error (RMSE). These two values have been computed after a nonlinear regression on the results of the QM. This regression is performed to map the output of each QM to the quality range of the DMOS. This regression has been done for the whole databases, and also separately for the JPEG and JPEG2000 subsets in each database. The non-linear function used to compute the regression was a logistic function with three parameters, as described in Eq. 1. The values of the three parameters are optimized in order to minimize the RMSE. Q mapped = a 1 + e b(q c) (1) Figure 1 presents the linear correlation coefficient (top line) and the RMSE (bottom line) for the three QM as well as for PSNR, on the five datasets. Vertical bars indicate the 95% confidence interval of each correlation values (computed after a Fisher transformation in order to transpose into a normal distribution). This information permits to determine if two correlations are statistically distinguishable or not. In this figure, results are doubled for the LIVE database (corresponding to release 2 and updated version) and for the Toyama database (corresponding to CRT and LCD dataset). Linear correlation between PSNR and JPEG subset of both Toyama subjective datasets was under 0.6.

(a) original (b) Toyama database (DMOS = 3.5) than on databases with a narrower range of quality such as IVC and Toyama. Actually, statistical metrics may be less accurate in the high quality range since quality perception in these area is mostly due to perceptual HVS features, rather than to the statistics of the image. The good performance of advanced quality metrics (VIF and C4), on all databases, seems to confirm this assumption. The results showed that the type of display does not significantly interfere on the subjective scores. Actually, this was an expected behaviour since both subjective datasets on CRT (performed in the University of Toyama) and on LCD (performed in the IRCCyN lab.) were highly correlated (Pearson s correlation of 0.957) and quality metrics results on both datasets were, of course, the same. However, it is somewhat remarkable to have such close results on two different subjective datasets, despite the different displays, different labs and cultural effect. This is an evidence that subjective quality assessment experiments are quite reliable and that differences between databases are mostly due to difference of contents, distortions type and quality range. We analyzed in this paper various quality metrics across subjective databases. We showed that, to be relevant, the performances evaluation of a quality metric has to be experimented on several databases. Evaluating a metric with one single subjective database might not be sufficient as the quality range of the database seems to be of central importance. As a result, it is important to be aware of the differences between subjective datasets, as well as to know what is the profile of the tested quality metric. (c) LIVE database (DMOS = 65) 5. REFERENCES Fig. 2: Illustration of the difference of low anchor between two databases. (b-c): lowest quality pictures for JPEG distortion respectively in Toyama and LIVE databases. [1] P. Le Callet and F. Autrusseau, Subjective quality assessment IRCCyN/IVC database, http://www.irccyn.ecnantes.fr/ivcdb/, 2005. 4. DISCUSSION [2] Y. Horita, Y. Kawayoke, and Z. M. Parvez Sazzad, Image quality evaluation database, ftp://guest@mict.eng.utoyama.ac.jp/. As expected, the performances of quality metrics show more or less the same tendency for the five subjective datasets. VIF and C4 appears to be the best QM, with some very close results as well for correlation (where 95% confidence intervals are highly overlapped) as for RMSE. SSIM obtains some lower results, particularly for IVC and Toyama databases and finally PSNR gets the lower results for all datasets. The differences over databases, particularly the decrease of performance on IVC and Toyama databases for all QM might be explained by the difference of quality range in the databases. As illutrated in Figure 2, low quality anchors in the LIVE database are indeed strongly distorted pictures with extremely low quality. The corresponding low anchors in Toyama and IVC databases appears to have a sensibly better quality. It is observed that quality metrics based on the statistics of images such as PSNR provide better results on the LIVE database [3] H.R. Sheikh, Z. Wang, L. Cormack, and A.C. Bovik, LIVE image quality assessment database release 2, http://live.ece.utexas.edu/research/quality/. [4] H.R. Sheikh, M.F. Sabir, and A.C. Bovik, A statistical evaluation of recent full reference image quality assessment algorithms, IEEE Transactions on Image Processing, vol. 15, no. 11, pp. 3440 3451, 2006. [5] Z. Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli, Image quality assessment: From error visibility to structural similarity, IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600 612, 2004. [6] M. Carnec, P. Le Callet, and D. Barba, Objective quality assessment of color images based on a generic perceptual reduced reference, Signal Processing: Image Communication, vol. 23, no. 4, pp. 239 256, April 2008. [7] H.R. Sheikh and A.C. Bovik, Image information and visual quality, IEEE Transactions on Image Processing, vol. 15, no. 2, pp. 430 444, May 2005.