Audio Informed Watermarking by means of Dirty Trellis Codes

Size: px

Start display at page:

Download "Audio Informed Watermarking by means of Dirty Trellis Codes"

Rosa Owens
5 years ago
Views:

1 Audio Informed Watermarking by means of Dirty Trellis Codes Andrea Abrardo, Mauro Barni, Gianluigi Ferrari Department of Information Engineering, University of Siena, Italy & CNIT Research Unit of Siena Department of Information Engineering, University of Parma, Parma - Italy & CNIT Research Unit of Parma Abstract We present a frequency-domain audio watermarking scheme based on dirty convolutional codes. In the scenario addressed by the paper, a masking threshold is proprely defined to allow the identification of the inaudibility of the inserted data. In particular, the masking threshold defines the maximum modification which can applied to each frequency sample. This represents a major deviation from classical distortion models, in which inaudibility is defined in terms of Mean Square Error (MSE), thus making the direct application of the dirty coding paradigm, derived from a theoretical perspective, problematic. To get around this problem, we first define an informed watermarking scheme based on trellis codes, in which the same information is represented by several paths of the trellis. Then, we determine both the specific structure of the codes and the algorithm for information embedding. The proposed scheme is proved to be robust to D/A and A/D conversion, multipath, scaling, noise, and time misalignment. I. INTRODUCTION With early works dating back to mid nineties, digital watermarking is now a rather well-understood field backed by a solid theoretical background. Yet, the application of theoretical findings to real systems is often problematic due to the deviation of practical working conditions from the assumptions underlying the theoretical analysis.the most striking example is the difficulty of applying the informed watermarking paradigm [1], derived by modeling the watermarking problem as one of channel coding with side information at the encoder [2], to certain practical situations, in which the oldest spread spectrum paradigm is often preferred by system designers. This is often the case, for instance, in audio watermarking systems thought to couple high imperceptibility with robustness against heavy distortions including temporal de-synchronization, D/A and A/D conversions, environment noise addition, multipath and microphone distortion. It is a matter of fact that the best performing schemes proposed so far to address this scenario are not based on informed watermarking. The spread spectrum watermarking systems described in [3] [9] are good examples of this trend. Such schemes embed a pseudo-random watermark and detect the hidden data by calculating the correlation between the pseudo-random sequence and the watermarked audio signal. Spread-spectrum schemes are easy to implement, are very flexible, and can be applied to different signal representation domains, e.g., frequency and time domains. In addition, by considering a proper embedding domain, spread spectrum watermarking allows to exploit psycho-acoustic models quite easily, hence resulting in inaudible distortions even for rather high values of watermark energy. Synchronization is also quite easy with this kind of schemes, thus making the watermark robust against time misalignments and multipath. The most undesirable characteristic of spread spectrum watermarking is that the host signal acts as an interference and, therefore, must be treated as a disturbing noise limiting the watermark channel capacity. As opposed to spread spectrum watermarking, watermarking theory [1], [10] [12] shows that, under certain hypotheses, it is possible to completely eliminate the interference between the watermark and the host signal, thus reaching the same performance obtained by systems where the decoder can access the original non-watermarked signal. To achieve this goal it is necessary to resort to so-called informed watermarking schemes - also known as dirty paper coding schemes - like the well known Quantization Index Modulation (QIM) approach [1]. At least in some application scenarios, then, there is an evident hiatus between theory and practice. While it is not easy to understand which are the conditions whose deviation from the theoretical assumptions determine the difficulty of fulfilling the expectations promised by informed watermarking, psycho-acoustic considerations surely play a major role. In audio watermarking, the inaudibility of the embedded signal is usually obtained by relying on a masking threshold which determines the maximum allowed distortion. In particular, a psycho-acoustic mask defines the maximum modification which can applied to each frequency sample. This represents a major deviation from classical distortion models in which inaudibility is defined in terms of Mean Square Error (MSE), thus making the direct application of the dirty coding paradigm, derived from a theoretical analysis, problematic. Additionally, when the energy of the noise, added after watermark embedding, is comparable to the energy of the host signal, the advantage ensured by informed watermarking decreases. In this paper, we propose a multi-bit informed audio wa-

2 termarking system - with a payload in the range of about 2 bytes per second - that tries to cope with the above problems. In the scenario at hand, an audio track is watermarked at its creation. Since at creation time it is not known how the audio signal will be used, it is of the outmost importance that the presence of the watermark is as imperceptible as possible, even under ideal (or quasi-ideal) listening conditions. Moreover, we require robustness against distortions introduced by D/A and A/D conversions. In the considered scenario, temporal-synchronization represents a challenging issue and several blind or pilot-aided synchronization mechanisms are proposed in the literature (e.g., see [13]). In general, whatever synchronization scheme is considered, robustness to time offset makes synchronization more robust, accurate and easier to achieve. Hence, rather than referring to a specific synchronization scheme (which is not the focus of this paper), we generically require that the watermarking scheme is robust to moderate time misalignments. In addition, we assume that the audio signal can reach the receiver from multiple paths, thus qualifying the channel as one affected by multipath (i.e., linear distortions). On the contrary, we do not consider security issues and the problems raised by the presence of an adversary explicitly aiming at damaging the watermark. The proposed scheme works in the Modified Complex Lapped Transform (MCLT) domain and is based on the use of informed trellis coded spreading sequences. Specifically, we first define an informed watermarking scheme based on Dirty Trellis Codes and Spread Spectrum (DTC-SS), in which the same information is represented by several paths of the trellis. Then, we determine both the specific structure of the codes and the algorithm for information embedding. The proposed scheme is proved to be robust to D/A and A/D conversions, multipath, scaling, noise, and time misalignment. Specifically, our results show that introducing a moderate amount of dirtiness has a beneficial effect under various working conditions, even if pushing the informed watermarking approach too far may lead to performance loss. The paper is organized as follows. In Section II, the proposed DTC-SS scheme is outlined. Section III is dedicated to the characterization of the embedded features domain. In Section IV, the proposed watermarking strategy is described in detail. After the introduction of a simple acoustic channel model in Section V, the performance of the proposed informed watermarking scheme is analyzed in Section VI. Conclusions are drawn in Section VII. II. THE PROPOSED DTC-SS SCHEME The proposed informed watermarking scheme is shown in Fig. 1. The idea of the DTC-SS embedder proposed in this paper relies on a dirty trellis mechanism similar to that described in [14], [15]. Let us denote by R the set of all possible codewords (i.e., the codebook) of a considered rate- 1/n binary convolutional code. By starting from the watermark sequence w (i.e., the sequence of information bits to be inserted), a bit sequence b is derived by interleaving each watermark bit with M 1 of so-called free bits v that will be used to select the codeword. The convolutional encoder is fed with a block b of MK consecutive bits, corresponding to K consecutive information bits. Accordingly, a codeword y is a nkm-long vector of binary terms. In particular, we assume that y have an antipodal binary representation, i.e., a 0 output by the encoder is mapped into +1 and a 1 output by the encoder is mapped into 1. Note that, being the information bits fixed, the codeword is thus a function of v. The role of the information and free bits is easily understood if one adopts the conventional random binning approach to watermark embedding. According to this approach: the information bits determine the bin (or coset) within which the codeword must be searched for; and the free bits determine the choice of a particular codeword within the bin. In classical approaches, this choice (corresponding to the choice of the free bits) is made in such a way that the embedding distortion is minimized. This is because in classical distortion models inaudibility is defined in terms of MSE. In our scenario, as will be shown in the following, inaudibility is defined in terms of the maximum modification which can be applied to each frequency sample: this represents a major deviation from a classical distortion model. Hence, according to the informed coding paradigm, at this stage we assume that the selected codeword is a generic function of the vector of features p which is extracted from the original audio stream s. Each bit at the output of the convolutional encoder is then spread into an antipodal binary sequence g of G s chips, with the conventional choice that +1 is mapped into g and 1 is mapped into g. Assuming that G s is even, we force g to contain exactly G s /2 1 s randomly distributed over the sequence s length, whereas the remaining G s /2 values are set to 1, so that the output vectors {c} have zero temporal mean. Hence, at the output of the spreading block there is a nkmg s -long vector c of antipodal symbols (i.e., which take values ±1). Finally, the embedding block takes, at its input, c and a nkmg s -long vector of features p and generates, at its output, a marked version p w of the extracted features. III. EMBEDDING FEATURES DOMAIN Signal representation in the Modified Discrete Cosine Transform (MDCT) domain has appeared in the last years as one of the most adopted solutions in audio coding. The main feature of an MDCT representation is its capability of providing a perfectly invertible time-frequency transformation in the presence of overlapped windowed blocks. To solve this apparent paradox, the MDCT uses the concept of time domain alias [16]. The main advantage of the MDCT domain stems from the fact that it is possible to derive a satisfactory frequency representation of the signal which avoids blocks effect and lends itself to the integration of a perceptual model of the human auditory system. However, the presence of time aliasing makes MDCT domain vulnerable to time shifts. Indeed, time misalignments produce distortions both in the phase and in the

3 Fig. 1. Overall scheme of the considered watermark embedder. magnitude of the frequency transformation, so that the MDCT samples are strongly modified even in the presence of small time shifts. In order to avoid high resolution synchronization and to devise a data hiding scheme robust to multipath, it is then necessary to consider a domain where no aliasing is produced. In this case, introducing the hidden data in the magnitudes of the frequency transformation samples make the system robust to imperfect synchronization (i.e., to time shifts) and/or multipath propagation. To this aim, a possible solution is represented by audio coding techniques based on 2x oversampled filterbank that provides perfect reconstruction, e.g., the Modified Complex Lapped Transform (MCLT) introduced in [17]. Although the MCLT is similar to a DFT filterbank, it possesses properties that make it attractive for audio processing, especially when integrated with compression systems. As an example, the original signal can be reconstructed from just N complex samples (i.e., half of the total) of the frequency samples, whereas the classical DFT requires N + 1 samples. Moreover, the real part of MCLT corresponds to MDCT, i.e., the original signal can be recovered from just the real part of the MCLT. Hence, we adopt the magnitudes of the MCLT samples as the domain where the watermark embedding occurs, i.e., the features p in Fig. 1 represent blocks of MCLT magnitudes extracted from the original audio following the time-frequency block-based analysis illustrated below. IV. THE PROPOSED WATERMARKING PROCEDURE A. Time-Frequency Block-based Analysis and Synthesis We consider a windowed version of the MCLT transformation [17], which allows to reduce frequency distortions due to block-based processing. In this case, consecutive blocks of the original audio stream must overlap to allow perfect reconstruction of the original audio stream. Denote by s n, n = 0,..., L 1 the n-th sample of the original L-long audio stream. As first step, such data stream is divided into overlapped slots of length 2N: each slot overlaps by N samples with its previous and successive slot, so that, neglecting the border effects, the overall number of slots contained in s is N B = L/N. Denote by s (m) = s(mn + n), n = 0,..., 2N 1 the data stream contained in the m-th slot. Hence, we consider a Resource Block (RB) for data embedding as a block of consecutive N s slots s (m). Accordingly, the signal contained in the r-th RB is denoted by x (r) = { s (rns), s (rns+1)..., s } ((r+1)n s 1). For the sake of notational simplicity, in the following we neglect the superscript r (indicating the RB number), except when strictly required. More precisely, the generic RB is denoted by x = { s (0),..., s } (N s 1). We then consider the N MCLT samples of the windowed version of s (j), j = 0,..., N s 1, denoting them by S (j), i.e., S (j) = MCLT (s (j) ). Note that the N-dimensional MLCT domain corresponds to a uniform sampling of the frequency interval [0, f s /2], f s being the sampling rate of the audio signal. In the considered data hiding scheme, only the coefficients within f min 0 and f max f s /2 are considered for data embedding. Hence, the number of samples considered in each slot is N e = N fmax fmin f s/2. The frequency interval [f min, f max ] is then divided into Q contiguous subbands of N f samples, fulfilling the constraint QN f = N e. The N f MCLT magnitudes contained in the q-th block of the j-th slot are denoted as p (j) q, q = 0,..., Q 1, and the whole set of N s N f -long vector contained in the q- th subband in all slots of the RB is denoted by p q = p q (r), r = 0,..., N s N f 1. The rationale for subband division is to avoid (or to limit) channel distortion on each p q, i.e., the subbands width (f max f min )/Q should be noticeably shorter than the coherence bandwidth of the channel. Afterwards, with reference to the embedding strategy depicted in Fig. 1, we impose that p q embeds exactly one information bit. To this

4 aim, we set N s N f = nmg s and the number of consecutive bits K forming a coding block is assumed to be a multiple of Q, i.e., K = HQ. Finally, the features p extracted from the block-based analysis are obtained by taking H consecutive RBs. Given the above, it is straightforward to derive the information data rate as R b = 2 (f max f min ) N f N s = 2 (f max f min ) nmg s. Note that in the proposed scheme there is a one-to-one correspondence between the chip sequence and the vector of features. It is then convenient, for future developments, to interpret the spread codeword c as formed by a sequence of contiguous G s -long vectors c(k, m, j) = c(k, m, j, i), k = 0,..., K 1, m = 0,..., M 1, j = 0,..., n 1, i = 0,..., G s 1. Similarly, p can be interpreted as formed by a sequence of contiguous sequences p(k, m, j) = p(k, m, j, i), k = 0,..., K 1, m = 0,..., M 1, j = 0,..., n 1, i = 0,..., G s 1. As for block synthesis, the modified features p w are reorganized following the opposite procedure described above, so that in each block S (j) the MLCT magnitudes in the frequency interval [f min, f max ] are substituted with the corresponding modified features while the phase spectrum is left equal to the original phase spectrum. Hence, the IMCLT is evaluated to produce consecutive time slot of 2N samples and the modified audio stream is evaluated by overlapping each slot by N samples. B. Psycho-Acoustic Frequency Masking Model The data hiding procedure applied to audio signals may yield unpleasant audible sound, regardless of the used data hiding scheme. Of course, reducing the hidden signal s strength can reduce such a drawback, at the expenses of data extraction robustness. A more effective solution to ensure inaudibility is signal shaping based on the psycho-acoustic models [3], [4], [18] [21]. These models, which are usually envisaged for audio compression, exploit frequency masking effects to ensure inaudibility by shaping the quantized noise according to the so called masking threshold. The global masking threshold, which is evaluated from the frequency representation of each windowed block of 2N samples, allows to determine the maximum admissible signal power variation in db at a given frequency f, before the distortion becomes audible. Such a maximum power variation will be referred to as γ(f) in the following. It is worth noting that the signal could not be stationary over the duration of 2N samples. In this case, a combination of frequency and time masking effects would require the evaluation of an effective audio distortion model. In this paper, rather than considering time-frequency masking thresholds (which would noticeably complicate the system model) we limit the value of N to a few milliseconds (so that the signal can be approximatively considered stationary over 2N samples) and we limit γ(f) to a maximum acceptable value, denoted as γ max and sufficient to avoid residual imperfections in the psycho-acoustic model. Reasonable values of γ max are in the order of 2 3 db (e.g., see [17]). C. Embedding Strategy According to the above discussion, if we denote by p w (k, m, j, i) the modified features after data embedding and by γ(k, m, j, i) the maximum admissible distortion associated to p(k, m, j, i), the constraints for marking inaudibility may be formulated as: p(k, m, j, i) 10 γ(k,m,j,i)/20 p w (k, m, j, i) p(k, m, j, i) 10 γ(k,m,j,i)/20. Denoting δ l (k, m, j, i) = 1 10 γ(k,m,j,i)/20 and δ h (k, m, j, i) = 10 γ(k,m,j,i)/20 1, we can reformulate (1) as: p(k, m, j, i)(1 δ l (k, m, j, i)) p w (k, m, j, i) p(k, m, j, i)(1 + δ h (k, m, j, i). As for the embedding strategy, we then devise a scheme which aims at maximizing the correlation with the transmitted spread codeword c subject to constraints (2). More precisely, the proposed embedding strategy can be formulated as follows: p w = arg max x K 1 M 1 n 1 G s 1 k=0 m=0 j=0 i=0 (1) (2) x(k, m, j, i)c(k, m, j, i) s.t. p(k, m, j, i)(1 δ l (k, m, j, i)) p w (k, m, j, i) p(k, m, j, i)(1 + δ h (k, m, j, i)) k, m, j, i. (3) Since the constraints in (3) are defined separately for each sample (i.e., for each value of k), it is straightforward to prove that optimization defined in (3) can be fulfilled by the following embedding strategy: p w = { (1 + δh )p if c = 1 (1 δ l )p if c = 1 where, in (4), we have omitted indexes k, m, j and i for notational simplicity. D. Detection Metric In order to perform Viterbi decoding at the receiver, it is necessary to define a metric associated with each observation. In order to do this, recall that c(k, m, j) represents the portion of the transmitted codeword associated with a single output bit, say it y(k, m, j). Accordingly, one has c(k, m, j) = y(k, m, j) g = ±g. Hence, consider the despreading correlation: W (k, m, j) = G s 1 i=0 (4) g(i)r(k, m, j, i) (5) where r(k, m, j, i) are the features associated to the received audio stream. Assume a simple Additive White Gaussian Noise (AWGN) model for the communication channel, i.e., r(k, m, j, i) = p w (k, m, j, i)+n(k, m, j, i). In this case, omitting indexes k, m, and j, and denoting by n i = Gs 1 p(i)g(i) and n d i=0 = Gs 1 n(i)g(i), from (5) it is straightforward to i=0

5 obtain: p(i)δ h (i) + p(i)δ l (i) + n i + n d i:g(i)=1 i:g(i)= 1 if y = 1 W = p(i)δ h (i) + p(i)δ l (i) + n i + n d i:g(i)= 1 i:g(i)=1 if y = 1. (6) Note that n i is an intrinsic zero mean noise term 1 introduced by the host signal p(i) and n d is a zero mean noise term introduced by the channel. Accordingly, the metric W encompasses an unbiased noisy version of the transmitted bit and, hence, it can be directly fed at the input of the Viterbi decoder for estimating the transmitted sequence w. More specifically, the Viterbi decoder estimates both w and v by solving the following maximization problem: w, v = argmax x,y K 1 M 1 n 1 r(k, m, j, i)c (x,y) (k, m, j, i). G s 1 k=0 m=0 j=0 i=0 (7) Despite the metric W in (7) is optimal only in the case of AWGN channel, we will still adopt it also in the presence of more realistic channel models. Simulation results, shown in Section VI, will confirm the effectiveness of this choice. It is worth noting that the elements in (5) that have corresponding high values of p(k, m, j, i) contribute more than the others to the transmitted bit estimation. This means that the effective spreading length G s would reduce if p(k, m, j, i) presented high fluctuations over i. On the other hand, remember that according to the time-frequency signal representation considered in this work, {p(k, m, j, i)} G s 1 i=0 come from the same subband of N f samples. Hence, for small values of N f, they represent the MCLT magnitudes concentrated in a narrow interval range, i.e., they are not expected to present strong fluctuations. E. Codeword Selection As implied by the scheme in Fig. 1, codeword selection is carried out with the aim of generating the transmitted codeword within the bin which allows to obtain the maximum correlation after data embedding. More precisely, denote by c (w,v) the transmitted spread codeword for information and free bits combinations (w, v). Hence, the free bits v can be set according to the following optimization strategy: v = argmax x K 1 M 1 n 1 p w (k, m, j, i)c (w,x) (k, m, j, i) G s 1 k=0 m=0 j=0 i=0 (8) where p w (k, m, j, i) is defined in (4). Such a maximization can be accomplished by means of the Viterbi algorithm with observations p w (k, m, j, i). In particular, since a subset of the input bits are fixed, the Viterbi algorithm is run over a restricted trellis, i.e., we consider only those state transitions corresponding to the actual w values. 1 This is due to the fact that the spreading sequences have zero mean. V. THE ACOUSTIC CHANNEL MODEL Accurate modeling the air acoustic communication channel stems from the seminar work by Schroeder [22], [23]. We assume that there is no distortion in the transmission and acquisition phases this might not be the case in the presence of low quality loudspeakers/microphones. Under the assumption of a single loudspeaker (mono transmission), denoting as x(t) the transmitted audio signal (t 0, x(t) = 0 per t 0), the received signal r(t) can be expressed as follows: r(t) = G amp x(t t flight ) t flight }{{} (A) n rif + G i x(t t i )(1 α i ) t i=1 i }{{} (B) where the two macro-addenda correspond to: (A) the direct audio signal; (B) the deterministic reflections. The coefficients appearing in the two macro-addendum have the following meanings: G amp represents the transmission gain between loudspeaker (transmitter) and microphone (receiver) (dimension: [dbw]); t flight is the propagation time of the direct audio signal; n rif is the (deterministic) number of reflections; {G i } are the gains relative to the paths of the various reflections; {t i }, with t flight < t 1 <... < t nrif < t flight + t mix, are the arrival delays of the n rif reflections; {α i } are the absorbing coefficients of the surfaces over which the sound reflects (0.2 < α i < 0.3, i). For the direct signals and the reflections, the attenuation is inversely proportional to the propagation time. VI. PERFORMANCE RESULTS In this section, we present the performance results obtained by applying the proposed DTC-SS scheme, implemented in Matlab, to 3 different single track (mono) audio clips, experimentally acquired with a sampling frequency f s = Hz. The audio clips are referred to as talk, rock, and jazz. Such clips represent, respectively: a speech track with musical background (extracted from a movie) and two different music styles (obviously, rock and jazz). The duration of each clip is around 45 s, thus yielding around N t = samples. The slot length N is set to 512 samples this corresponds to nearly 11.5 ms, which is a sufficiently short interval to assume that the signal is stationary within N samples. The watermark is inserted in the frequency interval [1400, 4400] Hz, i.e., f min = 1400 Hz and f max = 4400 Hz. Note that the whole audio signal band cannot be used for watermark embedding, because the high frequencies may be eliminated by an audio MPEG encoder or by the microphone. Moreover, in the frequencies below 1400 Hz, the imperfections of the psycho-acoustic model could lead to audible distortions. We have then set N f = 4, which yields a subband width N f f s /2N approximately equal to 170 Hz. Moreover we have set N s = 64, so that the product N f N s is equal to 256, i.e., the information rate is R b = bps. As for channel coding, we use a rate-1/2 terminated convolutional code with 16 states and generator polynomials (in octal form) (23, 35). Such a coding scheme allows to achieve (9)

6 a free distance equal to 7 [24]. The number of information bits K of each codeword is set to K = 68 (which corresponds to 4 RBs, i.e., H = 4). Note that, owing to the presence of the termination bits, the actual information rate is slightly lower that R b. In particular, we have a net rate equal to bps. The value of K is chosen to guarantee that the decoder can detect a hidden message in a relatively short time interval, namely, in a few seconds. For the proposed coding schemes, we will consider different values of M. Note that M = 1 means that no free bit is added at the input of the encoder and, hence, the system turns out to be a non-informed embedder with classical clean trellis encoding. On the contrary, the higher M the higher the dirtiness degree. Note that, since N f N s is fixed (i.e., the rate is fixed), the higher M the lower G s. In other words, for a fixed rate higher dirtiness is achieved at the expenses of a lower spreading gain. As for the propagation environment, we refer to the channel model shown in (9). More precisely, we consider 4 deterministic reflections, i.e., n rif = 4 and we set G amp = 1 (0 dbw) and G i 0.8 e (1 α i ) = 0.75, i = 1,..., n rif (macroaddendum (B)). Moreover, in order to set the delay terms t flight and {t i } n rif i=1, we consider a typical medium room environment (between 30 m 3 and 40 m 3 ), with t flight = 12 ms (which corresponds to a distance d flight = 4 m) and ti = {2, 4, 6, 8}. In this case, the coherence bandwidth is nearly 340 Hz, i.e., twice the subband width. As for the audio sensor (microphone) model, we consider a very simple model which encompasses only additive white noise. In particular, the Signal-to-Noise Ratio (SNR) is used to provide a measure of the introduced noise level. As a matter of fact, beyond the introduction of white noise, real audio sensors introduce both linear and noninear distortions. As for linear distortions, they can be easily prevented by limiting the bandwidth of the embedded data, so that the frequency response can be assumed flat in the region of interest, i.e., in the (f min, f max ) interval. Non-linear distortions depend on non-linearities in the whole reproduction and acquisition chain and, hence, they are very difficult to model since they strongly depend on the employed devices. A thorough investigation of the impact of microphones nonlinearities goes beyond the scope of this paper. In Fig. 2, the BER is shown, as a function of the SNR, for the three considered audio clips: (a) talk, (b) rock, and (c) jazz. In each case, various values of M are considered, namely 1, 2, 4. In all cases, γ max is set to 3 db. The following observations can be carried out from the results in Fig. 2. First, for each type of audio clip it can be observed that the relative behaviour of the curves associated with different values of M is similar. In particular, the best performance is obtained with M = 2. The performance with M = 4 is generally worse than that with M = 1 case, but for the jazz audio clip (Fig. 2 (c)), where the scheme with M = 4 outperforms that with M = 1 for medium-to-high SNRs. This means that in the considered scenario a slight level of dirtiness is beneficial. At the same time, pushing the informed approach too far does not provide any additional advantage, but it often leads to a performance loss. BER BER BER SNR (db) (a) M = 4 M = 2 M = 1 M = 4 M = 2 M = SNR (db) (b) SNR (db) (c) M = 4 M = 2 M = 1 Fig. 2. BER, as a function of the SNR, for the considered audio clips: (a) talk, (b) rock, and (c) jazz. In each case, various values of M are considered. Second, from the results in Fig. 2 one can also observe that different audio clips yield different performances, with the rock audio clip exhibiting the best performance and the jazz audio clip the worst one. In particular, considering the case with M = 2, we have that the rock audio clip (Fig. 2 (b)) allows to approach a BER equal to 10 3 for an SNR value around

7 BER 10 2 talk: SNR = 23 db rock: SNR = 15 db jazz: SNR = 26 db M) which maximizes the system performance (i.e., minimizes the BER). In particular, increasing the dirtiness beyond leads to a performance degradation. While the proposed watermarking algorithm has been analytically derived under the assumption of AWGN communication channel, its applicability to a realistic multipath audio channel has been verified. Owing to the embedding domain, the proposed algorithm is also moderately robust against time misalignments between the watermark embedder and the detector. REFERENCES δ Fig. 3. BER versus δ for the considered audio clips. The SNRs are set to 23 db, 17.5 db and 30 db for talk, rock and jazz audio clips, respectively. In all cases, M = db, with gains of 9 db and 14 db with respect to the talk (Fig. 2 (a)) and jazz (Fig. 2 (c)) audio clips, respectively. The reason behind this difference among the three audio clips resides in the different spectrum shapes. In particular, the more the spectrum is noisy-like, the higher is the allowed watermarking distortion. For instance, the Data-to-Watermark Ratio (DWR) (i.e., the ratio between the average signal energy and the average watermark energy) assumes approximately values equal to 20 db, 22 dbm, and 26 db in the rock, talk and jazz cases, respectively. As a final test, we evaluate the robustness of the proposed scheme against time desynchronization. In particular, a time desynchronization δ ( 1, 1) corresponds to a scenario where the receiver and the transmitter are misaligned by δt s, where T s = 2N/f s = 23 ms is the time slot. In Fig. 3, the BER is shown, as a function of the desynchronization δ, considering the three audio clips of Fig. 2. The SNRs are set to 23 db, 17.5 db and 30 db for talk, rock and jazz audio clips, respectively. In all cases, M = 2 (the optimized value, in all cases, according to the results in Fig. 2). It can be concluded that for desynchronization values with δ 0.05, i.e., for time lags shorted than 1.1 ms, the BER is not significantly affected, regardless of the type of audio clip. This shows that the proposed scheme has an intrinsic robustness against time shifts. While the need for a dedicated synchronization mechanism cannot be avoided, such a moderate intrinsic robustness to de-synchronization suggests that achieving full synchronization between the watermark embedder and the decoder should not be a difficult task. This remains an open issue to be investigated. VII. CONCLUSIONS In this paper, we have proposed a DTC-SS informed audio watermarking scheme in the MCLT domain. The obtained results show that there exists an optimized value of dirtiness (identified by an optimized value, equal to 2, of the parameter [1] B. Chen and G. Wornell. Quantization index modulation: A class of provably good methods for digital watermarking and information embedding. IEEE Trans. Inform. Theory, 47(4): , May [2] I. J. Cox, M. L. Miller, and A. L. McKellips. Watermarking as communications with side information. Proceedings of the IEEE, 87(7): , July [3] L. Boney, A. H. Tewfik, and K. N. Hamdy. Digital watermarks for audio signal. In International Conference on Multimedia Computing and Systems, pages , Hiroshima, Japan, [4] N. Cvejic., A. Keskinarkaus, and T. Seppanen. Audio watermarking using m-sequences and temporal masking. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages , New Paltz, New York, USA, [5] D. Kirovski and H. Malvar. Robust spread spectrum audio watermarking. In Proc. IEEE Intern. Conf. on Acoustics, Speech, and Signal Proc. (ICASSP), pages , Salt Lake City, UT, USA, May [6] H. Kim. Stochastic model based audio watermark and whitening filter for improved detection. In Proc. IEEE Intern. Conf. on Acoustics, Speech, and Signal Proc. (ICASSP), pages , Istanbul, Turkey, June [7] S. K. Lee and Y. S. Ho. Digital audio watermarking in the cepstrum domain. IEEE Trans. on Consumer Electronics, 46(3): , August [8] J. Seok, J. Hong, and J. Kim. A novel audio watermarking algorithm for copyright protection of digital audio. ETRI Journal, 24(3): , June [9] B.-S. Ko, R. Nishimura, and Y. Suzuki. Time-spread echo method for digital audio watermarking. IEEE Transactions on Multimedia, 7(2): , April [10] P. Moulin and J. A. O Sullivan. Information-theoretic analysis of information hiding. IEEE Trans. Inform. Theory, 49(3): , March [11] P. Moulin. The role of information theory in watermarking and its application to image watermarking. Signal Process., 81(6): , [12] A. S. Cohen and A. Lapidoth. The Gaussian watermarking game. IEEE Trans. Inform. Theory, 48(6): , June [13] S. Wu, J. Huang, D. Huang, and Y. Q. Shi. Efficiently self-synchronized audio watermarking for assured audio data transmission. IEEE Trans. on Broadcasting, 51(1):69 76, March [14] M. L. Miller, G. J. Doerr, and I. J. Cox. Applying informed coding and embedding to design a robust, high capacity, watermark. IEEE Trans. on Image Processing, 13(6): , June [15] A. Abrardo, M. Barni, F. Péréz-Gonzalez, and C. Mosquera. Improving the performance of RDM watermarking by means of trellis coded quantisation. IEE Proc.-Inf. Secur., 153(3): , September [16] H. S. Malvar. Signal Processing with Lapped Transforms. Artech House, [17] H. S. Malvar. A modulated complex lapped transform and its application to audio processing. In Proc. IEEE Intern. Conf. on Acoustics, Speech, and Signal Proc. (ICASSP), pages , Phoenix, AZ, USA, March [18] M. Arnold and K. Schilz. Quality evaluation of watermarked audio tracks. SPIE Proceedings, 4675:91 101, April [19] P. Bassia, I. Pitas, and N. Nikolaidis. Robust audio watermarking in the time domain. IEEE Transactions on Multimedia, 3(2): , June 2001.

8 [20] N. Cvejic and T. Seppanen. Improving audio watermarking scheme using psychoacoustic watermark filtering. In IEEE International Symposium on Signal Processing and Information Technology, pages , Cairo, Egypt, [21] T. Painter and A. Spanias. Perceptual coding of digital audio. Proceedings of the IEEE, 88(4): , April [22] M. R. Schroeder. Natural sounding artificial reverberation. J. Audio Engineering Society, 10(3): , July [23] M. R. Schroeder. Digital simulation of sound transmission in reverberant spaces. J. Acoustical Society of America, 47(2): , [24] J. G. Proakis and M. Salehi. Communication Systems Engineering. Prentice Hall, Upple Saddle River, NJ, USA, 2002.

Introduction to Audio Watermarking Schemes

Introduction to Audio Watermarking Schemes N. Lazic and P. Aarabi, Communication over an Acoustic Channel Using Data Hiding Techniques, IEEE Transactions on Multimedia, Vol. 8, No. 5, October 2006 Multimedia