Parallel Image Filtering Using WPVM in a Windows Multicomputer

Parallel Image Filtering Using WPVM in a Windows Multicomputer Luís Fabrício W. Góes {lfwg@pucmg.br} Luiz Eduardo S. Ramos {luizedu@pucmg.br} Carlos Augusto P. S. Martins {capsm@pucminas.br} Computer Science Department / Post-Graduation Program in Electrical Engineering Pontifical Catholic University of Minas Gerais Av. Dom José Gaspar 500, 30535-610 Belo Horizonte, MG, Brazil Telephone/Fax: 55-31-33194305 ABSTRACT In this paper we present a parallel implementation of image filtering using WPVM (Windows Parallel Virtual Machine), a message passing library. The main goal is to develop an implementation of an image filtering operation maintaining image quality and reducing response time. We analyze its performance in a Windows multicomputer using different filtering mask dimensions and types, image sizes and number of computers employed in the image filtering. 1. INTRODUCTION Images are very important to applications in many areas of knowledge such as computation, entertainment and communication [1]. Image processing techniques are used to manipulate a given image, so that the result is more suitable for a specific application (eg.: linear image processing, edge detection, feature extraction, template matching, regularization theory, morphological image processing etc) [][3]. Some image processing operations like filtering or interpolation are based on convolution [1], which is a common image processing technique used in many computer vision tasks [][4]. It consists of a Fourier transform relationship that constitutes a basic link between the space and frequency domains [3]. This relation stands because the image filtering process corresponds to a convolution in space domain between the image pixels and the convolution mask, which is the image filter in the space domain. The filtering process corresponds to a multiplication in frequency domain between the Fourier transforms of digital image and the image filter. Thus in this process a new filtered image is generated [1]. The image filtering process is computationally intensive [] and the number of computations involved in it tends to increase as large sized, high resolution images, become more common [5][6]. For this reason, several techniques have been employed to increase the speed of computations at the algorithmic and the architectural levels [4][5][6][7][8][9]. One of them is parallel processing that has started to dominate the supercomputing industry []. One trend in parallel processing is towards the use of clusters [] that consist of a collection of autonomous general purpose workstations connected by a local (LAN) or system area network (SAN) to form a parallel computer [10]. These multicomputers are commonly used to reduce the application s response time or to increase its throughput [11]. The communication between the nodes of a cluster can be carried out by applications based on the message passing paradigm. This model generally uses the explicit programming method, in which the interaction between processes, data allocation and workloads must be specified by the programmer [11]. A frequently used option for the implementation of parallel applications based on this model are message passing libraries like: PVM, MPI and MPI- [10]. PVM (Parallel Virtual Machine) is a public domain system that utilizes message passing and is presently available for different platforms and operating systems [11][1]. It consists of a user library and a daemon process running within each workstation. Among the MS Windows PVM implementations there is WPVM (Windows PVM), presented in [13].

1.1. MOTIVATION The demand of computational resources increases as real-time and complex image processing applications become more common. Image filtering involves a great number of computations requiring alternative high performance solutions to achieve results in a previously defined time. The filtering calculations are independent of each other, so they can be carried out in parallel [7]. So parallel processing provides a low cost and high performance solution. 1.. GOALS AND OBJECTIVES The main goal of this work is to develop an implementation of an image filtering operation maintaining image quality and reducing response time. The main objective is to analyze the performance of the parallel implementation using different filtering mask dimensions and types, image sizes and number of computers employed in the image filtering.. TRADITIONAL FILTERING OPERATION Image processing techniques in the space domain category are based on the direct manipulation of the pixels of an image, while the techniques in the frequency domain category are based on the modification of the Fourier transform of the image [1] [3]. (x,y) Figure 1. 3x3 neighborhood centered at (x,y) using a square shaped mask. Space domain methods are procedures that operate directly on the aggregate of pixels that compose an image. Equation 1 expresses the space domain image processing functions, where g(x,y) is a processed image resulting from the operation T on the original image f(x,y). The operator T is defined over some neighborhood of (x,y) [3]. g (x,y) = T [f (x,y)] (equation 1) The main approach in defining a neighborhood about (x,y) is to use a sub-image area centered at (x,y), as we can see in Figure 1. The sub-image has commonly a square or rectangular shape, because these are easier to implement [3]. The sub-image is moved from pixel to pixel in a specified direction (e.g.: from left to right, row by row) and the operator is applied at each (x,y), generating a processed value that will be used to compose g(x,y) [3][5][7]. A threshold can be used to eliminate the inexpressive results. This approach is based on two-dimensional arrays called masks (filters), whose coefficients are chosen to detect a given property in an image [3]. Frequency domain techniques are based on the convolution theorem, found at [3]. In equation, g(x,y) is an image formed by the convolution of an image f(x,y) and a position-invariant operator h(x,y). The operator h(x,y) is invariant because its result depends on the value of f(x,y) at a given point in the image (and not to the position of that point) [3]. Equation also describes a spatial process that is analogous to the one described in the spatial-domain methods. g (x,y) = h (x,y) * f (x,y) (equation ) Equation 3 is a frequency-domain relation that can be obtained from the convolution theorem, where G, H and F are Fourier transforms of g, h and f, respectively. It is important to notice that H(u,v) (called transfer function) and h(x,y) must be of the same size because the convolution theorem requires this. G (u,v) = H (u,v). F (u,v) (equation 3) From the relations established above, we define a D convolution operation in the equation 4, where: I[0..N-1, 0..N-1] is a NxN matrix containing the image to be processed, C[0..N-1, 0..N-1] is the processed image and T[0..M-1, 0..M-1] is the convolution (filtering) mask, also called kernel. M u = -M M C [ i, j ] = I [ (i+u), (j+v) ]. T [u, v] v = -M (equation 4) In the traditional sequential implementation, each element at I and C matrixes corresponds to the value of a single image pixel (commonly, an 8-bit value, representing a gray level tone) and each element at T corresponds to a kernel weight [].

A discrete image convolution in space domain is the process of converting a source image into a different one through filtering. This means to apply a convolution mask (kernel) with discrete values to an image. So it is possible for an application to execute different filtering by replacing the mask that performs the convolution [5][7]. 3. PARALLEL FILTERING IMPLEMENTATION The proposed image filter implementation consists of a master-slave process farm, in which slave processes execute computations with a centralized control (master process) [5][11]. In our implementation, a master process (controller) creates slave processes and specifies different stripes (image regions or sub-images) to be filtered by the slaves processes. Then it sends a small message to each slave containing data about a stripe. This message informs the slave about which stripe it must load and apply filtering on (see Figure ). Its important to remark that physical copies of the images were previously placed into each node of the cluster, so that there was no need for the master process to send the image stripes to its slaves, avoiding extra communication overheads. In a similar way, a slave process was able to load a sub-image independently from the master or from the other slaves. A slave process creates an internal stripe formed by the received stripe including the intersections (also shown on Figure ), which are necessary for the filtering operation. Then it carries out filtering and sends its results to the master. stripe 1 stripe stripe 3 stripe N N stripes intersection 1 intersection intersection 3 intersection N-1 Figure. Original image divided into N stripes with intersections between neighbor ones. Another important remark is that slave processes do not send the whole stripes all at once after the entire calculation. They send a number of messages which sizes are defined by the user before the execution. This allows the existence of intercalations between messages, making possible to avoid message collisions or message waiting and to reduce the bottleneck caused by the existence of a unique master. Finally the master process collects and merges the processed stripes, building a new image. In order to simplify the filter implementation we have not considered the border effect, so that borders of filtered images were simply left in black. 4. EXPERIMENTAL METHOD The experimental method consists of three stages: sequential version implementation, parallel implementation and performance analysis. In the first stage, a sequential image filtering operation is implemented and validated through qualitative and quantitative analysis, comparing the results with some commercial image processing softwares. In the second stage, a parallel image filtering is implemented and validated by comparing its results with the sequential one. In the last stage, a performance analysis is done using variations of filtering masks, image sizes and number of computers in the parallel image filtering. Some metrics like speedup, efficiency and response time are used to analyze the results. In order to perform tests, we used the Prober version.0, presented in [10], which is a functional and performance analyzer tool for parallel programs, proposed and developed during an undergraduate research project by our group. The tool was chosen because it combines the features of monitors, performance analyzers, benchmarks and job management systems, making tests, monitoring and performance analysis easier. 5. EXPERIMENTAL RESULTS The first step was the development of a sequential implementation version, coded in C, for Windows environment. It allows different dimension masks to be applied on gray-scale images of different sizes. This version was validated through qualitative analysis using subjective image quality and quantitative analyses using absolute difference image and histogram comparison between its results and the ones generated by JASC Paintshop Pro and Adobe Photoshop, commercial wellknown image processing softwares. Three image filtering mask were used: edge detection, low-pass and high-pass. In Figures 3, 4, 5 are shown three images

before and after the filtering operation performed by our implementation. Figure 3. Lowpass smoothing filter with 5x5 mask on a 51x51 image. submission files, specifying the number of iterations, arguments, variation of these arguments and executable files. According to the description of the parallel algorithm, we assume that the images are on each machine before the program starts to run. This feature avoids communication overhead of the image transferring. The algorithm has another special feature, the selection of the size of the message that contains the filtered lines from the slave process. It is possible to specify how many bytes, in other words, how many result lines the slave will send back to the master on each message during its processing. Instead of slaves send all their results at the same time, they send part of their results at some moments. The next step was the implementation of the parallel version using WPVM message-passing library. We validated this version comparing sequential results with parallel ones. Quantitative and qualitative analysis shows that filtered images were the same for all implementations. Figure 5. Highpass filter with 11x11 mask on a 51x51 image. Figure 4. 3 x 3 laplacian edge detection mask on a 51x51 image. To analyze the performance of our implementation, tests were done using a homogeneous multicomputer environment. The environment is a cluster of nine Pentium III 1.0 GHz (with 18 RAM), running WPVM over Windows 98 operating system. A Fast Ethernet switch interconnects the nodes of the cluster. The tests were performed using five filtering mask dimensions (3x3, 5x5, 7x7, 9x9 and 11x11 pixels) on images of three sizes (51x51, 104x104 and 048x048 pixels). Each combination was executed in 10 iterations. As an example, we used the border detection mask. The results were collect by Prober through its batch execution mechanism, where we created some To reduce the number of tests, because of the great number of used masks dimensions, we decided to analyze the impact of the size of the message only on an 11x11mask, which requires more processing time. Figure 6 shows the response time using eight slave processes. The minimum size of message permitted is the size of one line of the image. Varying the size of message from 51 bytes to 56 Kbytes, we observed that send short messages (1 to 4 lines), probably don t lead to good results, principally on large images, because the overhead to pack and send messages turns bigger than the processing time, decreasing the performance. When we use a large image, that has a coarse grain, short messages prevent it of take advantage from this feature. The biggest size of message for each image presented a decrease of performance, because bigger messages increase the probability of collisions and wait on the network.

4.7 for 8 processes, proving that the program is scalable up to 8 processes. We observed that bigger the image and filter size, better will be the speedup, because the computation part increases, prevailing over the communication overhead. Figure 6. Response time vs. size of message using an 11x11 mask. In the next experiments that involve other mask sizes, we used a message size of 3Kbytes as a standard, because it presented good results on all images (see Figure 5). Figure 8. Speedup vs. number of slave processes using an 11 x 11 mask. Another important performance metric to analyze is efficiency. In all images the efficiency reduced as the number of processes increased. In the small sized images this rate was high, but in this case, the efficiency tends to become constant from a certain number of processes on. In the 104 x 104 and 048 x 048 images, the rate of efficiency decreasing was lower, demonstrating a bigger speedup scalability (Figure 9). Figure 7. Response time vs. size of the mask using eight slave processes. The size of the mask influences in the processing time needed to perform the operation. In spite of the mask s size variation, we see in Figure 7 that the communication overhead prevailed over the processing time on the 51 x 51 and 104 x 104 images, leading to a constant response time. Because of its coarser grain (more processing), the 048 x 048 image presented a bigger response time when the mask s size was increased. In this case, the processing time prevailed over the communication time. In Figure 8, we can observe the speedup for all image sizes using an 11 x 11 edge detection mask. The 51 x 51 image had no speedup because the computation part or the grain size was so small. Nevertheless, the 104 x 104 image reached a speedup of.8 for 6 processes and after that the performance started reducing. As we expected, the 048 x 048 image presented the best result, reaching a speedup of Figure 9. Efficiency vs. number of slave processes using an 11 x 11 mask. 6. CONCLUSION Based on the results we conclude that our proposed implementation execute the image filtering operation with no loss in image quality and achieving speedup and efficiency according to our main goal. The use of this

implementation has reduced the filtering response time in many situations and different masks. A parallel image filter is more efficient on larger images using larger filtering masks (large number of arithmetic operations in comparison with the communication overheads). Nevertheless, on small sized images using small dimension masks, our image filter implementation was not so efficient. The size of the result messages influences on the performance of the program. The use of small sized and big messages generates so much communication overhead (packing, sending and network collisions). A middle sized message (between 4 and 3 lines of the image) proved to be a good choice, increasing the performance of the program. Besides, the source image must already be located in each node (machine) of the cluster. This may not be suitable for cases where the image must be transmitted, but the implementation can be easily altered to support image transmission, possibly decreasing performance. Our implementation has some limitations: it uses a high latency Fast Ethernet network and TCP/IP protocol, which are inadequate for clusters. These features should be eliminated in future versions. The final conclusion is that our parallel image filter implementation is more efficient than the sequential traditional one. The major contribution of this work was a parallel implementation of image filtering operation that reduces response time maintaining the image quality. Our implementation has shown that filtering problem is scalable up to eight machines, but not limited on it. The utilization of public domain implementation of message passing in enterprise clusters provides a low cost high performance and good scalability for image filtering operation. 7. FUTURE WORKS As future works we propose the implementation of parallel interpolation operations, which have a coarser grain than the image filtering presented in this article. We are also interested in comparing the use of other parallel and distributed message passing systems such as: MPI, RMI and CORBA. 8. ACKNOWLEDGEMENTS We would like to thank the Department of Mechanical Engineering, for lending the laboratory, PIBIC, CNPq and Computer Science Department for the support. 9. REFERENCES [1] C. Martins, J. Zuffo, and S. Kofuji, Two Dimensional Normalized Sampled Finite Sinc Reconstructor, in AeroSense 97, Proc. SPIE-3074, SPIE, Orlando, 1997. [] M. Hamdi, and C. Lee, Efficient Image Processing Applications on a Network of Workstations, in Proc. Computer Architectures for Machine Perception (CAMP 95), 1995 (ieee). [3] R. Gonzales, and P. Winitz, Digital Image Processing, nd edition, Addison-Wesley Publishing Company Inc, 1987. [4] G. Erten, and F. Salam, Real-Time Realization of Early Visual Perception, Inter. Journal of Computers and Electrical Engineering, Vol. 4-/3, 1999. [5] M. Clement, et al Parallel Algorithms for Image Convolution,, Technical Report, (www.cse.ucsd.edu/users/rbharath/papers.html), 1998. [6] X. Zhang, and S. Dykes, Distributed Edge Detection: Issues and Implementations, IEEE Computational Science & Engineering, 1997 (ieee). [7] R. Bharath, Parallel Implementations of Image Algorithms with MPI, unpublished technical report, 001(www.cse.ucsd.edu/users/rbharath/papers/cse60a.pdf). [8] S. Dykes, An Efficient Data Parallel Algorithm for -D Convolutions, Master's Thesis, University of Texas at San Antonio, 1994. [9] X. Zhang, and S. Dykes, Folding spatial image filters on the CM-5, in Proc. 9th International Parallel Processing Symposium (IPPS 95), 1995. [10] L. Goes, L. Ramos, and C. Martins, Prober: Uma Ferramenta de Análise Funcional e de Desempenho de Programas Paralelos e Configuração de Cluster, in Proc. WSCAD 01, 001 (in Portuguese). [11] K. Hwang, et al. Scalable Parallel Computing, WCB/McGraw-Hill, 1998. [1] "PVM: Parallel Virtual Machine" (www.epm.ornl.gov/pvm/pvm_home.html). [13] A. Alves, et al., WPVM: Parallel Computing for the People, in Proc. HPCN'95, 1995.