CHAPTER 4 - BREAST CANCER STAGE DETECTION (BCSD) USING MULTI VIEW UNIVARIATE CLASSIFICATION 4.1 INTRODUCTION The data mining techniques are used in various medical image analysis which is described in the previous chapters. The data mining clustering, classification and associative relationship techniques are used for descriptive mining. The neural network model and the genetic approaches are used for the predictive model. The Medical image analysis and clinical decision making system required more accurate result because all the decisions are directly affects the results of human life. The Medical image analysis a mammographic analysis is identified as potential research using data mining techniques and its literature review discussed in the previous chapters. However, the third chapter deals with the medical image analysis process using data mining techniques. The breast cancer can be detected using clinical diagnosis, Mammographic analysis, Ultra Sound and Computer Assistant Detection methods. This research work limited to the analysis of digital mammographic analysis using data mining techniques. Before discussing the techniques, the breast cancer indication and detection importance are described. 4.2 BREAST CANCER INDICATION A malignant tumour is a cluster of cancer cells that may invade nearby tissues or spread (metastasize) to distant areas of the body. It is a malignant tumour that starts in the ducts or lobules of the breast as shown in the following Figure.Cells lining the ducts or lobules can grow uncontrollably and develop into cancer. Some breast cancers are discovered when they 40
are still confined to the ducts or lobules of the breast. This is termed pre-invasive breast cancer.[online 52,53,54,55] The most common types are ductal carcinoma in situ (DCIS) and lobular carcinoma in situ (LCIS) as shown in Figures (a ) and (b ) respectively. (a) Ductal Carcinom 41
(b) Lobular Carcinoma Diagnosis and early detection go hand in hand when it comes to battling with breast cancer and it is the key to increasing one s survival rate. The stages of breast disease management are shown in below Figure. Signs and symptoms discovered through breast selfexamination (BSE), clinical breast examination (CBE) or through the detection of an [abnormality using screening mammography will often initiate diagnosis. Most of these women usually have benign lesions in spite of suspicious physical or radiological findings. [Brown ML et al 1995] As such, these women are put through unwarranted multiple diagnostic tests such as additional mammography, ultrasound, fine-needle aspiration and open surgical biopsy. [Peer PGM et al 1996] This often results in anxiety for patients and excessive health-care spending. [Davies, et al 1995] It is hence important, to safely reduce the number of diagnostic procedures undertaken by these women through cost-effective means. [M. Faupel et al 1997] 42
As per the approach, mammographic analysis is placed in both detection and diagnosis process. This research work is carried with the multi view Univariate classification approach described below. 4.3 DIGITAL MAMMOGRAM CLASSIFICATION FOR BCSD The methods starting from the mammographic image and proceeds with the determination and verification of stage of cancer. The collected digital mammogram image is processed and converts the pixel value into corresponding Digital Numbers or Index Values. The digital Mammogram values are classified according to the Univariate using multi view analysis. According to clustering and classification process, the density level of digital Mammogram area is identified. 43
The clustering analyses are implemented with direct image and rotational conversion method. In the direct method the image is processed with ascending red16,auxctq16, BWlnVLog16, BW Parabolic16, correction16,cyclic16,descending red16, Design16,.Grayscale16,Hot body16, Hot metal16, Isocount16, Heart16, Rainbow16, Red16, Spectrum, Parathyroid16, Warm metal 16 attributes. 16 represents for the bit process and the attribute represented for the pre processed image. In the rotational property Mirror +90, Mirror _90, +90,-90, magnify + and Magnify properties. After processing all the above methods, the average and frequency of the image stage is considered to determine the stage level.the identified area index value and its property are represented for stage level calculation using average method. The grade and stage of cancer is processed and the prevention possibilities could be recommended from the practitioners. The data source is illustrated below. 4.3.1 Data Collection The images are collected from the cancer institute laboratory and then 18 image properties are applied on original image. It consists of 12 color images and 6 gray scale images. Image properties includes ascending red16,descending red16, red16, region16, mirror, magnify, minify, Spectrum, bwlog, bwparabolic, designer16, -90, +90 and so on. The following figure shows original image of the patient Amul. The size of this image is 1012 X 688 pixels. The original image is preprocessed and described below. a) Original Image: It is directly fetched from the mammographic process and directly converted the radiation in into digital image format and presented as a JPEG image after the preprocess of RGB conversion. 44
b) Ascendingred16: The values are presented in an ascending order of the values without replacing the pixel positions. The image is presented in an array form of grouped values as per sorted order. c) Bwlnvlog16: This is a black and white converted image of the gray scale image with logarithmic value. d) Correction16: The image values are converted and corrected for the 16 bit value representation from unit value of 8 bit. Each pixel is represented using 16 bits or 2 bytes. There are 5 bits for red, 6 bits for green, and 5 bits for blue. The total number of possible colors is about 65 thousand (256x256). e) Grays 16: The intensity of a pixel is expressed within a given range between a minimum and a maximum, inclusive. This range is represented in an abstract way as a range from 0 (total absence, black) and 1 (total presence, white), with any fractional values in between. This notation is used in academic papers, but this does not define what "black" or "white" is in terms of colorimetry. Conversion of a color image to grayscale is not unique; different weighting of the color channels effectively represent the effect of shooting black-and-white film with different-colored photographic filters on the cameras. A common strategy is to match the luminance of the grayscale image to the luminance of the color image. To convert any color to a grayscale representation of its luminance, first one must obtain the values of its red, green, and blue (RGB) primaries in linear intensity encoding, by gamma expansion. 45
a.original b. Ascending red16 c. bwlnvlog16 D.correction16 E. grays16 f) Invert grayscale: The converted gray scale values complement for the maximum value of 255 is computed and the image is constructed. g) Mag125: the magnifier of 125 % is presented from the existing size of pixel to increase into 125 % of the enlargement. h) Micro delta Hot Metal: The ration values are fixed with the metal radiation according to the reflection the RGB values are converted and the image is constructed. This color scheme is true color of 32 bit value presentation. i) Mini 08: The 16 bit converted bits are presented in the unit of 8 bit values range. j) Mirror: The pixel positions are directed to changed with 180 o and the relocated position based images are generated. k) Para thyroid: The pixels are reworked with the range values of 16 bit with the thermal reflection. l) Rainbow: The image presented with VIBGYOR color combination as per the range of the pixels 46
m) + 90 and 90: The pixel directions are redirected the Ɵ value of +90 and 90 from current position of the pixel. Similarly the geometric properties are processed through Syngo Fast view tool is used for the preprocess image conversation and the following preprocessed sub images are arrived. auxctq16 heart16 Invert grayscale16 mag125 micro delta hot metal16 mini08 mirror rainbow16 spectrum16 warm metal16 47
-90 Parathyroid Stars16 +90 Preprocessing includes cropping which is used to cut the black parts of the image and written labels. It removes the unwanted parts of the image to improve the appearance such as the change of aspect ratio. This operation is done using image editing software such as Adobe Photoshop. It is a pixel based image editor, which is used for editing, animating and authoring. The original image contains 1012 X688 pixels. After cropping, the size of the image is 260 pixels width and 555 pixels height. Horizontal Resolution is 300 dpi. Vertical Resolution is 555 dpi. The Bit depth is 24.The frame count is 1. The color representation is RGB. Syngo fast view is software, which is a standalone viewing tool for DICOM images. It is used to generate ascending red16,auxctq16, BWlnVLog16, BW Parabolic16, correction16, Cyclic16, Descending red16, Design16, Grayscale16, Hot body16, Hot metal16, Iso count16, Heart16, Rainbow16, Red16, Spectrum, Parathyroid16, Warm metal 16, Mirror +90, Mirror _90, +90,-90, magnify + and Magnify attributed images for the clustering process. 48
The image is swept vertically and horizontally to make all images in same size. To sweep the image vertically cut 2.2x4.7 inches. To sweep the image horizontally cut 4.7x2.2 inches using the software Adobe Photoshop. The size of the image is DPI of the output device (i.e.) 300 dpi. After cropping the size of the image is.87x1.85 inches. Good quality 300 dpi preprocessed images are used for clustering process using mat lab 7.0 software. There are 4 images for each patient. Each image is clustered into 5 clustered images. Apply 18 properties for each image. A Totallly 420 image for each patient is analyzed to identify the cancer stage. The preprocessed images of the four patients are attached as an appendix I. 4.4 DIGITAL MAMMOGRAM INTO DIGITAL NUMBERS An image consists of number of pixels. The position of pixel is determined by x y co-ordinate system. The pixels are arranged in rows and columns. Each pixel is associated with digital number.digital numbers are ranged from zero to some higher number on grayscale. It can be described in numerical terms on a three co-ordinate system with x,y and Z, where x,y is the co-ordinate position, and Z is the intensity which is giving digital number that is displayed as a grey scale intensity value. In this mammography analyzes the pixel is the combination of RGB. In this process unwanted parts of the image can be removed and taking only breast area. (i.e.) making all images in equal size. So that Region of Interest can be processed in the same resolution. If the image is resized then odd number of pixel becomes equal. Region of Interest is the process of selecting a part of an image to perform some operation on it. In our research, we have selected whole breast area as a region of Interest. 49
The image represented in the multi dimensional layer is based on digital numbers (DN). The DNs are represented according to the layer such as R.G, B. This combinational array is represented according to the layer. These arrays are converted into an equaling array in two dimensions. The sample pixel Digital Numbers are presented below though true color image is specified by three values one each for red, green, and blue components of the pixel scalar of 555 X 260 X 3 arrays. Layer Red Layer Green Layer Blue 134 141 147 128 125 134 143 146 137 132 139 139 144 145 148 147 141 139 146 138 141 141 153 151 134 126 126 125 141 135 126 111 145 129 120 130 127 139 145 144 140 139 135 126 123 125 139 137 144 137 146 140 138 139 148 145 131 148 153 143 127 128 131 122 143 139 116 106 139 133 134 132 4.5 MULTI VIEW UNIVARIATE CLASSIFICATION Classification refers to the process of grouping items that have similar feature values. Classification can be done in two ways based on layers and range values. The Univariate classification is the method which is used in this research work. It explores untransformed or transformed datasets to analyze (classify and re-classify) image data and display continuous field data. In the majority of cases these procedures perform classification which is purely based on the input dataset, without reference to separate external evaluation criteria. 50
In almost all instances the objects to be classified are regarded as discrete, distinct items that can only reside in one class at a time. Separate schemes exist for classifying objects that have uncertain class membership and/or unclear boundaries or which require classification on the basis of multiple attributes called equal interval values. Typically the attributes used in classification have numerical values that are real or integer type. In most instances these numeric values represent interval or ratio-scaled variables. The table provides details of a number of univariate classification schemes together with comments on their use. A useful variant of the method, known as hybrid equal interval, in which the inter-quartile range is itself divided into equal intervals, and does not appear to be implemented in mainstream. Selected univariate classification schemes Classification Description / application scheme Unique values Each value is treated separately, for example mapped as a distinct color Manual classification The analyst specifies the boundaries between classes required as a list, or specifies a lower bound and interval or lower and upper bound plus number of intervals required Equal interval, Slice The attribute values are divided into n classes with each interval having the same width=range/n. For raster maps this operation is often called slice Defined interval A variant of manual and equal interval, in which the user defines each of the intervals required 51
Classification Description / application scheme Exponential interval Intervals are selected so that the number of observations in each successive interval increases (or decreases) exponentially Equal count or quartile Intervals are selected so that the number of observations in each interval is the same. If each interval contains 25% of the observations the result is known as a quartile classification. Ideally the procedure should indicate the exact numbers assigned to each class, since they will rarely be exactly equal The equal interval method, attribute values are divided into n classes with each interval having the same width=range/n. This method aids to determine the range value of the available attribute values. The classification is not accurate due to the multi variable classification of the Image. In this work, the range values are determined according to number of cluster assumed as five. The interval is calculated according to unique values method. Therefore the color combination 0-255 is dived into 5 with roundup ranges. The interval is 256/5 = 51. Range 1 2 3 4 5 Starting 0 51 103 155 207 End 50 102 154 206 255 52
The classification is done with individual layer and combinational layer values. In the individual layer, only one layer values are considered for the classification of the selected layer and the remaining values are extended. According the affected area, clustered image is adopted. Affected image part is classified into 5 clustered images from 0 to 4. Based on the discussed concept the univariate algorithm is derived and presented below for the classification 4.6 UNIVARIATE CLASSIFICATION ALGORITHM The preprocessed image is classified according to the range values. The range values are divided into to five ranges and the region of Interest is processed and the classified pixels are grouped and presented as a sub image. The algorithms is derived from the initial stage of data set collection from the patient, capture the image using radiation sensor and represented as the digital image. The digital image is preprocessed and the Region of Interest is identified and the image is converted with the Digital Numbers. The digital values are divided and presented with range value according to the above presented concepts. The range values are applied into the three layer values and the affected areas pixels are identified and clustered together. The affected ROI s pixel non-zero elements (NZE) are processed and the density is calculated. The high density values are computed for around 18 geometrical preprocessed images which are classified. From the computed density values frequency is computed. As per the computation, the final density index is computed and the corresponding stage is identified. The density index procedure is presented below Algorithm For Analysis Of Digital Mammogram For Breast Cancer Stage Detection Using Multi View Univariate Classification Data Mining Techniques 53
Step 1: Profile of the patient Get the patient ID If the patient id is exiting in the DB Load the patient history Else Create the new profile for the patient; Step 2: Reading of the patient mammographic image If (pid = loaded profile) Load the array of image Else Print ( invalid patient id); Step 3: Preprocess of the image Resize the image as per the evaluation pre definitions {Image size is 0.87 x 1.85 } Step 4: Image conversion with attributes Load the image Get the attribute image property number between1-21 Apply the attribute on the image Assign the attributed image as process image Step 5: Reading the mammographic process image Get file name If file format = *.jpg then Read the pixel from the selected file 54
Convert the pixel into corresponding (R, G, B) values Store X, Y, R, G, B into an array p_id_img () Repeat to read the pixel till end of the file Else Invalid file format Count number of pixels on process image Step 6: Determination of minimum and maximum Finding the maximum of range Finding the minimum of range Get the number of cluster {as default five due to cancer stages} Finding the difference between minimum and maximum Calculate the interval using the difference of the DNs Step 7: Fixation of cluster Fix the start range and end range for each cluster Read the pixel Evaluate the pixel (R, G, B) and evaluate with the range values As per the evaluation of the range value assign the cluster number Step 8: Construction of sub cluster image Read all the pixel cluster number Group the pixel as per the cluster number Construct the image for the clustered pixels Step 9: Process on clustered Image Count number of pixels on Process image 55
Identify the NZE Count number of NZE Get the DNs values of NZE Calculate the sum of DNs of R Calculate the sum of DNs of G Calculate the sum of DNs of B Determine the percentage of DNs of R Determine the percentage of DNs of G Determine the percentage of DNs of B Step 10: Density determination Percentage of pixel = Number of NZE / total number of pixels Single pixel density R= sum of NZE of DNs of R / total number of NZE Affected density on R(ADR) = Single pixel density R/ Percentage of pixel Single pixel density G= sum of NZE of DNs of G / total number of NZE Affected density on G(ADG) = Single pixel density G/ Percentage of pixel Single pixel density B= sum of NZE of DNs of B / total number of NZE Affected density on B(ADB) = Single pixel density B/ Percentage of pixel Step 11: Average of Affected (AAD) AAD = ADR + ADG+ADB / 3 Step 12: Repeat step 5 to step 11 to all attributes of a selected patient Step 13: High density identification Select the high density value of each sub clustered attributed images 56
Step 14: Frequency of high density Identify the high density sub cluster image and frequency Step 15: Frequency Average of density (FAD) Calculate the Sum of high density FAD = Sum of high density / frequency Step 16: Stage determination Density Index = (FAD /20) -1 IF FAD<1 then the affected tissues may be fat formation (benign). else If FAD < 20 then stage is 0 else If FAD < 40 then stage is 1 else If FAD < 60 then stage is 2 else If FAD < 80 then stage is 3 else stage is 4; Step 17: Correlation determination Calculate the correlation on the density Step 18: Evaluation of the stage Verify the determined stage of the patient index with expert and validate the same. According to the above mentioned algorithm the following range values are obtained. 57
4.7 UNIVARIATE RANGE RESULTS In digital image editing layers are used to separate different elements of an image. There are different types of layers. They represent a part of a picture either as pixels (or) as modification instructions. The simple kind of layer is called basic layer. It contains just a picture which can be super imposed on another one. From the image, the affected area is identified. The selected ROI is processed and the converted pixels three layers are clustered using the range values which specified in the previous chapter. The clustered images are processed with number of elements and the total non-zero elements (NZE) to compute the density. The pixel density is computed from total sum of the pixels and the number of affected pixels according to the range values. All the range values are computed and the percentage of the pixel is processed and the sample of calculation is presented below Name of the patient is Amul. There are four images are processed and fetched for the analysis. As a sample the first image result is presented all the image results are attached as an appendix III. The preprocess property is ascending Red 16.Total number of Pixels in the ROI is 144300.00 Table 1: Affected density in Red Layer Name nze sum % Red Single Pixel Density Affect Density in Red Cluster 0 88528 906237 61.35 10.24 0.17 Cluster 1 19504 1571552 13.52 80.58 5.96 Cluster 2 21488 2782105 14.89 129.47 8.69 58
Table 1: Affected density in Red Layer Name nze sum % Red Single Pixel Density Affect Density in Red Cluster 3 5022 849674 3.48 169.19 48.61 Cluster 4 9758 2486869 6.76 254.85 37.69 Table2: Affected density in Green Layer Name nze sum % Green Single Pixel Density Affected Density in Green Cluster 0 88528 16862576 61.35 190.48 3.10 Cluster 1 19504 2321412 13.52 119.02 8.81 Cluster 2 21488 2476934 14.89 115.27 7.74 Cluster 3 5022 573311 3.48 114.16 32.80 Cluster 4 9758 2152809 6.76 220.62 32.62 Table3: Affected density in Blue Layer Name nze sum % Blue S.P Density Affected Density in Blue Cluster 0 88528 18785858 61.35 212.20 3.46 Cluster 1 19504 2189329 13.52 112.25 8.30 Cluster 2 21488 2326645 14.89 108.28 7.27 59
Table3: Affected density in Blue Layer Name nze sum % Blue S.P Density Affected Density in Blue Cluster 3 5022 452196 3.48 90.04 25.87 Cluster 4 9758 2121168 6.76 217.38 32.15 Based on the above calculation of each layer the average affected density is computed and presented for the amul patient and the first image with first geometric preprocessing Table :4 Average affected density of patient Amul Cluster Affect Density in Red Affected Density in Green Affected Density in Blue Average Cluster 0 0.17 3.10 3.46 2.24 Cluster 1 5.96 8.81 8.30 7.69 Cluster 2 8.69 7.74 7.27 7.90 Cluster 3 48.61 32.80 25.87 35.76 Cluster 4 37.69 32.62 32.15 34.15 The highest value and its corresponding cluster number identified. As per the obtained table values, the cluster 3 is marked with 35.76 average of high density.the execution process is carried out with two different way of selection of attributes namely common attributes and multiple attributes. 60
The selection of common attributes are executing the same attributes for the identification of density ratio of four different images are which are from the same patient.the attributes are implemented on the same selected are of region of interest and the values are classified into five clusters and its density are calculated. The density ratio is computed from that the stage is obtained. The same methods are used but the image preprocessing selections of attributes are differing from one image to another. The fetching the mammographic images are not from the same angle and position. According the observation side, position, area and size of the mammographic images suitable attributes are selected from the 64 attribute process. 4.8 SUMMARY This chapter provides the detailed approach on breast cancer detection procedure and the available methods. The geometrical preprocessing of the captured digital mammographic image and the possible preprocessed images are presented. The conversion of image into digital number along with RGB color scheme is explained as the part of the image conversion. The classification algorithm is described along with the concept and its attributes. The major attributes are obtained and the results are discussed in the next chapter. The executed results and its interpretations are described in the next chapter. 61