PROJECT REPORT. Using Deep Learning to Classify Malignancy Associated Changes

Size: px

Start display at page:

Download "PROJECT REPORT. Using Deep Learning to Classify Malignancy Associated Changes"

Stephany Angela McLaughlin
5 years ago
Views:

1 Using Deep Learning to Classify Malignancy Associated Changes Hakan Wieslander, Gustav Forslid Project in Computational Science: Report January 2017 PROJECT REPORT Department of Information Technology

2 Abstract Screening specimen from the cervix is an important way to discover cervical cancer in time. Today these screenings are done manually by trained medical experts. The procedure is time consuming and thus costly. A suggested solution is using deep learning for detecting malignancy associated changes. The scope of this project was using convolutional neural networks for classifying the different cell types associated with cervical cancer. Since most malignancy associated changes are present in the nuclei of the cells one approach is to segment out the nuclei and train the networks on those. To get a comparison to this approach a case with unsegmented images is also investigated. The result showed that a deeper network was needed for the segmented images and a shallower network for the unsegmented images. With the unsegmented images we found that the network can distinguish between normal cells and more cancerous cells. With the segmented images the networks got high scores for the low grade cancerous cells but could not distinguish between the other cell types. In conclusion there is potential for the deep learning approach for detecting malignancy associated changes but more extensive studies are needed.

3 Contents 1 Introduction Background Purpose of project Theory Cervical Cancer Deep Learning Convolutional Neural Networks Convolutional Layer Pooling layer Fully Connected Layer ReLU Layer Dropout layer Batch Normalization layer Backpropagation Stochastic Gradient Descent Caffe Image Processing Watershed segmentation Morphological dilation Normalization The Dataset Method Data collection Image pre-processing Deep Learning Result Network architecture Testing result Deeper architecture Smaller architecture Discussion Network Architecture Images Segmented Images Unsegmented Images Conclusion Using Deep Learning to classify Malignancy Associated Changes Future work

4 1 Introduction 1.1 Background Cervical cancer was for a long time one of the deadliest cancer types for women. Nowadays women have a few options to prevent or reduce the suffering cervical cancer can cause. In highly developed countries many women become vaccinated for cervical cancer. For those who do not become vaccinated it is recommended to do a test every five years. The most common way to perform these tests is called pap-smear screenings. A pap-smear is a specimen taken from the cervix. The screening of these are done by trained medical experts. Due to the manual examination of pap-smears it is time consuming and thus expensive. This may result in that some countries do not provide these screenings for women or that the women them self can not afford to be screened. Since most of the cervical cancer cases occur in developing countries one might conclude that it is the women in these countries that suffer the most. Deep learning can be traced back to the 1960 s but it is not until this last decade that the field exploded. The reason for this is the possibility to run all the mathematical operations on a graphical processing unit (GPU). As the field grew so did the spectra for what deep learning could be used for. Convolutional Neural Network (CNN) is a type of deep learning method that can be used for image classification and feature detection. A CNN is built up by layers containing weights and biases. These weights and biases are updated during training and optimized with steepest descent. The capability of classifying images at a high rate can be applied to many different fields. 1.2 Purpose of project The purpose of this project is to investigate if deep learning can be applied as a classification tool in the normal procedure of pap-smear screenings. The benefit would be that medical experts only need to screen the pap-smears containing cancerous cell types instead of screening all the pap-smears. This classification would then create a much more time efficient procedure and hopefully in the end help more women to become screened. The following tasks are pursued: Determine the best way to process the images. Construct the best architecture for classifying the different cell types. 2 Theory 2.1 Cervical Cancer Cervical cancer is a malignant disease in the cervix. Smoking, reproductive history and number of sexual partners can be risk factors, but the leading cause is infection by the Human Papillomavirus (HPV). HPV is a common sexually transmitted infection which varies on how it effects the human. In some cases the virus regress itself without treatment, but in other cases it develops into cervical cancer. This transformation is very slow, for women with a normal immune system it can take up to years to develop into cervical cancer. For women with damaged immune system it might only take 5-10 years. The earlier the cancer is discovered the more likely it is for the patient to be cured [1]. Malignancy Associated Changes are referred to as MAC. MAC describes subtle changes in the texture and morphology of a cell nucleus. These subtle changes are difficult to distinguish in practice, but have been proven to be a reliable source for detecting cancerous cells [2]. The system for reporting on a cervical diagnosis is called The Bethesda System. This system is used for pap-smear screenings and below follows a list of the most common cell types [3]. NILM - Negative for Intraepithelial Lesion or Malignancy LSIL - Low-grade Squamous Intraepithelial Lesion HSIL - High-grade Squamous Intraepithelial Lesion SCC - Squamous Cell Carcinomas Adenocarcinoma - Adenosquamous Carcinomas ASC-H - Atypical Squamous Cells, which cannot exclude a High-grade lesion ASC-US - Atypical Squamous Cells of Undetermined Significance 3

5 2.2 Deep Learning Deep learning is a subset of machine learning. The main difference is that deep learning methods are able to automatically extract features from a data set instead of having to point to what features to look for. The theory behind deep learning can be dated back to the 1960 s [4]. At that time the computers could not handle the high amount of mathematical operations that deep learning requires. In the beginning of the 21 century the computers could finally preform at a required rate. This resulted in more work and research done in the deep learning field. In the mid of 2000 it became possible to run these mathematical operations on a graphical processing unit (GPU). By 2009 it was almost 20 times faster to distribute a mathematical operation on the GPU than running it on a CPU [5]. Today the deep learning field is one of the fastest growing fields in computer science. The ImageNet Large Scale Visual Recognition Challenge (ILSVCR), which is a annual challenge about classification and detection of hundreds of categories in millions of images, has since 2012 been won exclusively by Convolutional neural networks [6] Convolutional Neural Networks Convolutional Neural Networks is referred to as CNN and is a type of deep learning method that is inspired by the animal visual cortex. Cells in the visual cortex are all sensitive to a small region in the receptive field. The inspiration from the animal visual cortex resulted in a spatial structure of CNN where a specific region is connected to a specific region in the next layer[7]. CNN s are constructed by neurons which have learnable weights and biases. The spatial structure creates a volume of these neurons with a length, height and depth. The depth represents the number of filter-kernels each layer contains which is also the number of outputs that layer produces. When training a network the filter-kernels will learn to identify different features in different parts of the input image. This means that the more features the input images contains the more filter-kernels might be needed [8]. The architecture of a CNN can vary a lot. Depending on the purpose of the CNN the architect can choose different layers, how many layers and the construction of each layer. These variables are referred to as hyperparameters of the network. The most common layers are convolutional layers, pooling layers and fully connected layers. Other examples can be ReLU layers, batch normalization layers, softmax layers and dropout layers. When training a CNN you have to include some form of metric on how to calculated how good or bad the network preforms. This is called a loss function and can be constructed in a couple of different ways. The loss function takes the label of the input image, compares it to the network prediction and calculates the loss. Minimizing the loss will make the network better and better at predicting its input images [9]. Figure 1: Example architecture of a simple Convolutional Neural Network Convolutional Layer The main task for convolutional layers is to distinguish local unions of features from the input. Convolutional layers are built up by learnable filter-kernels with a set size. Each convolutional layer has a spatial architecture with a height and width of the filter-kernels and a depth of how many filter-kernels it has. The filter-kernels are convolved over the input where they perform a dot product that creates an output as can be seen in Figure 2. Each convolutional layer has a set stride, which indicates how far the filter-kernel is moved before performing a new dot product. During training each filter-kernel will start looking for some specific feature. The first convolutional layer usually pick up on primitive features e.g. vertical and diagonal lines. The deeper the convolutional layer is placed the more advanced features its filter-kernels will pick up. 4

Figure 2: Convolving a 3x3 filter-kernel over an input and producing an output with a dot product. 2.2.1.2 Pooling layer Pooling layers are often inserted after a convolutional layer.

There are two types of pooling layers that are more common than others. The first one is called Max Pooling and is described in Figure 3a.

Here you also have a set sized kernel that is slid over the input image and the output is the average value of the pixel values inside the kernel.

6 Figure 2: Convolving a 3x3 filter-kernel over an input and producing an output with a dot product Pooling layer Pooling layers are often inserted after a convolutional layer. The main task for the pooling layers is to decrease the spatial representation of its input. This creates less computations which speeds up the training process due to a reduced spatial size. There are two types of pooling layers that are more common than others. The first one is called Max Pooling and is described in Figure 3a. Here a set sized kernel is slid over an input image and the output is the max value inside the kernel. The second type is Average Pooling and is described in Figure 3b. Here you also have a set sized kernel that is slid over the input image and the output is the average value of the pixel values inside the kernel. By using a pooling layer on each input slice independently it resizes the input spatially [8]. (a) Max pooling. (b) Average pooling. Figure 3: Pooling Fully Connected Layer Fully connected layers are referred to as FC. The main task for FC layers is to compute the class scores, which indicate which class the network predicts the input belonging to. This can be described as the classification step in a CNN. Neurons in the FC layer are connected to all neurons in the previous layer and that is why they are called fully connected. The depth of the final FC layer depends on how many class sources the network has. The output from the last FC layer represents the probabilities of what class the input is according to the network. When summing up all the probabilities from the output from the last FC layer you always end up with 1. This is due to an activation function called softmax. Softmax takes a vector of real valued class scores and squashes it into numbers between 0 and 1.[8] ReLU Layer ReLU stands for rectified linear unit and is a non-linear pixel by pixel operation. ReLU layers are often systematically inserted in a CNN architecture after a convolutional layer. The main reason for adding ReLU layers is to introduce non-linearity in the CNN. This has to be introduced since almost all real world data is non-linear. The operation is defined in Equation (1). It takes the input value, compares it 5

7 with zero and picks the max value of the two as the output [10] Dropout layer Output = M ax(0, Input) (1) Dropout layers are implemented in the architecture to increase the accuracy by reducing overfitting. Overfitting is a phenomena that occurs when a network is trained with too few images. The network becomes good at classifying the images in the training set, but has a lower accuracy when classifying any other images. During the backpropagation, when the network adapts its weights and biases, the network becomes fragile and creates a structure that only works for the training data. Dropout layers counteracts these fragile networks by breaking up some relations at random. For each input batch in the training process some connections between two layers are temporarily removed from the network. Dropout can be seen as training multiple thinner networks which share the same weights and biases. A negative side of implementing dropout layers is that the execution time increases, which in general makes training a network 2-3 times longer than normal [11] Batch Normalization layer Batch Normalization layers are included to speed up the training process. Since the input of each layer in a CNN is effected by the parameters of the previous layer, small changes might amplify through the network. This phenomena is referred to as internal covariate shift. Batch normalization layers reduces this phenomena by first dividing the input data into mini-batches and then normalizing these. The normalization is done so that the features in each mini-batch are in the same range for each layer. By applying batch normalization to the networks architecture the learning rate can be much higher and the initialization is less sensitive. In some cases it can remove the need for a dropout layer, as it can act as a regularizer. Batch normalization layers make the training process approximately 14 times faster and also improve the result [12] Backpropagation After the class scores are calculated in the FC layer, the loss is calculated. The loss is calculated by comparing the networks prediction of the input images with the ground truth. For the network to learn from the errors you backpropagate the error through the network. In the backpropagation the gradient of the error is calculated with respect to the weights and biases. The update is executed in a way to minimize the error of the output[10]. One of the most common ways to preform this update is using stochastic gradient descent Stochastic Gradient Descent Stochastic Gradient Descent (SGD) is an algorithm used for updating the weights and the biases during the backpropagation. The update procedure is shown in Equation 2, where θ are the parameters, x (i) the training example and y (i) is the label. θ = θ α θ J(θ; x (i), y (i) ) (2) The SGD algorithm is created so that it can evaluate the gradient after just seeing a small part of the data set. This can be compared with the standard gradient descent which has to see the entire data set before doing the evaluation[13] Caffe Caffe is a framework designed for deep learning. It is developed by Berkely vision and learning center (BVLC) and community distributors. The caffe framework stores and manipulates the data in the network in blobs which are used as the memory interface between the layers. Each layer has a top blob and a bottom blob. The layers get data from the bottom blob and outputs the data to the top blob.[14] 2.3 Image Processing An important part of this project is the image processing. This section introduces some of the important methods used for processing the images before training. 6

2.3.1 Watershed segmentation A digital image is represented by each pixels intensity value. Dark areas have low intensity values and brighter areas have higher intensity values.

Each local minima in the image represents a valley in the landscape. When water from two valleys meet, a border or a watershed is placed which will prevent the water from two valleys to merge.

8 2.3.1 Watershed segmentation A digital image is represented by each pixels intensity value. Dark areas have low intensity values and brighter areas have higher intensity values. The image can be visualized as a landscape with valleys and peaks at the different intensity values. Watershed segmentation can be described as imaginary drops of water flooding the landscape. Each local minima in the image represents a valley in the landscape. When water from two valleys meet, a border or a watershed is placed which will prevent the water from two valleys to merge. This will create a representation of the image were only the borders of the watersheds are present. To minimize the number of regions that are segmented with watershed you can place seeds at the specific objects of interest and the background. Those seeds are now the valleys that will be flooded and all pixels within those regions will be labeled with the same label [15]. Figure 4: Visualization of watershed segmentation. Water floods the valleys and the landscape is divided by the watesheds Morphological dilation Morphological dilation is an image processing method for filling gaps and extending object boundaries in an image. The method is mostly used on binary images. Dilation can be seen as a convolution between a mask (structuring element) and the image and is calculated as A B = {z ˆB z A } (3) where A is the image, B the mask and A and B are sets of Z 2. The mask is constructed so that the purpose of the dilation is fulfilled. The mask is convolved over each pixel of the image. The image is dilated if the convolution between the image segment and the mask is not equal to zero [15]. Figure 5: Effect of dilation on a binary image with a 3x3 square mask Normalization Normalization is a process to scale down the range of features in the data. If different features in the data have different scales it might become hard for the network to adjust its weights to favor all those features. The normalization is usually done in two steps, first subtracting the mean of the image from each pixel then dividing by the standard deviation as described in equation 4 [9]. N ormalized Image = (Image mean(image))/std(image) (4) 7

9 2.4 The Dataset The data set contained stacked images from pap-smear tests. The images were photographed from different heights to have each cell in focus in at least one image. All cells were marked with what type of cell it is, which stack image it has the best focus in and its coordinates[16]. The number of samples for each cell type is presented below in Table 1. Table 1: Number of samples in the dataset Cell Type Samples Nilm 6064 Lsil 402 Hsil 471 Scc 653 Adenocarcinoma 53 Other distorted 17 Other occluded 144 Other inflammatory 53 Other degenerative 16 ASC-H 2 ASC-US 8 3 Method Since most of the malignancy associated changes are present in the nuclei of the cells one approach was to segment out the nuclei from each image. To get a comparison with this approach a case with unsegmented images was also investigated. 3.1 Data collection To handle the lack of samples of different cell types (Table 1) we group some of the cell types into one class. This class contained Other-distorted, -occluded, -inflammatory, -degenerative, ASC-H and ASC- US. From this we obtained 6 different classes to classify. All marked cells in the data set were cut out in 200x200 segments. In a later stage the images were cropped down to a size of 100x100 with the cell nucleus centered in the image. All images were saved with the name of its specific cell type. The images were divided into three folders, training, validation and testing as seen in Table 2. Table 2: Number of samples for training, validation and testing Cell Type Training Validation Test Nilm Lsil Hsil Scc Adenocarcinoma Other Image pre-processing To segment out the nucleus of each cell seeded watershed was used. Since all cells have a circular shape you can mark an ellipsoidal area around the cell as the background. From the watershed segmentation you receive a labeled representation of the image where the object, the border and the background have different labels. This image can be made binary by setting all pixels labeled as object or border to ones and the background pixels to zero. To make the transition between the background and the nuclei smoother the images were dialated with a 3x3 square mask three times. The remaining background was set to the mean intensity value of the dialated area. This was done to further smoothen the transitions between the nuclei and the background. 8

Figure 6: Pre-processing from raw image to final segmented image used for training To handle the unevenly distributed data set (Table 2), augmentation of cells with fewest samples was

After this random rotations were made until there were an equal amount of images of each class (Figure 7b).

Lastly the images were labeled from zero to five for the six different classes. Before training the network the images were normalized to scale down the range of the pixel intensities.

Two different architectures were investigated, one deeper architecture which was a scaled down version of the classic AlexNet architecture and one smaller architecture where batch

10 Figure 6: Pre-processing from raw image to final segmented image used for training To handle the unevenly distributed data set (Table 2), augmentation of cells with fewest samples was made. The augmentation was done by first mirroring the image in three directions, up-down, left-right and diagonally by mirroring up-down then left-right (Figure 7a). After this random rotations were made until there were an equal amount of images of each class (Figure 7b). To get rid of the rotation border the center of each image was cropped out in a 100x100 segment. The augmentation was done for the training and validation set. Lastly the images were labeled from zero to five for the six different classes. Before training the network the images were normalized to scale down the range of the pixel intensities. (a) Mirrored images. (b) Rotated and scaled images. Figure 7: Mirrored and rotated images of a cell 3.3 Deep Learning For this project Caffe was used as the framework for implementing CNN. Two different architectures were investigated, one deeper architecture which was a scaled down version of the classic AlexNet architecture and one smaller architecture where batch normalization was introduced. The hyperparameters of the architectures were decided with trial and error. 4 Result 4.1 Network architecture The resulting network architectures that were investigated are presented below (Figure 3 and 4). For each network the layers with corresponding hyperparameters and sub layers are presented. 9

11 Table 3: Deeper architecture with used layers and sub layers Layer Num. of features Filter size Pooling (kernel/stride) Non-linearity Regularizer Convolution Max. 3x3, 2 Relu - Convolution Max. 3x3, 2 Relu - Convolution Relu - Convolution Relu - Convolution Max. 3x3, 2 Relu - Fully connected Relu Dropout Fully connected Relu Dropout Fully connected Table 4: Smaller architecure with used layers and sub layers Layer Num. of features Filter size Pooling (kernel/stride) Non-linearity Regularizer Convolution 32 5 Max. 3x3, 2 Relu Batch norm. Convolution 32 5 Avg. 3x3, 2 Relu Batch norm. Convolution 64 5 Avg. 3x3, 2 Relu Batch norm. Fully connected Testing result The results when testing each network with images not seen before by the network are presented in four confusion matrices. The confusion matrix presents the ground truth versus the predicted result of the network Deeper architecture The result when testing on a deeper architecture with both segmented and unsegmented images are presented in Table 5 and Table 6. Table 5: Confusion matrix representing results form segmented images for the deeper network Nilm Lsil Hsil Scc Adeno Other Nilm 0.00% 58.91% 1.07% 19.09% 0.00% 20.93% Lsil 0.00% 76.39% 4.17% 15.28% 0.00% 4.17% Hsil 0.00% 17.14% 41.43% 38.57% 0.00% 2.89% Scc 0.00% 36.44% 23.73% 33.05% 0.00% 6.78% Adeno 0.00% 66.67% 0.00% 0.00% 0.00% 33.33% Other 0.00% 56.25% 0.00% 25.00% 0.00% 18.75% Table 6: Confusion matrix representing results form unsegmented images for the deeper network Nilm Lsil Hsil Scc Adeno Other Nilm 53.99% 20.00% 7.74% 5.93% 2.14% 10.21% Lsil 47.56% 26.83% 9.76% 7.32% 2.44% 6.10% Hsil 7.37% 7.37% 35.79% 35.79% 9.47% 4.21% Scc 7.52% 14.29% 24.81% 38.35% 8.27% 6.77% Adeno 8.33% 8.33% 25.00% 17.67% 0.00% 41.67% Other 30.95% 16.67% 11.90% 4.76% 4.76% 30.95% Smaller architecture The results when testing the smaller trained network are presented in Table 7 and Table 8. 10

12 Table 7: Confusion matrix representing results for segmented images for the shallower network Nilm Lsil Hsil Scc Adeno Other Nilm 0.00% 20.25% 34.79% 27.42% 1.36% 16.18% Lsil 0.00% 16.67% 44.44% 22.22% 2.78% 13.89% Hsil 0.00% 21.43% 40.00% 32.86% 4.29% 1.43% Scc 0.00% 16.10% 36.44% 28.81% 6.78% 11.86% Adeno 0.00% 0.00% 25.00% 33.33% 0.00% 41.67% Other 0.00% 15.62% 28.12% 31.25% 3.12% 21.88% Table 8: Confusion matrix representing results for unsegmented images for the shallower network Nilm Lsil Hsil Scc Adeno Other Nilm 58.93% 7.24% 1.98% 0.74% 2.55% 28.56% Lsil 46.34% 15.85% 3.66% 1.22% 0.00% 32.93% Hsil 0.00% 30.53% 33.68% 12.63% 5.26% 17.89% Scc 1.50% 21.05% 24.81% 21.80% 12.03% 18.80% Adeno 0.00% 8.33% 0.00% 8.33% 0.00% 83.33% Other 38.10% 4.76% 4.76% 9.52% 0.00% 42.86% 5 Discussion 5.1 Network Architecture A difficult task in this project was finding a suitable network architecture. Since the choice of architecture is crucial this might have a large impact on the result. The network architecture was decided with trial and error where the size and hyperparameters were experimented with. The two architectures presented in the results (Table 3 and Table 4) were the ones that produced the best result when experimenting with different depth and hyperparameters. As can be seen in Table 5 and Table 7, choosing a deeper network for the segmented images generated a slightly better result. The test scores for the Lsil and Hsil cells are slightly higher than for the shallower architecture. For the unsegmented images the shallower network tend to classify images towards the Other class more than the deeper architecture. An important result is the lower scores in the first column for the Hsil, Scc and Adenocarsimona cells in the shallower network (Table 8). This indicates that the shallower network is better at distinguishing between the Nilm cells and the cancerous cells. You might also argue that the deeper network is better for distinguishing between the Lsil cells and the more cancerous Hsil and Scc (Table 6) which is also an important result. 5.2 Images Segmented Images The best result with the segmented images was obtained using the deeper architecture. When reviewing Table 5, a couple of interesting things can be seen. The network could not predict any of the Nilm or Adencarcinoma cells. Since the data set contained so few Adenocarcinoma cells the network might have had too few samples to pick up any specific features for this type. In the case of failing to predict Nilm cells one explanation might be that that information about the surrounding is lost when segmenting the images. Since the Nilm and Lsil look similar in the nuclei this might be the reason for the network to predict Nilm as Lsil. Another thing that can be observed is that the network predicts Lsil very well. One can also see that the network can distinguish between Lsil and Hsil, but can not really distinguish between the Hsil and Scc. The explanation for this might be that the Hsil and Scc cells are quite alike but differs from the Lsil cells Unsegmented Images As discussed in Section 6.1 the smaller architecture generated the most interesting result for the unsegmented images. Looking at the confusion matrix in Table 8 one can see that the class called Other is predicted a lot for all classes. The Other class contains samples from many types of cells. This class probably have the largest spectra of cells which might be the reason that the network having a bias 11

13 towards predicting this class. Another thing that can be observed is the failure to predict the Adenocarcinoma cells. This can be caused by the fact that the data set only contained approximately 60 samples of this cell type. Since there are so few samples, even with the augmentation, the features for this class might be hard to determine for the network. You can also see that it is best at predicting Nilm cells, but when predicting LSIL it predicts these as Nilm almost half of the times. This can be caused by that Lsil cells only have smaller abnormal changes in comparison to the Nilm cells. Since there are more samples of the Nilm cells than for the Lsil cells there are more variety in that class. The most interesting result is that the network is very good at distinguishing the difference between Hsil, Scc and Adenocarcinoma from Nilm cells. Hsil cells are not likely to return to normal and Scc and Adenocarcinoma are already cancerous cells. This means that the network can actually see a difference between the more dangerous cells to the Nilm cells. 6 Conclusion 6.1 Using Deep Learning to classify Malignancy Associated Changes Even though the results are not perfect we see that there is potential for using deep learning for detecting malignancy associated changes. The major reason for this is the result for the unsegmented images where the network distinguished the more dangerous cell types from the Nilm cells. Due to shortage of time in this project we believe that more extensive studies are needed to evaluate the method and generate a better result. 6.2 Future work The major limitation in this project was the data set. The data set was unevenly distributed and thus many classes had to be augmented to a large extent. Collecting more samples of the cancerous cells would give a much more evenly distributed data set and would make the augmentation less crucial. This might generate a more accurate distinction between the Hsil, Scc and Adenocarcioma cells. It would also be interesting to work with the depth in the image stack. Instead of only training on one image for each cell, a substack with different lens focus could be used. This might give the network some extra information about each cell. On the image processing part more work could be done on enhancing features in the different cell types. This requires more studies on how the malignancy associated changes appear. An area that has a lot of potential for generating a better result is the architecture of the CNN. Working with the depth, different hyperparameters and different architectures has development potential. Acknowledgments We would like to thank our supervisors Carolina Wählby, Sajith Kecheril Sadanandan and Petter Ranefall for all the help and inspiration they have given us during this project. 12

14 References [1] Human papillomavirus (HPV) and cervical cancer (fact sheet), 2016 URL accessed [2] Hallinan, J., Detection of malignancy associated changes in cervical cells using statistical and evolutionary computation techniques, Ph.D. thesis, The University of Queensland, [3] Pap Test Faq, Wisconsin State Laboratory of Hygiene, URL accessed: [4] Jürgen Schmidhuber,My First Deep Learning System of 1991, + Deep Learning Timeline , arxiv: [cs.ne], URL [5] Yann LeCun, Yoshua Bengio, Geoffrey Hinton., Deep learning, Nature 521, , 2015 [6] Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition Challenge. IJCV, [7] Convolutional Neural Networks (LeNet), LISA lab, URL accessed: [8] Andrej Karpathy., Convolutional Neural Networks: Architectures, Convolution / Pooling Layers URL accessed: [9] Andrej Karpathy., Neural Networks Part 2: Setting up the Data and the Loss URL accessed: [10] Daniel Graupe., Deep Learning Neural Networks: Design and Case Studies, World Scientific Publishing Co Inc, 2016 [11] Nitish Srivastava, Geoffrey Hinton, Ilya Sutskever, Ruslan Salakhutdinov., Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Department of Computer Science, University of Toronto, 2014 [12] Sergey Ioffe, Christian Szegedy., Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015 URL accessed: [13] Optimization: Stochastic Gradient Descent, Stanford University URL accessed: [14] Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Karayev, Sergey and Long, Jonathan and Girshick, Ross and Guadarrama, Sergio and Darrell, Trevor, Caffe: Convolutional Architecture for Fast Feature Embedding, 2014, arxiv: [cs.cv], URL [15] Gonzales, R.C, Woods, E. R., Digital image processing, 2nd edition, Prentice-Hall, 2002 [16] P. Malm, Multi-resolution Cervical Cell Dataset, Centre for Image Analysis, Swedish University of Agricultural Sciences and Uppsala University, Technical report (Blue series) No. 37. Available online at: 13

Biologically Inspired Computation

Biologically Inspired Computation Deep Learning & Convolutional Neural Networks Joe Marino biologically inspired computation biological intelligence flexible capable of detecting/ executing/reasoning about