Classifying Malignant and Benign Breast Cancer Tumours with a Neural Network

@Joshua Payne | | 5 minute read | Home

Code from this article was based off a tutorial found here. This article assumes a basic intuitive understanding of neural networks. For background, check this out.

Table of Contents

Using this dataset, I created a neural network capable of classifying breast tumors. The features are measured characteristics of cell nuclei within the tumor, including perimeter, concavity, and smoothness. The labels are 0 or 1, representing benign and malignant diagnoses respectively. With my network, I mapped the relationship between these two variables.

Here’s how I did it

We’ll first import the following libraries.

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt

We’ll then use pandas to read in our features and labels data by assigning each of the sets of data to variables x and y. In pandas, these are called dataframes, which are basically the same as tables.

x = pd.read_csv(‘')
y = pd.read_csv(‘')

We’ll then scale our features data as part of the preprocessing stage.

x = preprocessing.scale(x)

It’s now time to split our data into testing and training data. Training data is what our neural network uses to learn how to map our features to our labels, and testing data is what we use to see our model in action on data samples it hasn’t seen before. 20% of our data will be testing data, and 80% will be training data.

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

Let’s now convert our training and testing data into numpy arrays so that we can use them with a Keras neural network.

x_train = np.array(x_train)
y_train = np.array(y_train)
x_test = np.array(x_test)
y_test = np.array(y_test)

Let’s now build our actual model. We’re simply building a feedforward neural network, so the Sequential model easily suffices. Our model is composed of dense layers with ReLU activation functions and 20 nodes, and then ends with 1 node representing the final classification prediction. To learn more about what the activation functions sigmoid and ReLU mean here, check this out.

Our input shape represents the shape of our feature arrays.

model = Sequential()
model.add(Dense(20, input_shape=(30,)))
model.add(Dense(20, activation=’relu’))
model.add(Dense(20, activation=’relu’))
model.add(Dense(20, activation=’relu’))
model.add(Dense(1, activation=’sigmoid’))

We’ll now compile our model. The Adam optimizer is highly effective, and binary cross-entropy is a go-to loss function for classification problems of 2 classes. It rounds are sigmoid output to an integer and then compares that against the dataset’s output to measure error. The ‘output’ represents the predicted or actual diagnosis for breast cancer — 0 for benign and 1 for malignant. We’ll use the accuracy metric so we can understand how accurate our model is at classifying these types of breast tumours.

Afterward, we’ll fit our model to our training data, pass over the data 500 times and have a validation split of 0.3. This means that 30% of our data becomes validation data, which our model tests its validation accuracy upon. These are different from testing samples as those are used for our own predictions, outside of our model’s training. We’ll then save the history of our neural network.

model.compile(optimizer=’adam’, loss=’binary_crossentropy’, metrics=[‘accuracy’])
history =, y_train, epochs=500, validation_split=0.3)
history_dict = history.history

It’s now time to plot our training loss and validation loss measured during the model’s training on a graph, to better understand how our network is operating.

loss_values = history_dict[‘loss’]
val_loss_values = history_dict[‘val_loss’]
plt.plot(loss_values, ‘bo’, label=’training loss’)
plt.plot(val_loss_values, ‘r’, label=’validation loss’)

After training the model over 500 iterations, here were the metrics for the final epoch and our graph.

Epoch 500/500
317/317 [==============================] - 0s 155us/sample - loss: 0.0851 - accuracy: 0.9621 - val_loss: 0.1539 - val_accuracy: 0.9270

We are obviously super successful! With uber-low loss rates and high accuracies for both validation and training data, our model was highly successful.

Let’s see it in action!

We’ll use the first data sample in our testing data, and see what our model predicts as its label. This means we’ll be using x_test[1].

x_test[1] = [1.096e+01 1.762e+01 7.079e+01 3.656e+02 9.687e-02 9.752e-02 5.263e-02 2.788e-02 1.619e-01 6.408e-02 1.507e-01 1.583e+00 
1.165e+00 1.009e+01 9.501e-03 3.378e-02 4.401e-02 1.346e-02 1.322e-02 3.534e-03 1.162e+01 2.651e+01 7.643e+01 4.075e+02 1.428e-01 
2.510e-01 2.123e-01 9.861e-02 2.289e-01 8.278e-02]

However, when making predictions with Keras, we need to have commas and pass a list of a list. Let’s make a new variable so that these requirements are accommodated for.

x_test_1 = [[1.096e+01, 1.762e+01, 7.079e+01, 3.656e+02, 9.687e-02, 9.752e-02, 5.263e-02, 2.788e-02, 1.619e-01, 6.408e-02, 1.507e-01, 
1.583e+00, 1.165e+00, 1.009e+01, 9.501e-03, 3.378e-02, 4.401e-02, 1.346e-02, 1.322e-02, 3.534e-03, 1.162e+01, 2.651e+01, 7.643e+01, 
4.075e+02, 1.428e-01, 2.510e-01, 2.123e-01, 9.861e-02, 2.289e-01, 8.278e-02]]

An output of ‘0’ means the tumor is predicted to be benign and an output of ‘1’ means a malignant prediction. By creating a classes variable we can modify our output to say the type of tumor, not 0 or 1.

classes = [‘benign’, ‘malignant’]

We can now actually use our model! The numeric label, 0 or 1, that’s predicted for the first testing data sample, is used as an index for the classes variable. If 0 is predicted, benign is the prediction, and if 1 is predicted, malignant is the prediction.

prediction = classes[int(model.predict(x_test_1))]

Let’s see what our model predicted!

>>> benign

We can now check if this prediction was accurate, because we know what the label actually is on the dataset. The model is predicting the label (y) of the first testing data sample, which we have in our testing dataset to cross-reference.

>>> benign

The actual diagnosis was benign, meaning our model successfully predicted whether the input data belonged to a benign or malignant breast cancer tumor!

Key Takeaways