Breast cancer dataset


The following notebook is an exploration of the Breast Cancer Dataset downloaded from https://www.kaggle.com/uciml/breast-cancer-wisconsin-data.
The aim is to predict the diagnosis, based on the features provided. The notebook attempts to explore the data and test four models against each other on which could be a potential candidate for predictions, as well as do a forward selection of features.

In [1]:
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, plot_roc_curve, plot_confusion_matrix,confusion_matrix, plot_precision_recall_curve, precision_recall_fscore_support, roc_auc_score,auc
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import cross_validate, KFold
In [2]:
import warnings
warnings.filterwarnings('ignore')
In [3]:
df = pd.read_csv('./data/data.csv')
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             569 non-null    float64
 15  area_se                  569 non-null    float64
 16  smoothness_se            569 non-null    float64
 17  compactness_se           569 non-null    float64
 18  concavity_se             569 non-null    float64
 19  concave points_se        569 non-null    float64
 20  symmetry_se              569 non-null    float64
 21  fractal_dimension_se     569 non-null    float64
 22  radius_worst             569 non-null    float64
 23  texture_worst            569 non-null    float64
 24  perimeter_worst          569 non-null    float64
 25  area_worst               569 non-null    float64
 26  smoothness_worst         569 non-null    float64
 27  compactness_worst        569 non-null    float64
 28  concavity_worst          569 non-null    float64
 29  concave points_worst     569 non-null    float64
 30  symmetry_worst           569 non-null    float64
 31  fractal_dimension_worst  569 non-null    float64
dtypes: float64(30), int64(1), object(1)
memory usage: 142.4+ KB

we know that we have to predict a categorical variable from only numeric ones.

In [5]:
df.head()
Out[5]:
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
0 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows × 32 columns

In [6]:
df.isna().sum()
Out[6]:
id                         0
diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64

The dataset is fully complete and does not require any cleaning, yay!!

In [7]:
df['id'].apply(str).describe()
Out[7]:
count        569
unique       569
top       908445
freq           1
Name: id, dtype: object

Each id is unique.

In [8]:
df.drop('id', axis = 1, inplace = True)

Defining the target variable

In [9]:
target = 'diagnosis'
In [10]:
df[target].value_counts().plot(kind='bar')
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d56ebd7ec8>

removing the target feature and describing the rest

In [11]:
features = df.columns.to_list()
features.pop(features.index(target))
Out[11]:
'diagnosis'
In [12]:
len(features)
Out[12]:
30

data exploration

First, let's do some descriptive statistics for the different features.

In [13]:
for feature in features:
    print(f'***\n{feature}\n***')
    print(df[feature].describe(), '\n')
***
radius_mean
***
count    569.000000
mean      14.127292
std        3.524049
min        6.981000
25%       11.700000
50%       13.370000
75%       15.780000
max       28.110000
Name: radius_mean, dtype: float64 

***
texture_mean
***
count    569.000000
mean      19.289649
std        4.301036
min        9.710000
25%       16.170000
50%       18.840000
75%       21.800000
max       39.280000
Name: texture_mean, dtype: float64 

***
perimeter_mean
***
count    569.000000
mean      91.969033
std       24.298981
min       43.790000
25%       75.170000
50%       86.240000
75%      104.100000
max      188.500000
Name: perimeter_mean, dtype: float64 

***
area_mean
***
count     569.000000
mean      654.889104
std       351.914129
min       143.500000
25%       420.300000
50%       551.100000
75%       782.700000
max      2501.000000
Name: area_mean, dtype: float64 

***
smoothness_mean
***
count    569.000000
mean       0.096360
std        0.014064
min        0.052630
25%        0.086370
50%        0.095870
75%        0.105300
max        0.163400
Name: smoothness_mean, dtype: float64 

***
compactness_mean
***
count    569.000000
mean       0.104341
std        0.052813
min        0.019380
25%        0.064920
50%        0.092630
75%        0.130400
max        0.345400
Name: compactness_mean, dtype: float64 

***
concavity_mean
***
count    569.000000
mean       0.088799
std        0.079720
min        0.000000
25%        0.029560
50%        0.061540
75%        0.130700
max        0.426800
Name: concavity_mean, dtype: float64 

***
concave points_mean
***
count    569.000000
mean       0.048919
std        0.038803
min        0.000000
25%        0.020310
50%        0.033500
75%        0.074000
max        0.201200
Name: concave points_mean, dtype: float64 

***
symmetry_mean
***
count    569.000000
mean       0.181162
std        0.027414
min        0.106000
25%        0.161900
50%        0.179200
75%        0.195700
max        0.304000
Name: symmetry_mean, dtype: float64 

***
fractal_dimension_mean
***
count    569.000000
mean       0.062798
std        0.007060
min        0.049960
25%        0.057700
50%        0.061540
75%        0.066120
max        0.097440
Name: fractal_dimension_mean, dtype: float64 

***
radius_se
***
count    569.000000
mean       0.405172
std        0.277313
min        0.111500
25%        0.232400
50%        0.324200
75%        0.478900
max        2.873000
Name: radius_se, dtype: float64 

***
texture_se
***
count    569.000000
mean       1.216853
std        0.551648
min        0.360200
25%        0.833900
50%        1.108000
75%        1.474000
max        4.885000
Name: texture_se, dtype: float64 

***
perimeter_se
***
count    569.000000
mean       2.866059
std        2.021855
min        0.757000
25%        1.606000
50%        2.287000
75%        3.357000
max       21.980000
Name: perimeter_se, dtype: float64 

***
area_se
***
count    569.000000
mean      40.337079
std       45.491006
min        6.802000
25%       17.850000
50%       24.530000
75%       45.190000
max      542.200000
Name: area_se, dtype: float64 

***
smoothness_se
***
count    569.000000
mean       0.007041
std        0.003003
min        0.001713
25%        0.005169
50%        0.006380
75%        0.008146
max        0.031130
Name: smoothness_se, dtype: float64 

***
compactness_se
***
count    569.000000
mean       0.025478
std        0.017908
min        0.002252
25%        0.013080
50%        0.020450
75%        0.032450
max        0.135400
Name: compactness_se, dtype: float64 

***
concavity_se
***
count    569.000000
mean       0.031894
std        0.030186
min        0.000000
25%        0.015090
50%        0.025890
75%        0.042050
max        0.396000
Name: concavity_se, dtype: float64 

***
concave points_se
***
count    569.000000
mean       0.011796
std        0.006170
min        0.000000
25%        0.007638
50%        0.010930
75%        0.014710
max        0.052790
Name: concave points_se, dtype: float64 

***
symmetry_se
***
count    569.000000
mean       0.020542
std        0.008266
min        0.007882
25%        0.015160
50%        0.018730
75%        0.023480
max        0.078950
Name: symmetry_se, dtype: float64 

***
fractal_dimension_se
***
count    569.000000
mean       0.003795
std        0.002646
min        0.000895
25%        0.002248
50%        0.003187
75%        0.004558
max        0.029840
Name: fractal_dimension_se, dtype: float64 

***
radius_worst
***
count    569.000000
mean      16.269190
std        4.833242
min        7.930000
25%       13.010000
50%       14.970000
75%       18.790000
max       36.040000
Name: radius_worst, dtype: float64 

***
texture_worst
***
count    569.000000
mean      25.677223
std        6.146258
min       12.020000
25%       21.080000
50%       25.410000
75%       29.720000
max       49.540000
Name: texture_worst, dtype: float64 

***
perimeter_worst
***
count    569.000000
mean     107.261213
std       33.602542
min       50.410000
25%       84.110000
50%       97.660000
75%      125.400000
max      251.200000
Name: perimeter_worst, dtype: float64 

***
area_worst
***
count     569.000000
mean      880.583128
std       569.356993
min       185.200000
25%       515.300000
50%       686.500000
75%      1084.000000
max      4254.000000
Name: area_worst, dtype: float64 

***
smoothness_worst
***
count    569.000000
mean       0.132369
std        0.022832
min        0.071170
25%        0.116600
50%        0.131300
75%        0.146000
max        0.222600
Name: smoothness_worst, dtype: float64 

***
compactness_worst
***
count    569.000000
mean       0.254265
std        0.157336
min        0.027290
25%        0.147200
50%        0.211900
75%        0.339100
max        1.058000
Name: compactness_worst, dtype: float64 

***
concavity_worst
***
count    569.000000
mean       0.272188
std        0.208624
min        0.000000
25%        0.114500
50%        0.226700
75%        0.382900
max        1.252000
Name: concavity_worst, dtype: float64 

***
concave points_worst
***
count    569.000000
mean       0.114606
std        0.065732
min        0.000000
25%        0.064930
50%        0.099930
75%        0.161400
max        0.291000
Name: concave points_worst, dtype: float64 

***
symmetry_worst
***
count    569.000000
mean       0.290076
std        0.061867
min        0.156500
25%        0.250400
50%        0.282200
75%        0.317900
max        0.663800
Name: symmetry_worst, dtype: float64 

***
fractal_dimension_worst
***
count    569.000000
mean       0.083946
std        0.018061
min        0.055040
25%        0.071460
50%        0.080040
75%        0.092080
max        0.207500
Name: fractal_dimension_worst, dtype: float64 

Let's continue with some boxplots to explore the data

In [14]:
for feature in features:
    sns.boxplot(data=df, x=target, y=feature)
    plt.figure()
<Figure size 432x288 with 0 Axes>

The following distribution plots showcase the same information, but present it differently.

In [15]:
for feature in features:
    sns.distplot(df[df[target] == 'M'][feature], color='r')
    sns.distplot(df[df[target] == 'B'][feature], color='g')
    plt.figure()
<Figure size 432x288 with 0 Axes>

From the above boxplots and distribution plots it is apparent that benevolent tumors are smaller in area, radius and perimeter. When looking at the features raidus_mean, concavity_mean and concave points_mean could provide to be good predictors. Let's have a look at a correlation matrix to see which features are in deed highly correllated with the outcome. As it is a binary outcome, let's map benign to 0 and malevolent to 1.

Start modelling

first do feature selection based on forward selection method with 4 models

In [16]:
# now as ou know our diagnosis column is a object type so we can map it to integer value
df[target] = df[target].map({'M': 1, 'B': 0})

do a correlation matrix

In [17]:
plt.subplots(figsize=(14,14))
sns.heatmap(df.corr())
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d56ef1c808>

Based on the correlation values with the diagnosis, I'll order the columns by their correlation coefficient and use forward selection to determine which of the columns are relevant for making a good prediction

In [18]:
correldf = df.corr()
correldf.iloc[0,:].sort_values(ascending = False)[1:]
feature_list = correldf.iloc[0, :].sort_values(ascending=False)[1:].index

In the below cell, I am defining four different models, each of which will be fitted with the forward selection model, and then scored accordingly.

In [19]:
models = {
    'logistic_regression': LogisticRegression(),
    'svc': SVC(gamma='auto'),
    'random_forest': RandomForestClassifier(n_estimators=100),
    'gradient_boost': GradientBoostingClassifier()
}
In [20]:
def evaluate_models(models: dict, df:pd.core.frame.DataFrame = df, target:str = 'diagnosis', feature_list : list = feature_list):
    """
    disclaimer: added default arguments so that this can be reused later, but added default values, so the code won't break. 
    """
       
    model_performance = pd.DataFrame({'colnames': []})
    cv = KFold(5)
    scorers = ['accuracy', 'roc_auc', 'precision', 'recall']
    for name, model in models.items():
        relevant_cols = []
        for feature in feature_list:
            relevant_cols.append(feature)
            X = df[relevant_cols]
            y = df[target]

            x_df = pd.DataFrame(cross_validate(model, X, y, cv=cv,
                                            scoring=scorers, return_estimator=False))
            columns_string = ','.join(relevant_cols)
            performance_results = pd.DataFrame(
                {'colnames': [columns_string], 'model': name, 'accuracy': x_df['test_accuracy'].mean(), 'precision': x_df['test_precision'].mean(), 'recall': x_df['test_recall'].mean()})
            # test_roc_auc	test_precision	test_recall

            model_performance = model_performance.append(performance_results)
    
    return model_performance
In [21]:
model_performance = evaluate_models(models)
model_performance.reset_index(inplace=True, drop=True)
model_performance['length'] = model_performance.colnames.apply(
    lambda x: len(x.split(',')))
model_performance['accuracy'] = model_performance.accuracy.apply(
    lambda x: round(x, 2))

Let's have a look at the best performing 15 models.

In [22]:
model_performance.sort_values(
    ['accuracy', 'length'], ascending=[False, True]).head(15)
Out[22]:
colnames model accuracy precision recall length
107 concave points_worst,perimeter_worst,concave p... gradient_boost 0.96 0.946343 0.942233 18
78 concave points_worst,perimeter_worst,concave p... random_forest 0.96 0.960255 0.942491 19
79 concave points_worst,perimeter_worst,concave p... random_forest 0.96 0.956921 0.939550 20
80 concave points_worst,perimeter_worst,concave p... random_forest 0.96 0.950627 0.942491 21
110 concave points_worst,perimeter_worst,concave p... gradient_boost 0.96 0.957222 0.936351 21
81 concave points_worst,perimeter_worst,concave p... random_forest 0.96 0.947644 0.937491 22
82 concave points_worst,perimeter_worst,concave p... random_forest 0.96 0.967246 0.943632 23
85 concave points_worst,perimeter_worst,concave p... random_forest 0.96 0.953519 0.929799 26
86 concave points_worst,perimeter_worst,concave p... random_forest 0.96 0.964276 0.934550 27
66 concave points_worst,perimeter_worst,concave p... random_forest 0.95 0.927278 0.929799 7
98 concave points_worst,perimeter_worst,concave p... gradient_boost 0.95 0.944548 0.914166 9
69 concave points_worst,perimeter_worst,concave p... random_forest 0.95 0.949255 0.934550 10
10 concave points_worst,perimeter_worst,concave p... logistic_regression 0.95 0.932226 0.920880 11
70 concave points_worst,perimeter_worst,concave p... random_forest 0.95 0.949255 0.934550 11
100 concave points_worst,perimeter_worst,concave p... gradient_boost 0.95 0.948496 0.915976 11

Selecting the winning model from the performance results and evaluating it with further metrics.

In [23]:
winning_model = model_performance.sort_values(
    ['accuracy', 'length'], ascending=[False, True]).iloc[0]
In [24]:
winning_features = winning_model['colnames'].split(',')

Fit and evaluate winning model using scores confusion matrix, and roc curve.

In [25]:
X = df[winning_features]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2)
In [26]:
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)
Out[26]:
RandomForestClassifier()
In [27]:
rfc.score(X_test, y_test)
Out[27]:
0.9473684210526315
In [28]:
predictions = rfc.predict(X_test)
precision, recall, f_score, support = precision_recall_fscore_support(
    y_test, predictions)
print('model evaluation:\n')
print(f'precision: {precision}\nrecall{recall}\nf-score{f_score}')
model evaluation:

precision: [0.94871795 0.94444444]
recall[0.97368421 0.89473684]
f-score[0.96103896 0.91891892]
In [29]:
accuracy_score(y_test,predictions)
Out[29]:
0.9473684210526315
In [30]:
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()
In [31]:
print(f'{tn}, {fp}, {fn}, {tp}')
74, 2, 4, 34
In [32]:
plot_confusion_matrix(rfc, X_test, y_test, cmap='Blues')
Out[32]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1d573aa4f08>
In [33]:
plot_precision_recall_curve(rfc, X_test, y_test)
Out[33]:
<sklearn.metrics._plot.precision_recall_curve.PrecisionRecallDisplay at 0x1d573b1c0c8>
In [34]:
plot_roc_curve(rfc, X_test, y_test)
Out[34]:
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x1d573b84b88>
In [35]:
y_score = rfc.predict_proba(X_test)

round(roc_auc_score(y_test, y_score[:, 1]),2)
Out[35]:
0.99
In [ ]: