Breast cancer dataset

The following notebook is an exploration of the Breast Cancer Dataset downloaded from https://www.kaggle.com/uciml/breast-cancer-wisconsin-data.
The aim is to predict the diagnosis, based on the features provided. The notebook attempts to explore the data and test four models against each other on which could be a potential candidate for predictions, as well as do a forward selection of features.

from sklearn.decomposition import PCA
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, plot_roc_curve, plot_confusion_matrix,confusion_matrix, plot_precision_recall_curve, precision_recall_fscore_support, roc_auc_score,auc
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import cross_validate, KFold

import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('./data/data.csv')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             569 non-null    float64
 15  area_se                  569 non-null    float64
 16  smoothness_se            569 non-null    float64
 17  compactness_se           569 non-null    float64
 18  concavity_se             569 non-null    float64
 19  concave points_se        569 non-null    float64
 20  symmetry_se              569 non-null    float64
 21  fractal_dimension_se     569 non-null    float64
 22  radius_worst             569 non-null    float64
 23  texture_worst            569 non-null    float64
 24  perimeter_worst          569 non-null    float64
 25  area_worst               569 non-null    float64
 26  smoothness_worst         569 non-null    float64
 27  compactness_worst        569 non-null    float64
 28  concavity_worst          569 non-null    float64
 29  concave points_worst     569 non-null    float64
 30  symmetry_worst           569 non-null    float64
 31  fractal_dimension_worst  569 non-null    float64
dtypes: float64(30), int64(1), object(1)
memory usage: 142.4+ KB

we know that we have to predict a categorical variable from only numeric ones.

df.head()

df.isna().sum()

id                         0
diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64

The dataset is fully complete and does not require any cleaning, yay!!

df['id'].apply(str).describe()

count        569
unique       569
top       908445
freq           1
Name: id, dtype: object

Each id is unique.

df.drop('id', axis = 1, inplace = True)

Defining the target variable

target = 'diagnosis'

df[target].value_counts().plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x1d56ebd7ec8>

removing the target feature and describing the rest

features = df.columns.to_list()
features.pop(features.index(target))

'diagnosis'

len(features)

30

data exploration¶

First, let's do some descriptive statistics for the different features.

for feature in features:
    print(f'***\n{feature}\n***')
    print(df[feature].describe(), '\n')

***
radius_mean
***
count    569.000000
mean      14.127292
std        3.524049
min        6.981000
25%       11.700000
50%       13.370000
75%       15.780000
max       28.110000
Name: radius_mean, dtype: float64 

***
texture_mean
***
count    569.000000
mean      19.289649
std        4.301036
min        9.710000
25%       16.170000
50%       18.840000
75%       21.800000
max       39.280000
Name: texture_mean, dtype: float64 

***
perimeter_mean
***
count    569.000000
mean      91.969033
std       24.298981
min       43.790000
25%       75.170000
50%       86.240000
75%      104.100000
max      188.500000
Name: perimeter_mean, dtype: float64 

***
area_mean
***
count     569.000000
mean      654.889104
std       351.914129
min       143.500000
25%       420.300000
50%       551.100000
75%       782.700000
max      2501.000000
Name: area_mean, dtype: float64 

***
smoothness_mean
***
count    569.000000
mean       0.096360
std        0.014064
min        0.052630
25%        0.086370
50%        0.095870
75%        0.105300
max        0.163400
Name: smoothness_mean, dtype: float64 

***
compactness_mean
***
count    569.000000
mean       0.104341
std        0.052813
min        0.019380
25%        0.064920
50%        0.092630
75%        0.130400
max        0.345400
Name: compactness_mean, dtype: float64 

***
concavity_mean
***
count    569.000000
mean       0.088799
std        0.079720
min        0.000000
25%        0.029560
50%        0.061540
75%        0.130700
max        0.426800
Name: concavity_mean, dtype: float64 

***
concave points_mean
***
count    569.000000
mean       0.048919
std        0.038803
min        0.000000
25%        0.020310
50%        0.033500
75%        0.074000
max        0.201200
Name: concave points_mean, dtype: float64 

***
symmetry_mean
***
count    569.000000
mean       0.181162
std        0.027414
min        0.106000
25%        0.161900
50%        0.179200
75%        0.195700
max        0.304000
Name: symmetry_mean, dtype: float64 

***
fractal_dimension_mean
***
count    569.000000
mean       0.062798
std        0.007060
min        0.049960
25%        0.057700
50%        0.061540
75%        0.066120
max        0.097440
Name: fractal_dimension_mean, dtype: float64 

***
radius_se
***
count    569.000000
mean       0.405172
std        0.277313
min        0.111500
25%        0.232400
50%        0.324200
75%        0.478900
max        2.873000
Name: radius_se, dtype: float64 

***
texture_se
***
count    569.000000
mean       1.216853
std        0.551648
min        0.360200
25%        0.833900
50%        1.108000
75%        1.474000
max        4.885000
Name: texture_se, dtype: float64 

***
perimeter_se
***
count    569.000000
mean       2.866059
std        2.021855
min        0.757000
25%        1.606000
50%        2.287000
75%        3.357000
max       21.980000
Name: perimeter_se, dtype: float64 

***
area_se
***
count    569.000000
mean      40.337079
std       45.491006
min        6.802000
25%       17.850000
50%       24.530000
75%       45.190000
max      542.200000
Name: area_se, dtype: float64 

***
smoothness_se
***
count    569.000000
mean       0.007041
std        0.003003
min        0.001713
25%        0.005169
50%        0.006380
75%        0.008146
max        0.031130
Name: smoothness_se, dtype: float64 

***
compactness_se
***
count    569.000000
mean       0.025478
std        0.017908
min        0.002252
25%        0.013080
50%        0.020450
75%        0.032450
max        0.135400
Name: compactness_se, dtype: float64 

***
concavity_se
***
count    569.000000
mean       0.031894
std        0.030186
min        0.000000
25%        0.015090
50%        0.025890
75%        0.042050
max        0.396000
Name: concavity_se, dtype: float64 

***
concave points_se
***
count    569.000000
mean       0.011796
std        0.006170
min        0.000000
25%        0.007638
50%        0.010930
75%        0.014710
max        0.052790
Name: concave points_se, dtype: float64 

***
symmetry_se
***
count    569.000000
mean       0.020542
std        0.008266
min        0.007882
25%        0.015160
50%        0.018730
75%        0.023480
max        0.078950
Name: symmetry_se, dtype: float64 

***
fractal_dimension_se
***
count    569.000000
mean       0.003795
std        0.002646
min        0.000895
25%        0.002248
50%        0.003187
75%        0.004558
max        0.029840
Name: fractal_dimension_se, dtype: float64 

***
radius_worst
***
count    569.000000
mean      16.269190
std        4.833242
min        7.930000
25%       13.010000
50%       14.970000
75%       18.790000
max       36.040000
Name: radius_worst, dtype: float64 

***
texture_worst
***
count    569.000000
mean      25.677223
std        6.146258
min       12.020000
25%       21.080000
50%       25.410000
75%       29.720000
max       49.540000
Name: texture_worst, dtype: float64 

***
perimeter_worst
***
count    569.000000
mean     107.261213
std       33.602542
min       50.410000
25%       84.110000
50%       97.660000
75%      125.400000
max      251.200000
Name: perimeter_worst, dtype: float64 

***
area_worst
***
count     569.000000
mean      880.583128
std       569.356993
min       185.200000
25%       515.300000
50%       686.500000
75%      1084.000000
max      4254.000000
Name: area_worst, dtype: float64 

***
smoothness_worst
***
count    569.000000
mean       0.132369
std        0.022832
min        0.071170
25%        0.116600
50%        0.131300
75%        0.146000
max        0.222600
Name: smoothness_worst, dtype: float64 

***
compactness_worst
***
count    569.000000
mean       0.254265
std        0.157336
min        0.027290
25%        0.147200
50%        0.211900
75%        0.339100
max        1.058000
Name: compactness_worst, dtype: float64 

***
concavity_worst
***
count    569.000000
mean       0.272188
std        0.208624
min        0.000000
25%        0.114500
50%        0.226700
75%        0.382900
max        1.252000
Name: concavity_worst, dtype: float64 

***
concave points_worst
***
count    569.000000
mean       0.114606
std        0.065732
min        0.000000
25%        0.064930
50%        0.099930
75%        0.161400
max        0.291000
Name: concave points_worst, dtype: float64 

***
symmetry_worst
***
count    569.000000
mean       0.290076
std        0.061867
min        0.156500
25%        0.250400
50%        0.282200
75%        0.317900
max        0.663800
Name: symmetry_worst, dtype: float64 

***
fractal_dimension_worst
***
count    569.000000
mean       0.083946
std        0.018061
min        0.055040
25%        0.071460
50%        0.080040
75%        0.092080
max        0.207500
Name: fractal_dimension_worst, dtype: float64

Let's continue with some boxplots to explore the data

for feature in features:
    sns.boxplot(data=df, x=target, y=feature)
    plt.figure()

<Figure size 432x288 with 0 Axes>

The following distribution plots showcase the same information, but present it differently.

for feature in features:
    sns.distplot(df[df[target] == 'M'][feature], color='r')
    sns.distplot(df[df[target] == 'B'][feature], color='g')
    plt.figure()

<Figure size 432x288 with 0 Axes>

From the above boxplots and distribution plots it is apparent that benevolent tumors are smaller in area, radius and perimeter. When looking at the features raidus_mean, concavity_mean and concave points_mean could provide to be good predictors. Let's have a look at a correlation matrix to see which features are in deed highly correllated with the outcome. As it is a binary outcome, let's map benign to 0 and malevolent to 1.

Start modelling¶

first do feature selection based on forward selection method with 4 models¶

# now as ou know our diagnosis column is a object type so we can map it to integer value
df[target] = df[target].map({'M': 1, 'B': 0})

do a correlation matrix

plt.subplots(figsize=(14,14))
sns.heatmap(df.corr())

<matplotlib.axes._subplots.AxesSubplot at 0x1d56ef1c808>

Based on the correlation values with the diagnosis, I'll order the columns by their correlation coefficient and use forward selection to determine which of the columns are relevant for making a good prediction

correldf = df.corr()
correldf.iloc[0,:].sort_values(ascending = False)[1:]
feature_list = correldf.iloc[0, :].sort_values(ascending=False)[1:].index

In the below cell, I am defining four different models, each of which will be fitted with the forward selection model, and then scored accordingly.

models = {
    'logistic_regression': LogisticRegression(),
    'svc': SVC(gamma='auto'),
    'random_forest': RandomForestClassifier(n_estimators=100),
    'gradient_boost': GradientBoostingClassifier()
}

def evaluate_models(models: dict, df:pd.core.frame.DataFrame = df, target:str = 'diagnosis', feature_list : list = feature_list):
    """
    disclaimer: added default arguments so that this can be reused later, but added default values, so the code won't break. 
    """
       
    model_performance = pd.DataFrame({'colnames': []})
    cv = KFold(5)
    scorers = ['accuracy', 'roc_auc', 'precision', 'recall']
    for name, model in models.items():
        relevant_cols = []
        for feature in feature_list:
            relevant_cols.append(feature)
            X = df[relevant_cols]
            y = df[target]

            x_df = pd.DataFrame(cross_validate(model, X, y, cv=cv,
                                            scoring=scorers, return_estimator=False))
            columns_string = ','.join(relevant_cols)
            performance_results = pd.DataFrame(
                {'colnames': [columns_string], 'model': name, 'accuracy': x_df['test_accuracy'].mean(), 'precision': x_df['test_precision'].mean(), 'recall': x_df['test_recall'].mean()})
            # test_roc_auc	test_precision	test_recall

            model_performance = model_performance.append(performance_results)
    
    return model_performance

model_performance = evaluate_models(models)
model_performance.reset_index(inplace=True, drop=True)
model_performance['length'] = model_performance.colnames.apply(
    lambda x: len(x.split(',')))
model_performance['accuracy'] = model_performance.accuracy.apply(
    lambda x: round(x, 2))

Let's have a look at the best performing 15 models.

model_performance.sort_values(
    ['accuracy', 'length'], ascending=[False, True]).head(15)

Selecting the winning model from the performance results and evaluating it with further metrics.

winning_model = model_performance.sort_values(
    ['accuracy', 'length'], ascending=[False, True]).iloc[0]

winning_features = winning_model['colnames'].split(',')

Fit and evaluate winning model using scores confusion matrix, and roc curve.

X = df[winning_features]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2)

rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)

RandomForestClassifier()

rfc.score(X_test, y_test)

0.9473684210526315

predictions = rfc.predict(X_test)
precision, recall, f_score, support = precision_recall_fscore_support(
    y_test, predictions)
print('model evaluation:\n')
print(f'precision: {precision}\nrecall{recall}\nf-score{f_score}')

model evaluation:

precision: [0.94871795 0.94444444]
recall[0.97368421 0.89473684]
f-score[0.96103896 0.91891892]

accuracy_score(y_test,predictions)

0.9473684210526315

tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()

print(f'{tn}, {fp}, {fn}, {tp}')

74, 2, 4, 34

plot_confusion_matrix(rfc, X_test, y_test, cmap='Blues')

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1d573aa4f08>

plot_precision_recall_curve(rfc, X_test, y_test)

<sklearn.metrics._plot.precision_recall_curve.PrecisionRecallDisplay at 0x1d573b1c0c8>

plot_roc_curve(rfc, X_test, y_test)

<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x1d573b84b88>

y_score = rfc.predict_proba(X_test)

round(roc_auc_score(y_test, y_score[:, 1]),2)

0.99

	id	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	...	radius_worst	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst
0	842302	M	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	...	25.38	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	842517	M	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	...	24.99	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	84300903	M	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	...	23.57	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	84348301	M	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	...	14.91	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	84358402	M	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	...	22.54	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678

	colnames	model	accuracy	precision	recall	length
107	concave points_worst,perimeter_worst,concave p...	gradient_boost	0.96	0.946343	0.942233	18
78	concave points_worst,perimeter_worst,concave p...	random_forest	0.96	0.960255	0.942491	19
79	concave points_worst,perimeter_worst,concave p...	random_forest	0.96	0.956921	0.939550	20
80	concave points_worst,perimeter_worst,concave p...	random_forest	0.96	0.950627	0.942491	21
110	concave points_worst,perimeter_worst,concave p...	gradient_boost	0.96	0.957222	0.936351	21
81	concave points_worst,perimeter_worst,concave p...	random_forest	0.96	0.947644	0.937491	22
82	concave points_worst,perimeter_worst,concave p...	random_forest	0.96	0.967246	0.943632	23
85	concave points_worst,perimeter_worst,concave p...	random_forest	0.96	0.953519	0.929799	26
86	concave points_worst,perimeter_worst,concave p...	random_forest	0.96	0.964276	0.934550	27
66	concave points_worst,perimeter_worst,concave p...	random_forest	0.95	0.927278	0.929799	7
98	concave points_worst,perimeter_worst,concave p...	gradient_boost	0.95	0.944548	0.914166	9
69	concave points_worst,perimeter_worst,concave p...	random_forest	0.95	0.949255	0.934550	10
10	concave points_worst,perimeter_worst,concave p...	logistic_regression	0.95	0.932226	0.920880	11
70	concave points_worst,perimeter_worst,concave p...	random_forest	0.95	0.949255	0.934550	11
100	concave points_worst,perimeter_worst,concave p...	gradient_boost	0.95	0.948496	0.915976	11