The following notebook is an exploration of the Breast Cancer Dataset downloaded from https://www.kaggle.com/uciml/breast-cancer-wisconsin-data.
The aim is to predict the diagnosis, based on the features provided.
The notebook attempts to explore the data and test four models against each other on which could be a potential candidate for predictions, as well as do a forward selection of features.
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, plot_roc_curve, plot_confusion_matrix,confusion_matrix, plot_precision_recall_curve, precision_recall_fscore_support, roc_auc_score,auc
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import cross_validate, KFold
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('./data/data.csv')
df.info()
we know that we have to predict a categorical variable from only numeric ones.
df.head()
df.isna().sum()
The dataset is fully complete and does not require any cleaning, yay!!
df['id'].apply(str).describe()
Each id is unique.
df.drop('id', axis = 1, inplace = True)
Defining the target variable
target = 'diagnosis'
df[target].value_counts().plot(kind='bar')
removing the target feature and describing the rest
features = df.columns.to_list()
features.pop(features.index(target))
len(features)
First, let's do some descriptive statistics for the different features.
for feature in features:
print(f'***\n{feature}\n***')
print(df[feature].describe(), '\n')
Let's continue with some boxplots to explore the data
for feature in features:
sns.boxplot(data=df, x=target, y=feature)
plt.figure()
The following distribution plots showcase the same information, but present it differently.
for feature in features:
sns.distplot(df[df[target] == 'M'][feature], color='r')
sns.distplot(df[df[target] == 'B'][feature], color='g')
plt.figure()
From the above boxplots and distribution plots it is apparent that benevolent tumors are smaller in area, radius and perimeter. When looking at the features raidus_mean, concavity_mean and concave points_mean could provide to be good predictors. Let's have a look at a correlation matrix to see which features are in deed highly correllated with the outcome. As it is a binary outcome, let's map benign
to 0 and malevolent
to 1.
# now as ou know our diagnosis column is a object type so we can map it to integer value
df[target] = df[target].map({'M': 1, 'B': 0})
do a correlation matrix
plt.subplots(figsize=(14,14))
sns.heatmap(df.corr())
Based on the correlation values with the diagnosis
, I'll order the columns by their correlation coefficient and use forward selection to determine which of the columns are relevant for making a good prediction
correldf = df.corr()
correldf.iloc[0,:].sort_values(ascending = False)[1:]
feature_list = correldf.iloc[0, :].sort_values(ascending=False)[1:].index
In the below cell, I am defining four different models, each of which will be fitted with the forward selection model, and then scored accordingly.
models = {
'logistic_regression': LogisticRegression(),
'svc': SVC(gamma='auto'),
'random_forest': RandomForestClassifier(n_estimators=100),
'gradient_boost': GradientBoostingClassifier()
}
def evaluate_models(models: dict, df:pd.core.frame.DataFrame = df, target:str = 'diagnosis', feature_list : list = feature_list):
"""
disclaimer: added default arguments so that this can be reused later, but added default values, so the code won't break.
"""
model_performance = pd.DataFrame({'colnames': []})
cv = KFold(5)
scorers = ['accuracy', 'roc_auc', 'precision', 'recall']
for name, model in models.items():
relevant_cols = []
for feature in feature_list:
relevant_cols.append(feature)
X = df[relevant_cols]
y = df[target]
x_df = pd.DataFrame(cross_validate(model, X, y, cv=cv,
scoring=scorers, return_estimator=False))
columns_string = ','.join(relevant_cols)
performance_results = pd.DataFrame(
{'colnames': [columns_string], 'model': name, 'accuracy': x_df['test_accuracy'].mean(), 'precision': x_df['test_precision'].mean(), 'recall': x_df['test_recall'].mean()})
# test_roc_auc test_precision test_recall
model_performance = model_performance.append(performance_results)
return model_performance
model_performance = evaluate_models(models)
model_performance.reset_index(inplace=True, drop=True)
model_performance['length'] = model_performance.colnames.apply(
lambda x: len(x.split(',')))
model_performance['accuracy'] = model_performance.accuracy.apply(
lambda x: round(x, 2))
Let's have a look at the best performing 15 models.
model_performance.sort_values(
['accuracy', 'length'], ascending=[False, True]).head(15)
Selecting the winning model from the performance results and evaluating it with further metrics.
winning_model = model_performance.sort_values(
['accuracy', 'length'], ascending=[False, True]).iloc[0]
winning_features = winning_model['colnames'].split(',')
Fit and evaluate winning model using scores confusion matrix, and roc curve.
X = df[winning_features]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2)
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)
rfc.score(X_test, y_test)
predictions = rfc.predict(X_test)
precision, recall, f_score, support = precision_recall_fscore_support(
y_test, predictions)
print('model evaluation:\n')
print(f'precision: {precision}\nrecall{recall}\nf-score{f_score}')
accuracy_score(y_test,predictions)
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()
print(f'{tn}, {fp}, {fn}, {tp}')
plot_confusion_matrix(rfc, X_test, y_test, cmap='Blues')
plot_precision_recall_curve(rfc, X_test, y_test)
plot_roc_curve(rfc, X_test, y_test)
y_score = rfc.predict_proba(X_test)
round(roc_auc_score(y_test, y_score[:, 1]),2)