Supervised machine learning (ML) using R

Summary

The goal is to analyse the Breast Cancer Wisconsin Data Set using a variety of supervised ML methods.

Analysis

R libraries

We need a couple of libraries for the exercises. Let’s load them all upfront:

library(ggplot2)
library(reshape2)
library(caret)
library(rpart.plot)
library(class)
library(randomForest)
library(pander)
library(kernlab)

Data Exploration

Load the data

df=read.table("data.csv.gz", header=T, sep=",")
df$outcome = factor(df$outcome)   # Outcome is a categorical variable

Summarize the data set

ggplot(data=df, aes(x=outcome, y=mean.radius)) + geom_boxplot(aes(group=outcome))

ggplot(data=df, aes(x=outcome, y=worst.area)) + geom_boxplot(aes(group=outcome))

Plot all features at once

lf = melt(df, id=c("outcome"))
ggplot(data=lf, aes(x=outcome, y=value)) +
    geom_boxplot(aes(group=outcome)) +
    facet_wrap(~variable, scales="free_y")

ggplot(data=lf, aes(x=value)) +
    geom_freqpoly(aes(group=outcome, color=outcome), stat="bin", bins=30) +
    facet_wrap(~variable, scales="free")