Data Exploration using R

Summary

The goal is to learn some approaches to understand and visualize continuous and categorical variables from the Iris Data Set.

Methods

These are the individual methods for measuring the central value:

These are the individual methods for measuring the variance or dispersion of the data:

These statistical methods can be used to compare variables:

Analysis

R libraries

We need a couple of libraries for the exercises. Let’s load them all upfront:

library(ggplot2)
library(reshape2)
library(Hmisc)
library(pastecs)

Calculating the mean and sd

Let’s load the data set and calculate the mean and sd of the sepal length feature.

data(iris)
print(dim(iris))
## [1] 150   5
print(head(iris))
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
mean(iris$Sepal.Length)
## [1] 5.843333
sd(iris$Sepal.Length)
## [1] 0.8280661

Calculating quantiles

We can also calculate the 25%, 50%, and 75% quantile of the data.

quantile(iris$Sepal.Length, probs=c(0.25, 0.5, 0.75))
## 25% 50% 75% 
## 5.1 5.8 6.4
median(iris$Sepal.Length)
## [1] 5.8

Let’s visualize these quantiles in a histogram of sepal length.

ggplot(data=iris, aes(x=Sepal.Length)) + geom_histogram(bins=30) + geom_vline(xintercept=c(5.1, 5.8, 6.4))

Let’s do the same with a boxplot.

ggplot(data=iris, aes(x="Sepal.Length", y=Sepal.Length)) + geom_boxplot()

ggplot(data=iris, aes(x="Sepal.Length", y=Sepal.Length)) + geom_boxplot() + geom_hline(yintercept=c(5.1, 5.8, 6.4), color="lightblue")

Summarizing all features (columns)

R has many useful summary functions that include mean, sd and other statistics.

print(summary(iris))
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
print(describe(iris))
## iris 
## 
##  5  Variables      150  Observations
## ---------------------------------------------------------------------------
## Sepal.Length 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      150        0       35    0.998    5.843   0.9462    4.600    4.800 
##      .25      .50      .75      .90      .95 
##    5.100    5.800    6.400    6.900    7.255 
## 
## lowest : 4.3 4.4 4.5 4.6 4.7, highest: 7.3 7.4 7.6 7.7 7.9
## ---------------------------------------------------------------------------
## Sepal.Width 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      150        0       23    0.992    3.057   0.4872    2.345    2.500 
##      .25      .50      .75      .90      .95 
##    2.800    3.000    3.300    3.610    3.800 
## 
## lowest : 2.0 2.2 2.3 2.4 2.5, highest: 3.9 4.0 4.1 4.2 4.4
## ---------------------------------------------------------------------------
## Petal.Length 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      150        0       43    0.998    3.758    1.979     1.30     1.40 
##      .25      .50      .75      .90      .95 
##     1.60     4.35     5.10     5.80     6.10 
## 
## lowest : 1.0 1.1 1.2 1.3 1.4, highest: 6.3 6.4 6.6 6.7 6.9
## ---------------------------------------------------------------------------
## Petal.Width 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      150        0       22     0.99    1.199   0.8676      0.2      0.2 
##      .25      .50      .75      .90      .95 
##      0.3      1.3      1.8      2.2      2.3 
## 
## lowest : 0.1 0.2 0.3 0.4 0.5, highest: 2.1 2.2 2.3 2.4 2.5
## ---------------------------------------------------------------------------
## Species 
##        n  missing distinct 
##      150        0        3 
##                                            
## Value          setosa versicolor  virginica
## Frequency          50         50         50
## Proportion      0.333      0.333      0.333
## ---------------------------------------------------------------------------
print(pastecs::stat.desc(iris))
##              Sepal.Length  Sepal.Width Petal.Length  Petal.Width Species
## nbr.val      150.00000000 150.00000000  150.0000000 150.00000000      NA
## nbr.null       0.00000000   0.00000000    0.0000000   0.00000000      NA
## nbr.na         0.00000000   0.00000000    0.0000000   0.00000000      NA
## min            4.30000000   2.00000000    1.0000000   0.10000000      NA
## max            7.90000000   4.40000000    6.9000000   2.50000000      NA
## range          3.60000000   2.40000000    5.9000000   2.40000000      NA
## sum          876.50000000 458.60000000  563.7000000 179.90000000      NA
## median         5.80000000   3.00000000    4.3500000   1.30000000      NA
## mean           5.84333333   3.05733333    3.7580000   1.19933333      NA
## SE.mean        0.06761132   0.03558833    0.1441360   0.06223645      NA
## CI.mean.0.95   0.13360085   0.07032302    0.2848146   0.12298004      NA
## var            0.68569351   0.18997942    3.1162779   0.58100626      NA
## std.dev        0.82806613   0.43586628    1.7652982   0.76223767      NA
## coef.var       0.14171126   0.14256420    0.4697441   0.63555114      NA

With the help of ggplot2 we can visualize all features based on the outcome variable “Species”.

lf = melt(iris, id=c("Species"))
ggplot(data=lf, aes(x=Species, y=value)) + geom_boxplot(aes(group=Species)) + facet_wrap(~variable, scales="free_y")

ggplot(data=lf, aes(x=value)) + geom_freqpoly(aes(group=Species, color=Species), stat="bin", bins=30) + facet_wrap(~variable, scales="free")