Analysis of variance (ANOVA) using R

Wolfgang

7 years ago

Analysis of variance (ANOVA) represents a common means to perform statistical tests as to whether there is a statistically significant difference among sample means. Though manual calculations are ok if you strive to understand the concept, however, larger examples quickly get tedious. Here, the long-standing open source R statistics package comes to our rescue. To show how R works for a simple ANOVA analysis the ensuing steps provide a quick intro as well a the source code for a fully working example.

Usual structure of an analysis of variance (ANOVA) results table.

ANOVA Refresher

To be clear about the results here is a classic ANOVA results table as you would usually encounter it in statistics text books as for instance in Lind et al. (2015) perennial standard text book. This is an excerpt from a handy LaTex based statistics formulae collection that you can share and compile online. Since it is an online working copy it will reflect updates automatically once new formulas are added or corrections made. So stay tuned.

So when do we use ANOVA. As a test methodology ANOVA comes into consideration in cases

where we intend to test whether two samples are from populations having equal variances (σ²), or
we compare several population means (μ) simultaneously.

The correct test statistic for ANOVA is the F distribution which serves as a kind of referee on whether we are able to reject the null hypothesis (H₀) and accept the alternate hypothesis (H₁).

The Test Set: Flight Surveys

The test intends to investigate flight survey results for 4 different airlines. Test results range from a minimum level of 0 up to a maximum value of 100. The motivation is to test whether there is a difference in the mean satisfaction level among the four airlines surveyed. The test result is as follows:

Northern (a): 94, 90, 85, 80
WTA (e): 75, 68, 77, 83, 88
Pocono (o): 70, 73, 76, 78, 80, 68, 65
Branson (x): 68, 70, 72, 65, 74, 65

The stated null and alternate hypothesis are:

At a 0.01 significance level (α) the null hypothesis states that the mean scores are the same across the four airlines.

H₀:   μₐ = μₑ = μₒ = μₓ

The alternate hypothesis states that the mean scores are not the same across the four airlines.

H₁:   μₐ ≠ μₑ ≠ μₒ ≠ μₓ

Scripting ANOVA with R

Once you have access to your R environment you simply need the data in as a csv file. You may download the data file survey.csv as well as the full working R skript on this share.

At first you need to import above csv file using this command.

survey = read.csv(file.choose()) # this opens up a file import window

Next you can display imported components from various angles.

dim(survey) # retrieves/sets the R object dimension
str(survey) # compact display of R object structure
head(survey) # returns the first or last part of an object

After executing the head(survey) command you should get following output:

X...northern wta pocono branson
1 94 75 70 68
2 90 68 73 70
3 85 77 76 72
4 80 83 78 65
5 NA 88 80 74
6 NA NA 68 65

Using the next command you tell R to attach the database referred to as survey so it can use it as a so-called data.frame or list and leverage its variables.

attach(survey)

Then the components need to be combined into a single group where combined.surveys is the defined name of the new group categorized by the variables.

combined.surveys <- data.frame(cbind(X...northern, wta, pocono, branson))

The summary command provides additional useful insights like mean, median, etc.

summary(combined.surveys)

After the summary(combined.surveys) command you should get following output:

X...northern wta pocono branson 
 Min. :80.00 Min. :68.0 Min. :65.00 Min. :65.00 
 1st Qu.:83.75 1st Qu.:75.0 1st Qu.:69.00 1st Qu.:65.75 
 Median :87.50 Median :77.0 Median :73.00 Median :69.00 
 Mean :87.25 Mean :78.2 Mean :72.86 Mean :69.00 
 3rd Qu.:91.00 3rd Qu.:83.0 3rd Qu.:77.00 3rd Qu.:71.50 
 Max. :94.00 Max. :88.0 Max. :80.00 Max. :74.00 
 NA's :3 NA's :2 NA's :1

Next the components need to be stacked so that the sum of squares (SS) can be computed. The name stacked.surveys is the defined name of the new group. The ensuing command stacked.surveys simply displays the result.

stacked.surveys <- stack(combined.surveys)
stacked.surveys

At last you perform the actual ANOVA calculation. The term anova.results is the defined name of the results.

anova.result <- aov(values ~ ind, data = stacked.surveys)

At last the command summary(anova.result) renders the ANOVA results.

summary(anova.result)

 Df Sum Sq Mean Sq F value Pr(>F) 
ind 3 890.7 296.89 8.991 0.000743 ***
Residuals 18 594.4 33.02 
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
6 observations deleted due to missingness

References

YouTube video on ANOVA by statisticfun 2014
Lind, D.A., Marchal, W.G., and Wathen, S.A. (2015). Statistical Techniques in Business and Economics (New York, NY: McGraw-Hill Education).

Wolfgang

Apple geek, analytics and AI fanatic, notorious project manager,