Linear Regression using R

Wolfgang

7 years ago

In a similar vein as the previous post that dealt with analysis of variance (ANOVA) let’s shift our focus on a another problem solving approach, that is, linear regression. In principle, this model gives special prominence on the ability to predict an outcome based on a given set of variables. In other instances it attempts to discern whether there are relations or dependencies among certain facts and whether one influences the other. Exemplarily, you are tasked to inspect whether the number of sales calls done by a sales representative during a certain period of time has a bearing on the number of copiers sold. You might suspect that more sales calls result in more copiers sold. In the realm of regression, statisticians have a certain speak if referring to the components making up regression analysis. The variable triggering or influencing all other values is called independent variable while the resulting variables are referred to as dependent variables. In our example above the phone calls would be labelled as independent variable that allegedly drive the number of copiers sold, hence being the dependent variable. We would like to take up this example and calculate it using the R statistics package capabilities.

To be clear about the results, have a look at one of the formulas dealing with linear regression. This is taken from a statistics text book as for instance in Lind et al. (2015) perennial standard text book. This is an excerpt from a handy LaTex based statistics formulae collection that you can share and compile online. Here you will find all relevant formulas pertaining to linear regression. Since it is an online working copy it will reflect updates automatically once new formulas are added or corrections made. So stay tuned.

The Sales Representative Example

Let’s turn our attention to the aforementioned example taken from Lind et al. (2015). Are number of sales calls related to number of copier sold? Refer to the listing on the right.

Listing of sales calls and copiers sold by sales representative.

The sample comprises 15 sales reps and their respective sales calls and the number of copiers sold. From the top it does look like as if there is a positive relation between calls and sells but how strong is this trend actually? To illustrate this relationship graphically we usually resort to a scatter diagram where we plot the number of sales calls (independent variable) on the x-axis and the resultant copier sells (dependent variable) on the y-axis.

Apparently, there is a positive, that is, upward relationship between calls and sells, however not as strong as we might have assumed. As the regression line represents the least squares of each data point, that is, the closest line of all plotted points there some major outliers, especially in the area of 80 < x < 100.

Scatter diagram of the sales calls and copiers sold including a regression line.

Scripting with R

Once you have access to your R environment you simply need the data in as a csv file. You may download the data file survey.csv as well as the full working R skript on this share.

At first you need to import the csv file using this command.

sales_rep = read.csv(file.choose()) # this opens up a file import window

Next you can display imported components from various angles.

dim(sales_rep) # retrieves/sets the R object dimension
str(sales_rep) # compact display of R object structure
head(sales_rep) # returns the first or last part of an object

After executing the head(survey) command you should get following output. Surely, it represents simply an excerpt from the sample.

       sales_rep sales_call copiers_sold
1   Brian Virost         96           41
2 Carlos Ramirez         40           41
3     Carol Saia        104           51
4      Greg Fish        128           60
5      Jeff Hall        164           61
6  Mark Reynolds         76           29

Using the next command you tell R to attach the database referred to as sales_rep so it can use it as a so-called data.frame or list and leverage its variables.

attach(survey)

The following command produces above scatter diagram.

plot(sales_call, copiers_sold, pch = 16, cex = 1.3, col = "blue", main = "Copier Sales based on Sales Calls", xlab = "# sales calls", ylab = "# copiers sold")

Now add a regression line to the graph.

abline(lm(copiers_sold ~ sales_call))

Next execute the regression calculation out of the function of copiers_sold and sales_calls.

regression <- lm(copiers_sold ~ sales_call)

Display the results.

summary(regression)

It produces all relevant regression number.

Call:
lm(formula = copiers_sold ~ sales_call)

Residuals:
 Min        1Q  Median     3Q     Max 
-11.873 -2.861   0.255  3.511  10.595

Coefficients:
         Estimate Std.  Error  t value   Pr(>|t|) 
(Intercept)   19.9800  4.3897    4.552   0.000544 ***
sales_call     0.2606  0.0420    6.205   3.19e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.72 on 13 degrees of freedom
Multiple R-squared: 0.7476, Adjusted R-squared: 0.7282 
F-statistic: 38.5 on 1 and 13 DF, p-value: 3.193e-05

References

YouTube video on linear regression by statisticfun 2014
Lind, D.A., Marchal, W.G., and Wathen, S.A. (2015). Statistical Techniques in Business and Economics (New York, NY: McGraw-Hill Education).

Wolfgang

Apple geek, analytics and AI fanatic, notorious project manager,