Logistic Regressions

Steph Locke

2017-04-20

Welcome!

Today’s aim

Understand what a logistic regression is, how to prepare data for a logistic regression, and how to evaluate a model.

About Me

Locke Data

Founded by Steph Locke (that’s me!), Locke Data is a data science consultancy focused on helping organisations get the most out of their data science resources. While we’re happy to do data science projects for you, we’d really like to set you up to do them yourself!

Locke Data offers a broad range of services including:

Data Science Readiness Reviews
Training and mentoring programmes
Independent audits and ethics reviews
Recruitment assistance

If you’d like more information about our services please get in touch via our website, itsalocke.com.

Steph Locke

I am a Microsoft Data Platform MVP with a decade of business intelligence and data science experience.

Having worked in a variety of industries – including finance, utilities, insurance, and cyber-security – I’ve tackled a wide range of business challenges with data.

However, I’m probably best known for my community activities; including presenting, training, blogging and speaking on panels and webinars.

If you have any questions about today’s session, community activity, or data science in general, please get in touch via Locke Data, or my Twitter, @SteffLocke

Logistic regression theory

What is a logistic regression?

A logistic regression is a linear regression, applied to categorical outcomes by using a transformation function.

A linear regression

A linear regression uses a line of best fit (the old \(y = mx + c\)) over multiple variables to predict a continuous variable.

Why do we need a transformation function?

If you’re trying to predict whether someone survives (1) or dies (0), does it make sense to say they’re -0.2 alive, 0.5 alive, or 1.1 alive?

What can we measure that is a continuous variable?

We can measure the probability of someone surviving. This gives us data in the range \([ 0 , 1 ]\) which is better, but still not our ideal of \([-\infty,+\infty]\).

How can we transform it to be in the range we want?

The odds of something happening are the probability of it happening versus the probability of it not happening can help us. \begin{equation} \frac{p}{1-p} \end{equation}

As probability can never be less than 0 or greater than 1, we get a range between \([0,+\infty]\).

How can allow negative values?

The final step in this transformation is to take the log of the odds, which is commonly called the logit. This gets us to \([-\infty,+\infty]\).

Interpreting the results

logits	odds	probs	pred_class
-4	0.0183156	0.0179862	FALSE
-3	0.0497871	0.0474259	FALSE
-2	0.1353353	0.1192029	FALSE
-1	0.3678794	0.2689414	FALSE
0	1.0000000	0.5000000	TRUE
1	2.7182818	0.7310586	TRUE
2	7.3890561	0.8807971	TRUE
3	20.0855369	0.9525741	TRUE
4	54.5981500	0.9820138	TRUE

Logistic regressions in R

`glm()`

The glm function is used for performing logistic regressions. It can be used for other linear models too.

glm(vs~ mpg , data=mtcars, family = binomial(link="logit"))
## 
## Call:  glm(formula = vs ~ mpg, family = binomial(link = "logit"), data = mtcars)
## 
## Coefficients:
## (Intercept)          mpg  
##     -8.8331       0.4304  
## 
## Degrees of Freedom: 31 Total (i.e. Null);  30 Residual
## Null Deviance:       43.86 
## Residual Deviance: 25.53     AIC: 29.53

Formula

R uses a formula system for specifying a model.

You put the outcome variable on the left
A tilde (~) is used for saying “predicted by”
Exclude an intercept term by adding -1 to your formula
You can use a . to predict by all other variables e.g. y ~ .
Use a + to provide multiple independent variables e.g. y ~ a + b
You can use a : to use the interaction of two variables e.g. y ~ a:b
You can use a * to use two variables and their interaction e.g. y ~ a*b (equivalent to y ~ a + b + a:b)
You can construct features on the fly e.g. y ~ log(x) or use I() when adding values e.g. y ~ I(a+b)

For more info, check out ?formula

Useful parameters

na.action can be set to amend the handling of missings in the data
model,x,y controls whether you get extra info about the model and data back. Setting these to FALSE saves space

Functions working with `glm`

Function	Purpose
coefficients	Extract coefficients
summary	Output a basic summary
fitted	Return the predicted values for the training data
predict	Predict some values for new data
plot	Produce some basic diagnostic plots
residuals	Return the errors on predicted values for the training data

Inputs

You can provide a glm with continuous and categorical variables.

Categorical variables get transformed into dummy variables
Continuous variables should ideally be scaled

Preparing data

Exploration

Many ways to explore your data for outliers, patterns, issues etc.

library(caret)
featurePlot(mtcarsVars, mtcarsOut)

Sampling

Commonly, we will take a training sample and a testing sample.

set.seed(77887)
trainRows<-createDataPartition(mtcarsOut, p=.7 , list=FALSE)

training_x<-mtcarsVars[trainRows,]
training_y<-mtcarsOut[trainRows]

testing_x<-mtcarsVars[-trainRows,]
testing_y<-mtcarsOut[-trainRows]

Why sample before processing?

Sampling before scaling etc prevents information about the test data leaking into our model. By preventing such leaks we get a truer view of how well our model generalises later.

Scaling variables

minmax Express numbers as a percentage of the maximum after subtracting the minimum. This results in range \([0,1]\) for training data but can result in a different range in test data and, therefore, production! \begin{equation} \frac{x - min(x)}{max(x) - min(x)} \end{equation}
z-score Express numbers as the distance from the mean in standard deviations. This results in a range that’s notionally \([-\infty,+\infty]\) and results will be in the same range in test data. \begin{equation}

Perform z-score scaling in R with the scale function:

x<-rnorm(50, mean = 50, sd = 10)
x_s<-scale(x, center = TRUE, scale = TRUE)
summary(x_s)
##        V1          
##  Min.   :-2.47800  
##  1st Qu.:-0.51156  
##  Median :-0.02056  
##  Mean   : 0.00000  
##  3rd Qu.: 0.72480  
##  Max.   : 2.75145

Scaling variables

Use caret to scale multiple variables simultaneously and get a reusable scaling model for applying to test data, and eventually production data.

transformations<-preProcess(training_x)
scaledVars<-predict(transformations,training_x)
knitr::kable(t(summary(scaledVars)))

mpg	Min. :-1.6256	1st Qu.:-0.7609	Median :-0.1762	Mean : 0.0000	3rd Qu.: 0.5485	Max. : 2.2450
cyl	Min. :-1.2291	1st Qu.:-1.2291	Median :-0.1418	Mean : 0.0000	3rd Qu.: 0.9455	Max. : 0.9455
disp	Min. :-1.3730	1st Qu.:-0.8722	Median : 0.2015	Mean : 0.0000	3rd Qu.: 0.8418	Max. : 2.0043
hp	Min. :-1.35226	1st Qu.:-0.75519	Median :-0.02242	Mean : 0.00000	3rd Qu.: 0.38467	Max. : 2.48799
drat	Min. :-1.3944	1st Qu.:-0.8474	Median : 0.0989	Mean : 0.0000	3rd Qu.: 0.6198	Max. : 2.3736
wt	Min. :-1.8341	1st Qu.:-0.5987	Median : 0.3213	Mean : 0.0000	3rd Qu.: 0.5562	Max. : 2.3459
qsec	Min. :-1.6263	1st Qu.:-0.5026	Median :-0.1018	Mean : 0.0000	3rd Qu.: 0.6703	Max. : 2.5046
am	Min. :-0.7142	1st Qu.:-0.7142	Median :-0.7142	Mean : 0.0000	3rd Qu.: 1.3392	Max. : 1.3392
gear	Min. :-0.8462	1st Qu.:-0.8462	Median :-0.8462	Mean : 0.0000	3rd Qu.: 0.3702	Max. : 1.5866
carb	Min. :-1.0299	1st Qu.:-0.4521	Median :-0.4521	Mean : 0.0000	3rd Qu.: 0.7033	Max. : 3.0142

Things to check for

Correlated variables
Low variance columns

caret is very useful for these

Handling missings

Common methods for coping with missing data:

Removing rows with missings
- Con: reduces sample size
- Pro: use only complete data
[Continuous variables only] Putting in a default value like mean
- Con: tends to flatten model coefficient for variable
- Pro: simple to do
Putting in a predicted value
- Con: requires another set of data
- Pro: realistic values
[Continuous variables only] Making variable a categorical with an explicit missing category
- Con: information loss on continuous variables
- Pro: explicit modelling of missings

Building models

Initial models

I try to build some candidate models:

All variables
A few strongest variables

Stepwise selection

fullmodel<-glm(training_y~ ., data=training_x, family = binomial(link="logit"))
steppedmodel<-step(fullmodel, direction="both",trace = FALSE)

summary(steppedmodel)
## 
## Call:
## glm(formula = training_y ~ wt + qsec, family = binomial(link = "logit"), 
##     data = training_x)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -3.747e-05  -2.100e-08  -2.100e-08   2.100e-08   3.668e-05  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -1150.82  949688.13  -0.001    0.999
## wt             -57.98   55673.18  -0.001    0.999
## qsec            74.94   58397.10   0.001    0.999
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3.1492e+01  on 22  degrees of freedom
## Residual deviance: 2.8881e-09  on 20  degrees of freedom
## AIC: 6
## 
## Number of Fisher Scoring iterations: 25

Other model types

Different logistic regression variants like glmnet, gbm
Different models like classification trees

Others

You can also try with different loss or error functions
You should try “common sense” models

Evaluating glms

`broom`

Use broom to make tidy versions of model outputs.

library(broom)

# Coefficients
knitr::kable(tidy(steppedmodel))

term	estimate	std.error	statistic	p.value
(Intercept)	-1150.81869	949688.13	-0.0012118	0.9990331
wt	-57.98390	55673.18	-0.0010415	0.9991690
qsec	74.94005	58397.10	0.0012833	0.9989761

`broom`

Use broom to make tidy versions of model outputs.

# Fitted data
knitr::kable(head(augment(steppedmodel)))

.rownames	training_y	wt	qsec	.fitted	.se.fit	.hat	.sigma
Hornet 4 Drive	1	3.215	19.44	119.59772	93130.23	1.90e-06	1.23e-05
Hornet Sportabout	0	3.440	17.02	-74.80359	60604.06	8.00e-07	1.23e-05
Valiant	1	3.460	20.22	163.84490	128950.92	3.70e-06	1.23e-05
Duster 360	0	3.570	15.84	-170.77076	132543.30	3.90e-06	1.23e-05
Merc 240D	1	3.190	20.00	163.01374	126134.55	3.50e-06	1.23e-05
Merc 230	1	3.150	22.90	382.65926	295784.32	1.94e-05	1.23e-05

`broom`

Use broom to make tidy versions of model outputs.

# Key statistics
knitr::kable(glance(steppedmodel))

null.deviance	df.null	logLik	AIC	BIC	deviance	df.residual
31.49235	22	0	6	9.406483	0	20

Coefficients

Are the coefficient signs in the right directions?
How significant are they?
How important are they?

Key metrics

Residual deviance is a measure of how much error is in the model, after considering all the variables in the model. The smaller the residual deviance, the better.

deviance(fullmodel)
## [1] 4.780243e-10

Akaike’s information criterion (AIC) is a measure of information captured by a model and penalises more variables over fewer variables. The smaller the AIC, the better.

AIC(fullmodel)
## [1] 22

Classification rates

training_pred<-ifelse(predict(steppedmodel,training_x)>0, "1","0")
confusionMatrix(training_pred,training_y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 13  0
##          1  0 10
##                                      
##                Accuracy : 1          
##                  95% CI : (0.8518, 1)
##     No Information Rate : 0.5652     
##     P-Value [Acc > NIR] : 2e-06      
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.5652     
##          Detection Rate : 0.5652     
##    Detection Prevalence : 0.5652     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
##

Classification rates

testing_pred<-ifelse(predict(fullmodel,testing_x)>0, "1","0")
confusionMatrix(testing_pred,testing_y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 5 2
##          1 0 2
##                                           
##                Accuracy : 0.7778          
##                  95% CI : (0.3999, 0.9719)
##     No Information Rate : 0.5556          
##     P-Value [Acc > NIR] : 0.1575          
##                                           
##                   Kappa : 0.5263          
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.7143          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.5556          
##          Detection Rate : 0.5556          
##    Detection Prevalence : 0.7778          
##       Balanced Accuracy : 0.7500          
##                                           
##        'Positive' Class : 0               
##

Conclusion

Followup

Get the slides: stephlocke.info/Rtraining/logisticregressions.html
@SteffLocke & @LockeData
itsalocke.com

Logistic Regressions

Welcome!

Today’s aim

About Me

Locke Data

Steph Locke

Logistic regression theory

What is a logistic regression?

A linear regression

Why do we need a transformation function?

What can we measure that is a continuous variable?

How can we transform it to be in the range we want?

How can allow negative values?

Interpreting the results

Logistic regressions in R

glm()

Formula

Useful parameters

Functions working with glm

Inputs

Preparing data

Exploration

Sampling

Why sample before processing?

Scaling variables

Scaling variables

Things to check for

Handling missings

Building models

Initial models

Stepwise selection

Other model types

Others

Evaluating glms

broom

broom

broom

Coefficients

Key metrics

Classification rates

Classification rates

Conclusion

Followup

Thank you!

`glm()`

Functions working with `glm`

`broom`

`broom`

`broom`