1 Logistic regression

We uses a logistic regression to be able to use our simple linear regression methods against categorical data.

1.1 Linear Regression

In a linear regression, the data we are trying to predict might look like:

1.2 Binomial class

In a logistic regression, the data looks like:

1.3 Probability

Trying to plot a standard regression would result in lots of error, instead we need to do something to make the results of a linear regression to put values closer to the two binomial outcomes.

We could use the probability of a given outcome as this gives us a range of values.

prob_y<-seq(0,1, by=.001)
qplot(prob_y,prob_y)

The trouble with probabilities is they must be within 0 and 1 to make any sense and it would be tough to bind a linear regression to this range.

1.4 Odds

To bust out of this range, we could use odds, the probability of something happening versus it not happening to create a more dispersed value.

odds_y<- prob_y/(1-prob_y)
qplot(prob_y,odds_y)

Using odds allows us to exceed 1, but there’s still no negative values allowed. Additionally, the relationship is distribution of values is difficult to model linearly.

1.5 Logit

To be able to get both a positive and a negative range of values allowed, we can take the log of the odds.

qplot(prob_y, log(odds_y))

This now gives us a strong dispersal of values into positive and negative ranges, with a distribution much more suited to a linear regression model.

We can take a nuanced probabilistic approach to values, or simply say if the logit is positive, we predict the outcome 1, if not the outcome 0.

1.6 Transforming logits

To get back to a probability from a logit (or vice versa) is pretty simple but I wrote some helper functions in the package optiRum to facilitate this.

library(optiRum)

logits     <- -4:4
odds       <- logit.odd(logits)
probs      <- odd.prob(odds)
pred_class <- logits>=0

knitr::kable(data.frame(logits,odds,probs,pred_class))

logits	odds	probs	pred_class
-4	0.0183156	0.0179862	FALSE
-3	0.0497871	0.0474259	FALSE
-2	0.1353353	0.1192029	FALSE
-1	0.3678794	0.2689414	FALSE
0	1.0000000	0.5000000	TRUE
1	2.7182818	0.7310586	TRUE
2	7.3890561	0.8807971	TRUE
3	20.0855369	0.9525741	TRUE
4	54.5981500	0.9820138	TRUE

prob.odd

## function (prob) 
## {
##     prob/(1 - prob)
## }
## <environment: namespace:optiRum>

odd.logit

## function (odds) 
## {
##     log(odds)
## }
## <environment: namespace:optiRum>

logit.odd

## function (logit) 
## {
##     exp(logit)
## }
## <environment: namespace:optiRum>

odd.prob

## function (odds) 
## {
##     odds/(1 + odds)
## }
## <environment: namespace:optiRum>

2 Analysis workflow

Understand the problem
Gather some data
Clean the data
Clean the data some more
Exploration and visualisation
Feature Reduction
Feature Engineering
(Depending on models) Feature Selection
Candidate models builds
Evaluate models
In-depth evaluation of selected model
Productionising

3 Sources of change in analysis

3.1 Exercise

What sort of things can alter the results of a piece of analysis?

3.2 Answers

Changes in data
Changes in code behaviours
Changes in behaviours in dependencies
Randomness

4 Accounting for change

4.1 Exercise

What sort of things can we do to prevent changes creeping into our analysis that stop it from being “deterministic”?

4.2 Answers

Checksums to flag if anything has changed
Keeping a seperate copy of data
Keeping dependencies the same over time
Source control
Unit testing and validating code
set.seed

5 GLM step-by-step – Project setup

5.1 Project checklist

Git
Project options
- No Rdata or history!
- Insert spaces for tabs
Packrat +packrat::init()
Folder structure
- data
- processeddata
- analysis
- outputs
- docs
DESCRIPTION
LICENSE
.Rbuildignore
README.Rmd
Makefile
- Karl Broman on Makefiles
.travis.yml

5.2 Travis setup

5.3 Github setup

6 GLM step-by-step – Data

Source
Verification steps
Multiple outputs?
Main report
Supplementary data quality report
Shiny?

7 GLM step-by-step – Data processing

Cleaning steps
Sampling
Feature scaling
Univariate analysis
Bivariate analysis

8 GLM step-by-step – Candidate models

Feature selection
Various glm* models

9 GLM step-by-step – Evaluation

Scaling sample
Single model evaluation techniques
Comparing multiple models
Cross-validation

10 GLM step-by-step – Model selection

Using evaluation metrics to select best model
Presenting model
In-depth evaluation of best model

11 GLM step-by-step – Supplementary materials

Data lineage
Data quality
Feature analysis in-depth
Candidate model evaluations
Code
Reproducibility info

Reproducible logistic regression models

Steph Locke (@SteffLocke)

2017-04-07