R for developers

Steph Locke

2017-04-20

Steph Locke | Who I am

R

R | A brief history

  • S, born 1976 in Bell Laboratories
  • R, born 1997 in a uni in NZ
  • R Foundation, born 2002
  • R Consortium, born 2015
  • R now ranked 6th most popular language by IEEE

R | Why use it?

  • For the stats
  • For the cross-platform capability
  • For the reproducibility
  • For the time it can save
  • For the flexibility
  • For the money

R | Why not Python?

Firstly, Python is a valid way to go. There are number of really good libraries out there for number crunching etc. and it is a well written language with few “quirks”

Secondly, why R is my preference:

  • Easy install
  • Strong community base
  • Data focused
  • Very flexible and extendable
  • It’s the favourite to win

But… “production” code can be faster in Python

R fundamentals

Basics

# Define a variable
a<-25

# Call a variable
a
## [1] 25
# Do something to it
a+1
## [1] 26

Data types | p1

# Numeric
25
## [1] 25
# Character
"25"
## [1] "25"
# Logical
TRUE
## [1] TRUE

Data types | p2

# Dates
as.Date("2015-08-05")
## [1] "2015-08-05"
as.POSIXct("2015-08-01")
## [1] "2015-08-01 UTC"

Data types | p3

# Factor
as.numeric(factor("25"))
## [1] 1
as.character(factor("25"))
## [1] "25"

Constructs | p1

# Vector
a<-c(25, 30)

# Matrix
matrix(a)
##      [,1]
## [1,]   25
## [2,]   30

Constructs | p2

# Data frame
data.frame(a,b=a/5,c=LETTERS[1:2])
##    a b c
## 1 25 5 A
## 2 30 6 B
# List
list(vector=a, matrix=matrix(a))
## $vector
## [1] 25 30
## 
## $matrix
##      [,1]
## [1,]   25
## [2,]   30

Subsetting | Vectors

a <- sample(1:20, size = 5, replace = TRUE) # setup
a # visual check
## [1]  3 18 15  7 12
a[1:2] # row numbers
## [1]  3 18
a[a<=10] # value filters
## [1] 3 7

Subsetting | Data.frames p1

df <- data.frame(a=1:10, b = LETTERS[1:5]) # setup
df[1:2,] # row numbers
##   a b
## 1 1 A
## 2 2 B
df[df$a<2,] # value filters
##   a b
## 1 1 A

Subsetting | Data.frames p2

df[df$a<3,1] # column filter
## [1] 1 2
df[df$a<3,1, drop=FALSE] # column filter (keep data.frame)
##   a
## 1 1
## 2 2

Functions

# Define a function
showAsPercent<-function(x) {
  paste0(round(x*100 ,0) ,"%")
}

# Call a function
showAsPercent(0.1)
## [1] "10%"

Extending R

# Get a package
install.packages("caret")

# Activate a package
library(caret)

What does R look like? | OO

# Orig OO (s3): cyclismo.org/tutorial/R/s3Classes.html
library(R6)
Loan<-R6Class("Loan", 
              public=list(term=NA
                         ,initialize=function(term){
                           if(!missing(term)){ 
                              self$term<-term 
                              }} 
                         ,extendBy=function(ext){ 
                            self$term<-self$term+ext
                            }))

What does R look like? | OO

acc<-Loan$new(36)
acc$extendBy(6)
acc$term
## [1] 42

Building up an R script

Commands | magrittr

magrittr allows you to pass one thing into another instead of writing lots of brackets

library(magrittr)
# Typical
pairs(iris)
pairs(tail(iris))
pairs(tail(iris,nrow(iris)/5))

# Pipe
iris %>% pairs
iris %>% tail %>% pairs
iris %>% {tail(.,nrow(.)/5)} %>% pairs

Commands | dplyr

Use dplyr to transform your datasets

library(dplyr)
iris %>% 
  filter(Petal.Width<2) %>%
  group_by(Species) %>%
  summarise_each(funs(mean))
## # A tibble: 3 × 5
##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
##       <fctr>        <dbl>       <dbl>        <dbl>       <dbl>
## 1     setosa     5.006000    3.428000      1.46200    0.246000
## 2 versicolor     5.936000    2.770000      4.26000    1.326000
## 3  virginica     6.338095    2.790476      5.32381    1.761905

Read data | CSV & Excel

library(readr)
OrderData<-read_csv("Order.csv")

library(readxl)
OrderData<-read_sheet("Order.xlsx","Orders")

Read data | Databases

library(RODBC)
azure <- odbcDriverConnect(
  "Driver={SQL Server Native Client 11.0};
  Server=mhknbn2kdz.database.windows.net;
  Database=AdventureWorks2012;
  Uid=sqlfamily;
  Pwd=sqlf@m1ly;")

Order    <- sqlQuery( azure, 
            "SELECT * FROM [Sales].[SalesOrderHeader]")

Write data | CSV

This is easiest and most portable option

write.csv(iris,"iris.csv", row.names = FALSE)

Script best practices

  • Make sure to declare packages using library() at the top of the script
  • Label sections of work using # ---- SectionName ---- to allow you to pick up the code into a LaTeX or markdown doc later
  • Never delete <- always make new objects with modifications
  • If comfortable with coding, or have a lot of data to process use data.table over dplyr
  • Try to do all data manipulation at the top of the script
  • Reuse your code by writing functions for anything you do frequently

“Best” practices

Charts

library(ggplot2)
ggplot(data=iris, 
       aes(x=Sepal.Width, y=Sepal.Length, colour=Species)) + 
  geom_point() 

Documentation

Document as you go!

  • Use markdown to use a light syntax for integrating your code and commentary
  • Use LaTeX for finer level of control on output layout
  • Use these instead of Excel or Word so that a change in assumption means updating the code and re-running the doc
  • Load your main R file up and use the labels from # ---- SectionName ---- to save repetition
  • Present in HTML, slide decks, PDF, Word at the click of a button

Interactive reports

Consider doing a shiny application that explores the data and findings

Workflow best practices

  • Focus on working code & documentation
  • Use source control (github)
  • Write tests (assertive, assertthat, testthat)
  • Regularly visualise (ggplot2)
  • Do code reviews
  • Modularise
  • Use Rstudio
  • Consider structuring as a package

Next steps

Find out more

Online

In-person

Get this presentation

This presentation is available on github.com/stephlocke/Rtraining. All the code is available for you to take a copy and play with to help you learn on the go.

If you have any questions, contact me!

itsalocke.com | github.com/StephLocke | @SteffLocke