data.table power hour

Steph Locke

2017-04-20

Welcome!

Today’s aim

Learn about using data.table in R to process bigger data sets in R and save yourself tons of typing!

Polls

Hands up if you …

  • know the difference between a table and a data.frame in R
  • use dplyr
  • have heard of data.table
  • use data.table

Agenda

  • Ways data.table has been awesome
  • Basic syntax
  • Data I/O
  • Joins
  • Within group activities
  • Updates & adding columns
  • Pivoting data

About Me

Locke Data

Founded by Steph Locke (that’s me!), Locke Data is a data science consultancy focused on helping organisations get the most out of their data science resources. While we’re happy to do data science projects for you, we’d really like to set you up to do them yourself!

Locke Data offers a broad range of services including:

  • Data Science Readiness Reviews
  • Training and mentoring programmes
  • Independent audits and ethics reviews
  • Recruitment assistance

If you’d like more information about our services please get in touch via our website, itsalocke.com.

Steph Locke

I am a Microsoft Data Platform MVP with a decade of business intelligence and data science experience.

Having worked in a variety of industries – including finance, utilities, insurance, and cyber-security – I’ve tackled a wide range of business challenges with data.

However, I’m probably best known for my community activities; including presenting, training, blogging and speaking on panels and webinars.

If you have any questions about today’s session, community activity, or data science in general, please get in touch via Locke Data, or my Twitter, @SteffLocke

Ways data.table has been awesome

Variable content CSVs/tables

myList<-list( p1=iris[1:50, -1 ], 
              p2=iris[51:100,-2]  ,
              p3=iris[101:150,-3] ) 
dt<-rbindlist(myList
 , use.names=TRUE, fill=TRUE, idcol="myList")

Splitting out fields

myTbl<-data.table(a="a,b,c")
myTbl[,(LETTERS[1:3]):=tstrsplit(a, split=",")]

My “big data” modelling story

  • 250k rows
  • 500 models to evaluate

data.table basics

SELECT

# Base R
iris[1,"Species"]

# data.table R
irisDT[1, Species]

SELECT

# Base R
iris[iris$Sepal.Length>5,"Species"]

# data.table R
irisDT[Sepal.Length>5, Species]

SELECT

# Base R
iris[ ,c("Species","Sepal.Width")]

# data.table R
irisDT[ , .(Species, Sepal.Width)]

SELECT

# Base R
iris[ ,setdiff(colnames(iris),c("Species"))]

# data.table R
irisDT[ , .SDcols=-Species]

Aliasing

# Base R
irisTMP <- iris[,c("Species","Sepal.Width")]
names(irisTMP) <- c("Species","SepalW")
irisTMP

# data.table R
irisDT[, .(Species, SepalW = Sepal.Width)]

WHERE

# Base R
iris[iris$Sepal.Length > 5, ]

# data.table R
irisDT[Sepal.Length > 5, ]

WHERE BETWEEN

# Base R
iris[iris$Sepal.Length >= 5 & iris$Sepal.Length <= 6, ]

# data.table R
irisDT[between(Sepal.Length, 5, 6), ]

WHERE LIKE

# Base R
iris[grep("set",iris$Species), ]

# data.table R
irisDT[like(Species,"set") ]

ORDER BY

# Base R
iris[order(iris$Sepal.Length),]

# data.table R
irisDT[order(Sepal.Length),]

ORDER BY

# Base R
iris[order(iris$Species, -iris$Sepal.Length),]

# data.table R
irisDT[order(Species, -Sepal.Length),]

COUNT

# Base R
sum(iris$Sepal.Width > 5)

# data.table R
irisDT[Sepal.Width > 5, .N]

COUNT DISTINCT

# Base R
length(unique(iris$Sepal.Width))

# data.table R
irisDT[, uniqueN(Sepal.Width)]

Aggregate

# Base R
data.frame(Count=nrow(iris), SumSepal= sum(iris$Sepal.Width))

# data.table R
irisDT[, .(Count=.N, SumSepal=sum(Sepal.Width))]

GROUP BY

# Base R
data.frame(table(iris$Species))

# data.table R
irisDT[, .N, Species]

GROUP BY

# Base R
data.frame(table(iris$Species,iris$Sepal.Length>5))

# data.table R
irisDT[, .N, .(Species, Gt5=Sepal.Length>5)]

Combining commands

Exercise

For flowers with sepals wider than 1, group by species and if their sepals are less than or greater than 5, return the average sepal width and count in each group.

base R

irisTMP<-iris[iris$Sepal.Width>1,]
irisTMP$gt5<-irisTMP$Sepal.Length>5
irisCount<-data.frame(table(irisTMP$Species,irisTMP$gt5>5))
names(irisCount) <- c("Species", "gt5", "Freq")
irisMean<-aggregate(Petal.Length~Species+gt5, data=irisTMP,FUN=mean)
merge(irisCount,irisMean)
## Warning in `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L, 1L, 1L, :
## invalid factor level, NA generated

data.table R

irisDT[  Sepal.Width > 1
       , .(Count= .N, meanPetal=mean(Petal.Length))
       , .(Species, Gt5=Sepal.Length>5)]

IO

Problem

Setup

library(data.table)
colnames<-paste0("reallyLongColumnNamesSuuuuuck",1:13000)
rows<-13000
sampledt<-data.table(a=1:rows)
sampledt[,c(colnames):=1]

File

system.time(
fwrite(sampledt,"sample.csv",verbose = FALSE)
)

paste0(floor(file.size("sample.csv")/1024^2),"Mb")

Read

system.time(sampledt<-fread("sample.csv",verbose = FALSE))
system.time(sampledf<-read.csv("sample.csv"))
getSize(sampledt)
getSize(sampledf)

Results

Joins

Indexes

We can “index” columns for performance of queries. You can set multiple indexes.

setindex(irisDT, Species, Sepal.Width)
setindex(irisDT, Sepal.Length)
indices(irisDT)

Keys

We can do a “primary key” in data.table - unlike a typical PK it doesn’t have a uniqueness constraint and allows missings. It performs a sort, when it sets a key.

setkey(irisDT, Species)
key(irisDT)

However, we don’t often need to do this these days!

Implicit joins

If there are keys set on two tables then we can do an implicit join.

  • Can join on multiple columns
  • Works on position
  • Will join on columns to the shortest key length
irisLK<-data.table(a="setosa",b=12, key = "a")
irisDT[irisLK,][1:5,]

Explicit joins

data.table also has an ON command to enable us to specify join criteria at join time.

irisDT[irisLK, on=.(Species=a)]

Comparison joins

We can add range-based joins too

irisDT[irisLK, on=.(Species=a,Sepal.Length<=b)]

Join types

# Inner join
irisDT[irisLK,]

# Left join
irisLK[irisDT,]

# Full outer
merge(irisDT, irisLK, by.x="Species", by.y="a", all=TRUE)

Union

UNION ALL multiple tables at once with rbindlist. - Can do UNION by wrapping the rbindlist in a unique

rbindlist(list(irisDT, irisDT[1:50]))
unique(rbindlist(list(irisDT, irisDT[1:50])))

Grouping

.SD

data.table has a concept of a subset of a data.table

  • a table is a subset of itself
  • when you group, each group is a subset
  • each subset is a data.table

Within group

irisDT[
  , .SD[which.min(Sepal.Width), ]
  , Species]

By column

irisDT[ 
  , lapply(.SD, mean)
  , Species]

UPDATEs/INSERTs

Basic update

The := operator is used to update or insert.

irisDT[, Sepal.Area:= Sepal.Width*Sepal.Length]

Update multiple columns

Use the := as a function!

irisDT[, `:=`(Sepal.Area= Sepal.Width*Sepal.Length,
              Petal.Area= Petal.Width*Petal.Length)]

Dynamic names

Put a vector on the LHS of names

irisDT[ ,c("Mean","Median"):=.(mean(Sepal.Width),median(Sepal.Width))
        , Species]

Pivoting

Unpivot

Unpivot with the melt command

irisM<-melt(irisDT,id.vars = "Species")

Pivot

Use dcast to pivot data

dcast(irisM, Species ~ variable, mean)

Conclusion

data.table

  • Fast
  • Concise
  • Tackles interesting data manipulation challenges easily
  • I haven’t covered even close to everything!

Followup

  • Get the slides: stephlocke.info/Rtraining/datatablepowerhour.html
  • @SteffLocke & @LockeData
  • itsalocke.com
  • I’m here all week!

Thank you!

Thank you to the sponsors, organisers, helpers, and you peeps in the audience!