Steph Locke
2017-04-20
Learn about using data.table in R to process bigger data sets in R and save yourself tons of typing!
Hands up if you …
table
and a data.frame
in Rdplyr
data.table
data.table
Founded by Steph Locke (that’s me!), Locke Data is a data science consultancy focused on helping organisations get the most out of their data science resources. While we’re happy to do data science projects for you, we’d really like to set you up to do them yourself!
Locke Data offers a broad range of services including:
If you’d like more information about our services please get in touch via our website, itsalocke.com.
I am a Microsoft Data Platform MVP with a decade of business intelligence and data science experience.
Having worked in a variety of industries – including finance, utilities, insurance, and cyber-security – I’ve tackled a wide range of business challenges with data.
However, I’m probably best known for my community activities; including presenting, training, blogging and speaking on panels and webinars.
If you have any questions about today’s session, community activity, or data science in general, please get in touch via Locke Data, or my Twitter, @SteffLocke
myList<-list( p1=iris[1:50, -1 ],
p2=iris[51:100,-2] ,
p3=iris[101:150,-3] )
dt<-rbindlist(myList
, use.names=TRUE, fill=TRUE, idcol="myList")
myTbl<-data.table(a="a,b,c")
myTbl[,(LETTERS[1:3]):=tstrsplit(a, split=",")]
# Base R
iris[1,"Species"]
# data.table R
irisDT[1, Species]
# Base R
iris[iris$Sepal.Length>5,"Species"]
# data.table R
irisDT[Sepal.Length>5, Species]
# Base R
iris[ ,c("Species","Sepal.Width")]
# data.table R
irisDT[ , .(Species, Sepal.Width)]
# Base R
iris[ ,setdiff(colnames(iris),c("Species"))]
# data.table R
irisDT[ , .SDcols=-Species]
# Base R
irisTMP <- iris[,c("Species","Sepal.Width")]
names(irisTMP) <- c("Species","SepalW")
irisTMP
# data.table R
irisDT[, .(Species, SepalW = Sepal.Width)]
# Base R
iris[iris$Sepal.Length > 5, ]
# data.table R
irisDT[Sepal.Length > 5, ]
# Base R
iris[iris$Sepal.Length >= 5 & iris$Sepal.Length <= 6, ]
# data.table R
irisDT[between(Sepal.Length, 5, 6), ]
# Base R
iris[grep("set",iris$Species), ]
# data.table R
irisDT[like(Species,"set") ]
# Base R
iris[order(iris$Sepal.Length),]
# data.table R
irisDT[order(Sepal.Length),]
# Base R
iris[order(iris$Species, -iris$Sepal.Length),]
# data.table R
irisDT[order(Species, -Sepal.Length),]
# Base R
sum(iris$Sepal.Width > 5)
# data.table R
irisDT[Sepal.Width > 5, .N]
# Base R
length(unique(iris$Sepal.Width))
# data.table R
irisDT[, uniqueN(Sepal.Width)]
# Base R
data.frame(Count=nrow(iris), SumSepal= sum(iris$Sepal.Width))
# data.table R
irisDT[, .(Count=.N, SumSepal=sum(Sepal.Width))]
# Base R
data.frame(table(iris$Species))
# data.table R
irisDT[, .N, Species]
# Base R
data.frame(table(iris$Species,iris$Sepal.Length>5))
# data.table R
irisDT[, .N, .(Species, Gt5=Sepal.Length>5)]
For flowers with sepals wider than 1, group by species and if their sepals are less than or greater than 5, return the average sepal width and count in each group.
irisTMP<-iris[iris$Sepal.Width>1,]
irisTMP$gt5<-irisTMP$Sepal.Length>5
irisCount<-data.frame(table(irisTMP$Species,irisTMP$gt5>5))
names(irisCount) <- c("Species", "gt5", "Freq")
irisMean<-aggregate(Petal.Length~Species+gt5, data=irisTMP,FUN=mean)
merge(irisCount,irisMean)
## Warning in `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L, 1L, 1L, :
## invalid factor level, NA generated
irisDT[ Sepal.Width > 1
, .(Count= .N, meanPetal=mean(Petal.Length))
, .(Species, Gt5=Sepal.Length>5)]
Yikes! Survey file with 13,000 columns with titles so long that R cuts out and binary readers can't get to the end of the line… Madness!
— Simon Whiteley (@MrSiWhiteley) November 29, 2016
library(data.table)
colnames<-paste0("reallyLongColumnNamesSuuuuuck",1:13000)
rows<-13000
sampledt<-data.table(a=1:rows)
sampledt[,c(colnames):=1]
system.time(
fwrite(sampledt,"sample.csv",verbose = FALSE)
)
paste0(floor(file.size("sample.csv")/1024^2),"Mb")
system.time(sampledt<-fread("sample.csv",verbose = FALSE))
system.time(sampledf<-read.csv("sample.csv"))
getSize(sampledt)
getSize(sampledf)
@SteffLocke Ok… ooook. I'll admit it, fixing the R script was waaay quicker than writing C# (both dev & running!). Thanks for the steer :) pic.twitter.com/clPKNhdxbO
— Simon Whiteley (@MrSiWhiteley) December 3, 2016
We can “index” columns for performance of queries. You can set multiple indexes.
setindex(irisDT, Species, Sepal.Width)
setindex(irisDT, Sepal.Length)
indices(irisDT)
We can do a “primary key” in data.table - unlike a typical PK it doesn’t have a uniqueness constraint and allows missings. It performs a sort, when it sets a key.
setkey(irisDT, Species)
key(irisDT)
However, we don’t often need to do this these days!
If there are keys set on two tables then we can do an implicit join.
irisLK<-data.table(a="setosa",b=12, key = "a")
irisDT[irisLK,][1:5,]
data.table also has an ON
command to enable us to specify join criteria at join time.
irisDT[irisLK, on=.(Species=a)]
We can add range-based joins too
irisDT[irisLK, on=.(Species=a,Sepal.Length<=b)]
# Inner join
irisDT[irisLK,]
# Left join
irisLK[irisDT,]
# Full outer
merge(irisDT, irisLK, by.x="Species", by.y="a", all=TRUE)
UNION ALL multiple tables at once with rbindlist
. - Can do UNION by wrapping the rbindlist
in a unique
rbindlist(list(irisDT, irisDT[1:50]))
unique(rbindlist(list(irisDT, irisDT[1:50])))
data.table has a concept of a subset of a data.table
irisDT[
, .SD[which.min(Sepal.Width), ]
, Species]
irisDT[
, lapply(.SD, mean)
, Species]
The :=
operator is used to update or insert.
irisDT[, Sepal.Area:= Sepal.Width*Sepal.Length]
Use the :=
as a function!
irisDT[, `:=`(Sepal.Area= Sepal.Width*Sepal.Length,
Petal.Area= Petal.Width*Petal.Length)]
Put a vector on the LHS of names
irisDT[ ,c("Mean","Median"):=.(mean(Sepal.Width),median(Sepal.Width))
, Species]
Unpivot with the melt
command
irisM<-melt(irisDT,id.vars = "Species")
Use dcast
to pivot data
dcast(irisM, Species ~ variable, mean)
Thank you to the sponsors, organisers, helpers, and you peeps in the audience!