The Ultimate R Cheat Sheet - Data Management (Page 3 of 4) in pdf

important option is na.rm=T, which means that missing values are ignored in the calculations, e.g.

mean(VAR1, na.rm=T). Without that option, the function returns missing values as a result of

missing values in the input.



rowSums(), colSums(), rowMeans() or colMeans() applies functions to rows or columns of a

table. For example, rowsum(dat1[,10:15]) will return the row-sums of the variables in columns

10 to 15. Don’t forget na.rm=T.



apply(dat,1,max) apply any function (e.g. max), to either rows (1) or columns (2) of a table (dat).

Pivot table functionality



The functions aggregate and ddply can be used to summarize data similarly to working with Excel

pivot tables. Aggregate has simpler syntax if you have many variables that you want to summarize in

the same way; ddply is better if you have few variables but want several custom summary statistics.



aggregate(dat1[,4:9], by=list(TREAT1=dat1$TREAT1, TREAT2=dat2$TREAT2),

mean) calculates the means of a number of numerical variables in columns 4 to 9 for two treatments.



ddply(dat1,.(TREAT), summarise, mVAR1=mean(VAR1)) returns means of VAR1 by a class

variables TREAT. This requires installation of the library plyr.



ddply(dat1,.(TREAT1, TREAT2), summarise, cVAR1=length(VAR1[!is.na(VAR1)])

returns a count of non-missing values in variable VAR1 for each combination of two class variables.



ddply(dat1,.(TREAT1, TREAT2), summarise, mVAR1=mean(VAR1, na.rm=T),

seVAR1=sd(VAR1, na.rm=T)/sqrt(length(VAR1[!is.na(VAR1)])). A clever piece of code

to calculate means and standard errors, with missing values being properly handled.

Long-to-wide and wide-to-long data table conversions



First we generate an artificial dataset to play with (copy and paste to R to see what it does):

long=expand.grid(SITE=c("A","B"),TREAT=c("low","med","high"), REP=c(1:5))

long$YIELD=round(rnorm(10)*5+15); long



Long-to-wide conversion with the reshape2 package, where SITE and REP remain columns, but the

treatment levels of TREAT now become several new columns:

wide=dcast(long, SITE+REP~TREAT, value.var="YIELD")

Wide-to-long conversion back to what we had before. The variables that you want to maintain as

columns are SITE and REP, all others will be gathered into a new variable where the remaining

columns become treatment levels. You have to fix the variable names to get to the original long:

long2=melt(wide, id.vars=c("SITE","REP"))

names(long2)=c("SITE","REP","TREAT","YIELD")

Dealing with missing values



transform(dat1, VAR1=ifelse(is.na(VAR1),0,VAR1)) sets missing values in variable

VAR1 to 0. You may do this if missing has a biological meaning of zero, e.g. zero productivity.



transform(dat1, VAR1=ifelse(VAR1==0,NA,VAR1)) … or vice versa if zero really means

that the measurement was not taken.



dat1[is.na(dat1)]=0 sets missing values to 0 in entire dataset.



dat1[dat1==0]=NA ... vice versa.



dat2=na.omit(dat1) delete rows with missing values in any variable



dat2=dat1[!is.na(dat1$VAR1),] delete rows with missing values in VAR1



dat2=transform(dat1, VAR2=ifelse(is.na(VAR1),NA,VAR2))modify a second variable

(here: set to missing) based on missing values in a first variable.

Dealing with duplicate data entries or IDs:



unique(dat1) or dat1[!duplicated(dat1),] removes exact duplicate rows.



dat1[duplicated(dat1),] returns the duplicate rows.



dat1[!duplicated(dat1[,c("ID")]),] removes all rows with duplicate IDs (first is kept).



dat1[duplicated(dat1[,c("ID")]),] returns the rows with duplicate IDs.

The Ultimate R Cheat Sheet - Data Management Page 3

Related Articles

Related forms

Related Categories