The Ultimate R Cheat Sheet - Data Management Page 3

ADVERTISEMENT

important option is na.rm=T, which means that missing values are ignored in the calculations, e.g.
mean(VAR1, na.rm=T). Without that option, the function returns missing values as a result of
missing values in the input.
rowSums(), colSums(), rowMeans() or colMeans() applies functions to rows or columns of a
table. For example, rowsum(dat1[,10:15]) will return the row-sums of the variables in columns
10 to 15. Don’t forget na.rm=T.
apply(dat,1,max) apply any function (e.g. max), to either rows (1) or columns (2) of a table (dat).
Pivot table functionality
The functions aggregate and ddply can be used to summarize data similarly to working with Excel
pivot tables. Aggregate has simpler syntax if you have many variables that you want to summarize in
the same way; ddply is better if you have few variables but want several custom summary statistics.
aggregate(dat1[,4:9], by=list(TREAT1=dat1$TREAT1, TREAT2=dat2$TREAT2),
mean) calculates the means of a number of numerical variables in columns 4 to 9 for two treatments.
ddply(dat1,.(TREAT), summarise, mVAR1=mean(VAR1)) returns means of VAR1 by a class
variables TREAT. This requires installation of the library plyr.
ddply(dat1,.(TREAT1, TREAT2), summarise, cVAR1=length(VAR1[!is.na(VAR1)])
returns a count of non-missing values in variable VAR1 for each combination of two class variables.
ddply(dat1,.(TREAT1, TREAT2), summarise, mVAR1=mean(VAR1, na.rm=T),
seVAR1=sd(VAR1, na.rm=T)/sqrt(length(VAR1[!is.na(VAR1)])). A clever piece of code
to calculate means and standard errors, with missing values being properly handled.
Long-to-wide and wide-to-long data table conversions
First we generate an artificial dataset to play with (copy and paste to R to see what it does):
long=expand.grid(SITE=c("A","B"),TREAT=c("low","med","high"), REP=c(1:5))
long$YIELD=round(rnorm(10)*5+15); long
Long-to-wide conversion with the reshape2 package, where SITE and REP remain columns, but the
treatment levels of TREAT now become several new columns:
wide=dcast(long, SITE+REP~TREAT, value.var="YIELD")
Wide-to-long conversion back to what we had before. The variables that you want to maintain as
columns are SITE and REP, all others will be gathered into a new variable where the remaining
columns become treatment levels. You have to fix the variable names to get to the original long:
long2=melt(wide, id.vars=c("SITE","REP"))
names(long2)=c("SITE","REP","TREAT","YIELD")
Dealing with missing values
transform(dat1, VAR1=ifelse(is.na(VAR1),0,VAR1)) sets missing values in variable
VAR1 to 0. You may do this if missing has a biological meaning of zero, e.g. zero productivity.
transform(dat1, VAR1=ifelse(VAR1==0,NA,VAR1)) … or vice versa if zero really means
that the measurement was not taken.
dat1[is.na(dat1)]=0 sets missing values to 0 in entire dataset.
dat1[dat1==0]=NA ... vice versa.
dat2=na.omit(dat1) delete rows with missing values in any variable
dat2=dat1[!is.na(dat1$VAR1),] delete rows with missing values in VAR1
dat2=transform(dat1, VAR2=ifelse(is.na(VAR1),NA,VAR2))modify a second variable
(here: set to missing) based on missing values in a first variable.
Dealing with duplicate data entries or IDs:
unique(dat1) or dat1[!duplicated(dat1),] removes exact duplicate rows.
dat1[duplicated(dat1),] returns the duplicate rows.
dat1[!duplicated(dat1[,c("ID")]),] removes all rows with duplicate IDs (first is kept).
dat1[duplicated(dat1[,c("ID")]),] returns the rows with duplicate IDs.

ADVERTISEMENT

00 votes

Related Articles

Related forms

Related Categories

Parent category: Education
Go
Page of 4