R Programming Cheat Sheet Page 2

ADVERTISEMENT

D
M
F
c
D
r
ata
unGinG
unctionS anD
ontrolS
ata
eShapinG
aPPLy
rEarraNgE
group_by(), sample_n()
say_hello
<- function(first,
(apply, tapply, lapply, mapply)
Create Function
last
= 'hola') { }
• Chain functions
• Apply - most restrictive. Must be used on a matrix, all
reshape2.melt(df1, id.vars =
Call Function
say_hello(first
= 'hello')
elements must be the same type
Melt Data - from
c('col1', 'col2'), variable.
df1
%>% group_by(year, month) %>%
• R automatically returns the value of the last line of
column to row
name = 'newCol1', value.name =
select(col1, col2) %>%
summarise(col1mean
• If used on some other object, such as a data.frame, it
'newCol2')
= mean(col1))
code in a function. This is bad practice. Use return()
will be converted to a matrix first
reshape2.dcast(df1,
col1
+
explicitly instead.
Cast Data - from
• Much faster than plyr, with four types of easy-to-use
col2
~ newCol1, value.var =
apply(matrix1,
1
- rows or
2
- columns,
row to column
• do.call() - specify the name of a function either as
'newCol2')
function to
apply)
joins (inner, left, semi, anti)
string (i.e. 'mean') or as object (i.e. mean) and provide
# if rows, then pass each row as input to the function
• Abstracts the way data is stored so you can work with
If
has 3 more columns, col3 to col5, 'melting' creates
arguments as a list.
df1
data frames, data tables, and remote databases with
a new df that has 3 rows for each combination of col1
• By default, computation on NA (missing data) always
the same set of functions
do.call(mean, args =
list(first
= '1st'))
returns NA, so if a matrix contains NAs, you can
and col2, with the values coming from the respective col3
ignore them (use
in the
to col5.
HELPEr FuNctiONs
na.rm = TRUE
apply(..)
iF /ELsE /ELsE iF /switcH
which doesn’t pass NAs to your function)
cOmbiNE
(mutiple sets into one)
each() - supply multiple functions to a function like aggregate
lapply
if { } else
ifelse
1. cbind - bind by columns
aggregate(price
~ cut, diamonds, each(mean,
Applies a function to each element of a list and returns
median))
Works with Vectorized Argument
No
Yes
the results as a list
data.frame from two vectors
cbind(v1, v2)
Most Efficient for Non-Vectorized Argument
Yes
No
sapply
Works with NA *
data.frame combining df1 and
No
Yes
cbind(df1, df2)
D
df2 columns
Same as lapply except return the results as a vector
ata
Use &&, || **†
Yes
No
2. rbind - similar to cbind but for rows, you can assign
Note: lapply & sapply can both take a vector as input, a
Use &, | ***†
No
Yes
new column names to vectors in cbind
vector is technically a form of list
LOad data FrOm csv
cbind(col1
= v1, ...)
* NA == 1 result is NA, thus if won’t work, it’ll be an
• Read csv
aggrEgatE
(SQL GROUPBY)
error. For ifelse, NA will return instead
3. Joins - (merge, join, data.table) using common keys
read.table(file =
url or
filepath, header =
aggregate(formulas, data, function)
TRUE, sep = ',')
** &&, || is best used in if, since it only compares the
3.1 Merge
,
• Formulas:
y represents a variable that we
y ~ x
first element of vector from each side
• “stringAsFactors” argument defaults to TRUE, set it to
and
specify the key columns use in the
by.x
by.y
want to make a calculation on, x represents one or
FALSE to prevent converting columns to factors. This
*** &, | is necessary for ifelse, as it compares every
join() operation
more variables we want to group the calculation by
saves computation time and maintains character data
element of vector from each side
• Can only use one function in aggregate(). To apply
• Merge can be much slower than the alternatives
• Other useful arguments are "quote" and "colClasses",
† &&, || are similar to if in that they don’t work with
more than one function, use the plyr() package
specifying the character used for enclosing cells and
merge(x = df1, y = df2, by.x = c('col1',
vectors, where ifelse, &, | work with vectors
the data type for each column.
In the example below diamonds is a data.frame; price,
'col3'), by.y = c('col3', 'col6'))
cut, color etc. are columns of diamonds.
• If cell separator has been used inside a cell, then use
• Similar to C++/Java, for &, |, both sides of operator
3.2 Join
or read
instead of
read.csv2()
delim2()
read.
are always checked. For &&, ||, if left side fails, no
aggregate(price
~ cut, diamonds, mean)
• Join in plyr() package works similar to merge but
table()
need to check the right side.
# get the average price of different cuts for the diamonds
databasE
much faster, drawback is key columns in each
aggregate(price
~
cut
+ color, diamonds,
• } else, else must be on the same line as }
table must have the same name
mean) # group by cut and color
Connect to
• join() has an argument for specifying left, right,
db1
<- RODBC::odbcConnect('conStr')
aggregate(cbind(price, carat) ~ cut,
Database
inner joins
diamonds, mean) # get the average price and average
Query
df1
<- RODBC::sqlQuery(db1,
'SELECT
G
carat of different cuts
raphicS
Database
..', stringAsFactors = FALSE)
join(x = df1, y = df2, by = c('col1',
Close
PLyr
RODBC::odbcClose(db1)
'col3'))
('split-apply-combine')
Connection
• Only one connection may be open at a time. The
dEFauLt basic graPHic
• ddply(), llply(), ldply(), etc. (1st letter = the type of
3.3 data.table
connection automatically closes if R closes or another
input, 2nd = the type of output
connection is opened.
hist(df1$col1, main = 'title', xlab =
'x
• plyr can be slow, most of the functionality in plyr
dt1
<- data.table(df1, key = c('1',
axis
label')
• If table name has space, use [ ] to surround the table
<- ... ‡
'2')),
dt2
can be accomplished using base function or other
name in the SQL string.
plot(col2
~ col1, data = df1),
packages, but plyr is easier to use
aka
y
~
x
or plot(x, y)
• Left Join
• which() in R is similar to ‘where’ in SQL
ddply
iNcLudEd data
LatticE aNd ggPLOt2
dt1[dt2]
(more popular)
Takes a data.frame, splits it according to some
R and some packages come with data included.
variable(s), performs a desired action on it and returns a
• Initialize the object and add layers (points, lines,
‡ Data table join requires specifying the keys for the data
data.frame
List Available Datasets
data()
histograms) using +, map variable in the data to an
tables
List Available Datasets in
data(package =
llply
axis or aesthetic using ‘aes’
a Specific Package
'ggplot2')
• Can use this instead of lapply
ggplot(data = df1) + geom_histogram(aes(x
missiNg data
(NA and NULL)
• For sapply, can use laply (‘a’ is array/vector/matrix),
= col1))
Created by Arianne Colton and Sean Chen
however, laply result does not include the names.
NULL is not missing, it’s nothingness. NULL is atomical
• Normalized histogram (pdf, not relative frequency
and cannot exist within a vector. If used inside a vector, it
dPLyr
histogram)
(for data.frame ONLY)
Based on content from
simply disappears.
'R for Everyone' by Jared Lander
• Basic functions: filter(), slice(), arrange(), select(),
ggplot(data = df1) + geom_density(aes(x =
Check Missing Data
is.na()
col1), fill = 'grey50')
rename(), distinct(), mutate(), summarise(),
Avoid Using
is.null()
Updated: December 2, 2015

ADVERTISEMENT

00 votes

Related Articles

Related forms

Related Categories

Parent category: Education
Go
Page of 2