R Programming Cheat Sheet (Page 2 of 2) in pdf

ata

unGinG

unctionS anD

ontrolS

ata

eShapinG

aPPLy

rEarraNgE

group_by(), sample_n()

say_hello

<- function(first,

(apply, tapply, lapply, mapply)

Create Function

last

= 'hola') { }

• Chain functions

• Apply - most restrictive. Must be used on a matrix, all

reshape2.melt(df1, id.vars =

Call Function

say_hello(first

= 'hello')

elements must be the same type

Melt Data - from

c('col1', 'col2'), variable.

df1

%>% group_by(year, month) %>%

• R automatically returns the value of the last line of

column to row

name = 'newCol1', value.name =

select(col1, col2) %>%

summarise(col1mean

• If used on some other object, such as a data.frame, it

'newCol2')

= mean(col1))

code in a function. This is bad practice. Use return()

will be converted to a matrix first

reshape2.dcast(df1,

col1

explicitly instead.

Cast Data - from

• Much faster than plyr, with four types of easy-to-use

col2

~ newCol1, value.var =

apply(matrix1,

- rows or

- columns,

row to column

• do.call() - specify the name of a function either as

'newCol2')

function to

apply)

joins (inner, left, semi, anti)

string (i.e. 'mean') or as object (i.e. mean) and provide

# if rows, then pass each row as input to the function

• Abstracts the way data is stored so you can work with

has 3 more columns, col3 to col5, 'melting' creates

arguments as a list.

df1

data frames, data tables, and remote databases with

a new df that has 3 rows for each combination of col1

• By default, computation on NA (missing data) always

the same set of functions

do.call(mean, args =

list(first

= '1st'))

returns NA, so if a matrix contains NAs, you can

and col2, with the values coming from the respective col3

ignore them (use

in the

to col5.

HELPEr FuNctiONs

na.rm = TRUE

apply(..)

iF /ELsE /ELsE iF /switcH

which doesn’t pass NAs to your function)

cOmbiNE

(mutiple sets into one)

each() - supply multiple functions to a function like aggregate

lapply

if { } else

ifelse

1. cbind - bind by columns

aggregate(price

~ cut, diamonds, each(mean,

Applies a function to each element of a list and returns

median))

Works with Vectorized Argument

Yes

the results as a list

data.frame from two vectors

cbind(v1, v2)

Most Efficient for Non-Vectorized Argument

Yes

sapply

Works with NA *

data.frame combining df1 and

Yes

cbind(df1, df2)

df2 columns

Same as lapply except return the results as a vector

ata

Use &&, || **†

Yes

2. rbind - similar to cbind but for rows, you can assign

Note: lapply & sapply can both take a vector as input, a

Use &, | ***†

Yes

new column names to vectors in cbind

vector is technically a form of list

LOad data FrOm csv

cbind(col1

= v1, ...)

* NA == 1 result is NA, thus if won’t work, it’ll be an

• Read csv

aggrEgatE

(SQL GROUPBY)

error. For ifelse, NA will return instead

3. Joins - (merge, join, data.table) using common keys

•

read.table(file =

url or

filepath, header =

aggregate(formulas, data, function)

TRUE, sep = ',')

** &&, || is best used in if, since it only compares the

3.1 Merge

• Formulas:

y represents a variable that we

y ~ x

first element of vector from each side

• “stringAsFactors” argument defaults to TRUE, set it to

•

and

specify the key columns use in the

by.x

by.y

want to make a calculation on, x represents one or

FALSE to prevent converting columns to factors. This

*** &, | is necessary for ifelse, as it compares every

join() operation

more variables we want to group the calculation by

saves computation time and maintains character data

element of vector from each side

• Can only use one function in aggregate(). To apply

• Merge can be much slower than the alternatives

• Other useful arguments are "quote" and "colClasses",

† &&, || are similar to if in that they don’t work with

more than one function, use the plyr() package

specifying the character used for enclosing cells and

merge(x = df1, y = df2, by.x = c('col1',

vectors, where ifelse, &, | work with vectors

the data type for each column.

In the example below diamonds is a data.frame; price,

'col3'), by.y = c('col3', 'col6'))

cut, color etc. are columns of diamonds.

• If cell separator has been used inside a cell, then use

• Similar to C++/Java, for &, |, both sides of operator

3.2 Join

or read

instead of

read.csv2()

delim2()

read.

are always checked. For &&, ||, if left side fails, no

aggregate(price

~ cut, diamonds, mean)

• Join in plyr() package works similar to merge but

table()

need to check the right side.

# get the average price of different cuts for the diamonds

databasE

much faster, drawback is key columns in each

aggregate(price

cut

+ color, diamonds,

• } else, else must be on the same line as }

table must have the same name

mean) # group by cut and color

Connect to

• join() has an argument for specifying left, right,

db1

<- RODBC::odbcConnect('conStr')

aggregate(cbind(price, carat) ~ cut,

Database

inner joins

diamonds, mean) # get the average price and average

Query

df1

<- RODBC::sqlQuery(db1,

'SELECT

carat of different cuts

raphicS

Database

..', stringAsFactors = FALSE)

join(x = df1, y = df2, by = c('col1',

PLyr

RODBC::odbcClose(db1)

'col3'))

('split-apply-combine')

Connection

• Only one connection may be open at a time. The

dEFauLt basic graPHic

• ddply(), llply(), ldply(), etc. (1st letter = the type of

3.3 data.table

connection automatically closes if R closes or another

input, 2nd = the type of output

connection is opened.

hist(df1$col1, main = 'title', xlab =

• plyr can be slow, most of the functionality in plyr

dt1

<- data.table(df1, key = c('1',

axis

label')

• If table name has space, use [ ] to surround the table

<- ... ‡

'2')),

dt2

can be accomplished using base function or other

name in the SQL string.

plot(col2

~ col1, data = df1),

packages, but plyr is easier to use

aka

or plot(x, y)

• Left Join

• which() in R is similar to ‘where’ in SQL

ddply

iNcLudEd data

LatticE aNd ggPLOt2

dt1[dt2]

(more popular)

Takes a data.frame, splits it according to some

R and some packages come with data included.

variable(s), performs a desired action on it and returns a

• Initialize the object and add layers (points, lines,

‡ Data table join requires specifying the keys for the data

data.frame

List Available Datasets

data()

histograms) using +, map variable in the data to an

tables

List Available Datasets in

data(package =

llply

axis or aesthetic using ‘aes’

a Specific Package

'ggplot2')

• Can use this instead of lapply

ggplot(data = df1) + geom_histogram(aes(x

missiNg data

(NA and NULL)

• For sapply, can use laply (‘a’ is array/vector/matrix),

= col1))

Created by Arianne Colton and Sean Chen

however, laply result does not include the names.

NULL is not missing, it’s nothingness. NULL is atomical

• Normalized histogram (pdf, not relative frequency

and cannot exist within a vector. If used inside a vector, it

dPLyr

histogram)

(for data.frame ONLY)

Based on content from

simply disappears.

'R for Everyone' by Jared Lander

• Basic functions: filter(), slice(), arrange(), select(),

ggplot(data = df1) + geom_density(aes(x =

Check Missing Data

is.na()

col1), fill = 'grey50')

rename(), distinct(), mutate(), summarise(),

Avoid Using

is.null()

Updated: December 2, 2015

R Programming Cheat Sheet Page 2

Related Articles

Related forms

Related Categories