Statistics Cheat Sheet printable pdf download

Statistics Cheat Sheet

Basic Statistics Definitions:

A Normal Distribution:

99.7% of data within

3 standard deviations of mean

95% within

Statistics – Practice or science of collecting and

A.K.A. “Bell Curve”

2 standard deviations

analyzing numerical data

Way to visualize how volume of a population is

68% within

1 standard

distributed based on some measurement

deviation

Data – Values collected by direct or indirect observation

Largest volume is packed around middle

Population – Complete set of all observations in

existence

Volume curves down towards zero to left and right

Sample – Slice of population meant to represent, as

Symmetrical around middle

accurately as possible, that population

Interesting Fact: The Mean, Median, and Mode are

Measure – Measurement of population/sample, an

all the same and at the exact center

example would be some “score” (a.k.a. an observation)

Hypothesis – Educated guess about what’s going on

How We Describe Things...

Skew – Not symmetrical, crooked or uneven

(Measures of Central Tendency)

Big takeaway: Most measurements of a

Impute – To fill in missing values

normally distributed population will be

Mean – Also called ”Average”, probably the most

centered around the middle.

Type I Error (false positive) – In hypothesis testing,

popular statistic, calculated as sum of all values

when you incorrectly reject Null Hypothesis

divided by number of values

Why you care: If population is “normally

distributed” then we can use a bunch of useful

Median – Value at center

Type II Error (false negative) – In hypothesis testing,

characteristics to help describe it.

when you incorrectly fail to reject Null Hypothesis

Mode – Value that occurs most

Standard Deviation – Measurement relative to mean, so a measure of how far a value is away from the mean.

The further a value is from the mean the more unique… and perhaps interesting… it becomes.

Is My Data Special?

Make sure to review

Hazards! section regarding skewed distributions

Null Hypothesis in Layman’s Terms:

There is nothing different, or special, about this data

Sampling

Good Sampling Rule of Thumb:

Some Sampling Methods:

Simple Random

Consider sampling when population you’re working with

is too big to handle

(probably the only one you will ever

see or use)

2.5%

95%

Aim is to get a good representative for actual population

Systematic Random

Generally the bigger the sample the better, but a simple tip is:

Stratified

- At minimum your sample size should be 100

Cluster

- At maximum your sample size should be 10% or 1000,

Multistage

whichever is smaller

REJECT IT!

Can’t reject it, nothing special

REJECT IT!

Keep bias out of it by ensuring a RANDOM sample!

Best used when you need to know if your data is

different or somehow special

Random Numbers

Always start out assuming Null Hypothesis is TRUE

Are an excellent way to create a Simple Random Sample. Most analytical tools (including Excel & Google Sheets)

Goal is to either “reject” or “fail to reject” Null

have a random number generator you can use. Just apply a random number to each row, sort in ascending order

by the random number then select the top however-many rows.

Hypothesis

If FAIL TO REJECT Null Hypothesis then there is

nothing really different about the data

Caution Hazard

If REJECT Null Hypothesis then we are confident

that what we see is different or special

On curve above, can only say that an observation is

different/special if it falls in either of shaded regions

Beware of...

BIAS

(called “tails”)

The tails are 2 Standard Deviations away from (either

Bias can effect both how samples are selected, and also what conclusions you draw from them

above or below) the Mean

(i.e. interpretation).

Assumes dealing with a normal distribution!

Selection Bias – when an individual or observation

Extrapolation Bias – when you assume results

See

Hazards!

is more likely to be picked for sampling (in other

of a study describe a larger population than

words, NOT random)

what you originally started with (e.g. assuming

a study of college students is a good proxy for

Observer Bias – when you subconsciously let your

entire country)

preconceptions influence how you perform your

Big takeaway: If your data falls within +/- 2

Reporting Bias – when availability of data

analysis

Standard Deviations of Mean then its probably not

favors a certain subgroup within true

all that different. If your data falls outside those

Detection Bias – when something is more likely to

population

boundaries then it is most likely something to take

be detected in a specific set of observations (e.g.

note of.

Confirmation Bias – tend to listen only to

measuring website traffic on Black Friday)

information that confirms hypothesis,

Funding Bias – when selection or interpretation

assumption, or opinion

Caution Hazard

favors a financial sponsor

Skewed Distributions...

Imputing Missing Values...

Confusing Confidence Intervals...

Missing values are a part of real-life data analysis.

…with probability. 95% confidence just means that

mode

But, resist temptation to just fill them in with Mean

95% of the time the true (population) value will be

median

or Median.

within the limits.

Sometimes this is an OK option, but remember that

Multiple Inference...Faking it ‘till

missing values can be trying to send you a message

you’re making it

about some process that you are unaware of (i.e.

mean

telling a story).

Running a hypothesis test over and over, the same

Also, there are a number of imputation methods out

way on the same data, until you get a “significant”

result greatly increases chances you will get a false

there, be sure to review them thoroughly to see if

Not all data is normally distributed… and when your

there are any that better fit your needs/data.

positive (Type I Error) result because… there is always

the chance of getting a randomly significant result.

data is not normally distributed, all those helpful

characteristics of a normal distribution no longer apply!

Thinking that Correlation proves Causation (it doesn’t)

For instance Hypothesis testing limits will change,

Mean & Median will shift, and most statistical models

(think regression) rely heavily on assumption that your

Check out

Probability & Correlation Cheat Sheet

for more on this one!

data is normally distributed!

Start Your Journey With Us:

(888) 252-7866

Locations: Rocklin, San Francisco, New York, Seattle, Los Angeles, Chicago, Boston, London

Statistics Cheat Sheet

Related Articles

Related forms

Related Categories