Statistics Cheat Sheet

ADVERTISEMENT

Statistics Cheat Sheet
Basic Statistics Definitions:
A Normal Distribution:
99.7% of data within
3 standard deviations of mean
95% within
Statistics – Practice or science of collecting and
A.K.A. “Bell Curve”
2 standard deviations
analyzing numerical data
Way to visualize how volume of a population is
68% within
1 standard
distributed based on some measurement
deviation
Data – Values collected by direct or indirect observation
Largest volume is packed around middle
Population – Complete set of all observations in
existence
Volume curves down towards zero to left and right
Sample – Slice of population meant to represent, as
Symmetrical around middle
accurately as possible, that population
Interesting Fact: The Mean, Median, and Mode are
Measure – Measurement of population/sample, an
all the same and at the exact center
example would be some “score” (a.k.a. an observation)
Hypothesis – Educated guess about what’s going on
How We Describe Things...
Skew – Not symmetrical, crooked or uneven
(Measures of Central Tendency)
Big takeaway: Most measurements of a
Impute – To fill in missing values
normally distributed population will be
Mean – Also called ”Average”, probably the most
centered around the middle.
Type I Error (false positive) – In hypothesis testing,
popular statistic, calculated as sum of all values
when you incorrectly reject Null Hypothesis
divided by number of values
Why you care: If population is “normally
distributed” then we can use a bunch of useful
Median – Value at center
Type II Error (false negative) – In hypothesis testing,
characteristics to help describe it.
when you incorrectly fail to reject Null Hypothesis
Mode – Value that occurs most
Standard Deviation – Measurement relative to mean, so a measure of how far a value is away from the mean.
The further a value is from the mean the more unique… and perhaps interesting… it becomes.
Is My Data Special?
Make sure to review
Hazards! section regarding skewed distributions
Null Hypothesis in Layman’s Terms:
There is nothing different, or special, about this data
Sampling
Good Sampling Rule of Thumb:
Some Sampling Methods:
Simple Random
Consider sampling when population you’re working with
is too big to handle
(probably the only one you will ever
see or use)
2.5%
2.5%
95%
Aim is to get a good representative for actual population
Systematic Random
Generally the bigger the sample the better, but a simple tip is:
Stratified
- At minimum your sample size should be 100
Cluster
- At maximum your sample size should be 10% or 1000,
Multistage
whichever is smaller
REJECT IT!
Can’t reject it, nothing special
REJECT IT!
Keep bias out of it by ensuring a RANDOM sample!
Best used when you need to know if your data is
different or somehow special
Random Numbers
Always start out assuming Null Hypothesis is TRUE
Are an excellent way to create a Simple Random Sample. Most analytical tools (including Excel & Google Sheets)
Goal is to either “reject” or “fail to reject” Null
have a random number generator you can use. Just apply a random number to each row, sort in ascending order
by the random number then select the top however-many rows.
Hypothesis
If FAIL TO REJECT Null Hypothesis then there is
nothing really different about the data
Caution Hazard
If REJECT Null Hypothesis then we are confident
that what we see is different or special
On curve above, can only say that an observation is
different/special if it falls in either of shaded regions
Beware of...
BIAS
(called “tails”)
The tails are 2 Standard Deviations away from (either
Bias can effect both how samples are selected, and also what conclusions you draw from them
above or below) the Mean
(i.e. interpretation).
Assumes dealing with a normal distribution!
Selection Bias – when an individual or observation
Extrapolation Bias – when you assume results
See
Hazards!
is more likely to be picked for sampling (in other
of a study describe a larger population than
words, NOT random)
what you originally started with (e.g. assuming
a study of college students is a good proxy for
Observer Bias – when you subconsciously let your
entire country)
preconceptions influence how you perform your
Big takeaway: If your data falls within +/- 2
Reporting Bias – when availability of data
analysis
Standard Deviations of Mean then its probably not
favors a certain subgroup within true
all that different. If your data falls outside those
Detection Bias – when something is more likely to
population
boundaries then it is most likely something to take
be detected in a specific set of observations (e.g.
note of.
Confirmation Bias – tend to listen only to
measuring website traffic on Black Friday)
information that confirms hypothesis,
Funding Bias – when selection or interpretation
assumption, or opinion
Caution Hazard
favors a financial sponsor
Skewed Distributions...
Imputing Missing Values...
Confusing Confidence Intervals...
Missing values are a part of real-life data analysis.
…with probability. 95% confidence just means that
mode
mode
But, resist temptation to just fill them in with Mean
95% of the time the true (population) value will be
median
median
or Median.
within the limits.
Sometimes this is an OK option, but remember that
Multiple Inference...Faking it ‘till
missing values can be trying to send you a message
you’re making it
about some process that you are unaware of (i.e.
mean
mean
telling a story).
Running a hypothesis test over and over, the same
X
Also, there are a number of imputation methods out
way on the same data, until you get a “significant”
X
result greatly increases chances you will get a false
there, be sure to review them thoroughly to see if
Not all data is normally distributed… and when your
there are any that better fit your needs/data.
positive (Type I Error) result because… there is always
the chance of getting a randomly significant result.
data is not normally distributed, all those helpful
characteristics of a normal distribution no longer apply!
Thinking that Correlation proves Causation (it doesn’t)
For instance Hypothesis testing limits will change,
Mean & Median will shift, and most statistical models
(think regression) rely heavily on assumption that your
Check out
Probability & Correlation Cheat Sheet
for more on this one!
data is normally distributed!
Start Your Journey With Us:
(888) 252-7866
|
Locations: Rocklin, San Francisco, New York, Seattle, Los Angeles, Chicago, Boston, London

ADVERTISEMENT

00 votes

Related Articles

Related forms

Related Categories

Parent category: Education
Go