How do I determine whether my data are normal?
- There are three interrelated approaches to determine normality, and all three should be conducted.
- Look at a histogram with the normal curve superimposed. A histogram provides useful graphical representation of the data.
- To provide a rough example of normality and non-normality, see the
following histograms. The black line superimposed on the histograms
represents the bell-shaped "normal" curve. Notice how the data for
variable1 are normal, and the data for variable2 are non-normal. In
this case, the non-normality is driven by the presence of an outlier.
For more information about outliers, see What are outliers?, How do I detect outliers?, and How do I deal with outliers?.
Problem -- All samples deviate somewhat from normal, so the question is
how much deviation from the black line indicates “non-normality”?
Unfortunately, graphical representations like histogram provide no
hard-and-fast rules. After you have viewed many (many!) histograms, over
time you will get a sense for the normality of data.
- Look at the values of Skewness. Skewness involves the
symmetry of the distribution. Skewness that is normal involves a
perfectly symmetric distribution. A positively skewed distribution has
scores clustered to the left, with the tail extending to the right. A
negatively skewed distribution has scores clustered to the right, with
the tail extending to the left. Skewness is 0 in a normal distribution,
so the farther away from 0, the more non-normal the distribution. The
question is “how much” skew render the data non-normal? This is an
arbitrary determination, and sometimes difficult to interpret using the
values of Skewness.
- The histogram above for variable1 represents perfect symmetry
(skewness) and perfect peakedness (kurtosis); and the descriptive
statistics below for variable1 parallel this information by reporting
"0" for both skewness and kurtosis. The histogram above for variable2
represents positive skewness (tail extending to the right); and the
descriptive statistics below for variable2 parallel this information.
Problem -- The question is “how much” skew render the data non-normal?
This is an arbitrary determination, and sometimes difficult to interpret
using the values of Skewness. Luckily, there are more objective tests
of normality, described next.
- Look at established tests for normality that take into account
both Skewness and Kurtosis simultaneously. The Kolmogorov-Smirnov test
(K-S) and Shapiro-Wilk (S-W) test are designed to test normality by
comparing your data to a normal distribution with the same mean and
standard deviation of your sample. If the test is NOT significant, then
the data are normal, so any value above .05 indicates normality. If the
test is significant (less than .05), then the data are non-normal.
- See the data below which indicate variable1 is normal, and variable2
is non-normal. Also, keep in mind one limitation of the normality tests
is that the larger the sample size, the more likely to get significant
results. Thus, you may get significant results with only slight
deviations from normality when sample sizes are large.
- Look at normality plots of the data. “Normal Q-Q Plot” provides
a graphical way to determine the level of normality. The black line
indicates the values your sample should adhere to if the distribution
was normal. The dots are your actual data. If the dots fall exactly on
the black line, then your data are normal. If they deviate from the
black line, your data are non-normal. - Notice how the data for variable1 fall along the line, whereas the data for variable2 deviate from the line.
- Matlab codes for normality of data:
h = adtest(x)
h = adtest(x) returns a test decision for the null hypothesis that the data in vector x is from a population with a normal distribution, using the Anderson-Darling test. The alternative hypothesis is that x is not from a population with a normal distribution. The result h is 1 if the test rejects the null hypothesis at the 5% significance level, or 0 otherwise.
- One-sample Kolmogorov-Smirnov test
h = kstest(x)example
h = kstest(x) returns a test decision for the null hypothesis that the data in vector x comes from a standard normal distribution, against the alternative that it does not come from such a distribution, using the one-sample Kolmogorov-Smirnov test. The result h is 1 if the test rejects the null hypothesis at the 5% significance level, or 0 otherwise.
h = jbtest(x)
h = jbtest(x) returns a test decision for the null hypothesis that the data in vector x comes from a normal distribution with an unknown mean and variance, using the Jarque-Bera test. The alternative hypothesis is that it does not come from such a distribution. The result h is 1 if the test rejects the null hypothesis at the 5% significance level, and 0 otherwise.
h = lillietest(x)
h = lillietest(x) returns a test decision for the null hypothesis that the data in vector x comes from a distribution in the normal family, against the alternative that it does not come from such a distribution, using a Lilliefors test. The result h is 1 if the test rejects the null hypothesis at the 5% significance level, and 0 otherwise.
http://www.psychwiki.com
MATLAB Help