Top 3 methods for handling skewed data. Log, square root, box cox transformations
(What is the Box-Cox Power Transformation?)
- a procedure to identify an appropriate exponent (Lambda = l) to use to transform data into a “normal shape.”
- The Lambda value indicates the power to which all data should be raised.
The Box-Cox transformation is a useful family of transformations. \
- Many statistical tests and intervals are based on the assumption of normality.
- The assumption of normality often leads to tests that are simple, mathematically tractable, and powerful compared to tests that do not make the normality assumption.
- Unfortunately, many real data sets are in fact not approximately normal.
- However, an appropriate transformation of a data set can often yield a data set that does follow approximately a normal distribution.
- This increases the applicability and usefulness of statistical techniques based on the normality assumption.
IMPORTANT:!! After a transformation (c), we need to measure the normality of the resulting transformation (d) .
- One measure is to compute the correlation coefficient of a normal probability plot => (d).
- The correlation is computed between the vertical and horizontal axis variables of the probability plot and is a convenient measure of the linearity of the probability plot
- In other words: the more linear the probability plot, the better a normal distribution fits the data!
*NOTE: another useful link that explains it with figures, but i did not read it.
GUARANTEED NORMALITY?
- NO!
- This is because it actually does not really check for normality;
- the method checks for the smallest standard deviation.
- The assumption is that among all transformations with Lambda values between -5 and +5, transformed data has the highest likelihood – but not a guarantee – to be normally distributed when standard deviation is the smallest.
- it is absolutely necessary to always check the transformed data for normality using a probability plot. (d)
+ Additionally, the Box-Cox Power transformation only works if all the data is positive and greater than 0.
+ achieved easily by adding a constant ‘c’ to all data such that it all becomes positive before it is transformed. The transformation equation is then:\
COMMON TRANSFORMATION FORMULAS (based on the actual formula)
Finally: An awesome tutorial (dead), here is a new one in python with code examples, there is also another code example here
“Simply pass a 1-D array into the function and it will return the Box-Cox transformed array and the optimal value for lambda. You can also specify a number, alpha, which calculates the confidence interval for that value. (For example, alpha = 0.05 gives the 95% confidence interval).” \
* Maybe there is a slight problem in the python vs R code, details here, but needs investigating.
(what is?) - the Mann–Whitney U test is a nonparametric test of the null hypothesis that it is equally likely that a randomly selected value from one sample will be less than or greater than a randomly selected value from a second sample.\
In other words: This test can be used to determine whether two independent samples were selected from populations having the same distribution.
Unlike the t-test it does not require the assumption of normal distributions. It is nearly as efficient as the t-test on normal distributions.
- What is chi-square and what is a null hypothesis, and how do we calculate observed vs expected and check if we can reject the null and get significant difference.
- Analytics vidhya
- What is hypothesis testing
- Intro to t-tests analytics vidhya - always good
- Anova analysis of variance, one way, two way, manova
- if the means of two or more groups are significantly different from each other. ANOVA checks the impact of one or more factors by comparing the means of different samples.
- A one-way ANOVA tells us that at least two groups are different from each other. But it won’t tell us which groups are different.
- For such cases, when the outcome or dependent variable (in our case the test scores) is affected by two independent variables/factors we use a slightly modified technique called two-way ANOVA.
- multivariate case and the technique we will use to solve it is known as MANOVA.