# The Box-Cox Transformation

The Box-Cox transformation is a family of power transform functions that are used to stabilize variance and make a dataset look more like a normal distribution. Lots of useful tools require normal-like data in order to be effective, so by using the Box-Cox transformation on your wonky-looking dataset you can then utilize some of these tools.

Here’s the transformation in its basic form. For value $x$ and parameter $\lambda$: $\displaystyle \frac{x^{\lambda}-1}{\lambda} \quad \text{if} \quad x\neq 0$ $\displaystyle log(x) \quad \text{if} \quad x=0$

Before getting to an example, I’ll ask and answer as best I can a few questions that seemed good to me, but (mostly) didn’t appear in any explanation of the subject that I could find.

• Why subtract 1 from $x^{\lambda}$?

For all values of $\lambda$, when our datapoint is 1, $x^{\lambda}$ evaluates to 1. Noticing this, we have an opportunity to center transformed data: by subtracting 1 from $x^{\lambda}$, we ensure that our transformed data is always centered around 0 exactly where our datapoint is equal to 1.

• Why divide by lambda?

As a normalizing constant. As lambda increases, the effect of exponentiation increases, but so too does the division factor. Not on the same scale, but it still it has some normalizing effect. Also, this makes taking the first derivative a little nicer when it comes to evaluating the optimal value of $\lambda$, though I haven’t done the derivation for log-likelihood or anything like that and can’t say if this was the motivation behind dividing by lambda.

• Why is zero a different value?

When $\lambda = 0$, both the numerator and denominator are 0, and it becomes a non-continuous function.

• Why log(x) at $\lambda = 0$?

As $\lambda$ approaches 0, the Box-Cox equation approaches the log function. Try it out.

Straightforward solution is to include a shift parameter such that for all values x, x + shift parameter is positive. Box-Cox is a family of power transformations, and there are lots of variations that have different ways of dealing with exceptions like negative values, e.g. one proposed variation exponentiates the data values.

In practice, a Box-Cox function in a software package like scipy basically tries out a range of values for $\lambda$ and returns the value of $\lambda$ that maximizes the log-likelihood function, i.e. the value of $\lambda$ that makes your data most normal-looking.

So, let’s take some data and see what happens when we apply Box-Cox. Great, we put in some data, removed outliers (we’ll revisit the removal of outliers a little later) and found the best value of $\lambda$ to make our data more normal.

Here’s our data distribution before Box-Cox, where we see some unwanted skew and kurtosis (peakiness): And here’s the probability plot for our data. A probability plot puts your data against data generated from a theoretical distribution, in this case the normal distribution. The data is scaled such that the straight line contains data points from a normal distribution; the closer our data is to this line, the more normal it is. Again, you can see the right skew and high-variance data points on the right.

Here, we look at a range of $\lambda$ values and plot them against the R^2  values from the probplot, thus finding the optimal value of $\lambda$  Pretty cool, huh? As an aside, most explanations of Box-Cox mention that the search value for $\lambda$ is between -5 and 5. Those are obviously arbitrary parameter bounds, and yet they’re entirely appropriate on this dataset.

Here’s our new probplot and distribution looking much more normal:  Out of curiosity, I ran the Box-Cox again, but this time left in 3 outliers sitting on the bottom range; it makes quite a difference. Note both the range of all values, as well as the range between where the left and right tails loosely begin and end: ~.03 for the above dataset, ~.6 for the dataset below. ## 2 thoughts on “The Box-Cox Transformation”

1. Felix says:

Thank you for this great overview! After reading your explanation, I still have one question:
x is transformed following (x^lambda – 1 )/ lambda, which makes sense to me. Dividing by lambda changes the sign for negative values and reduces the stretching due to large lambda. However, why do we then end up transforming our data by x^lambda and not (x^lambda – 1 )/ lambda?

Like