Day: 2016/11/28

Gini Coefficients: How to Calculate Them in Excel and What The F They Mean

Social scientists and wonk readers are familiar with Gini coefficients, and while I’ve used them a few times myself, I have no idea what they really mean.

Okay, fine, I know what a Lorenz curve is, and I can explain it mathematically. But ultimately any understanding of the Gini coefficient is inherently relative, e.g. “They say .3 is good for income distribution, but anything above .5 is bad. The U.S. is .45, so that’s bad … ish.” It doesn’t say anything about the actual distribution.

Indeed, the first two Wonkblog posts I found in a brief Internet search describe the Gini coefficient essentially as a (well-known) measure of inequality. No details.

Vox is slightly better: “[I]t’s an incredibly abstract idea that’s difficult to verbally describe. [Thanks, Vox.] The advantage to using a gini coefficent is that in principle it summarizes all the information about the distribution of income and thus facilitates easy comparisons.”

Easy comparisons? To what? Other Gini coefficients? Whatever.

Still, aside from eyeballing a bunch of Lorenz curves—fun fact: I’m going to make you eyeball a bunch of Lorenz curves today—a Gini coefficient is totally undescriptive because it’s a decimal without any units.

This post tries to remedy that, but first, per the title here’s a big ol’ array formula for calculating Gini coefficients in MS Excel. The source is Excel and UDF Performance Stuff. I chose the Angus Deaton version and improved on it by replacing the ROW formulae with COUNTIFs because those account for blank cells in the data. It also avoids the pitfall of negative data numbers, which mess up Lorenz curves. However, as an array formula, it must be entered with CTRL-SHIFT-ENTER otherwise it won’t work. Behold:

((COUNTIF(Datarange,”>=0″)+1)/(COUNTIF(Datarange,”>=0″)-1)-2/(COUNTIF(Datarange,”>=0″)*(COUNTIF(Datarange,”>=0″)-1)*AVERAGEIF(Datarange,”>=0″))*SUM((IF(Datarange>=0,RANK(Datarange,Datarange))+((COUNT(Datarange)+1-IF(Datarange>=0,RANK(Datarange,Datarange,0))-IF(Datarange>=0,RANK(Datarange,Datarange,1)))/2))*IF(Datarange>=0,Datarange)))/(COUNTIF(Datarange,”>=0″)/(COUNTIF(Datarange,”>=0″)-1))

*Where “Datarange” is something like, “$B$1:$B$1000”.

The Web site contains a few other methods, but this one is the most comprehensive and isn’t limited to 4,000 data points. I’ve tested this version with hand-cranked Ginis, that is, sorting the data by size, creating a cumulative sum of them (the Lorenz curve), calculating a bunch of trapezoids among them to find the area under the Lorenz curve, adding those up, subtracting them by equivalent triangle of equality, and then dividing them by that triangle.

So what does the Gini coefficient mean? In other words, what kind of distribution can one imagine when provided with a given Gini number?

(more…)