Gini Coefficients: How to Calculate Them in Excel and What The F They Mean

Social scientists and wonk readers are familiar with Gini coefficients, and while I’ve used them a few times myself, I have no idea what they really mean.

Okay, fine, I know what a Lorenz curve is, and I can explain it mathematically. But ultimately any understanding of the Gini coefficient is inherently relative, e.g. “They say .3 is good for income distribution, but anything above .5 is bad. The U.S. is .45, so that’s bad … ish.” It doesn’t say anything about the actual distribution.

Indeed, the first two Wonkblog posts I found in a brief Internet search describe the Gini coefficient essentially as a (well-known) measure of inequality. No details.

Vox is slightly better: “[I]t’s an incredibly abstract idea that’s difficult to verbally describe. [Thanks, Vox.] The advantage to using a gini coefficent is that in principle it summarizes all the information about the distribution of income and thus facilitates easy comparisons.”

Easy comparisons? To what? Other Gini coefficients? Whatever.

Still, aside from eyeballing a bunch of Lorenz curves—fun fact: I’m going to make you eyeball a bunch of Lorenz curves today—a Gini coefficient is totally undescriptive because it’s a decimal without any units.

This post tries to remedy that, but first, per the title here’s a big ol’ array formula for calculating Gini coefficients in MS Excel. The source is Excel and UDF Performance Stuff. I chose the Angus Deaton version and improved on it by replacing the ROW formulae with COUNTIFs because those account for blank cells in the data. It also avoids the pitfall of negative data numbers, which mess up Lorenz curves. However, as an array formula, it must be entered with CTRL-SHIFT-ENTER otherwise it won’t work. Behold:

((COUNTIF(Datarange,”>=0″)+1)/(COUNTIF(Datarange,”>=0″)-1)-2/(COUNTIF(Datarange,”>=0″)*(COUNTIF(Datarange,”>=0″)-1)*AVERAGEIF(Datarange,”>=0″))*SUM((IF(Datarange>=0,RANK(Datarange,Datarange))+((COUNT(Datarange)+1-IF(Datarange>=0,RANK(Datarange,Datarange,0))-IF(Datarange>=0,RANK(Datarange,Datarange,1)))/2))*IF(Datarange>=0,Datarange)))/(COUNTIF(Datarange,”>=0″)/(COUNTIF(Datarange,”>=0″)-1))

*Where “Datarange” is something like, “$B$1:$B$1000”.

The Web site contains a few other methods, but this one is the most comprehensive and isn’t limited to 4,000 data points. I’ve tested this version with hand-cranked Ginis, that is, sorting the data by size, creating a cumulative sum of them (the Lorenz curve), calculating a bunch of trapezoids among them to find the area under the Lorenz curve, adding those up, subtracting them by equivalent triangle of equality, and then dividing them by that triangle.

So what does the Gini coefficient mean? In other words, what kind of distribution can one imagine when provided with a given Gini number?

It’s a damn good question that I figured out an answer to. I created a bunch of columns in Excel with a bunch of RAND() formulae, gathered some statistics, and then averaged the results. The RAND() formula produces a random number from 0 to 1. The more of them you use, the bigger, more robust the distribution you get. A .1 is as likely as a .9, etc. The mean and median will be .5. But what’s the Gini coefficient of a large number of random numbers within a set range?

Answer: One third.

This is huge for me because now I know when someone says that the Gini coefficient for the income distribution in some country is .333333, I can visualize that it’s equivalent to a random income distribution. Because one third is such a round number, I had a hunch that taking each RAND() formula to different powers would yield other round Gini coefficients, so, here’s a table of Gini coefficients as a power of each data point from a random distribution. (And yes, this blog endorses the divisible properties of the number 12.) I’m adding similar information for the resulting Lorenz curves. It’s quite intriguing—and quite tedious because it takes a few minutes for my computer to spit out the numbers. But this blog wouldn’t be anything if it didn’t hog some clock cycles:

DATA POINTS SHARE OF MAXIMUM AT GIVEN GINI COEFFICIENT
POWER GINI MINIMUM 10TH PERCENTILE 25TH PERCENTILE MEDIAN 75TH PERCENTILE 90TH PERCENTILE MAXIMUM MEAN
1/12 .040 48.9% 82.5% 89.1% 94.4% 97.6% 99.1% 100.0% 92.3%
2/12 .077 24.2% 68.0% 79.4% 89.1% 95.3% 98.3% 100.0% 85.7%
3/12 .111 12.1% 56.1% 70.7% 84.1% 93.1% 97.4% 100.0% 80.0%
4/12 .143 06.1% 46.3% 63.0% 79.4% 90.9% 96.6% 100.0% 75.0%
5/12 .173 03.1% 38.2% 56.1% 74.9% 88.8% 95.7% 100.0% 70.6%
6/12 .200 01.6% 31.5% 50.0% 70.7% 86.7% 94.9% 100.0% 66.7%
7/12 .226 00.8% 26.0% 44.5% 66.8% 84.6% 94.0% 100.0% 63.2%
8/12 .250 00.4% 21.4% 39.7% 63.0% 82.6% 93.2% 100.0% 60.0%
9/12 .273 00.2% 17.7% 35.3% 59.5% 80.7% 92.4% 100.0% 57.2%
10/12 .294 00.1% 14.6% 31.5% 56.2% 78.8% 91.6% 100.0% 54.6%
11/12 .315 00.1% 12.0% 28.1% 53.0% 76.9% 90.8% 100.0% 52.2%
12/12 .334 00.0% 09.9% 25.0% 50.0% 75.1% 90.0% 100.0% 50.0%
16/12 .400 00.0% 04.6% 15.7% 39.7% 68.3% 86.9% 100.0% 42.9%
20/12 .454 00.0% 02.1% 09.9% 31.5% 62.1% 83.9% 99.9% 37.6%
24/12 .500 00.0% 01.0% 06.3% 25.0% 56.4% 81.0% 99.9% 33.4%
30/12 .555 00.0% 00.3% 03.1% 17.7% 48.9% 76.9% 99.9% 28.6%
36/12 .600 00.0% 00.1% 01.6% 12.5% 42.4% 72.9% 99.9% 25.1%
42/12 .636 00.0% 00.0% 00.8% 08.9% 36.8% 69.2% 99.9% 22.3%
48/12 .666 00.0% 00.0% 00.4% 06.3% 31.9% 65.6% 99.9% 20.1%
54/12 .692 00.0% 00.0% 00.2% 04.4% 27.6% 62.3% 99.9% 18.2%
60/12 .714 00.0% 00.0% 00.1% 03.1% 24.0% 59.1% 99.8% 16.7%
66/12 .733 00.0% 00.0% 00.0% 02.2% 20.8% 56.1% 99.8% 15.4%
72/12 .749 00.0% 00.0% 00.0% 01.6% 18.0% 53.2% 99.8% 14.3%
84/12 .777 00.0% 00.0% 00.0% 00.8% 13.5% 47.9% 99.8% 12.6%
96/12 .799 00.0% 00.0% 00.0% 00.4% 10.2% 43.1% 99.7% 11.2%
120/12 .833 00.0% 00.0% 00.0% 00.1% 05.8% 35.0% 99.7% 09.1%
144/12 .856 00.0% 00.0% 00.0% 00.0% 03.3% 28.3% 99.6% 07.7%
240/12 .909 00.0% 00.0% 00.0% 00.0% 00.3% 12.3% 99.4% 04.8%

And here is the same information for the Lorenz curves.

CUMULATIVE DATA POINTS SHARE OF MAXIMUM AT GIVEN GINI COEFFICIENT (LORENZ CURVES)
POWER GINI MINIMUM 10TH PERCENTILE 25TH PERCENTILE MEDIAN 75TH PERCENTILE 90TH PERCENTILE MAXIMUM MEAN
1/12 .040 00.0% 08.2% 22.3% 47.2% 73.3% 89.2% 100.0% 48.0%
2/12 .077 00.0% 06.8% 19.8% 44.5% 71.5% 88.5% 100.0% 46.2%
3/12 .111 00.0% 05.6% 17.7% 42.0% 69.8% 87.7% 100.0% 44.4%
4/12 .143 00.0% 04.6% 15.7% 39.7% 68.2% 86.9% 100.0% 42.9%
5/12 .173 00.0% 03.8% 14.0% 37.4% 66.6% 86.2% 100.0% 41.4%
6/12 .200 00.0% 03.1% 12.5% 35.3% 65.0% 85.4% 100.0% 40.0%
7/12 .226 00.0% 02.6% 11.1% 33.4% 63.5% 84.7% 100.0% 38.7%
8/12 .250 00.0% 02.1% 09.9% 31.5% 62.0% 83.9% 100.0% 37.5%
9/12 .273 00.0% 01.8% 08.8% 29.7% 60.5% 83.2% 100.0% 36.4%
10/12 .294 00.0% 01.5% 07.8% 28.1% 59.1% 82.5% 100.0% 35.3%
11/12 .315 00.0% 01.2% 07.0% 26.5% 57.7% 81.8% 100.0% 34.3%
12/12 .334 00.0% 01.0% 06.2% 25.0% 56.3% 81.1% 100.0% 33.3%
16/12 .400 00.0% 00.5% 03.9% 19.8% 51.2% 78.3% 100.0% 30.0%
20/12 .454 00.0% 00.2% 02.5% 15.8% 46.5% 75.6% 100.0% 27.3%
24/12 .500 00.0% 00.1% 01.6% 12.5% 42.3% 73.0% 100.0% 25.0%
30/12 .555 00.0% 00.0% 00.8% 08.9% 36.7% 69.3% 100.0% 22.3%
36/12 .600 00.0% 00.0% 00.4% 06.3% 31.8% 65.8% 100.0% 20.0%
42/12 .636 00.0% 00.0% 00.2% 04.4% 27.5% 62.4% 100.0% 18.2%
48/12 .666 00.0% 00.0% 00.1% 03.1% 23.9% 59.3% 100.0% 16.7%
54/12 .692 00.0% 00.0% 00.0% 02.2% 20.7% 56.3% 100.0% 15.4%
60/12 .714 00.0% 00.0% 00.0% 01.6% 18.0% 53.4% 100.0% 14.3%
66/12 .733 00.0% 00.0% 00.0% 01.1% 15.6% 50.7% 100.0% 13.4%
72/12 .749 00.0% 00.0% 00.0% 00.8% 13.5% 48.1% 100.0% 12.6%
84/12 .777 00.0% 00.0% 00.0% 00.4% 10.2% 43.3% 100.0% 11.2%
96/12 .799 00.0% 00.0% 00.0% 00.2% 07.6% 39.1% 100.0% 10.1%
120/12 .833 00.0% 00.0% 00.0% 00.1% 04.3% 31.7% 100.0% 08.4%
144/12 .856 00.0% 00.0% 00.0% 00.0% 02.4% 25.7% 100.0% 07.2%
240/12 .909 00.0% 00.0% 00.0% 00.0% 00.3% 11.2% 100.0% 04.6%

And as promised, some Lorenz curves:

gini-coefficients-at-given-powers-of-random-distribution

Ooooh, the colors.

And for a fun digression, here’s a Lorenz curve of U.S. earnings.

lorenz-curve-of-u-s-earnings-2015

(Source: Social Security Administration)

The data aren’t presented in a way that can be properly sorted, so I can’t calculate the area under the curve, but at the median, the Gini coefficient is probably about .5. At the 90th percentile, it’s more like .8. I wouldn’t be surprised if U.S. income polarization is worse than .45.

So, now you know how to estimate a Gini coefficient in MS Excel, and if someone throws Gini numbers at you, you can look at the tables to get a sense of what they mean in terms of the maximum data point. It’s still a number without any units, but at least you can see the relationships among the data points more clearly.

Advertisements

2 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s