4 February 2020
Gaps in income between groups can arise from differences in characteristics across groups, differences into the returns to those characteristics and outright discrimination that is independent of productive characteristics. Each of these sources of income gaps have dramatically different policy implications. Consequently, it is important to decompose income gaps into these different components. This exercise will walk you through these decompositions with historical earnings data from the federal census.
We will work with the IPUMS 1 percent sample of the 1950 federal census. The file ipums-1950-25-55-year-olds.dta contains data from an extract of the 1950 sample restricted to individuals between the ages of 25 and 55 (we want to focus on adults in the labor force). This file can be downloaded from our course website. Alternatively, you could create your own extract through IPUMS if you want to play around with additional variables or other age ranges.
Let’s focus on income gaps between black and white males. To simplify our lives, the first thing we’ll do is open the data and then drop all females and all races other than white or black. Note that in the IPUMS extracts, sex is coded as equal to one for males and two for females. Race is coded as one for white and two for black.
. clear . use ipums-1950-25-55-year-olds.dta . keep if sex==1 & (race==1 | race==2) (383,544 observations deleted)
It is not necessarily the best practice to take this approach. I am dropping the other observations here to keep subsequent commands shorter for the sake of making the tutorial easier to read. Typically, I would prefer to keep all of the data and simply use if statements on all of my commands to focus on the groups of interest; you never know when those other observations may come in useful.
Now let’s take a look at some basic summary statistics for the incomes of black and white males using the inctot variable, total individual pre-tax earnings from all sources. A quick way to do this is to use the tabulate command with the summarize option. To make meaningful comparisons, we will restrict our attention to only those who are employed (those for whom the empstat variable is equal to one). Before we look at the summary statistics, there is one small but important step we need to take. If you look at the documentation for inctot, the variable takes on a value of 9999999 to designate N/A. We certainly do not want to count these individuals as earning ten million dollars, so we need to set these values to missing. Once we do that, we can summarize incomes by race.
. replace inctot=. if inctot==9999999
(274,960 real changes made, 274,960 to missing)
. tab race if empstat==1, sum(inctot)
Race │
[general │ Summary of Total personal income
version] │ Mean Std. Dev. Freq.
────────────┼────────────────────────────────────
White │ 3253.3299 2001.1598 77,089
Black/Afr │ 1746.7236 1160.6887 7,376
────────────┼────────────────────────────────────
Total │ 3121.7638 1988.3331 84,465
Here we can see a pretty large difference in average incomes between black and white males, with white males having roughly twice the average income of black males. We could do a simple t-test in Stata to confirm that this is a statistically significant gap but, in anticipation of starting to decompose the gap, let’s switch to using a regression framework to estimate the gap. We will begin with a simple regression:
\[\begin{equation} ln(y_{i}) = \beta_0 + \beta_{1} Black_{i} + \varepsilon_{i} \end{equation}\]The variable \(Black_{i}\) is simply an indicator variable equal to one if an individual is black and zero if an individual is white. The coefficient \(\beta_{1}\) will then give us the average difference in log income between black and white individuals. Let’s give it a try, first creating the needed log income and black indicator variables (when generating the black indicator variable, recall that race is coded as one for white and two for black):
. gen lninctot = ln(inctot)
(279,673 missing values generated)
. gen black = race - 1
. reg lninctot black
Source │ SS df MS Number of obs = 88,848
─────────────┼────────────────────────────────── F(1, 88846) = 5587.54
Model │ 3287.86373 1 3287.86373 Prob > F = 0.0000
Residual │ 52279.4359 88,846 .58842757 R-squared = 0.0592
─────────────┼────────────────────────────────── Adj R-squared = 0.0592
Total │ 55567.2996 88,847 .625426853 Root MSE = .76709
─────────────┬────────────────────────────────────────────────────────────────
lninctot │ Coef. Std. Err. t P>|t| [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
black │ -.6683385 .008941 -74.75 0.000 -.6858627 -.6508142
_cons │ 7.863199 .0026995 2912.87 0.000 7.857908 7.86849
─────────────┴────────────────────────────────────────────────────────────────
Here we get our basic, unconditional black-white gap in log earnings, -0.67. Now we are ready to start thinking about how to explain this gap.
We are going to first consider the extent to which we can explain the gap by differences in observable characteristics. Let’s see what happens when we introduce controls for years of educational attainment and potential years of experience. First things first, let’s recode the higrade variable to be equal to years of educational attainment. Note that higrade takes on a value of zero if it is not applicable, so we want these to be treated as missing, and otherwise it is basically equal to years of education plus three (refer to the codes on IPUMS to see what I mean by this). Using that information, we can construct a years of schooling variable:
. gen schooling = higrade - 3 . replace schooling = . if higrade==0 (274,960 real changes made, 274,960 to missing) . replace schooling = 0 if schooling<0 (1,138 real changes made)
Now for years of potential experience, we’ll assume that schooling starts at age 5 and then is completed without gaps. This means we can estimate years of potential experience based on age and our new schooling variable:
. gen exp = age - schooling - 5 (274,960 missing values generated) . gen exp2 = exp^2 (274,960 missing values generated)
Note that I created a variable for the square of potential experience as well. We will want to use a quadratic in experience in our regression to account for diminishing returns to experience. Our new regression equation is now:
\[\begin{equation} ln(y_{i}) = \beta_0 + \beta_{1} Black_{i} + \beta_{2} Schooling_{i} + \beta_{3} Exp_{i} + \beta_{4} Exp_{i}^{2} + \varepsilon_{i} \end{equation}\]Let’s run this new specification and see what happens to the black-white log earnings gap once we control for differences in schooling and experience:
. reg lninctot black schooling exp exp2
Source │ SS df MS Number of obs = 88,848
─────────────┼────────────────────────────────── F(4, 88843) = 4301.60
Model │ 9015.72537 4 2253.93134 Prob > F = 0.0000
Residual │ 46551.5742 88,843 .523975713 R-squared = 0.1622
─────────────┼────────────────────────────────── Adj R-squared = 0.1622
Total │ 55567.2996 88,847 .625426853 Root MSE = .72386
─────────────┬────────────────────────────────────────────────────────────────
lninctot │ Coef. Std. Err. t P>|t| [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
black │ -.4311397 .0087577 -49.23 0.000 -.4483048 -.4139747
schooling │ .0802859 .00082 97.91 0.000 .0786787 .0818931
exp │ .0431854 .0011701 36.91 0.000 .040892 .0454789
exp2 │ -.0006657 .0000228 -29.21 0.000 -.0007104 -.0006211
_cons │ 6.474822 .0177423 364.94 0.000 6.440048 6.509597
─────────────┴────────────────────────────────────────────────────────────────
Our coefficient on \(Black_{i}\) has now shrunk to -0.43; controlling for differences in schooling and experience has accounted for about a third of the black-white log income gap. What other characteristics might matter? Consider location. A quick check confirms that the black population share is substantially higher in the southern regions of the United States. These southern regions also have lower incomes on average, perhaps this could explain some of black-white gap:
. tab region, sum(black)
Census │
region and │ Summary of black
division │ Mean Std. Dev. Freq.
────────────┼────────────────────────────────────
New Engla │ .01583286 .12483133 22,927
Middle At │ .06750081 .25088894 77,525
East Nort │ .06883883 .25318157 74,580
West Nort │ .03016848 .17105331 32,882
South Atl │ .24544119 .43035253 53,084
East Sout │ .2235597 .4166378 27,581
West Sout │ .16176428 .36823972 34,847
Mountain │ .01428449 .11866613 11,691
Pacific D │ .04182134 .2001837 33,404
────────────┼────────────────────────────────────
Total │ .10343508 .30452671 368,521
. tab region, sum(inctot)
Census │
region and │ Summary of Total personal income
division │ Mean Std. Dev. Freq.
────────────┼────────────────────────────────────
New Engla │ 2934.633 1906.1064 5,649
Middle At │ 3137.5167 2033.8479 19,901
East Nort │ 3251.4877 1935.5788 19,273
West Nort │ 2880.4717 2010.2336 8,427
South Atl │ 2458.6902 1941.6888 12,655
East Sout │ 2077.2632 1808.1586 6,299
West Sout │ 2635.378 2077.82 8,793
Mountain │ 3087.2279 2028.3844 3,058
Pacific D │ 3388.7455 2076.9102 9,506
────────────┼────────────────────────────────────
Total │ 2939.0832 2020.6728 93,561
To include regional variation in our regressions, let’s construct a few new indicator variables:
. gen midwest = 0 . replace midwest = 1 if region==21 | region==22 (107,462 real changes made) . gen south = 0 . replace south = 1 if region==31 | region==32 | region==33 (115,512 real changes made) . gen west = 0 . replace west = 1 if region==41 | region==42 (45,095 real changes made)
With region controls, our regression equation now becomes:
\[\begin{equation} ln(y_{i}) = \beta_0 + \beta_{1} Black_{i} + \beta_{2} Schooling_{i} + \beta_{3} Exp_{i} + \beta_{4} Exp_{i}^{2} + \beta_{5} Midwest_{i} + \beta_{6} South_{i} + \beta_{7} West_{i} + \varepsilon_{i} \end{equation}\]Note that there is no indicator variable for the Northeast. We need to omit one category, otherwise we will have collinear variables and won’t be able to get unique coefficients (if we did include a Northeast indicator variable and ran the regression, Stata would drop one of the variables with a warning message noting the collinearity). Each region coefficient tells us how average income in that region compares to the average income in the Northeast controlling for race, schooling and experience. Let’s see how this impacts our estimated black-white income gap:
. reg lninctot black schooling exp exp2 midwest south west
Source │ SS df MS Number of obs = 88,848
─────────────┼────────────────────────────────── F(7, 88840) = 2720.92
Model │ 9809.92955 7 1401.41851 Prob > F = 0.0000
Residual │ 45757.3701 88,840 .515053693 R-squared = 0.1765
─────────────┼────────────────────────────────── Adj R-squared = 0.1765
Total │ 55567.2996 88,847 .625426853 Root MSE = .71767
─────────────┬────────────────────────────────────────────────────────────────
lninctot │ Coef. Std. Err. t P>|t| [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
black │ -.3721592 .0088123 -42.23 0.000 -.3894312 -.3548872
schooling │ .0750892 .0008242 91.10 0.000 .0734738 .0767046
exp │ .0430693 .0011601 37.12 0.000 .0407954 .0453431
exp2 │ -.0006769 .0000226 -29.96 0.000 -.0007212 -.0006326
midwest │ -.0057455 .0063868 -0.90 0.368 -.0182635 .0067725
south │ -.2170893 .0065642 -33.07 0.000 -.2299551 -.2042235
west │ .0030648 .0080198 0.38 0.702 -.012654 .0187835
_cons │ 6.595718 .0182088 362.23 0.000 6.560029 6.631407
─────────────┴────────────────────────────────────────────────────────────────
Sure enough a significant part of the black-white gap was being driven by black individuals being disproportionately located in the low-income south.
Up to this point, we have been assuming that returns to characteristics are the same for black and white workers. However, part of the black-white income gap will be driven by differences in the returns to characteristics. In other words, an additional year of schooling might lead to greater labor market returns for a white worker than a black worker. To see if this is the case, we can run two separate regressions, one for black males and one for white males and see if the coefficients differ. We will use the same specification as above except we can drop the \(Black_{i}\) variable as there will be no variation in race within each regression sample:
. reg lninctot schooling exp exp2 midwest south west if black==1
Source │ SS df MS Number of obs = 8,099
─────────────┼────────────────────────────────── F(6, 8092) = 250.18
Model │ 951.424234 6 158.570706 Prob > F = 0.0000
Residual │ 5128.95486 8,092 .633830308 R-squared = 0.1565
─────────────┼────────────────────────────────── Adj R-squared = 0.1558
Total │ 6080.37909 8,098 .75084948 Root MSE = .79613
─────────────┬────────────────────────────────────────────────────────────────
lninctot │ Coef. Std. Err. t P>|t| [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
schooling │ .0550572 .0030105 18.29 0.000 .0491558 .0609586
exp │ .0315035 .0046549 6.77 0.000 .0223786 .0406284
exp2 │ -.0005181 .0000816 -6.35 0.000 -.0006781 -.0003582
midwest │ .082265 .0302801 2.72 0.007 .0229081 .1416218
south │ -.4216104 .0258107 -16.33 0.000 -.4722059 -.3710148
west │ .0718059 .0455241 1.58 0.115 -.0174331 .1610448
_cons │ 6.64194 .075739 87.70 0.000 6.493472 6.790408
─────────────┴────────────────────────────────────────────────────────────────
. reg lninctot schooling exp exp2 midwest south west if black==0
Source │ SS df MS Number of obs = 80,749
─────────────┼────────────────────────────────── F(6, 80742) = 1897.79
Model │ 5710.0202 6 951.670033 Prob > F = 0.0000
Residual │ 40489.0366 80,742 .501461898 R-squared = 0.1236
─────────────┼────────────────────────────────── Adj R-squared = 0.1235
Total │ 46199.0568 80,748 .572138713 Root MSE = .70814
─────────────┬────────────────────────────────────────────────────────────────
lninctot │ Coef. Std. Err. t P>|t| [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
schooling │ .0765422 .0008552 89.50 0.000 .074866 .0782185
exp │ .0438008 .0012051 36.35 0.000 .0414389 .0461627
exp2 │ -.0006864 .0000237 -28.90 0.000 -.0007329 -.0006398
midwest │ -.0116585 .0064823 -1.80 0.072 -.0243638 .0010467
south │ -.1917959 .0067935 -28.23 0.000 -.2051111 -.1784807
west │ .0017851 .0080763 0.22 0.825 -.0140445 .0176147
_cons │ 6.565453 .0186564 351.91 0.000 6.528886 6.602019
─────────────┴────────────────────────────────────────────────────────────────
Sure enough the coefficients turn out to be quite different, with the returns to schooling and experience being lower for black workers and the region coefficients differing as well. These differences in the returns to characteristics are going to contribute to the overall black-white income gap and help explain why such a large gap still existed even after controlling for differences in characteristics.
We would now like to say something about how much of the black-white income gap is driven by differences in characteristics and how much is driven by differences in the returns to those characteristics. We can do this using a Blinder-Oaxaca decomposition. To see how the decomposition works, let’s start by writing down equations for the average log income for black individuals and for white individuals by plugging in the average values for each characteristic into our race-specific regression equations. For simplicity, I will revert back to the equations that did not include region controls (even though we now know they are important):
\[\begin{equation} \overline{ln(y)}_{W} = \beta_{0,W} + \beta_{1,W} \overline{Schooling}_{W} + \beta_{2,W} \overline{Exp}_{W} + \beta_{3,W} \overline{Exp^{2}}_{W} + \overline{\varepsilon}_{W} \end{equation}\] \[\begin{equation} \overline{ln(y)}_{B} = \beta_{0,B} + \beta_{1,B} \overline{Schooling}_{B} + \beta_{2,B} \overline{Exp}_{B} + \beta_{3,B} \overline{Exp^{2}}_{B} + \overline{\varepsilon}_{B} \end{equation}\]Now we will take the difference between these, noting that the mean of the error term in each case is zero and can be dropped:
\[\begin{equation} \overline{ln(y)}_{W} - \overline{ln(y)}_{B} = \left(\beta_{0,W} + \beta_{1,W} \overline{Schooling}_{W} + \beta_{2,W} \overline{Exp}_{W} + \beta_{3,W} \overline{Exp^{2}}_{W} \right) - \left(\beta_{0,B} + \beta_{1,B} \overline{Schooling}_{B} + \beta_{2,B} \overline{Exp}_{B} + \beta_{3,B} \overline{Exp^{2}}_{B} \right) \end{equation}\]With a little bit of algebra, we can transform this equation into a rather intuitive decomposition of the income gap. First we will add and subtract a series of identical terms, you’ll see why in just a second:
\[\begin{equation} \overline{ln(y)}_{W} - \overline{ln(y)}_{B} = \left(\beta_{0,W} + \beta_{1,W} \overline{Schooling}_{W} + \beta_{2,W} \overline{Exp}_{W} + \beta_{3,W} \overline{Exp^{2}}_{W} \right) - \left(\beta_{0,B} + \beta_{1,B} \overline{Schooling}_{B} + \beta_{2,B} \overline{Exp}_{B} + \beta_{3,B} \overline{Exp^{2}}_{B} \right) + \end{equation}\] \[\begin{equation} \left(\beta_{1,W} \overline{Schooling}_{B} - \beta_{1,W} \overline{Schooling}_{B}\right) + \left(\beta_{2,W} \overline{Exp}_{B} - \beta_{2,W} \overline{Exp}_{B}\right) + \left(\beta_{3,W} \overline{Exp^{2}}_{B} - \beta_{3,W} \overline{Exp^{2}}_{B}\right) \end{equation}\]Now we can rearrange and group terms together:
\[\begin{equation} \overline{ln(y)}_{W} - \overline{ln(y)}_{B} = \left(\beta_{0,W} - \beta_{0,B}\right) + \end{equation}\] \[\begin{equation} \left(\beta_{1,W}-\beta_{1,B}\right) \overline{Schooling}_{B} + \left(\beta_{2,W}-\beta_{2,B}\right) \overline{Exp}_{B} + \left(\beta_{3,W}-\beta_{3,B}\right) \overline{Exp^{2}}_{B} + \end{equation}\] \[\begin{equation}\beta_{1,W} \left(\overline{Schooling}_{W}-\overline{Schooling}_{B}\right) + \beta_{2,W} \left(\overline{Exp}_{W}-\overline{Exp}_{B}\right) + \beta_{3,W}\left(\overline{Exp^{2}}_{W}-\overline{Exp^{2}}_{B}\right) \end{equation}\]What we are left with is our decomposition. The first difference is the difference in intercepts, essentially the black-white gap that would exist between two workers with no schooling or experience. The next three terms capture the difference in black and white log incomes due to differences in the returns to characteristics. The final three terms capture the difference in black and white log income due to differences in average levels of the characteristics.
A simple way to do the Blinder-Oaxaca decomposition would be to copy the relevant regression coefficients and variables means into Excel and calculate all of the relevant terms in there. However, we can also do all of the calculations right in Stata if we get a bit more advanced with our commands. The first thing we need to do is have Stata store our regression coefficients in a way that is easy for us to work with. We will do this by re-running the regressions and then saving the coefficients to a matrix:
. reg lninctot schooling exp exp2 midwest south west if black==1
Source │ SS df MS Number of obs = 8,099
─────────────┼────────────────────────────────── F(6, 8092) = 250.18
Model │ 951.424234 6 158.570706 Prob > F = 0.0000
Residual │ 5128.95486 8,092 .633830308 R-squared = 0.1565
─────────────┼────────────────────────────────── Adj R-squared = 0.1558
Total │ 6080.37909 8,098 .75084948 Root MSE = .79613
─────────────┬────────────────────────────────────────────────────────────────
lninctot │ Coef. Std. Err. t P>|t| [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
schooling │ .0550572 .0030105 18.29 0.000 .0491558 .0609586
exp │ .0315035 .0046549 6.77 0.000 .0223786 .0406284
exp2 │ -.0005181 .0000816 -6.35 0.000 -.0006781 -.0003582
midwest │ .082265 .0302801 2.72 0.007 .0229081 .1416218
south │ -.4216104 .0258107 -16.33 0.000 -.4722059 -.3710148
west │ .0718059 .0455241 1.58 0.115 -.0174331 .1610448
_cons │ 6.64194 .075739 87.70 0.000 6.493472 6.790408
─────────────┴────────────────────────────────────────────────────────────────
. matrix b_coefficients = e(b)
. reg lninctot schooling exp exp2 midwest south west if black==0
Source │ SS df MS Number of obs = 80,749
─────────────┼────────────────────────────────── F(6, 80742) = 1897.79
Model │ 5710.0202 6 951.670033 Prob > F = 0.0000
Residual │ 40489.0366 80,742 .501461898 R-squared = 0.1236
─────────────┼────────────────────────────────── Adj R-squared = 0.1235
Total │ 46199.0568 80,748 .572138713 Root MSE = .70814
─────────────┬────────────────────────────────────────────────────────────────
lninctot │ Coef. Std. Err. t P>|t| [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
schooling │ .0765422 .0008552 89.50 0.000 .074866 .0782185
exp │ .0438008 .0012051 36.35 0.000 .0414389 .0461627
exp2 │ -.0006864 .0000237 -28.90 0.000 -.0007329 -.0006398
midwest │ -.0116585 .0064823 -1.80 0.072 -.0243638 .0010467
south │ -.1917959 .0067935 -28.23 0.000 -.2051111 -.1784807
west │ .0017851 .0080763 0.22 0.825 -.0140445 .0176147
_cons │ 6.565453 .0186564 351.91 0.000 6.528886 6.602019
─────────────┴────────────────────────────────────────────────────────────────
. matrix w_coefficients = e(b)
. matrix w_minus_b = w_coefficients - b_coefficients
If you look at that first matrix command, it tells Stata to create a matrix named b_coefficients containing the regression coefficients. These regression coefficients are stored temporarily after running the regression as e(b). If you want to see all of the results stored by Stata after running a command, you can use the commands return list and ereturn list. Let’s see what our new matrices look like:
. matrix list b_coefficients
b_coefficients[1,7]
schooling exp exp2 midwest south west _cons
y1 .0550572 .03150349 -.00051812 .08226497 -.42161036 .07180586 6.6419397
. matrix list w_coefficients
w_coefficients[1,7]
schooling exp exp2 midwest south west _cons
y1 .07654223 .04380079 -.00068639 -.01165852 -.19179591 .00178511 6.5654529
. matrix list w_minus_b
w_minus_b[1,7]
schooling exp exp2 midwest south west _cons
y1 .02148503 .0122973 -.00016827 -.09392349 .22981445 -.07002076 -.07648672
These matrices contain all of the various \(\beta_{W}\)’s and \(\beta_{B}\)’s we need for our decomposition. Now we need to store the variable means. The summarize command will store the mean of the summarized variable for us. Unfortunately, it will only do this one variable at a time. We could try to get fancy here by looping over the regression variables but we have already done enough fancy stuff for one tutorial. We will just use summarize one variable at a time and store the values we need as local macros. There is one last fancy thing we need to do. If we were to just summarize a variable, the mean would include observations that were not actually used in the regression (think observations missing values for one or more variables). To restrict our summary statistics to just the regression samples, we can use e(sample), information returned by the regression command that identifies the observations used in the regression sample:
. reg lninctot schooling exp exp2 midwest south west if black==1
Source │ SS df MS Number of obs = 8,099
─────────────┼────────────────────────────────── F(6, 8092) = 250.18
Model │ 951.424234 6 158.570706 Prob > F = 0.0000
Residual │ 5128.95486 8,092 .633830308 R-squared = 0.1565
─────────────┼────────────────────────────────── Adj R-squared = 0.1558
Total │ 6080.37909 8,098 .75084948 Root MSE = .79613
─────────────┬────────────────────────────────────────────────────────────────
lninctot │ Coef. Std. Err. t P>|t| [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
schooling │ .0550572 .0030105 18.29 0.000 .0491558 .0609586
exp │ .0315035 .0046549 6.77 0.000 .0223786 .0406284
exp2 │ -.0005181 .0000816 -6.35 0.000 -.0006781 -.0003582
midwest │ .082265 .0302801 2.72 0.007 .0229081 .1416218
south │ -.4216104 .0258107 -16.33 0.000 -.4722059 -.3710148
west │ .0718059 .0455241 1.58 0.115 -.0174331 .1610448
_cons │ 6.64194 .075739 87.70 0.000 6.493472 6.790408
─────────────┴────────────────────────────────────────────────────────────────
. sum lninctot if e(sample)
Variable │ Obs Mean Std. Dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
lninctot │ 8,099 7.194861 .8665157 3.912023 9.21034
. local b_lninctot = r(mean)
. sum schooling if e(sample)
Variable │ Obs Mean Std. Dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
schooling │ 8,099 6.769972 3.75231 0 17
. local b_schooling = r(mean)
. sum exp if e(sample)
Variable │ Obs Mean Std. Dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
exp │ 8,099 26.71836 10.32438 3 50
. local b_exp = r(mean)
. sum exp2 if e(sample)
Variable │ Obs Mean Std. Dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
exp2 │ 8,099 820.4504 576.6295 9 2500
. local b_exp2 = r(mean)
. sum midwest if e(sample)
Variable │ Obs Mean Std. Dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
midwest │ 8,099 .187554 .3903797 0 1
. local b_midwest = r(mean)
. sum south if e(sample)
Variable │ Obs Mean Std. Dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
south │ 8,099 .6058773 .4886916 0 1
. local b_south = r(mean)
. sum west if e(sample)
Variable │ Obs Mean Std. Dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
west │ 8,099 .0497592 .2174605 0 1
. local b_west = r(mean)
Now we will do the same for the white regression and variable means:
. reg lninctot schooling exp exp2 midwest south west if black==0
Source │ SS df MS Number of obs = 80,749
─────────────┼────────────────────────────────── F(6, 80742) = 1897.79
Model │ 5710.0202 6 951.670033 Prob > F = 0.0000
Residual │ 40489.0366 80,742 .501461898 R-squared = 0.1236
─────────────┼────────────────────────────────── Adj R-squared = 0.1235
Total │ 46199.0568 80,748 .572138713 Root MSE = .70814
─────────────┬────────────────────────────────────────────────────────────────
lninctot │ Coef. Std. Err. t P>|t| [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
schooling │ .0765422 .0008552 89.50 0.000 .074866 .0782185
exp │ .0438008 .0012051 36.35 0.000 .0414389 .0461627
exp2 │ -.0006864 .0000237 -28.90 0.000 -.0007329 -.0006398
midwest │ -.0116585 .0064823 -1.80 0.072 -.0243638 .0010467
south │ -.1917959 .0067935 -28.23 0.000 -.2051111 -.1784807
west │ .0017851 .0080763 0.22 0.825 -.0140445 .0176147
_cons │ 6.565453 .0186564 351.91 0.000 6.528886 6.602019
─────────────┴────────────────────────────────────────────────────────────────
. sum lninctot if e(sample)
Variable │ Obs Mean Std. Dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
lninctot │ 80,749 7.863199 .7563985 3.912023 9.21034
. local w_lninctot = r(mean)
. sum schooling if e(sample)
Variable │ Obs Mean Std. Dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
schooling │ 80,749 10.04538 3.47595 0 17
. local w_schooling = r(mean)
. sum exp if e(sample)
Variable │ Obs Mean Std. Dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
exp │ 80,749 23.72776 10.10725 3 50
. local w_exp = r(mean)
. sum exp2 if e(sample)
Variable │ Obs Mean Std. Dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
exp2 │ 80,749 665.162 510.1659 9 2500
. local w_exp2 = r(mean)
. sum midwest if e(sample)
Variable │ Obs Mean Std. Dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
midwest │ 80,749 .30975 .4623933 0 1
. local w_midwest = r(mean)
. sum south if e(sample)
Variable │ Obs Mean Std. Dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
south │ 80,749 .263446 .4405049 0 1
. local w_south = r(mean)
. sum west if e(sample)
Variable │ Obs Mean Std. Dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
west │ 80,749 .1440885 .3511816 0 1
. local w_west = r(mean)
Finally we are ready to do our decomposition. We will calculate the overall difference in log incomes, the component due to the difference in intercepts, the component due to differences in the returns to characteristics, and the component due to the differences in mean levels of those characteristics:
. local overall_gap = `w_lninctot' - `b_lninctot' . local intercept_gap = w_minus_b[1,7] . local coefficient_gap = w_minus_b[1,1] * `b_schooling' + w_minus_b[1,2] * `b_exp' + w_minus_b[1,3] * `b_exp2' + w_minus_b[1 > ,4] * `b_midwest' + w_minus_b[1,5] * `b_south' + w_minus_b[1,6] * `b_west' . local characteristic_gap = w_coefficients[1,1] * (`w_schooling' - `b_schooling') + w_coefficients[1,2] * (`w_exp'-`b_exp > ') + w_coefficients[1,3] * (`w_exp2'-`b_exp2') + w_coefficients[1,4] * (`w_midwest'-`b_midwest') + w_coefficients[1,5] * (` > w_south'-`b_south') + w_coefficients[1,6] * (`w_west'-`b_west')
A quick note about the notation used above. To use the value of a local macro in an expression, you use single quotes around the name of the macro (i.e., the use of `w_lninctot’ in the first line above). The use the value of a cell in a matrix, you use the matrix name followed by the cell row and column in brackets (i.e., the use of w_minus_b[1,7] in the second line above). To see the values of any of these local macros we have been creating, we can use the display command. Let’s take a look at the different gap components we have generated and then calculate the shares of each component’s contribution to the overall gap:
. display `overall_gap' .66833847 . display `intercept_gap' -.07648672 . display `coefficient_gap' .45409959 . display `characteristic_gap' .2907256
. local share_coefficients = `coefficient_gap' / `overall_gap' . local share_characteristics = `characteristic_gap' / `overall_gap'
. display `share_coefficients' .67944554 . display `share_characteristics' .43499755
We have finally arrived at our decomposition. The differences in average characteristics account for 0.29, or a share of 0.43 of the overall log income gap. Differences in returns to those characteristics account for 0.45, or a share of 0.68 of the overall log income gap. Clearly both sources are important for explaining black-white income gaps in the middle of the 20th century.