4 February 2020
Gaps in income between groups can arise from differences in characteristics across groups, differences into the returns to those characteristics and outright discrimination that is independent of productive characteristics. Each of these sources of income gaps have dramatically different policy implications. Consequently, it is important to decompose income gaps into these different components. This exercise will walk you through these decompositions with historical earnings data from the federal census.
We will work with the IPUMS 1 percent sample of the 1950 federal census. The file ipums-1950-25-55-year-olds.dta contains data from an extract of the 1950 sample restricted to individuals between the ages of 25 and 55 (we want to focus on adults in the labor force). This file can be downloaded from our course website. Alternatively, you could create your own extract through IPUMS if you want to play around with additional variables or other age ranges.
Let’s focus on income gaps between black and white males. To simplify our lives, the first thing we’ll do is open the data and then drop all females and all races other than white or black. Note that in the IPUMS extracts, sex is coded as equal to one for males and two for females. Race is coded as one for white and two for black.
. clear . use ipums-1950-25-55-year-olds.dta . keep if sex==1 & (race==1 | race==2) (383,544 observations deleted)
It is not necessarily the best practice to take this approach. I am dropping the other observations here to keep subsequent commands shorter for the sake of making the tutorial easier to read. Typically, I would prefer to keep all of the data and simply use if statements on all of my commands to focus on the groups of interest; you never know when those other observations may come in useful.
Now let’s take a look at some basic summary statistics for the incomes of black and white males using the inctot variable, total individual pre-tax earnings from all sources. A quick way to do this is to use the tabulate command with the summarize option. To make meaningful comparisons, we will restrict our attention to only those who are employed (those for whom the empstat variable is equal to one). Before we look at the summary statistics, there is one small but important step we need to take. If you look at the documentation for inctot, the variable takes on a value of 9999999 to designate N/A. We certainly do not want to count these individuals as earning ten million dollars, so we need to set these values to missing. Once we do that, we can summarize incomes by race.
. replace inctot=. if inctot==9999999 (274,960 real changes made, 274,960 to missing) . tab race if empstat==1, sum(inctot) Race │ [general │ Summary of Total personal income version] │ Mean Std. Dev. Freq. ────────────┼──────────────────────────────────── White │ 3253.3299 2001.1598 77,089 Black/Afr │ 1746.7236 1160.6887 7,376 ────────────┼──────────────────────────────────── Total │ 3121.7638 1988.3331 84,465
Here we can see a pretty large difference in average incomes between black and white males, with white males having roughly twice the average income of black males. We could do a simple t-test in Stata to confirm that this is a statistically significant gap but, in anticipation of starting to decompose the gap, let’s switch to using a regression framework to estimate the gap. We will begin with a simple regression:
\[\begin{equation} ln(y_{i}) = \beta_0 + \beta_{1} Black_{i} + \varepsilon_{i} \end{equation}\]The variable \(Black_{i}\) is simply an indicator variable equal to one if an individual is black and zero if an individual is white. The coefficient \(\beta_{1}\) will then give us the average difference in log income between black and white individuals. Let’s give it a try, first creating the needed log income and black indicator variables (when generating the black indicator variable, recall that race is coded as one for white and two for black):
. gen lninctot = ln(inctot) (279,673 missing values generated) . gen black = race - 1 . reg lninctot black Source │ SS df MS Number of obs = 88,848 ─────────────┼────────────────────────────────── F(1, 88846) = 5587.54 Model │ 3287.86373 1 3287.86373 Prob > F = 0.0000 Residual │ 52279.4359 88,846 .58842757 R-squared = 0.0592 ─────────────┼────────────────────────────────── Adj R-squared = 0.0592 Total │ 55567.2996 88,847 .625426853 Root MSE = .76709 ─────────────┬──────────────────────────────────────────────────────────────── lninctot │ Coef. Std. Err. t P>|t| [95% Conf. Interval] ─────────────┼──────────────────────────────────────────────────────────────── black │ -.6683385 .008941 -74.75 0.000 -.6858627 -.6508142 _cons │ 7.863199 .0026995 2912.87 0.000 7.857908 7.86849 ─────────────┴────────────────────────────────────────────────────────────────
Here we get our basic, unconditional black-white gap in log earnings, -0.67. Now we are ready to start thinking about how to explain this gap.
We are going to first consider the extent to which we can explain the gap by differences in observable characteristics. Let’s see what happens when we introduce controls for years of educational attainment and potential years of experience. First things first, let’s recode the higrade variable to be equal to years of educational attainment. Note that higrade takes on a value of zero if it is not applicable, so we want these to be treated as missing, and otherwise it is basically equal to years of education plus three (refer to the codes on IPUMS to see what I mean by this). Using that information, we can construct a years of schooling variable:
. gen schooling = higrade - 3 . replace schooling = . if higrade==0 (274,960 real changes made, 274,960 to missing) . replace schooling = 0 if schooling<0 (1,138 real changes made)
Now for years of potential experience, we’ll assume that schooling starts at age 5 and then is completed without gaps. This means we can estimate years of potential experience based on age and our new schooling variable:
. gen exp = age - schooling - 5 (274,960 missing values generated) . gen exp2 = exp^2 (274,960 missing values generated)
Note that I created a variable for the square of potential experience as well. We will want to use a quadratic in experience in our regression to account for diminishing returns to experience. Our new regression equation is now:
\[\begin{equation} ln(y_{i}) = \beta_0 + \beta_{1} Black_{i} + \beta_{2} Schooling_{i} + \beta_{3} Exp_{i} + \beta_{4} Exp_{i}^{2} + \varepsilon_{i} \end{equation}\]Let’s run this new specification and see what happens to the black-white log earnings gap once we control for differences in schooling and experience:
. reg lninctot black schooling exp exp2 Source │ SS df MS Number of obs = 88,848 ─────────────┼────────────────────────────────── F(4, 88843) = 4301.60 Model │ 9015.72537 4 2253.93134 Prob > F = 0.0000 Residual │ 46551.5742 88,843 .523975713 R-squared = 0.1622 ─────────────┼────────────────────────────────── Adj R-squared = 0.1622 Total │ 55567.2996 88,847 .625426853 Root MSE = .72386 ─────────────┬──────────────────────────────────────────────────────────────── lninctot │ Coef. Std. Err. t P>|t| [95% Conf. Interval] ─────────────┼──────────────────────────────────────────────────────────────── black │ -.4311397 .0087577 -49.23 0.000 -.4483048 -.4139747 schooling │ .0802859 .00082 97.91 0.000 .0786787 .0818931 exp │ .0431854 .0011701 36.91 0.000 .040892 .0454789 exp2 │ -.0006657 .0000228 -29.21 0.000 -.0007104 -.0006211 _cons │ 6.474822 .0177423 364.94 0.000 6.440048 6.509597 ─────────────┴────────────────────────────────────────────────────────────────
Our coefficient on \(Black_{i}\) has now shrunk to -0.43; controlling for differences in schooling and experience has accounted for about a third of the black-white log income gap. What other characteristics might matter? Consider location. A quick check confirms that the black population share is substantially higher in the southern regions of the United States. These southern regions also have lower incomes on average, perhaps this could explain some of black-white gap:
. tab region, sum(black) Census │ region and │ Summary of black division │ Mean Std. Dev. Freq. ────────────┼──────────────────────────────────── New Engla │ .01583286 .12483133 22,927 Middle At │ .06750081 .25088894 77,525 East Nort │ .06883883 .25318157 74,580 West Nort │ .03016848 .17105331 32,882 South Atl │ .24544119 .43035253 53,084 East Sout │ .2235597 .4166378 27,581 West Sout │ .16176428 .36823972 34,847 Mountain │ .01428449 .11866613 11,691 Pacific D │ .04182134 .2001837 33,404 ────────────┼──────────────────────────────────── Total │ .10343508 .30452671 368,521 . tab region, sum(inctot) Census │ region and │ Summary of Total personal income division │ Mean Std. Dev. Freq. ────────────┼──────────────────────────────────── New Engla │ 2934.633 1906.1064 5,649 Middle At │ 3137.5167 2033.8479 19,901 East Nort │ 3251.4877 1935.5788 19,273 West Nort │ 2880.4717 2010.2336 8,427 South Atl │ 2458.6902 1941.6888 12,655 East Sout │ 2077.2632 1808.1586 6,299 West Sout │ 2635.378 2077.82 8,793 Mountain │ 3087.2279 2028.3844 3,058 Pacific D │ 3388.7455 2076.9102 9,506 ────────────┼──────────────────────────────────── Total │ 2939.0832 2020.6728 93,561
To include regional variation in our regressions, let’s construct a few new indicator variables:
. gen midwest = 0 . replace midwest = 1 if region==21 | region==22 (107,462 real changes made) . gen south = 0 . replace south = 1 if region==31 | region==32 | region==33 (115,512 real changes made) . gen west = 0 . replace west = 1 if region==41 | region==42 (45,095 real changes made)
With region controls, our regression equation now becomes:
\[\begin{equation} ln(y_{i}) = \beta_0 + \beta_{1} Black_{i} + \beta_{2} Schooling_{i} + \beta_{3} Exp_{i} + \beta_{4} Exp_{i}^{2} + \beta_{5} Midwest_{i} + \beta_{6} South_{i} + \beta_{7} West_{i} + \varepsilon_{i} \end{equation}\]Note that there is no indicator variable for the Northeast. We need to omit one category, otherwise we will have collinear variables and won’t be able to get unique coefficients (if we did include a Northeast indicator variable and ran the regression, Stata would drop one of the variables with a warning message noting the collinearity). Each region coefficient tells us how average income in that region compares to the average income in the Northeast controlling for race, schooling and experience. Let’s see how this impacts our estimated black-white income gap:
. reg lninctot black schooling exp exp2 midwest south west Source │ SS df MS Number of obs = 88,848 ─────────────┼────────────────────────────────── F(7, 88840) = 2720.92 Model │ 9809.92955 7 1401.41851 Prob > F = 0.0000 Residual │ 45757.3701 88,840 .515053693 R-squared = 0.1765 ─────────────┼────────────────────────────────── Adj R-squared = 0.1765 Total │ 55567.2996 88,847 .625426853 Root MSE = .71767 ─────────────┬──────────────────────────────────────────────────────────────── lninctot │ Coef. Std. Err. t P>|t| [95% Conf. Interval] ─────────────┼──────────────────────────────────────────────────────────────── black │ -.3721592 .0088123 -42.23 0.000 -.3894312 -.3548872 schooling │ .0750892 .0008242 91.10 0.000 .0734738 .0767046 exp │ .0430693 .0011601 37.12 0.000 .0407954 .0453431 exp2 │ -.0006769 .0000226 -29.96 0.000 -.0007212 -.0006326 midwest │ -.0057455 .0063868 -0.90 0.368 -.0182635 .0067725 south │ -.2170893 .0065642 -33.07 0.000 -.2299551 -.2042235 west │ .0030648 .0080198 0.38 0.702 -.012654 .0187835 _cons │ 6.595718 .0182088 362.23 0.000 6.560029 6.631407 ─────────────┴────────────────────────────────────────────────────────────────
Sure enough a significant part of the black-white gap was being driven by black individuals being disproportionately located in the low-income south.
Up to this point, we have been assuming that returns to characteristics are the same for black and white workers. However, part of the black-white income gap will be driven by differences in the returns to characteristics. In other words, an additional year of schooling might lead to greater labor market returns for a white worker than a black worker. To see if this is the case, we can run two separate regressions, one for black males and one for white males and see if the coefficients differ. We will use the same specification as above except we can drop the \(Black_{i}\) variable as there will be no variation in race within each regression sample:
. reg lninctot schooling exp exp2 midwest south west if black==1 Source │ SS df MS Number of obs = 8,099 ─────────────┼────────────────────────────────── F(6, 8092) = 250.18 Model │ 951.424234 6 158.570706 Prob > F = 0.0000 Residual │ 5128.95486 8,092 .633830308 R-squared = 0.1565 ─────────────┼────────────────────────────────── Adj R-squared = 0.1558 Total │ 6080.37909 8,098 .75084948 Root MSE = .79613 ─────────────┬──────────────────────────────────────────────────────────────── lninctot │ Coef. Std. Err. t P>|t| [95% Conf. Interval] ─────────────┼──────────────────────────────────────────────────────────────── schooling │ .0550572 .0030105 18.29 0.000 .0491558 .0609586 exp │ .0315035 .0046549 6.77 0.000 .0223786 .0406284 exp2 │ -.0005181 .0000816 -6.35 0.000 -.0006781 -.0003582 midwest │ .082265 .0302801 2.72 0.007 .0229081 .1416218 south │ -.4216104 .0258107 -16.33 0.000 -.4722059 -.3710148 west │ .0718059 .0455241 1.58 0.115 -.0174331 .1610448 _cons │ 6.64194 .075739 87.70 0.000 6.493472 6.790408 ─────────────┴──────────────────────────────────────────────────────────────── . reg lninctot schooling exp exp2 midwest south west if black==0 Source │ SS df MS Number of obs = 80,749 ─────────────┼────────────────────────────────── F(6, 80742) = 1897.79 Model │ 5710.0202 6 951.670033 Prob > F = 0.0000 Residual │ 40489.0366 80,742 .501461898 R-squared = 0.1236 ─────────────┼────────────────────────────────── Adj R-squared = 0.1235 Total │ 46199.0568 80,748 .572138713 Root MSE = .70814 ─────────────┬──────────────────────────────────────────────────────────────── lninctot │ Coef. Std. Err. t P>|t| [95% Conf. Interval] ─────────────┼──────────────────────────────────────────────────────────────── schooling │ .0765422 .0008552 89.50 0.000 .074866 .0782185 exp │ .0438008 .0012051 36.35 0.000 .0414389 .0461627 exp2 │ -.0006864 .0000237 -28.90 0.000 -.0007329 -.0006398 midwest │ -.0116585 .0064823 -1.80 0.072 -.0243638 .0010467 south │ -.1917959 .0067935 -28.23 0.000 -.2051111 -.1784807 west │ .0017851 .0080763 0.22 0.825 -.0140445 .0176147 _cons │ 6.565453 .0186564 351.91 0.000 6.528886 6.602019 ─────────────┴────────────────────────────────────────────────────────────────
Sure enough the coefficients turn out to be quite different, with the returns to schooling and experience being lower for black workers and the region coefficients differing as well. These differences in the returns to characteristics are going to contribute to the overall black-white income gap and help explain why such a large gap still existed even after controlling for differences in characteristics.
We would now like to say something about how much of the black-white income gap is driven by differences in characteristics and how much is driven by differences in the returns to those characteristics. We can do this using a Blinder-Oaxaca decomposition. To see how the decomposition works, let’s start by writing down equations for the average log income for black individuals and for white individuals by plugging in the average values for each characteristic into our race-specific regression equations. For simplicity, I will revert back to the equations that did not include region controls (even though we now know they are important):
\[\begin{equation} \overline{ln(y)}_{W} = \beta_{0,W} + \beta_{1,W} \overline{Schooling}_{W} + \beta_{2,W} \overline{Exp}_{W} + \beta_{3,W} \overline{Exp^{2}}_{W} + \overline{\varepsilon}_{W} \end{equation}\] \[\begin{equation} \overline{ln(y)}_{B} = \beta_{0,B} + \beta_{1,B} \overline{Schooling}_{B} + \beta_{2,B} \overline{Exp}_{B} + \beta_{3,B} \overline{Exp^{2}}_{B} + \overline{\varepsilon}_{B} \end{equation}\]Now we will take the difference between these, noting that the mean of the error term in each case is zero and can be dropped:
\[\begin{equation} \overline{ln(y)}_{W} - \overline{ln(y)}_{B} = \left(\beta_{0,W} + \beta_{1,W} \overline{Schooling}_{W} + \beta_{2,W} \overline{Exp}_{W} + \beta_{3,W} \overline{Exp^{2}}_{W} \right) - \left(\beta_{0,B} + \beta_{1,B} \overline{Schooling}_{B} + \beta_{2,B} \overline{Exp}_{B} + \beta_{3,B} \overline{Exp^{2}}_{B} \right) \end{equation}\]With a little bit of algebra, we can transform this equation into a rather intuitive decomposition of the income gap. First we will add and subtract a series of identical terms, you’ll see why in just a second:
\[\begin{equation} \overline{ln(y)}_{W} - \overline{ln(y)}_{B} = \left(\beta_{0,W} + \beta_{1,W} \overline{Schooling}_{W} + \beta_{2,W} \overline{Exp}_{W} + \beta_{3,W} \overline{Exp^{2}}_{W} \right) - \left(\beta_{0,B} + \beta_{1,B} \overline{Schooling}_{B} + \beta_{2,B} \overline{Exp}_{B} + \beta_{3,B} \overline{Exp^{2}}_{B} \right) + \end{equation}\] \[\begin{equation} \left(\beta_{1,W} \overline{Schooling}_{B} - \beta_{1,W} \overline{Schooling}_{B}\right) + \left(\beta_{2,W} \overline{Exp}_{B} - \beta_{2,W} \overline{Exp}_{B}\right) + \left(\beta_{3,W} \overline{Exp^{2}}_{B} - \beta_{3,W} \overline{Exp^{2}}_{B}\right) \end{equation}\]Now we can rearrange and group terms together:
\[\begin{equation} \overline{ln(y)}_{W} - \overline{ln(y)}_{B} = \left(\beta_{0,W} - \beta_{0,B}\right) + \end{equation}\] \[\begin{equation} \left(\beta_{1,W}-\beta_{1,B}\right) \overline{Schooling}_{B} + \left(\beta_{2,W}-\beta_{2,B}\right) \overline{Exp}_{B} + \left(\beta_{3,W}-\beta_{3,B}\right) \overline{Exp^{2}}_{B} + \end{equation}\] \[\begin{equation}\beta_{1,W} \left(\overline{Schooling}_{W}-\overline{Schooling}_{B}\right) + \beta_{2,W} \left(\overline{Exp}_{W}-\overline{Exp}_{B}\right) + \beta_{3,W}\left(\overline{Exp^{2}}_{W}-\overline{Exp^{2}}_{B}\right) \end{equation}\]What we are left with is our decomposition. The first difference is the difference in intercepts, essentially the black-white gap that would exist between two workers with no schooling or experience. The next three terms capture the difference in black and white log incomes due to differences in the returns to characteristics. The final three terms capture the difference in black and white log income due to differences in average levels of the characteristics.
A simple way to do the Blinder-Oaxaca decomposition would be to copy the relevant regression coefficients and variables means into Excel and calculate all of the relevant terms in there. However, we can also do all of the calculations right in Stata if we get a bit more advanced with our commands. The first thing we need to do is have Stata store our regression coefficients in a way that is easy for us to work with. We will do this by re-running the regressions and then saving the coefficients to a matrix:
. reg lninctot schooling exp exp2 midwest south west if black==1 Source │ SS df MS Number of obs = 8,099 ─────────────┼────────────────────────────────── F(6, 8092) = 250.18 Model │ 951.424234 6 158.570706 Prob > F = 0.0000 Residual │ 5128.95486 8,092 .633830308 R-squared = 0.1565 ─────────────┼────────────────────────────────── Adj R-squared = 0.1558 Total │ 6080.37909 8,098 .75084948 Root MSE = .79613 ─────────────┬──────────────────────────────────────────────────────────────── lninctot │ Coef. Std. Err. t P>|t| [95% Conf. Interval] ─────────────┼──────────────────────────────────────────────────────────────── schooling │ .0550572 .0030105 18.29 0.000 .0491558 .0609586 exp │ .0315035 .0046549 6.77 0.000 .0223786 .0406284 exp2 │ -.0005181 .0000816 -6.35 0.000 -.0006781 -.0003582 midwest │ .082265 .0302801 2.72 0.007 .0229081 .1416218 south │ -.4216104 .0258107 -16.33 0.000 -.4722059 -.3710148 west │ .0718059 .0455241 1.58 0.115 -.0174331 .1610448 _cons │ 6.64194 .075739 87.70 0.000 6.493472 6.790408 ─────────────┴──────────────────────────────────────────────────────────────── . matrix b_coefficients = e(b) . reg lninctot schooling exp exp2 midwest south west if black==0 Source │ SS df MS Number of obs = 80,749 ─────────────┼────────────────────────────────── F(6, 80742) = 1897.79 Model │ 5710.0202 6 951.670033 Prob > F = 0.0000 Residual │ 40489.0366 80,742 .501461898 R-squared = 0.1236 ─────────────┼────────────────────────────────── Adj R-squared = 0.1235 Total │ 46199.0568 80,748 .572138713 Root MSE = .70814 ─────────────┬──────────────────────────────────────────────────────────────── lninctot │ Coef. Std. Err. t P>|t| [95% Conf. Interval] ─────────────┼──────────────────────────────────────────────────────────────── schooling │ .0765422 .0008552 89.50 0.000 .074866 .0782185 exp │ .0438008 .0012051 36.35 0.000 .0414389 .0461627 exp2 │ -.0006864 .0000237 -28.90 0.000 -.0007329 -.0006398 midwest │ -.0116585 .0064823 -1.80 0.072 -.0243638 .0010467 south │ -.1917959 .0067935 -28.23 0.000 -.2051111 -.1784807 west │ .0017851 .0080763 0.22 0.825 -.0140445 .0176147 _cons │ 6.565453 .0186564 351.91 0.000 6.528886 6.602019 ─────────────┴──────────────────────────────────────────────────────────────── . matrix w_coefficients = e(b) . matrix w_minus_b = w_coefficients - b_coefficients
If you look at that first matrix command, it tells Stata to create a matrix named b_coefficients containing the regression coefficients. These regression coefficients are stored temporarily after running the regression as e(b). If you want to see all of the results stored by Stata after running a command, you can use the commands return list and ereturn list. Let’s see what our new matrices look like:
. matrix list b_coefficients b_coefficients[1,7] schooling exp exp2 midwest south west _cons y1 .0550572 .03150349 -.00051812 .08226497 -.42161036 .07180586 6.6419397 . matrix list w_coefficients w_coefficients[1,7] schooling exp exp2 midwest south west _cons y1 .07654223 .04380079 -.00068639 -.01165852 -.19179591 .00178511 6.5654529 . matrix list w_minus_b w_minus_b[1,7] schooling exp exp2 midwest south west _cons y1 .02148503 .0122973 -.00016827 -.09392349 .22981445 -.07002076 -.07648672
These matrices contain all of the various \(\beta_{W}\)’s and \(\beta_{B}\)’s we need for our decomposition. Now we need to store the variable means. The summarize command will store the mean of the summarized variable for us. Unfortunately, it will only do this one variable at a time. We could try to get fancy here by looping over the regression variables but we have already done enough fancy stuff for one tutorial. We will just use summarize one variable at a time and store the values we need as local macros. There is one last fancy thing we need to do. If we were to just summarize a variable, the mean would include observations that were not actually used in the regression (think observations missing values for one or more variables). To restrict our summary statistics to just the regression samples, we can use e(sample), information returned by the regression command that identifies the observations used in the regression sample:
. reg lninctot schooling exp exp2 midwest south west if black==1 Source │ SS df MS Number of obs = 8,099 ─────────────┼────────────────────────────────── F(6, 8092) = 250.18 Model │ 951.424234 6 158.570706 Prob > F = 0.0000 Residual │ 5128.95486 8,092 .633830308 R-squared = 0.1565 ─────────────┼────────────────────────────────── Adj R-squared = 0.1558 Total │ 6080.37909 8,098 .75084948 Root MSE = .79613 ─────────────┬──────────────────────────────────────────────────────────────── lninctot │ Coef. Std. Err. t P>|t| [95% Conf. Interval] ─────────────┼──────────────────────────────────────────────────────────────── schooling │ .0550572 .0030105 18.29 0.000 .0491558 .0609586 exp │ .0315035 .0046549 6.77 0.000 .0223786 .0406284 exp2 │ -.0005181 .0000816 -6.35 0.000 -.0006781 -.0003582 midwest │ .082265 .0302801 2.72 0.007 .0229081 .1416218 south │ -.4216104 .0258107 -16.33 0.000 -.4722059 -.3710148 west │ .0718059 .0455241 1.58 0.115 -.0174331 .1610448 _cons │ 6.64194 .075739 87.70 0.000 6.493472 6.790408 ─────────────┴──────────────────────────────────────────────────────────────── . sum lninctot if e(sample) Variable │ Obs Mean Std. Dev. Min Max ─────────────┼───────────────────────────────────────────────────────── lninctot │ 8,099 7.194861 .8665157 3.912023 9.21034 . local b_lninctot = r(mean) . sum schooling if e(sample) Variable │ Obs Mean Std. Dev. Min Max ─────────────┼───────────────────────────────────────────────────────── schooling │ 8,099 6.769972 3.75231 0 17 . local b_schooling = r(mean) . sum exp if e(sample) Variable │ Obs Mean Std. Dev. Min Max ─────────────┼───────────────────────────────────────────────────────── exp │ 8,099 26.71836 10.32438 3 50 . local b_exp = r(mean) . sum exp2 if e(sample) Variable │ Obs Mean Std. Dev. Min Max ─────────────┼───────────────────────────────────────────────────────── exp2 │ 8,099 820.4504 576.6295 9 2500 . local b_exp2 = r(mean) . sum midwest if e(sample) Variable │ Obs Mean Std. Dev. Min Max ─────────────┼───────────────────────────────────────────────────────── midwest │ 8,099 .187554 .3903797 0 1 . local b_midwest = r(mean) . sum south if e(sample) Variable │ Obs Mean Std. Dev. Min Max ─────────────┼───────────────────────────────────────────────────────── south │ 8,099 .6058773 .4886916 0 1 . local b_south = r(mean) . sum west if e(sample) Variable │ Obs Mean Std. Dev. Min Max ─────────────┼───────────────────────────────────────────────────────── west │ 8,099 .0497592 .2174605 0 1 . local b_west = r(mean)
Now we will do the same for the white regression and variable means:
. reg lninctot schooling exp exp2 midwest south west if black==0 Source │ SS df MS Number of obs = 80,749 ─────────────┼────────────────────────────────── F(6, 80742) = 1897.79 Model │ 5710.0202 6 951.670033 Prob > F = 0.0000 Residual │ 40489.0366 80,742 .501461898 R-squared = 0.1236 ─────────────┼────────────────────────────────── Adj R-squared = 0.1235 Total │ 46199.0568 80,748 .572138713 Root MSE = .70814 ─────────────┬──────────────────────────────────────────────────────────────── lninctot │ Coef. Std. Err. t P>|t| [95% Conf. Interval] ─────────────┼──────────────────────────────────────────────────────────────── schooling │ .0765422 .0008552 89.50 0.000 .074866 .0782185 exp │ .0438008 .0012051 36.35 0.000 .0414389 .0461627 exp2 │ -.0006864 .0000237 -28.90 0.000 -.0007329 -.0006398 midwest │ -.0116585 .0064823 -1.80 0.072 -.0243638 .0010467 south │ -.1917959 .0067935 -28.23 0.000 -.2051111 -.1784807 west │ .0017851 .0080763 0.22 0.825 -.0140445 .0176147 _cons │ 6.565453 .0186564 351.91 0.000 6.528886 6.602019 ─────────────┴──────────────────────────────────────────────────────────────── . sum lninctot if e(sample) Variable │ Obs Mean Std. Dev. Min Max ─────────────┼───────────────────────────────────────────────────────── lninctot │ 80,749 7.863199 .7563985 3.912023 9.21034 . local w_lninctot = r(mean) . sum schooling if e(sample) Variable │ Obs Mean Std. Dev. Min Max ─────────────┼───────────────────────────────────────────────────────── schooling │ 80,749 10.04538 3.47595 0 17 . local w_schooling = r(mean) . sum exp if e(sample) Variable │ Obs Mean Std. Dev. Min Max ─────────────┼───────────────────────────────────────────────────────── exp │ 80,749 23.72776 10.10725 3 50 . local w_exp = r(mean) . sum exp2 if e(sample) Variable │ Obs Mean Std. Dev. Min Max ─────────────┼───────────────────────────────────────────────────────── exp2 │ 80,749 665.162 510.1659 9 2500 . local w_exp2 = r(mean) . sum midwest if e(sample) Variable │ Obs Mean Std. Dev. Min Max ─────────────┼───────────────────────────────────────────────────────── midwest │ 80,749 .30975 .4623933 0 1 . local w_midwest = r(mean) . sum south if e(sample) Variable │ Obs Mean Std. Dev. Min Max ─────────────┼───────────────────────────────────────────────────────── south │ 80,749 .263446 .4405049 0 1 . local w_south = r(mean) . sum west if e(sample) Variable │ Obs Mean Std. Dev. Min Max ─────────────┼───────────────────────────────────────────────────────── west │ 80,749 .1440885 .3511816 0 1 . local w_west = r(mean)
Finally we are ready to do our decomposition. We will calculate the overall difference in log incomes, the component due to the difference in intercepts, the component due to differences in the returns to characteristics, and the component due to the differences in mean levels of those characteristics:
. local overall_gap = `w_lninctot' - `b_lninctot' . local intercept_gap = w_minus_b[1,7] . local coefficient_gap = w_minus_b[1,1] * `b_schooling' + w_minus_b[1,2] * `b_exp' + w_minus_b[1,3] * `b_exp2' + w_minus_b[1 > ,4] * `b_midwest' + w_minus_b[1,5] * `b_south' + w_minus_b[1,6] * `b_west' . local characteristic_gap = w_coefficients[1,1] * (`w_schooling' - `b_schooling') + w_coefficients[1,2] * (`w_exp'-`b_exp > ') + w_coefficients[1,3] * (`w_exp2'-`b_exp2') + w_coefficients[1,4] * (`w_midwest'-`b_midwest') + w_coefficients[1,5] * (` > w_south'-`b_south') + w_coefficients[1,6] * (`w_west'-`b_west')
A quick note about the notation used above. To use the value of a local macro in an expression, you use single quotes around the name of the macro (i.e., the use of `w_lninctot’ in the first line above). The use the value of a cell in a matrix, you use the matrix name followed by the cell row and column in brackets (i.e., the use of w_minus_b[1,7] in the second line above). To see the values of any of these local macros we have been creating, we can use the display command. Let’s take a look at the different gap components we have generated and then calculate the shares of each component’s contribution to the overall gap:
. display `overall_gap' .66833847 . display `intercept_gap' -.07648672 . display `coefficient_gap' .45409959 . display `characteristic_gap' .2907256
. local share_coefficients = `coefficient_gap' / `overall_gap' . local share_characteristics = `characteristic_gap' / `overall_gap'
. display `share_coefficients' .67944554 . display `share_characteristics' .43499755
We have finally arrived at our decomposition. The differences in average characteristics account for 0.29, or a share of 0.43 of the overall log income gap. Differences in returns to those characteristics account for 0.45, or a share of 0.68 of the overall log income gap. Clearly both sources are important for explaining black-white income gaps in the middle of the 20th century.