Decomposing Income Gaps

John M. Parman

4 February 2020


Gaps in income between groups can arise from differences in characteristics across groups, differences into the returns to those characteristics and outright discrimination that is independent of productive characteristics. Each of these sources of income gaps have dramatically different policy implications. Consequently, it is important to decompose income gaps into these different components. This exercise will walk you through these decompositions with historical earnings data from the federal census.

We will work with the IPUMS 1 percent sample of the 1950 federal census. The file ipums-1950-25-55-year-olds.dta contains data from an extract of the 1950 sample restricted to individuals between the ages of 25 and 55 (we want to focus on adults in the labor force). This file can be downloaded from our course website. Alternatively, you could create your own extract through IPUMS if you want to play around with additional variables or other age ranges.

Basic Incomes Gaps

Let’s focus on income gaps between black and white males. To simplify our lives, the first thing we’ll do is open the data and then drop all females and all races other than white or black. Note that in the IPUMS extracts, sex is coded as equal to one for males and two for females. Race is coded as one for white and two for black.

. clear

. use ipums-1950-25-55-year-olds.dta

. keep if sex==1 & (race==1 | race==2)
(383,544 observations deleted)

It is not necessarily the best practice to take this approach. I am dropping the other observations here to keep subsequent commands shorter for the sake of making the tutorial easier to read. Typically, I would prefer to keep all of the data and simply use if statements on all of my commands to focus on the groups of interest; you never know when those other observations may come in useful.

Now let’s take a look at some basic summary statistics for the incomes of black and white males using the inctot variable, total individual pre-tax earnings from all sources. A quick way to do this is to use the tabulate command with the summarize option. To make meaningful comparisons, we will restrict our attention to only those who are employed (those for whom the empstat variable is equal to one). Before we look at the summary statistics, there is one small but important step we need to take. If you look at the documentation for inctot, the variable takes on a value of 9999999 to designate N/A. We certainly do not want to count these individuals as earning ten million dollars, so we need to set these values to missing. Once we do that, we can summarize incomes by race.

. replace inctot=. if inctot==9999999
(274,960 real changes made, 274,960 to missing)

. tab race if empstat==1, sum(inctot)

       Race │
   [general │  Summary of Total personal income
   version] │        Mean   Std. Dev.       Freq.
      White │   3253.3299   2001.1598      77,089
  Black/Afr │   1746.7236   1160.6887       7,376
      Total │   3121.7638   1988.3331      84,465

Here we can see a pretty large difference in average incomes between black and white males, with white males having roughly twice the average income of black males. We could do a simple t-test in Stata to confirm that this is a statistically significant gap but, in anticipation of starting to decompose the gap, let’s switch to using a regression framework to estimate the gap. We will begin with a simple regression:

\[\begin{equation} ln(y_{i}) = \beta_0 + \beta_{1} Black_{i} + \varepsilon_{i} \end{equation}\]

The variable \(Black_{i}\) is simply an indicator variable equal to one if an individual is black and zero if an individual is white. The coefficient \(\beta_{1}\) will then give us the average difference in log income between black and white individuals. Let’s give it a try, first creating the needed log income and black indicator variables (when generating the black indicator variable, recall that race is coded as one for white and two for black):

. gen lninctot = ln(inctot)
(279,673 missing values generated)

. gen black = race - 1

. reg lninctot black

      Source │       SS           df       MS      Number of obs   =    88,848
─────────────┼──────────────────────────────────   F(1, 88846)     =   5587.54
       Model │  3287.86373         1  3287.86373   Prob > F        =    0.0000
    Residual │  52279.4359    88,846   .58842757   R-squared       =    0.0592
─────────────┼──────────────────────────────────   Adj R-squared   =    0.0592
       Total │  55567.2996    88,847  .625426853   Root MSE        =    .76709

    lninctot │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
       black │  -.6683385    .008941   -74.75   0.000    -.6858627   -.6508142
       _cons │   7.863199   .0026995  2912.87   0.000     7.857908     7.86849

Here we get our basic, unconditional black-white gap in log earnings, -0.67. Now we are ready to start thinking about how to explain this gap.

Differences in Characteristics

We are going to first consider the extent to which we can explain the gap by differences in observable characteristics. Let’s see what happens when we introduce controls for years of educational attainment and potential years of experience. First things first, let’s recode the higrade variable to be equal to years of educational attainment. Note that higrade takes on a value of zero if it is not applicable, so we want these to be treated as missing, and otherwise it is basically equal to years of education plus three (refer to the codes on IPUMS to see what I mean by this). Using that information, we can construct a years of schooling variable:

. gen schooling = higrade - 3

. replace schooling = . if higrade==0
(274,960 real changes made, 274,960 to missing)

. replace schooling = 0 if schooling<0
(1,138 real changes made)

Now for years of potential experience, we’ll assume that schooling starts at age 5 and then is completed without gaps. This means we can estimate years of potential experience based on age and our new schooling variable:

. gen exp = age - schooling - 5
(274,960 missing values generated)

. gen exp2 = exp^2
(274,960 missing values generated)

Note that I created a variable for the square of potential experience as well. We will want to use a quadratic in experience in our regression to account for diminishing returns to experience. Our new regression equation is now:

\[\begin{equation} ln(y_{i}) = \beta_0 + \beta_{1} Black_{i} + \beta_{2} Schooling_{i} + \beta_{3} Exp_{i} + \beta_{4} Exp_{i}^{2} + \varepsilon_{i} \end{equation}\]

Let’s run this new specification and see what happens to the black-white log earnings gap once we control for differences in schooling and experience:

. reg lninctot black schooling exp exp2

      Source │       SS           df       MS      Number of obs   =    88,848
─────────────┼──────────────────────────────────   F(4, 88843)     =   4301.60
       Model │  9015.72537         4  2253.93134   Prob > F        =    0.0000
    Residual │  46551.5742    88,843  .523975713   R-squared       =    0.1622
─────────────┼──────────────────────────────────   Adj R-squared   =    0.1622
       Total │  55567.2996    88,847  .625426853   Root MSE        =    .72386

    lninctot │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
       black │  -.4311397   .0087577   -49.23   0.000    -.4483048   -.4139747
   schooling │   .0802859     .00082    97.91   0.000     .0786787    .0818931
         exp │   .0431854   .0011701    36.91   0.000      .040892    .0454789
        exp2 │  -.0006657   .0000228   -29.21   0.000    -.0007104   -.0006211
       _cons │   6.474822   .0177423   364.94   0.000     6.440048    6.509597

Our coefficient on \(Black_{i}\) has now shrunk to -0.43; controlling for differences in schooling and experience has accounted for about a third of the black-white log income gap. What other characteristics might matter? Consider location. A quick check confirms that the black population share is substantially higher in the southern regions of the United States. These southern regions also have lower incomes on average, perhaps this could explain some of black-white gap:

. tab region, sum(black)

     Census │
 region and │          Summary of black
   division │        Mean   Std. Dev.       Freq.
  New Engla │   .01583286   .12483133      22,927
  Middle At │   .06750081   .25088894      77,525
  East Nort │   .06883883   .25318157      74,580
  West Nort │   .03016848   .17105331      32,882
  South Atl │   .24544119   .43035253      53,084
  East Sout │    .2235597    .4166378      27,581
  West Sout │   .16176428   .36823972      34,847
  Mountain  │   .01428449   .11866613      11,691
  Pacific D │   .04182134    .2001837      33,404
      Total │   .10343508   .30452671     368,521

. tab region, sum(inctot)

     Census │
 region and │  Summary of Total personal income
   division │        Mean   Std. Dev.       Freq.
  New Engla │    2934.633   1906.1064       5,649
  Middle At │   3137.5167   2033.8479      19,901
  East Nort │   3251.4877   1935.5788      19,273
  West Nort │   2880.4717   2010.2336       8,427
  South Atl │   2458.6902   1941.6888      12,655
  East Sout │   2077.2632   1808.1586       6,299
  West Sout │    2635.378     2077.82       8,793
  Mountain  │   3087.2279   2028.3844       3,058
  Pacific D │   3388.7455   2076.9102       9,506
      Total │   2939.0832   2020.6728      93,561

To include regional variation in our regressions, let’s construct a few new indicator variables:

. gen midwest = 0

. replace midwest = 1 if region==21 | region==22
(107,462 real changes made)

. gen south = 0

. replace south = 1 if region==31 | region==32 | region==33
(115,512 real changes made)

. gen west = 0

. replace west = 1 if region==41 | region==42
(45,095 real changes made)

With region controls, our regression equation now becomes:

\[\begin{equation} ln(y_{i}) = \beta_0 + \beta_{1} Black_{i} + \beta_{2} Schooling_{i} + \beta_{3} Exp_{i} + \beta_{4} Exp_{i}^{2} + \beta_{5} Midwest_{i} + \beta_{6} South_{i} + \beta_{7} West_{i} + \varepsilon_{i} \end{equation}\]

Note that there is no indicator variable for the Northeast. We need to omit one category, otherwise we will have collinear variables and won’t be able to get unique coefficients (if we did include a Northeast indicator variable and ran the regression, Stata would drop one of the variables with a warning message noting the collinearity). Each region coefficient tells us how average income in that region compares to the average income in the Northeast controlling for race, schooling and experience. Let’s see how this impacts our estimated black-white income gap:

. reg lninctot black schooling exp exp2 midwest south west

      Source │       SS           df       MS      Number of obs   =    88,848
─────────────┼──────────────────────────────────   F(7, 88840)     =   2720.92
       Model │  9809.92955         7  1401.41851   Prob > F        =    0.0000
    Residual │  45757.3701    88,840  .515053693   R-squared       =    0.1765
─────────────┼──────────────────────────────────   Adj R-squared   =    0.1765
       Total │  55567.2996    88,847  .625426853   Root MSE        =    .71767

    lninctot │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
       black │  -.3721592   .0088123   -42.23   0.000    -.3894312   -.3548872
   schooling │   .0750892   .0008242    91.10   0.000     .0734738    .0767046
         exp │   .0430693   .0011601    37.12   0.000     .0407954    .0453431
        exp2 │  -.0006769   .0000226   -29.96   0.000    -.0007212   -.0006326
     midwest │  -.0057455   .0063868    -0.90   0.368    -.0182635    .0067725
       south │  -.2170893   .0065642   -33.07   0.000    -.2299551   -.2042235
        west │   .0030648   .0080198     0.38   0.702     -.012654    .0187835
       _cons │   6.595718   .0182088   362.23   0.000     6.560029    6.631407

Sure enough a significant part of the black-white gap was being driven by black individuals being disproportionately located in the low-income south.

Differences in Returns to Characteristics

Up to this point, we have been assuming that returns to characteristics are the same for black and white workers. However, part of the black-white income gap will be driven by differences in the returns to characteristics. In other words, an additional year of schooling might lead to greater labor market returns for a white worker than a black worker. To see if this is the case, we can run two separate regressions, one for black males and one for white males and see if the coefficients differ. We will use the same specification as above except we can drop the \(Black_{i}\) variable as there will be no variation in race within each regression sample:

. reg lninctot schooling exp exp2 midwest south west if black==1

      Source │       SS           df       MS      Number of obs   =     8,099
─────────────┼──────────────────────────────────   F(6, 8092)      =    250.18
       Model │  951.424234         6  158.570706   Prob > F        =    0.0000
    Residual │  5128.95486     8,092  .633830308   R-squared       =    0.1565
─────────────┼──────────────────────────────────   Adj R-squared   =    0.1558
       Total │  6080.37909     8,098   .75084948   Root MSE        =    .79613

    lninctot │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
   schooling │   .0550572   .0030105    18.29   0.000     .0491558    .0609586
         exp │   .0315035   .0046549     6.77   0.000     .0223786    .0406284
        exp2 │  -.0005181   .0000816    -6.35   0.000    -.0006781   -.0003582
     midwest │    .082265   .0302801     2.72   0.007     .0229081    .1416218
       south │  -.4216104   .0258107   -16.33   0.000    -.4722059   -.3710148
        west │   .0718059   .0455241     1.58   0.115    -.0174331    .1610448
       _cons │    6.64194    .075739    87.70   0.000     6.493472    6.790408

. reg lninctot schooling exp exp2 midwest south west if black==0

      Source │       SS           df       MS      Number of obs   =    80,749
─────────────┼──────────────────────────────────   F(6, 80742)     =   1897.79
       Model │   5710.0202         6  951.670033   Prob > F        =    0.0000
    Residual │  40489.0366    80,742  .501461898   R-squared       =    0.1236
─────────────┼──────────────────────────────────   Adj R-squared   =    0.1235
       Total │  46199.0568    80,748  .572138713   Root MSE        =    .70814

    lninctot │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
   schooling │   .0765422   .0008552    89.50   0.000      .074866    .0782185
         exp │   .0438008   .0012051    36.35   0.000     .0414389    .0461627
        exp2 │  -.0006864   .0000237   -28.90   0.000    -.0007329   -.0006398
     midwest │  -.0116585   .0064823    -1.80   0.072    -.0243638    .0010467
       south │  -.1917959   .0067935   -28.23   0.000    -.2051111   -.1784807
        west │   .0017851   .0080763     0.22   0.825    -.0140445    .0176147
       _cons │   6.565453   .0186564   351.91   0.000     6.528886    6.602019

Sure enough the coefficients turn out to be quite different, with the returns to schooling and experience being lower for black workers and the region coefficients differing as well. These differences in the returns to characteristics are going to contribute to the overall black-white income gap and help explain why such a large gap still existed even after controlling for differences in characteristics.

The Blinder-Oaxaca Decomposition

We would now like to say something about how much of the black-white income gap is driven by differences in characteristics and how much is driven by differences in the returns to those characteristics. We can do this using a Blinder-Oaxaca decomposition. To see how the decomposition works, let’s start by writing down equations for the average log income for black individuals and for white individuals by plugging in the average values for each characteristic into our race-specific regression equations. For simplicity, I will revert back to the equations that did not include region controls (even though we now know they are important):

\[\begin{equation} \overline{ln(y)}_{W} = \beta_{0,W} + \beta_{1,W} \overline{Schooling}_{W} + \beta_{2,W} \overline{Exp}_{W} + \beta_{3,W} \overline{Exp^{2}}_{W} + \overline{\varepsilon}_{W} \end{equation}\] \[\begin{equation} \overline{ln(y)}_{B} = \beta_{0,B} + \beta_{1,B} \overline{Schooling}_{B} + \beta_{2,B} \overline{Exp}_{B} + \beta_{3,B} \overline{Exp^{2}}_{B} + \overline{\varepsilon}_{B} \end{equation}\]

Now we will take the difference between these, noting that the mean of the error term in each case is zero and can be dropped:

\[\begin{equation} \overline{ln(y)}_{W} - \overline{ln(y)}_{B} = \left(\beta_{0,W} + \beta_{1,W} \overline{Schooling}_{W} + \beta_{2,W} \overline{Exp}_{W} + \beta_{3,W} \overline{Exp^{2}}_{W} \right) - \left(\beta_{0,B} + \beta_{1,B} \overline{Schooling}_{B} + \beta_{2,B} \overline{Exp}_{B} + \beta_{3,B} \overline{Exp^{2}}_{B} \right) \end{equation}\]

With a little bit of algebra, we can transform this equation into a rather intuitive decomposition of the income gap. First we will add and subtract a series of identical terms, you’ll see why in just a second:

\[\begin{equation} \overline{ln(y)}_{W} - \overline{ln(y)}_{B} = \left(\beta_{0,W} + \beta_{1,W} \overline{Schooling}_{W} + \beta_{2,W} \overline{Exp}_{W} + \beta_{3,W} \overline{Exp^{2}}_{W} \right) - \left(\beta_{0,B} + \beta_{1,B} \overline{Schooling}_{B} + \beta_{2,B} \overline{Exp}_{B} + \beta_{3,B} \overline{Exp^{2}}_{B} \right) + \end{equation}\] \[\begin{equation} \left(\beta_{1,W} \overline{Schooling}_{B} - \beta_{1,W} \overline{Schooling}_{B}\right) + \left(\beta_{2,W} \overline{Exp}_{B} - \beta_{2,W} \overline{Exp}_{B}\right) + \left(\beta_{3,W} \overline{Exp^{2}}_{B} - \beta_{3,W} \overline{Exp^{2}}_{B}\right) \end{equation}\]

Now we can rearrange and group terms together:

\[\begin{equation} \overline{ln(y)}_{W} - \overline{ln(y)}_{B} = \left(\beta_{0,W} - \beta_{0,B}\right) + \end{equation}\] \[\begin{equation} \left(\beta_{1,W}-\beta_{1,B}\right) \overline{Schooling}_{B} + \left(\beta_{2,W}-\beta_{2,B}\right) \overline{Exp}_{B} + \left(\beta_{3,W}-\beta_{3,B}\right) \overline{Exp^{2}}_{B} + \end{equation}\] \[\begin{equation}\beta_{1,W} \left(\overline{Schooling}_{W}-\overline{Schooling}_{B}\right) + \beta_{2,W} \left(\overline{Exp}_{W}-\overline{Exp}_{B}\right) + \beta_{3,W}\left(\overline{Exp^{2}}_{W}-\overline{Exp^{2}}_{B}\right) \end{equation}\]

What we are left with is our decomposition. The first difference is the difference in intercepts, essentially the black-white gap that would exist between two workers with no schooling or experience. The next three terms capture the difference in black and white log incomes due to differences in the returns to characteristics. The final three terms capture the difference in black and white log income due to differences in average levels of the characteristics.

Running the Decomposition in Stata

A simple way to do the Blinder-Oaxaca decomposition would be to copy the relevant regression coefficients and variables means into Excel and calculate all of the relevant terms in there. However, we can also do all of the calculations right in Stata if we get a bit more advanced with our commands. The first thing we need to do is have Stata store our regression coefficients in a way that is easy for us to work with. We will do this by re-running the regressions and then saving the coefficients to a matrix:

. reg lninctot schooling exp exp2 midwest south west if black==1

      Source │       SS           df       MS      Number of obs   =     8,099
─────────────┼──────────────────────────────────   F(6, 8092)      =    250.18
       Model │  951.424234         6  158.570706   Prob > F        =    0.0000
    Residual │  5128.95486     8,092  .633830308   R-squared       =    0.1565
─────────────┼──────────────────────────────────   Adj R-squared   =    0.1558
       Total │  6080.37909     8,098   .75084948   Root MSE        =    .79613

    lninctot │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
   schooling │   .0550572   .0030105    18.29   0.000     .0491558    .0609586
         exp │   .0315035   .0046549     6.77   0.000     .0223786    .0406284
        exp2 │  -.0005181   .0000816    -6.35   0.000    -.0006781   -.0003582
     midwest │    .082265   .0302801     2.72   0.007     .0229081    .1416218
       south │  -.4216104   .0258107   -16.33   0.000    -.4722059   -.3710148
        west │   .0718059   .0455241     1.58   0.115    -.0174331    .1610448
       _cons │    6.64194    .075739    87.70   0.000     6.493472    6.790408

. matrix b_coefficients = e(b)

. reg lninctot schooling exp exp2 midwest south west if black==0

      Source │       SS           df       MS      Number of obs   =    80,749
─────────────┼──────────────────────────────────   F(6, 80742)     =   1897.79
       Model │   5710.0202         6  951.670033   Prob > F        =    0.0000
    Residual │  40489.0366    80,742  .501461898   R-squared       =    0.1236
─────────────┼──────────────────────────────────   Adj R-squared   =    0.1235
       Total │  46199.0568    80,748  .572138713   Root MSE        =    .70814

    lninctot │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
   schooling │   .0765422   .0008552    89.50   0.000      .074866    .0782185
         exp │   .0438008   .0012051    36.35   0.000     .0414389    .0461627
        exp2 │  -.0006864   .0000237   -28.90   0.000    -.0007329   -.0006398
     midwest │  -.0116585   .0064823    -1.80   0.072    -.0243638    .0010467
       south │  -.1917959   .0067935   -28.23   0.000    -.2051111   -.1784807
        west │   .0017851   .0080763     0.22   0.825    -.0140445    .0176147
       _cons │   6.565453   .0186564   351.91   0.000     6.528886    6.602019

. matrix w_coefficients = e(b)

. matrix w_minus_b = w_coefficients - b_coefficients

If you look at that first matrix command, it tells Stata to create a matrix named b_coefficients containing the regression coefficients. These regression coefficients are stored temporarily after running the regression as e(b). If you want to see all of the results stored by Stata after running a command, you can use the commands return list and ereturn list. Let’s see what our new matrices look like:

. matrix list b_coefficients

     schooling         exp        exp2     midwest       south        west       _cons
y1    .0550572   .03150349  -.00051812   .08226497  -.42161036   .07180586   6.6419397

. matrix list w_coefficients

     schooling         exp        exp2     midwest       south        west       _cons
y1   .07654223   .04380079  -.00068639  -.01165852  -.19179591   .00178511   6.5654529

. matrix list w_minus_b

     schooling         exp        exp2     midwest       south        west       _cons
y1   .02148503    .0122973  -.00016827  -.09392349   .22981445  -.07002076  -.07648672

These matrices contain all of the various \(\beta_{W}\)’s and \(\beta_{B}\)’s we need for our decomposition. Now we need to store the variable means. The summarize command will store the mean of the summarized variable for us. Unfortunately, it will only do this one variable at a time. We could try to get fancy here by looping over the regression variables but we have already done enough fancy stuff for one tutorial. We will just use summarize one variable at a time and store the values we need as local macros. There is one last fancy thing we need to do. If we were to just summarize a variable, the mean would include observations that were not actually used in the regression (think observations missing values for one or more variables). To restrict our summary statistics to just the regression samples, we can use e(sample), information returned by the regression command that identifies the observations used in the regression sample:

. reg lninctot schooling exp exp2 midwest south west if black==1

      Source │       SS           df       MS      Number of obs   =     8,099
─────────────┼──────────────────────────────────   F(6, 8092)      =    250.18
       Model │  951.424234         6  158.570706   Prob > F        =    0.0000
    Residual │  5128.95486     8,092  .633830308   R-squared       =    0.1565
─────────────┼──────────────────────────────────   Adj R-squared   =    0.1558
       Total │  6080.37909     8,098   .75084948   Root MSE        =    .79613

    lninctot │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
   schooling │   .0550572   .0030105    18.29   0.000     .0491558    .0609586
         exp │   .0315035   .0046549     6.77   0.000     .0223786    .0406284
        exp2 │  -.0005181   .0000816    -6.35   0.000    -.0006781   -.0003582
     midwest │    .082265   .0302801     2.72   0.007     .0229081    .1416218
       south │  -.4216104   .0258107   -16.33   0.000    -.4722059   -.3710148
        west │   .0718059   .0455241     1.58   0.115    -.0174331    .1610448
       _cons │    6.64194    .075739    87.70   0.000     6.493472    6.790408

. sum lninctot if e(sample)

    Variable │        Obs        Mean    Std. Dev.       Min        Max
    lninctot │      8,099    7.194861    .8665157   3.912023    9.21034

. local b_lninctot = r(mean)

. sum schooling if e(sample)

    Variable │        Obs        Mean    Std. Dev.       Min        Max
   schooling │      8,099    6.769972     3.75231          0         17

. local b_schooling = r(mean)

. sum exp if e(sample)

    Variable │        Obs        Mean    Std. Dev.       Min        Max
         exp │      8,099    26.71836    10.32438          3         50

. local b_exp = r(mean)

. sum exp2 if e(sample)

    Variable │        Obs        Mean    Std. Dev.       Min        Max
        exp2 │      8,099    820.4504    576.6295          9       2500

. local b_exp2 = r(mean)

. sum midwest if e(sample)

    Variable │        Obs        Mean    Std. Dev.       Min        Max
     midwest │      8,099     .187554    .3903797          0          1

. local b_midwest = r(mean)

. sum south if e(sample)

    Variable │        Obs        Mean    Std. Dev.       Min        Max
       south │      8,099    .6058773    .4886916          0          1

. local b_south = r(mean)

. sum west if e(sample)

    Variable │        Obs        Mean    Std. Dev.       Min        Max
        west │      8,099    .0497592    .2174605          0          1

. local b_west = r(mean)

Now we will do the same for the white regression and variable means:

. reg lninctot schooling exp exp2 midwest south west if black==0

      Source │       SS           df       MS      Number of obs   =    80,749
─────────────┼──────────────────────────────────   F(6, 80742)     =   1897.79
       Model │   5710.0202         6  951.670033   Prob > F        =    0.0000
    Residual │  40489.0366    80,742  .501461898   R-squared       =    0.1236
─────────────┼──────────────────────────────────   Adj R-squared   =    0.1235
       Total │  46199.0568    80,748  .572138713   Root MSE        =    .70814

    lninctot │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
   schooling │   .0765422   .0008552    89.50   0.000      .074866    .0782185
         exp │   .0438008   .0012051    36.35   0.000     .0414389    .0461627
        exp2 │  -.0006864   .0000237   -28.90   0.000    -.0007329   -.0006398
     midwest │  -.0116585   .0064823    -1.80   0.072    -.0243638    .0010467
       south │  -.1917959   .0067935   -28.23   0.000    -.2051111   -.1784807
        west │   .0017851   .0080763     0.22   0.825    -.0140445    .0176147
       _cons │   6.565453   .0186564   351.91   0.000     6.528886    6.602019

. sum lninctot if e(sample)

    Variable │        Obs        Mean    Std. Dev.       Min        Max
    lninctot │     80,749    7.863199    .7563985   3.912023    9.21034

. local w_lninctot = r(mean)

. sum schooling if e(sample)

    Variable │        Obs        Mean    Std. Dev.       Min        Max
   schooling │     80,749    10.04538     3.47595          0         17

. local w_schooling = r(mean)

. sum exp if e(sample)

    Variable │        Obs        Mean    Std. Dev.       Min        Max
         exp │     80,749    23.72776    10.10725          3         50

. local w_exp = r(mean)

. sum exp2 if e(sample)

    Variable │        Obs        Mean    Std. Dev.       Min        Max
        exp2 │     80,749     665.162    510.1659          9       2500

. local w_exp2 = r(mean)

. sum midwest if e(sample)

    Variable │        Obs        Mean    Std. Dev.       Min        Max
     midwest │     80,749      .30975    .4623933          0          1

. local w_midwest = r(mean)

. sum south if e(sample)

    Variable │        Obs        Mean    Std. Dev.       Min        Max
       south │     80,749     .263446    .4405049          0          1

. local w_south = r(mean)

. sum west if e(sample)

    Variable │        Obs        Mean    Std. Dev.       Min        Max
        west │     80,749    .1440885    .3511816          0          1

. local w_west = r(mean)

Finally we are ready to do our decomposition. We will calculate the overall difference in log incomes, the component due to the difference in intercepts, the component due to differences in the returns to characteristics, and the component due to the differences in mean levels of those characteristics:

. local overall_gap = `w_lninctot' - `b_lninctot'

. local intercept_gap = w_minus_b[1,7]

. local coefficient_gap = w_minus_b[1,1] * `b_schooling' + w_minus_b[1,2] * `b_exp' + w_minus_b[1,3] * `b_exp2' + w_minus_b[1
> ,4] * `b_midwest' + w_minus_b[1,5] * `b_south' + w_minus_b[1,6] * `b_west'

. local characteristic_gap = w_coefficients[1,1] * (`w_schooling'    - `b_schooling') + w_coefficients[1,2] * (`w_exp'-`b_exp
> ') + w_coefficients[1,3] * (`w_exp2'-`b_exp2') + w_coefficients[1,4] * (`w_midwest'-`b_midwest') + w_coefficients[1,5] * (`
> w_south'-`b_south') + w_coefficients[1,6] * (`w_west'-`b_west')

A quick note about the notation used above. To use the value of a local macro in an expression, you use single quotes around the name of the macro (i.e., the use of `w_lninctot’ in the first line above). The use the value of a cell in a matrix, you use the matrix name followed by the cell row and column in brackets (i.e., the use of w_minus_b[1,7] in the second line above). To see the values of any of these local macros we have been creating, we can use the display command. Let’s take a look at the different gap components we have generated and then calculate the shares of each component’s contribution to the overall gap:

. display `overall_gap' 

. display `intercept_gap'

. display `coefficient_gap'

. display `characteristic_gap'
. local share_coefficients = `coefficient_gap' / `overall_gap'

. local share_characteristics = `characteristic_gap' / `overall_gap'
. display `share_coefficients'

. display `share_characteristics'

We have finally arrived at our decomposition. The differences in average characteristics account for 0.29, or a share of 0.43 of the overall log income gap. Differences in returns to those characteristics account for 0.45, or a share of 0.68 of the overall log income gap. Clearly both sources are important for explaining black-white income gaps in the middle of the 20th century.


  1. Follow the same procedures to decompose the male-female log income gap into the share due to differences in characteristics and the share due to differences in returns to those characteristics. Discuss any interesting similarities or differences to the black-white decomposition.
  2. The income variable in 1950 is top-coded at $10,000. Explain whether you think this will lead to overestimates or underestimates of the true black-white income gap. Using the 1950 data, make a case for whether you think this top-coding presents a serious issue for our analysis.
  3. It may not be years of education that matter in the labor market but simply whether you have a high school degree and whether you have a college degree. Redo the decomposition of the black-white income gap using regression specifications that focus on high school graduation and college graduation rather than years of schooling.
  4. When we derived the formula for the decomposition, we added and subtracted terms in a way that left us with an equation that evaluated the impact of differences in characteristics on income using the white coefficients. Redo the derivation and analysis to instead use the black coefficients to evaluate the impact of differences in characteristics. Does this change the conclusions in a meaningful way?