31 January 2018

# Creating a Sample of Fathers and Sons

First, let’s construct a sample of fathers and sons, matching certains characteristics of the joint distribution of father and son earnings in the United States. We will start by generated a sample of 10,000 fathers whose incomes are distributed log normally with a mean and standard deviation equal to that given in the sample if Solon (1992).

. clear

. set obs 5000
number of observations (_N) was 0, now 5,000

. gen id = _n

The set obs command generates an empty 5,000 observation dataset. The gen id command simply creates a unique id number for each observation equal to its observation number.

Now we are going to generate fathers’ incomes assuming that earnings are distributed log normal. We can do this by creating log income as a random variable using Stata’s rnormal function, using the mean and standard deviation from Table 1 in Solon (1992).

. gen log_father_inc = rnormal(10.1,0.69)

As for sons’ earnings, we would like those to be a function of fathers’ earnings. We will assume that sons’ log earnings are linearly related to fathers’ log earnings with a mean zero, normally distributed error term:

$$$ln(y_{s}) = \beta_0 + \beta_{1} ln(y_{f}) + \varepsilon$$$

The value of $$\beta_{1}$$ can be taken directly from the estimated coefficient in Table 2 of Solon (1992). The value of $$\beta_{0}$$ is then simply equal to the mean log income for sons in Table 1 minus $$\beta_{1}$$ times the mean log income for fathers in Table 1. Finally, we can choose the standard deviation for $$\varepsilon$$ that, once used in the above equation, generates son incomes that match the standard deviation of log son earnings given in Table 1. This leads to a value of 0.413 for $$\beta_{1}$$, 5.58 for $$\beta_{0}$$ and 0.94 for $$\sigma_{\varepsilon}$$. With these values, we can now generate sons’ log income values:

. gen son_epsilon = rnormal(0,0.94)

. gen log_son_inc = 5.58+.413 * log_father_inc + son_epsilon

. gen father_inc = exp(log_father_inc)

. gen son_inc = exp(log_son_inc)

Let’s take a quick look at our generated incomes, summarizing the data, looking at the correlation between father and son incomes, and then looking at the income distributions graphically.

. sum log_son_inc log_father_inc

Variable │        Obs        Mean    Std. Dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
log_son_inc │      5,000    9.746891    .9820698   6.654865   13.42529
log_father~c │      5,000    10.10787    .6848147   7.425219   12.61664

. corr log_son_inc log_father_inc
(obs=5,000)

│ log_so~c log_fa~c
─────────────┼──────────────────
log_son_inc │   1.0000
log_father~c │   0.3145   1.0000

. histogram father_inc, frequency ytitle(Frequency) xtitle(Father's income)
(bin=36, start=1677.7662, width=8329.3728)

. graph export father_inc.png, width(500) replace
(file father_inc.png written in PNG format)

. histogram log_father_inc, frequency ytitle(Frequency) xtitle(Father's log income)
(bin=36, start=7.4252186, width=.14420621)

. graph export log_father_inc.png, width(500) replace
(file log_father_inc.png written in PNG format)

. histogram son_inc, frequency ytitle(Frequency) xtitle(Son's income)
(bin=36, start=776.55328, width=18781.49)

. graph export son_inc.png, width(500) replace
(file son_inc.png written in PNG format)

. histogram log_son_inc, frequency ytitle(Frequency) xtitle(Son's log income)
(bin=36, start=6.6548653, width=.18806746)

. graph export log_son_inc.png, width(500) replace
(file log_son_inc.png written in PNG format)

To take a graphical look at the relationship between father and son log incomes, we could use a standard scatterplot. However, with 5,000 observations, a scatterplot will be somewhat uninformative (go ahead and try it yourself using Stata’s scatter command if you would like to see why). Instead, we can use a package for Stata to create a binned scatterplot:

. graph export father_son_scatter.png, width(500) replace
(file father_son_scatter.png written in PNG format)

Notice the nice, linear relationship between son’s log income and father’s log income. This should not come as a surprise given that this is how we constructed son’s log income in the first place. If you would like to use the binscatter program on your own computer, you can install it with the following command: ssc install binscatter.

# The Effect of Rounding on Estimated Intergenerational Income Elasticities

First let’s confirm that our simulated data match the real data used in Solon (1992). To check, let’s run a regression to recover the intergenerational income elasticity for the sample:

. reg log_son_inc log_father_inc

Source │       SS           df       MS      Number of obs   =     5,000
─────────────┼──────────────────────────────────   F(1, 4998)      =    548.73
Model │  476.970622         1  476.970622   Prob > F        =    0.0000
Residual │  4344.37078     4,998  .869221845   R-squared       =    0.0989
─────────────┼──────────────────────────────────   Adj R-squared   =    0.0987
Total │   4821.3414     4,999  .964461173   Root MSE        =    .93232

───────────────┬────────────────────────────────────────────────────────────────
log_son_inc │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
───────────────┼────────────────────────────────────────────────────────────────
log_father_inc │   .4510567   .0192553    23.43   0.000     .4133079    .4888056
_cons │   5.187666   .1950764    26.59   0.000     4.805231    5.570102
───────────────┴────────────────────────────────────────────────────────────────

From the regression results, we see that we get a coefficient on father’s log income 0.45. Thus we have an intergenerational income elasticity (roughly) equal to that of Solon (1992). We can think of this as the true intergenerational income elasticity for our sample. Now we will consider what happens with two common problems with the way incomes are recorded in survey data: rounding and censoring.

First, we will explore the effects of rounding. Suppose that the survey provides options for income that are in $5,000 intervals (alternatively, assume that people tend to round their incomes to the nearest$5,000). We can generate rounded versions of the father and son incomes using Stata’s round function and then take the natural log to get rounded log income values:

. gen rounded_father_inc = round(father_inc,5000)

. gen rounded_son_inc = round(son_inc,5000)

. gen log_rounded_father_inc = ln(rounded_father_inc)
(3 missing values generated)

. gen log_rounded_son_inc = ln(rounded_son_inc)
(123 missing values generated)

Now we can use these new variables to re-estimate our intergenerational income elasticity:

. reg log_rounded_son_inc log_rounded_father_inc

Source │       SS           df       MS      Number of obs   =     4,875
─────────────┼──────────────────────────────────   F(1, 4873)      =    469.51
Model │  355.815138         1  355.815138   Prob > F        =    0.0000
Residual │  3693.00687     4,873  .757850785   R-squared       =    0.0879
─────────────┼──────────────────────────────────   Adj R-squared   =    0.0877
Total │  4048.82201     4,874  .830697992   Root MSE        =    .87055

───────────────────────┬────────────────────────────────────────────────────────────────
log_rounded_son_inc │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
───────────────────────┼────────────────────────────────────────────────────────────────
log_rounded_father_inc │   .3930834   .0181411    21.67   0.000     .3575186    .4286482
_cons │   5.834878   .1839725    31.72   0.000     5.474209    6.195547
───────────────────────┴────────────────────────────────────────────────────────────────

Notice that the measurement error we introduced by rounding incomes has led to an attenuation bias for the intergenerational income elasticity, substantially reducing the estimated coefficient on father’s log income to 0.39. Using rounded incomes, without acknowledging the impact of this rounding on the estimation, would lead us to conclude there is significantly more income mobility than is actually in the underlying data.

The rounding exercise also demonstrates another problem. If you look closely at the commands generating new log incomes, you will notice that several missing values were generated. These missing values are cases where the income was rounded to zero and the natural log of zero does not exist, hence the missing value for log income. One criticism of the intergenerational income elasticity is that its calculation requires dropping individuals with no earnings.

# The Effect of Censoring on Estimated Intergenerational Income Elasticities

Now we will consider what happens when we top code incomes, a common practice in income datasets. We will impose a top code of $100,000 in our dataset using Stata’s min function (all incomes above$100,000 simply get coded as \$100,000):

. gen censored_father_inc = min(rounded_father_inc,100000)

. gen censored_son_inc = min(rounded_son_inc,100000)

. gen log_censored_father_inc = ln(censored_father_inc)
(3 missing values generated)

. gen log_censored_son_inc = ln(censored_son_inc)
(123 missing values generated)

. reg log_censored_son_inc log_censored_father_inc

Source │       SS           df       MS      Number of obs   =     4,875
─────────────┼──────────────────────────────────   F(1, 4873)      =    455.88
Model │  318.699794         1  318.699794   Prob > F        =    0.0000
Residual │  3406.62359     4,873  .699081385   R-squared       =    0.0855
─────────────┼──────────────────────────────────   Adj R-squared   =    0.0854
Total │  3725.32338     4,874  .764325684   Root MSE        =    .83611

────────────────────────┬────────────────────────────────────────────────────────────────
log_censored_son_inc │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
────────────────────────┼────────────────────────────────────────────────────────────────
log_censored_father_inc │   .3794401   .0177712    21.35   0.000     .3446006    .4142796
_cons │   5.959218   .1801042    33.09   0.000     5.606133    6.312304
────────────────────────┴────────────────────────────────────────────────────────────────

Notice that this further reduces our estimated intergenerational income elasticity to 0.38. The main takeaway is the same, whether we introduce mismeasurement through rounding or censoring of the data, any mismeasurement leads to the appearance of a weaker relationship between father and son’s incomes. This leads to a lower estimated intergenerational income elasticity, leading us to the false conclusion that there is greater mobility. However, this greater estimated mobility is simply the product of measurement, it has nothing to do with sons’ fortunes being less closely tied to those of their fathers.

# The Effect of Transitory Fluctuations in Income

Now we will turn our attention to the difference between average income over the life cycle and income in the current period. In general, annual income over the life cycle follows a concave shape, with earnings rising over the early career of an individual and then falling in the final years of the career. This suggests that observing earnings very early or very late in an individual’s career will lead to underestimates of average earnings and observing earnings in the peak of a career will lead to overestimates of average earnings. This problem can be handled reasonably well by controlling for a quadratic in an individual’s age.

More problematic is that individuals experience transitory fluctuations in income over their careers, temporary rises and falls in income unrelated to overall trends over the life cycle. To examine the effect these transitory fluctuations have on the estimated income elasticity, let’s introduce some random ups and downs in father and son’s earnings. We can introduce these transitory fluctuations by treating our son_inc and father_inc variables as our average lifetime annual income and creating a new observation of annual income that includes a random increase or decrease relative to this average income.

. gen father_income_shock = (runiform()-.5)*.5

. gen transitory_father_inc = father_inc * (1+father_income_shock)

. gen log_transitory_father_inc = ln(transitory_father_inc)

. gen son_income_shock = (runiform()-.5)*.5

. gen transitory_son_inc = son_inc * (1+son_income_shock)

. gen log_transitory_son_inc = ln(transitory_son_inc)

In the above commands, we have adjusted incomes by a random percentage ranging with a uniform probability between negative 25% and positive 25%. We can think of these new incomes as observations of a single year of income and the original income variables as observations of the true lifetime average annual income. Now we can see the impact of using one year’s earnings rather than average annual earnings on our estimate of intergenerational income elasticity:

. reg log_transitory_son_inc log_transitory_father_inc

Source │       SS           df       MS      Number of obs   =     5,000
─────────────┼──────────────────────────────────   F(1, 4998)      =    514.56
Model │  462.431111         1  462.431111   Prob > F        =    0.0000
Residual │  4491.67431     4,998  .898694339   R-squared       =    0.0933
─────────────┼──────────────────────────────────   Adj R-squared   =    0.0932
Total │  4954.10542     4,999  .991019288   Root MSE        =    .94799

──────────────────────────┬────────────────────────────────────────────────────────────────
log_transitory_son_inc │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
──────────────────────────┼────────────────────────────────────────────────────────────────
log_transitory_father_inc │   .4330768   .0190918    22.68   0.000     .3956485    .4705052
_cons │   5.365198   .1932628    27.76   0.000     4.986318    5.744078
──────────────────────────┴────────────────────────────────────────────────────────────────

The effect on the estimated coefficient on father’s log income is quite small. This result seems to run a bit counter to what we see in real world data. For example, Table 2 in Solon (1992), our source for our empirical elasticity estimate, demonstrates a clear increase in estimated intergenerational income elasticities as more periods are used to construct average incomes. In a more recent example, Mazumder (2005) finds large changes in the intergenerational income elasticity when using a single observation of annual income versus an average of several years of annual income observations (see Figure 4). One important difference is that our random income shocks may be a bit different than real world random income shocks. In particular, Mazumder notes that transitory income shocks may exhibit some persistence. This autocorrelation in real-world transitory shocks will further weaken the association between father and son incomes, creating a greater attenuation of the intergenerational income elasticity estimate.