31 January 2018

First, let’s construct a sample of fathers and sons, matching certains characteristics of the joint distribution of father and son earnings in the United States. We will start by generated a sample of 10,000 fathers whose incomes are distributed log normally with a mean and standard deviation equal to that given in the sample if Solon (1992).

. clear . set obs 5000 number of observations (_N) was 0, now 5,000 . gen id = _n

The set obs command generates an empty 5,000 observation dataset. The gen id command simply creates a unique id number for each observation equal to its observation number.

Now we are going to generate fathers’ incomes assuming that earnings are distributed log normal. We can do this by creating log income as a random variable using Stata’s rnormal function, using the mean and standard deviation from Table 1 in Solon (1992).

. gen log_father_inc = rnormal(10.1,0.69)

As for sons’ earnings, we would like those to be a function of fathers’ earnings. We will assume that sons’ log earnings are linearly related to fathers’ log earnings with a mean zero, normally distributed error term:

\[\begin{equation} ln(y_{s}) = \beta_0 + \beta_{1} ln(y_{f}) + \varepsilon \end{equation}\]The value of \(\beta_{1}\) can be taken directly from the estimated coefficient in Table 2 of Solon (1992). The value of \(\beta_{0}\) is then simply equal to the mean log income for sons in Table 1 minus \(\beta_{1}\) times the mean log income for fathers in Table 1. Finally, we can choose the standard deviation for \(\varepsilon\) that, once used in the above equation, generates son incomes that match the standard deviation of log son earnings given in Table 1. This leads to a value of 0.413 for \(\beta_{1}\), 5.58 for \(\beta_{0}\) and 0.94 for \(\sigma_{\varepsilon}\). With these values, we can now generate sons’ log income values:

. gen son_epsilon = rnormal(0,0.94) . gen log_son_inc = 5.58+.413 * log_father_inc + son_epsilon . gen father_inc = exp(log_father_inc) . gen son_inc = exp(log_son_inc)

Let’s take a quick look at our generated incomes, summarizing the data, looking at the correlation between father and son incomes, and then looking at the income distributions graphically.

. sum log_son_inc log_father_inc Variable │ Obs Mean Std. Dev. Min Max ─────────────┼───────────────────────────────────────────────────────── log_son_inc │ 5,000 9.746891 .9820698 6.654865 13.42529 log_father~c │ 5,000 10.10787 .6848147 7.425219 12.61664 . corr log_son_inc log_father_inc (obs=5,000) │ log_so~c log_fa~c ─────────────┼────────────────── log_son_inc │ 1.0000 log_father~c │ 0.3145 1.0000 . histogram father_inc, frequency ytitle(Frequency) xtitle(Father's income) (bin=36, start=1677.7662, width=8329.3728) . graph export father_inc.png, width(500) replace (file father_inc.png written in PNG format) . histogram log_father_inc, frequency ytitle(Frequency) xtitle(Father's log income) (bin=36, start=7.4252186, width=.14420621) . graph export log_father_inc.png, width(500) replace (file log_father_inc.png written in PNG format) . histogram son_inc, frequency ytitle(Frequency) xtitle(Son's income) (bin=36, start=776.55328, width=18781.49) . graph export son_inc.png, width(500) replace (file son_inc.png written in PNG format) . histogram log_son_inc, frequency ytitle(Frequency) xtitle(Son's log income) (bin=36, start=6.6548653, width=.18806746) . graph export log_son_inc.png, width(500) replace (file log_son_inc.png written in PNG format)

To take a graphical look at the relationship between father and son log incomes, we could use a standard scatterplot. However, with 5,000 observations, a scatterplot will be somewhat uninformative (go ahead and try it yourself using Stata’s scatter command if you would like to see why). Instead, we can use a package for Stata to create a binned scatterplot:

. binscatter log_son_inc log_father_inc, xtitle("Father's log income") ytitle("Son's log income") . graph export father_son_scatter.png, width(500) replace (file father_son_scatter.png written in PNG format)

Notice the nice, linear relationship between son’s log income and father’s log income. This should not come as a surprise given that this is how we constructed son’s log income in the first place. If you would like to use the binscatter program on your own computer, you can install it with the following command: *ssc install binscatter*.

First let’s confirm that our simulated data match the real data used in Solon (1992). To check, let’s run a regression to recover the intergenerational income elasticity for the sample:

. reg log_son_inc log_father_inc Source │ SS df MS Number of obs = 5,000 ─────────────┼────────────────────────────────── F(1, 4998) = 548.73 Model │ 476.970622 1 476.970622 Prob > F = 0.0000 Residual │ 4344.37078 4,998 .869221845 R-squared = 0.0989 ─────────────┼────────────────────────────────── Adj R-squared = 0.0987 Total │ 4821.3414 4,999 .964461173 Root MSE = .93232 ───────────────┬──────────────────────────────────────────────────────────────── log_son_inc │ Coef. Std. Err. t P>|t| [95% Conf. Interval] ───────────────┼──────────────────────────────────────────────────────────────── log_father_inc │ .4510567 .0192553 23.43 0.000 .4133079 .4888056 _cons │ 5.187666 .1950764 26.59 0.000 4.805231 5.570102 ───────────────┴────────────────────────────────────────────────────────────────

From the regression results, we see that we get a coefficient on father’s log income 0.45. Thus we have an intergenerational income elasticity (roughly) equal to that of Solon (1992). We can think of this as the true intergenerational income elasticity for our sample. Now we will consider what happens with two common problems with the way incomes are recorded in survey data: rounding and censoring.

First, we will explore the effects of rounding. Suppose that the survey provides options for income that are in $5,000 intervals (alternatively, assume that people tend to round their incomes to the nearest $5,000). We can generate rounded versions of the father and son incomes using Stata’s round function and then take the natural log to get rounded log income values:

. gen rounded_father_inc = round(father_inc,5000) . gen rounded_son_inc = round(son_inc,5000) . gen log_rounded_father_inc = ln(rounded_father_inc) (3 missing values generated) . gen log_rounded_son_inc = ln(rounded_son_inc) (123 missing values generated)

Now we can use these new variables to re-estimate our intergenerational income elasticity:

. reg log_rounded_son_inc log_rounded_father_inc Source │ SS df MS Number of obs = 4,875 ─────────────┼────────────────────────────────── F(1, 4873) = 469.51 Model │ 355.815138 1 355.815138 Prob > F = 0.0000 Residual │ 3693.00687 4,873 .757850785 R-squared = 0.0879 ─────────────┼────────────────────────────────── Adj R-squared = 0.0877 Total │ 4048.82201 4,874 .830697992 Root MSE = .87055 ───────────────────────┬──────────────────────────────────────────────────────────────── log_rounded_son_inc │ Coef. Std. Err. t P>|t| [95% Conf. Interval] ───────────────────────┼──────────────────────────────────────────────────────────────── log_rounded_father_inc │ .3930834 .0181411 21.67 0.000 .3575186 .4286482 _cons │ 5.834878 .1839725 31.72 0.000 5.474209 6.195547 ───────────────────────┴────────────────────────────────────────────────────────────────

Notice that the measurement error we introduced by rounding incomes has led to an attenuation bias for the intergenerational income elasticity, substantially reducing the estimated coefficient on father’s log income to 0.39. Using rounded incomes, without acknowledging the impact of this rounding on the estimation, would lead us to conclude there is significantly more income mobility than is actually in the underlying data.

The rounding exercise also demonstrates another problem. If you look closely at the commands generating new log incomes, you will notice that several missing values were generated. These missing values are cases where the income was rounded to zero and the natural log of zero does not exist, hence the missing value for log income. One criticism of the intergenerational income elasticity is that its calculation requires dropping individuals with no earnings.

Now we will consider what happens when we top code incomes, a common practice in income datasets. We will impose a top code of $100,000 in our dataset using Stata’s min function (all incomes above $100,000 simply get coded as $100,000):

. gen censored_father_inc = min(rounded_father_inc,100000) . gen censored_son_inc = min(rounded_son_inc,100000) . gen log_censored_father_inc = ln(censored_father_inc) (3 missing values generated) . gen log_censored_son_inc = ln(censored_son_inc) (123 missing values generated) . reg log_censored_son_inc log_censored_father_inc Source │ SS df MS Number of obs = 4,875 ─────────────┼────────────────────────────────── F(1, 4873) = 455.88 Model │ 318.699794 1 318.699794 Prob > F = 0.0000 Residual │ 3406.62359 4,873 .699081385 R-squared = 0.0855 ─────────────┼────────────────────────────────── Adj R-squared = 0.0854 Total │ 3725.32338 4,874 .764325684 Root MSE = .83611 ────────────────────────┬──────────────────────────────────────────────────────────────── log_censored_son_inc │ Coef. Std. Err. t P>|t| [95% Conf. Interval] ────────────────────────┼──────────────────────────────────────────────────────────────── log_censored_father_inc │ .3794401 .0177712 21.35 0.000 .3446006 .4142796 _cons │ 5.959218 .1801042 33.09 0.000 5.606133 6.312304 ────────────────────────┴────────────────────────────────────────────────────────────────

Notice that this further reduces our estimated intergenerational income elasticity to 0.38. The main takeaway is the same, whether we introduce mismeasurement through rounding or censoring of the data, any mismeasurement leads to the appearance of a weaker relationship between father and son’s incomes. This leads to a lower estimated intergenerational income elasticity, leading us to the false conclusion that there is greater mobility. However, this greater estimated mobility is simply the product of measurement, it has nothing to do with sons’ fortunes being less closely tied to those of their fathers.

Now we will turn our attention to the difference between average income over the life cycle and income in the current period. In general, annual income over the life cycle follows a concave shape, with earnings rising over the early career of an individual and then falling in the final years of the career. This suggests that observing earnings very early or very late in an individual’s career will lead to underestimates of average earnings and observing earnings in the peak of a career will lead to overestimates of average earnings. This problem can be handled reasonably well by controlling for a quadratic in an individual’s age.

More problematic is that individuals experience transitory fluctuations in income over their careers, temporary rises and falls in income unrelated to overall trends over the life cycle. To examine the effect these transitory fluctuations have on the estimated income elasticity, let’s introduce some random ups and downs in father and son’s earnings. We can introduce these transitory fluctuations by treating our *son_inc* and *father_inc* variables as our average lifetime annual income and creating a new observation of annual income that includes a random increase or decrease relative to this average income.

. gen father_income_shock = (runiform()-.5)*.5 . gen transitory_father_inc = father_inc * (1+father_income_shock) . gen log_transitory_father_inc = ln(transitory_father_inc) . gen son_income_shock = (runiform()-.5)*.5 . gen transitory_son_inc = son_inc * (1+son_income_shock) . gen log_transitory_son_inc = ln(transitory_son_inc)

In the above commands, we have adjusted incomes by a random percentage ranging with a uniform probability between negative 25% and positive 25%. We can think of these new incomes as observations of a single year of income and the original income variables as observations of the true lifetime average annual income. Now we can see the impact of using one year’s earnings rather than average annual earnings on our estimate of intergenerational income elasticity:

. reg log_transitory_son_inc log_transitory_father_inc Source │ SS df MS Number of obs = 5,000 ─────────────┼────────────────────────────────── F(1, 4998) = 514.56 Model │ 462.431111 1 462.431111 Prob > F = 0.0000 Residual │ 4491.67431 4,998 .898694339 R-squared = 0.0933 ─────────────┼────────────────────────────────── Adj R-squared = 0.0932 Total │ 4954.10542 4,999 .991019288 Root MSE = .94799 ──────────────────────────┬──────────────────────────────────────────────────────────────── log_transitory_son_inc │ Coef. Std. Err. t P>|t| [95% Conf. Interval] ──────────────────────────┼──────────────────────────────────────────────────────────────── log_transitory_father_inc │ .4330768 .0190918 22.68 0.000 .3956485 .4705052 _cons │ 5.365198 .1932628 27.76 0.000 4.986318 5.744078 ──────────────────────────┴────────────────────────────────────────────────────────────────

The effect on the estimated coefficient on father’s log income is quite small. This result seems to run a bit counter to what we see in real world data. For example, Table 2 in Solon (1992), our source for our empirical elasticity estimate, demonstrates a clear increase in estimated intergenerational income elasticities as more periods are used to construct average incomes. In a more recent example, Mazumder (2005) finds large changes in the intergenerational income elasticity when using a single observation of annual income versus an average of several years of annual income observations (see Figure 4). One important difference is that our random income shocks may be a bit different than real world random income shocks. In particular, Mazumder notes that transitory income shocks may exhibit some persistence. This autocorrelation in real-world transitory shocks will further weaken the association between father and son incomes, creating a greater attenuation of the intergenerational income elasticity estimate.