Age Heaping as a Proxy for Human Capital

John M. Parman

28 February 2018

Looking for Age Heaping

I have pulled demographic and educational attainment data from the IPUMS one percent samples of the 1880, 1900, 1920 and 1940 federal censuses. The resulting dataset is available here. We will use these data to explore age heaping, the tendency to round ages to particular numbers. The most common numbers to round to are those ending in zero or five.

First things first, let’s load the data and construct a variable that gives us the last digit of each individual’s age:

. clear

. use ipums-sample-for-age-heaping.dta

. gen last_digit = mod(age,10)

The mod function gives the modulus of the first number with respect to the second number. In other words, it gives you the remainder after dividing the first number by the second number. By choosing 10 as the second number, the mod function is effectively giving us the last digit of age.

The last digit of age should be distributed fairly uniformly, we do not expect any real cyclicality of births relative to census years. So if people are correctly reporting their ages, we would expect roughly the same number of people to have an age ending in zero, one, two, three and so on. However, this will not necessarily be true at very young ages and very old ages where mortality is an issue. Particularly in the earlier census years, childhood mortality is still a big issue. A significant number child deaths occur at early ages. So there will be more one-year-olds than two-year-olds and more two-year-olds than three-year-olds even if the same number of children are born each year. The same goes for the elderly, many of the people that make it to age 70 will not make it to age 71, many that make it to 71 will not make it to 72 and so on. So for the very young and the very old, we do not expect a uniform distribution of the last digit for age. Instead, lower digits should be more likely than higher digits. Let’s check this really quickly:

. tab last_digit if year==1880 & age<10

 last_digit │      Freq.     Percent        Cum.
────────────┼───────────────────────────────────
          0 │     14,510       10.78       10.78
          1 │     12,788        9.50       20.27
          2 │     14,377       10.68       30.95
          3 │     13,966       10.37       41.32
          4 │     14,076       10.45       51.78
          5 │     13,598       10.10       61.88
          6 │     13,718       10.19       72.07
          7 │     13,063        9.70       81.77
          8 │     12,898        9.58       91.35
          9 │     11,652        8.65      100.00
────────────┼───────────────────────────────────
      Total │    134,646      100.00

. tab last_digit if year==1880 & age>59

 last_digit │      Freq.     Percent        Cum.
────────────┼───────────────────────────────────
          0 │      6,931       24.30       24.30
          1 │      2,457        8.61       32.91
          2 │      3,158       11.07       43.98
          3 │      2,808        9.84       53.82
          4 │      2,517        8.82       62.64
          5 │      3,406       11.94       74.58
          6 │      2,035        7.13       81.72
          7 │      1,775        6.22       87.94
          8 │      1,904        6.67       94.61
          9 │      1,537        5.39      100.00
────────────┼───────────────────────────────────
      Total │     28,528      100.00

So in 1880, it looks like mortality is affecting the distribution of last digits, particularly for the elderly. Let’s see if this changes by 1940 when medical technology is much better:

. tab last_digit if year==1940 & age<10

 last_digit │      Freq.     Percent        Cum.
────────────┼───────────────────────────────────
          0 │     19,179        8.55        8.55
          1 │     21,692        9.67       18.21
          2 │     23,161       10.32       28.53
          3 │     22,381        9.97       38.50
          4 │     22,808       10.16       48.66
          5 │     22,961       10.23       58.89
          6 │     22,051        9.82       68.72
          7 │     22,563       10.05       78.77
          8 │     23,870       10.64       89.41
          9 │     23,772       10.59      100.00
────────────┼───────────────────────────────────
      Total │    224,438      100.00

. tab last_digit if year==1940 & age>59

 last_digit │      Freq.     Percent        Cum.
────────────┼───────────────────────────────────
          0 │     21,822       15.94       15.94
          1 │     14,320       10.46       26.39
          2 │     16,213       11.84       38.23
          3 │     14,707       10.74       48.97
          4 │     14,027       10.24       59.22
          5 │     15,084       11.02       70.23
          6 │     10,992        8.03       78.26
          7 │     10,570        7.72       85.98
          8 │     10,216        7.46       93.44
          9 │      8,983        6.56      100.00
────────────┼───────────────────────────────────
      Total │    136,934      100.00

By 1940, we do not really see a systematic trend in the frequency of last digits among children but we still see a noticeable downward trend for the elderly. This should make sense. Improvements in health at the turn of the century made a big difference for infants and children but deaths at old ages are always going to be present.

Let’s ignore the deaths at very young and very old ages, focusing our attention on people between the ages of 20 and 59. Within this range, it is reasonable to assume that the last digit of age should be fairly uniformly distributed. Let’s check:

. tab last_digit if age>19 & age<60

 last_digit │      Freq.     Percent        Cum.
────────────┼───────────────────────────────────
          0 │    243,892       12.94       12.94
          1 │    177,709        9.43       22.38
          2 │    204,582       10.86       33.23
          3 │    185,421        9.84       43.07
          4 │    184,922        9.81       52.89
          5 │    204,573       10.86       63.74
          6 │    177,017        9.39       73.14
          7 │    165,788        8.80       81.94
          8 │    178,544        9.48       91.41
          9 │    161,788        8.59      100.00
────────────┼───────────────────────────────────
      Total │  1,884,236      100.00

This looks pretty uniform except for one unusual feature. There appears to be far too many people with ages ending in zero (and a few too many with ages ending in five) than we would expect. This prominence of zeros and, to a lesser extent, fives suggests that there may be age heaping taking place. Of course, it could be the case that there really are many more 20, 30, 40, and 50 year olds than 21, 31, 41 and 51 year olds. How can we check whether this is the true distribution of ages or whether there is rounding?

The Correlates of Age Heaping

We cannot know for certain, but we can make some assumptions about how the likelihood of heaping should correspond to other individual characteristics. In particular, let’s assume that people with lower levels of human capital are more likely to round their ages. We can use the data to test whether people with ages ending in zero or five are more educated or less educated on average than people with ages ending in other digits. To do this, we will start by constructing an indicator variable equal to one if an individual has an age ending in zero or five and equal to zero otherwise. To measure human capital, we will construct an indicator for literacy, an indicator for being a high school graduate, and a measure of years of schooling:

. gen zero_or_five = 0 if last_digit~=0 & last_digit~=5
(833,278 missing values generated)

. replace zero_or_five = 1 if last_digit==0 | last_digit==5
(833,278 real changes made)

. gen literate = 0 if lit==1
(3,511,419 missing values generated)

. replace literate = 1 if lit==2 | lit==3 | lit==4
(1,616,129 real changes made)

. gen years_of_schooling = 0 if higrade==1
(3,481,606 missing values generated)

. replace years_of_schooling = higrade - 3 if higrade>1 & higrade~=.
(1,168,072 real changes made)

. gen hs_grad = 0 if years_of_schooling < 12 & years_of_schooling~=.
(2,570,058 missing values generated)

. replace hs_grad = 1 if years_of_schooling > 11 & years_of_schooling~=.
(256,524 real changes made)

Now we can run some simple regressions to see if these measures of human capital help us predict having an age ending in zero or five:

. reg zero_or_five literate if age>19 & age<60

      Source │       SS           df       MS      Number of obs   = 1,150,044
─────────────┼──────────────────────────────────   F(1, 1150042)   =   4539.90
       Model │  841.535029         1  841.535029   Prob > F        =    0.0000
    Residual │  213176.744 1,150,042  .185364312   R-squared       =    0.0039
─────────────┼──────────────────────────────────   Adj R-squared   =    0.0039
       Total │  214018.279 1,150,043  .186095893   Root MSE        =    .43054

─────────────┬────────────────────────────────────────────────────────────────
zero_or_five │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
    literate │  -.0956001   .0014188   -67.38   0.000     -.098381   -.0928192
       _cons │   .3344165   .0013552   246.77   0.000     .3317605    .3370726
─────────────┴────────────────────────────────────────────────────────────────

. reg zero_or_five hs_grad if age>19 & age<60

      Source │       SS           df       MS      Number of obs   =   734,192
─────────────┼──────────────────────────────────   F(1, 734190)    =     73.24
       Model │  12.7133561         1  12.7133561   Prob > F        =    0.0000
    Residual │  127445.619   734,190    .1735867   R-squared       =    0.0001
─────────────┼──────────────────────────────────   Adj R-squared   =    0.0001
       Total │  127458.333   734,191   .17360378   Root MSE        =    .41664

─────────────┬────────────────────────────────────────────────────────────────
zero_or_five │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
     hs_grad │   .0091517   .0010694     8.56   0.000     .0070558    .0112477
       _cons │   .2209283   .0005779   382.30   0.000     .2197957     .222061
─────────────┴────────────────────────────────────────────────────────────────

. reg zero_or_five years_of_schooling if age>19 & age<60

      Source │       SS           df       MS      Number of obs   =   734,192
─────────────┼──────────────────────────────────   F(1, 734190)    =      0.05
       Model │  .009526401         1  .009526401   Prob > F        =    0.8148
    Residual │  127458.323   734,190  .173604003   R-squared       =    0.0000
─────────────┼──────────────────────────────────   Adj R-squared   =   -0.0000
       Total │  127458.333   734,191   .17360378   Root MSE        =    .41666

───────────────────┬────────────────────────────────────────────────────────────────
      zero_or_five │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
───────────────────┼────────────────────────────────────────────────────────────────
years_of_schooling │   -.000032   .0001367    -0.23   0.815       -.0003    .0002359
             _cons │    .223886   .0013106   170.82   0.000     .2213172    .2264548
───────────────────┴────────────────────────────────────────────────────────────────

Let’s do this a little more systematically, including additional potentially relevant controls.

. gen black = 0

. replace black = 1 if race==2
(395,115 real changes made)

. gen female = 0 

. replace female = 1 if sex==2
(1,806,991 real changes made)

. reg zero_or_five literate black female

      Source │       SS           df       MS      Number of obs   = 1,769,976
─────────────┼──────────────────────────────────   F(3, 1769972)   =   3006.96
       Model │  1644.62128         3  548.207094   Prob > F        =    0.0000
    Residual │  322688.174 1,769,972  .182312587   R-squared       =    0.0051
─────────────┼──────────────────────────────────   Adj R-squared   =    0.0051
       Total │  324332.795 1,769,975  .183241455   Root MSE        =    .42698

─────────────┬────────────────────────────────────────────────────────────────
zero_or_five │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
    literate │  -.0792631   .0012291   -64.49   0.000    -.0816722   -.0768541
       black │   .0449308   .0011236    39.99   0.000     .0427285    .0471332
      female │   .0045789   .0006421     7.13   0.000     .0033204    .0058374
       _cons │   .3069861   .0012561   244.39   0.000     .3045241    .3094481
─────────────┴────────────────────────────────────────────────────────────────

Finally, let’s add some fixed effects for region and census year:

. xi: reg zero_or_five literate black female i.region i.year
i.region          _Iregion_11-91      (naturally coded; _Iregion_11 omitted)
i.year            _Iyear_1880-1940    (naturally coded; _Iyear_1880 omitted)
note: _Iyear_1940 omitted because of collinearity

      Source │       SS           df       MS      Number of obs   = 1,769,976
─────────────┼──────────────────────────────────   F(14, 1769961)  =    773.55
       Model │  1972.40261        14  140.885901   Prob > F        =    0.0000
    Residual │  322360.392 1,769,961  .182128528   R-squared       =    0.0061
─────────────┼──────────────────────────────────   Adj R-squared   =    0.0061
       Total │  324332.795 1,769,975  .183241455   Root MSE        =    .42677

─────────────┬────────────────────────────────────────────────────────────────
zero_or_five │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
    literate │  -.0744088   .0012435   -59.84   0.000     -.076846   -.0719716
       black │   .0460044   .0012098    38.03   0.000     .0436332    .0483757
      female │   .0044168   .0006423     6.88   0.000     .0031578    .0056758
 _Iregion_12 │   .0022653   .0013544     1.67   0.094    -.0003892    .0049198
 _Iregion_21 │  -.0102248   .0013531    -7.56   0.000    -.0128769   -.0075728
 _Iregion_22 │  -.0111654   .0014748    -7.57   0.000     -.014056   -.0082748
 _Iregion_31 │  -.0009243   .0015087    -0.61   0.540    -.0038813    .0020327
 _Iregion_32 │  -.0093862   .0016187    -5.80   0.000    -.0125589   -.0062136
 _Iregion_33 │  -.0102054   .0016365    -6.24   0.000    -.0134128    -.006998
 _Iregion_41 │   .0000999   .0023554     0.04   0.966    -.0045167    .0047165
 _Iregion_42 │    .000526    .001924     0.27   0.785     -.003245    .0042971
 _Iregion_91 │  -.0386298   .0137706    -2.81   0.005    -.0656197   -.0116399
 _Iyear_1900 │  -.0266067   .0009017   -29.51   0.000     -.028374   -.0248393
 _Iyear_1920 │  -.0324146    .000855   -37.91   0.000    -.0340905   -.0307388
 _Iyear_1940 │          0  (omitted)
       _cons │    .331187   .0017642   187.73   0.000     .3277292    .3346447
─────────────┴────────────────────────────────────────────────────────────────

These regressions seem to confirm that there is likely rounding taking place. Zeros and fives are substantially more common among illiterate individuals, among black individuals relative to white individuals, and in 1880 relative to the more recent census years.

Graphing the Relationship between Age Heaping and Literacy

In the literature on age heaping, a few different measures are used to describe the extent of heaping, or rounding, in the sample. One simple measure is the following heaping index:

\[\begin{equation} H_{i} = \frac{5}{4} \left( X_{i} - 20\right) \end{equation}\]

\(X_{i}\) is the proportion of individuals in sample \(i\) with an age ending in either zero or five, expressed as a percentage. \(H_{i}\) then ranges from zero in the case of no heaping (20 percent of individuals have an age ending in zero or five) to 100 (100 percent of individuals have an age ending in zero or five). Let’s go ahead and calculate this heaping index at the state-race-year level using the collapse command:

. keep if age>19 & age<60
(1,781,030 observations deleted)

. collapse (sum) rounded_sum = zero_or_five (count) all_sum = zero_or_five (mean) mean_literacy_rate = literate, by(stateicp 
> black year)

. gen heaping_index = (5/4) * ( 100 * rounded_sum/all_sum - 20)

. twoway (scatter heaping_index mean_literacy_rate if black==0 & all_sum>100) (scatter heaping_index mean_literacy_rate if bl
> ack==1 & all_sum>100), ytitle(Heaping index) xtitle(Literacy rate) legend(order(1 "White" 2 "Black"))

. graph export heaping-v-literacy.png, width(500) replace
(file heaping-v-literacy.png written in PNG format)
Age heaping and literacy by state and race
Age heaping and literacy by state and race

That is a pretty clear negative relationship at the state level between age heaping and literacy.

A couple of quick notes about the graphing command above. Notice that I restricted the plot to states with at least 100 individuals used to calculated the heaping index. When there are a small number of individuals, any index like the heaping index will become quite noisy and rather uninformative (you can take a look at just how ugly the graph gets for yourself if you drop the restriction on the all_sum variable). Also note that I changed the keys for the legend. Unless you do a good job of nicely labeling the values of your variables, you will typically end up with pretty uninformative legends.

Exercises

  1. In class, we saw a slightly different approach to age heaping used by Ó’Gráda and Mokyr. They estimated the following measure of age heaping: \[\begin{equation} \gamma = \sum_{i=15}^{34} \left( \frac{n_{i}}{\sum{n_{i}}} - \frac{\hat{n_{i}}}{\sum\hat{n_{i}}}\right)^{2} \end{equation}\] where \(n_{i}\) is the observed number of people of age \(i\) and \(\hat{n_{i}}\) is the predicted number based on a smooth polynomial. Try implementing this alternative definition of age heaping using a quartic polynomial to get the predicted \(\hat{n_{i}}\) values.

  2. In theory, the original dataset contains the same individuals at multiple points in time (in practice, these are based on one percent sample so it is very unlikely the same individual appears twice). This gives us a chance to see whether people round ages more or less as they get older. For the cohort born between 1860 and 1870, determine whether age heaping is getting more or less pronounced as they get older.

  3. For young children, we would not expect to see ages getting rounded to zero or five; the difference between 43 and 45 is a whole lot smaller than the difference between 3 and 5. However, that does not mean ages are being accurately reported for young children. Can you detect any patterns that suggest rounding of ages for children under the age of 10 in the census?

  4. One other reason for misreporting age is that various laws have age cutoffs. For example, under current laws you must be 18 to vote and 21 to purchase alcohol. If you asked for ages on a college campus, you might find the number of people stating they are 21 to be curiously higher than the number of people stating they are 20. What ages might people falsely claim in these census samples (1880, 1900, 1920, 1940)? Do you see evidence that individuals are falsely claiming these ages?