Names as Proxies for Wealth

John M. Parman

25 September 2019


For this exercise, we will work with 1870 federal census data to explore the links between names and socioeconomic status. The correlation between first names and socioeconomic status forms the basis of the pseudo-linking approach to estimating intergenerational mobility used by Olivetti and Paserman (2015). The correlation between last names and socioeconomic status is exploited by Clark and Cummins (2015) in their work on long term trends in British wealth mobility.

We will work with a dataset constructed from the IPUMS samples of the 1870 federal census, ipums-1870-census-data-with-names-and-wealth.dta. This file is available on our course website. However, you can also obtain the same data directly through IPUMS. The key variables contained in the dataset are first and last name, value of real and personal property, occupational status, and basic demographic characteristics.

Identifying Common Names

First things first, we need to identify common names. We will begin by loading the data and doing a bit of cleaning of the names. There is a slight problem for our purposes with the way IPUMS provides the first names. Many of the first names contain initials either before or after a full first name. These initials are going to make it difficult to group common names so our first step after opening the data is to remove them with a few string function tricks:

. clear

. use ipums-1870-census-data-with-names-and-wealth.dta

. gen first_space = strpos(namefrst," ")

The line above finds the first space in the first name variable, returning its position in the string using the strpos command. If there is no space, it returns a value of zero. We’ll create a new cleaned version of first names, starting by simply creating a new variable equal to namefrst if the original namefrst had no spaces in it:

. gen cleaned_namefrst = namefrst if first_space == 0
(173,713 missing values generated)

. split namefrst, gen(namepart) parse(" ") 
variables created as string: 
namepart1  namepart2  namepart3  namepart4  namepart5  namepart6

This splits up the first name string at each space (the parse variable) and creates new variables for each substring (namepart1, namepart2, and so on). Now for people with spaces in their first name (i.e. people with an initial), we can set the cleaned first name equal to the longer of the first name substrings (the full name rather than the initial):

. replace cleaned_namefrst = namepart1 if length(namepart1)>length(namepart2) & namepart2~=""
(145,968 real changes made)

. replace cleaned_namefrst = namepart2 if length(namepart2)>length(namepart1) & namepart2~=""
(10,358 real changes made)

We could keep going through the remaining namepart variables but for now we’ll say this is good enough (only 0.5% of our sample has more than one space in the namefrst string). Now we are ready to sort on our cleaned up first names and count them:

. sort cleaned_namefrst

. gen name_counter = 1

. gen new_name = 0

. replace new_name = 1 if cleaned_namefrst~=cleaned_namefrst[_n-1]
(29,557 real changes made)

. replace name_counter = name_counter + name_counter[_n-1] if new_name==0
(797,178 real changes made, 17,387 to missing)

. gsort cleaned_namefrst -name_counter

. replace name_counter = name_counter[_n-1] if cleaned_namefrst==cleaned_namefrst[_n-1]
(779,791 real changes made)

In the above commands, our new_name variable is set to one for any individual whose first name differs from the previous first name. The name_counter variable is used to count the number of individuals with the same name. We continue counting them until we hit a new name (until new_name is no longer equal to zero). The last person with the first name should then have a value for name_counter equal to the total number of people with the name. The final two lines simply replace the name_counter with this total number for everyone with that name.

Now we can take a look at common names by restricting our attention to first names with a high value for name_counter. Our dataset has a little over 800,000 observations in it. Let’s define a common name as one held by at least 0.5% of the population, or roughly 4,000 people in our sample:

. tab cleaned_namefrst if name_counter>4000

cleaned_namefrst │      Freq.     Percent        Cum.
             ANN │      5,572        1.53        1.53
            ANNA │      7,732        2.12        3.65
        CAROLINE │      4,266        1.17        4.81
       CATHARINE │      4,320        1.18        6.00
         CHARLES │     13,150        3.60        9.60
           DAVID │      5,276        1.45       11.05
          EDWARD │      5,413        1.48       12.53
           ELIZA │      7,209        1.98       14.51
       ELIZABETH │     12,540        3.44       17.94
           ELLEN │      6,484        1.78       19.72
            EMMA │      6,565        1.80       21.52
           FRANK │      6,063        1.66       23.18
          GEORGE │     16,222        4.44       27.62
           HENRY │     13,209        3.62       31.24
           JACOB │      4,338        1.19       32.43
           JAMES │     21,388        5.86       38.29
            JANE │      7,544        2.07       40.36
            JOHN │     41,210       11.29       51.65
          JOSEPH │      9,447        2.59       54.24
           JULIA │      4,822        1.32       55.56
          LOUISA │      4,508        1.24       56.79
        MARGARET │      8,119        2.22       59.02
          MARTHA │      9,631        2.64       61.66
            MARY │     51,407       14.09       75.74
           NANCY │      6,608        1.81       77.55
          ROBERT │      5,745        1.57       79.13
          SAMUEL │      5,840        1.60       80.73
           SARAH │     19,357        5.30       86.03
           SUSAN │      6,104        1.67       87.71
          THOMAS │     10,709        2.93       90.64
         WILLIAM │     26,803        7.34       97.98
              WM │      7,358        2.02      100.00
           Total │    364,959      100.00

These are our common names. Let’s see if we notice any differences in average wealth across these names:

. tab cleaned_namefrst if name_counter>4000, sum(realprop)

cleaned_namefrs │    Summary of Real estate value
              t │        Mean   Std. Dev.       Freq.
            ANN │   109.21752   955.42588       5,572
           ANNA │   66.638645   1008.8238       7,732
       CAROLINE │   149.34365   2693.7278       4,266
      CATHARINE │   157.38287   1835.2368       4,320
        CHARLES │   494.81057   4844.0074      13,150
          DAVID │    1072.508   4170.7742       5,276
         EDWARD │   587.57325   3896.3046       5,413
          ELIZA │   160.68525   2472.5584       7,209
      ELIZABETH │    141.8437   1753.4728      12,540
          ELLEN │    44.13942   667.04011       6,484
           EMMA │   99.923839   3231.6566       6,565
          FRANK │   233.15026   2011.2645       6,063
         GEORGE │   695.88263   4339.7406      16,222
          HENRY │   796.44235   7830.7822      13,209
          JACOB │   1315.0251   4425.5907       4,338
          JAMES │   655.86857   5037.8151      21,388
           JANE │    173.6983   2463.7724       7,544
           JOHN │   809.22575   4778.7251      41,210
         JOSEPH │   698.06658    2833.772       9,447
          JULIA │   189.91705   6142.7525       4,822
         LOUISA │   86.435226   1211.0327       4,508
       MARGARET │   89.779529   912.80828       8,119
         MARTHA │   140.83647   4398.3662       9,631
           MARY │   90.501002   1321.2063      51,407
          NANCY │   131.15224   1706.8991       6,608
         ROBERT │   641.41897    3219.652       5,745
         SAMUEL │   1240.1541   14502.762       5,840
          SARAH │   94.800021   937.85688      19,357
          SUSAN │   146.54653    1776.962       6,104
         THOMAS │   787.60855   8578.0982      10,709
        WILLIAM │   716.34164   4307.9256      26,803
             WM │   845.30402   3604.0505       7,358
          Total │   444.69382   4305.0563     364,959

. tab cleaned_namefrst if name_counter>4000, sum(persprop)

cleaned_namefrs │ Summary of Value of personal estate
              t │        Mean   Std. Dev.       Freq.
            ANN │   48.433058   460.89857       5,572
           ANNA │   28.609674   444.81434       7,732
       CAROLINE │   170.76535   5797.5976       4,266
      CATHARINE │   82.553009   1350.2736       4,320
        CHARLES │   273.83779    3578.223      13,150
          DAVID │   510.25095   3007.7902       5,276
         EDWARD │   481.44892   5824.4034       5,413
          ELIZA │   52.684006   466.95549       7,209
      ELIZABETH │   61.518979   819.20172      12,540
          ELLEN │   16.076342   272.23555       6,484
           EMMA │   10.076161   280.83655       6,565
          FRANK │   150.06762   1934.0911       6,063
         GEORGE │   314.35643   2357.7149      16,222
          HENRY │   329.58112   2378.1733      13,209
          JACOB │   440.14292   1728.5116       4,338
          JAMES │   267.91495    1616.831      21,388
           JANE │   54.068134   646.57473       7,544
           JOHN │   350.72184   3022.5267      41,210
         JOSEPH │   317.80872   2051.4446       9,447
          JULIA │   100.20966   2706.0297       4,822
         LOUISA │   54.941437    1519.506       4,508
       MARGARET │   30.520015    369.5543       8,119
         MARTHA │   97.966982   3693.4969       9,631
           MARY │   39.144805   673.07076      51,407
          NANCY │   65.822791   1338.7636       6,608
         ROBERT │   373.76432   5307.7804       5,745
         SAMUEL │    429.6351   2346.4793       5,840
          SARAH │   62.792633   1327.9369      19,357
          SUSAN │   42.281291    515.4781       6,104
         THOMAS │   297.28079   3589.8667      10,709
        WILLIAM │   315.04511   2477.8445      26,803
             WM │   493.60356   4447.2619       7,358
          Total │   201.17816   2421.4008     364,959

We have a slight problem here. Our common female names have much lower average wealth values than our common male names. This is unsurprising given that during this time period, wealth will most likely be listed under the husband but not the wife in a household. For a more meaningful comparison, let’s construct two different measures of household wealth, one that captures the sum of wealth for individual household members and one that is simply the highest wealth value reported by any individual in the household. We can identify households using the serial variable. We will also calculate the highest occscore within the household

. sort serial

. gen hh_realprop_sum = 0

. gen hh_realprop_max = 0

. replace hh_realprop_sum = realprop if serial~=serial[_n-1]
(10,229 real changes made)

. replace hh_realprop_max = realprop if serial~=serial[_n-1]
(10,229 real changes made)

. replace hh_realprop_sum = hh_realprop_sum[_n-1] + realprop if serial==serial[_n-1] 
(355,556 real changes made)

. replace hh_realprop_max = realprop if serial==serial[_n-1] & realprop>realprop[_n-1]
(69,537 real changes made)

. gsort serial -hh_realprop_sum

. replace hh_realprop_sum = hh_realprop_sum[_n-1] if serial==serial[_n-1]
(304,280 real changes made)

. gsort serial -hh_realprop_max

. replace hh_realprop_max = hh_realprop_max[_n-1] if serial==serial[_n-1]
(525,977 real changes made)
. sort serial

. gen hh_persprop_sum = 0

. gen hh_persprop_max = 0

. replace hh_persprop_sum = persprop if serial~=serial[_n-1]
(57,231 real changes made)

. replace hh_persprop_max = persprop if serial~=serial[_n-1]
(57,231 real changes made)

. replace hh_persprop_sum = hh_persprop_sum[_n-1] + persprop if serial==serial[_n-1] 
(572,785 real changes made)

. replace hh_persprop_max = persprop if serial==serial[_n-1] & persprop>persprop[_n-1]
(42,419 real changes made)

. gsort serial -hh_persprop_sum

. replace hh_persprop_sum = hh_persprop_sum[_n-1] if serial==serial[_n-1]
(188,884 real changes made)

. gsort serial -hh_persprop_max

. replace hh_persprop_max = hh_persprop_max[_n-1] if serial==serial[_n-1]
(616,810 real changes made)
. sort serial

. gen hh_occscore_max = 0

. replace hh_occscore_max = occscore if serial~=[_n-1]
(282,644 real changes made)

. replace hh_occscore_max = occscore if serial==serial[_n-1] & occscore>occscore[_n-1]
(0 real changes made)

. gsort serial -hh_occscore_max

. replace hh_occscore_max = hh_occscore_max[_n-1] if serial==serial[_n-1]
(690,419 real changes made)

Now we can do a fairer comparison of wealth across common first names:

. tab cleaned_namefrst if name_counter>4000, sum(hh_realprop_max)

cleaned_namefrs │     Summary of hh_realprop_max
              t │        Mean   Std. Dev.       Freq.
            ANN │   4292.0352   16630.774       5,572
           ANNA │   4745.5934   14170.522       7,732
       CAROLINE │   4183.2278   11282.965       4,266
      CATHARINE │   4250.5255     11988.4       4,320
        CHARLES │   4351.0875   12332.875      13,150
          DAVID │   4329.5866   11066.307       5,276
         EDWARD │   4581.5365    14158.56       5,413
          ELIZA │   4201.5937   14254.015       7,209
      ELIZABETH │   4617.5067   16263.988      12,540
          ELLEN │   4715.4917   19476.574       6,484
           EMMA │   4761.4821   13850.093       6,565
          FRANK │   4753.2436   14674.191       6,063
         GEORGE │   4286.9851   14024.497      16,222
          HENRY │   4343.9806   18780.531      13,209
          JACOB │   4670.3649   12190.851       4,338
          JAMES │   3727.4042   11979.452      21,388
           JANE │    4454.928    20005.92       7,544
           JOHN │   4029.2289   14006.478      41,210
         JOSEPH │   4023.7135   12070.107       9,447
          JULIA │   4023.9915   13329.259       4,822
         LOUISA │   3876.3951   11075.418       4,508
       MARGARET │   4151.7958   17566.872       8,119
         MARTHA │   3850.2244    14586.55       9,631
           MARY │   4343.4392   16696.405      51,407
          NANCY │   3777.7518   14301.808       6,608
         ROBERT │   3702.6956    9832.979       5,745
         SAMUEL │   4426.0663    21320.09       5,840
          SARAH │   4169.5699   13296.015      19,357
          SUSAN │    4048.422   11807.303       6,104
         THOMAS │   3897.5595   15028.204      10,709
        WILLIAM │   4171.1279   13840.455      26,803
             WM │   4889.0034   26181.796       7,358
          Total │   4228.0345   15133.884     364,959

. tab cleaned_namefrst if name_counter>4000, sum(hh_persprop_max)

cleaned_namefrs │     Summary of hh_persprop_max
              t │        Mean   Std. Dev.       Freq.
            ANN │   2234.0246   17417.099       5,572
           ANNA │   2398.7254   13537.411       7,732
       CAROLINE │   2109.7506   11457.601       4,266
      CATHARINE │   2110.3565   10544.089       4,320
        CHARLES │   2046.5221   10010.474      13,150
          DAVID │    1883.561   6488.0704       5,276
         EDWARD │   2338.9483   10969.133       5,413
          ELIZA │   1798.2559    7251.552       7,209
      ELIZABETH │   1942.6273   8714.3953      12,540
          ELLEN │    2555.891   20860.553       6,484
           EMMA │   2162.1922   9377.9149       6,565
          FRANK │   2387.3924   14155.326       6,063
         GEORGE │   1905.5353   8072.6011      16,222
          HENRY │   1882.7166   10950.825      13,209
          JACOB │   1731.6955   7105.9839       4,338
          JAMES │   1648.2388   7444.9111      21,388
           JANE │   2231.0881   16746.072       7,544
           JOHN │   1810.1869   12778.949      41,210
         JOSEPH │   1820.0874   11333.676       9,447
          JULIA │   1921.6794   8132.7159       4,822
         LOUISA │     1781.26   7879.4137       4,508
       MARGARET │   2096.0002   15131.974       8,119
         MARTHA │   2004.9226   16064.886       9,631
           MARY │   2034.8685   13582.329      51,407
          NANCY │   1602.4355   12520.717       6,608
         ROBERT │   1784.5909   8395.4167       5,745
         SAMUEL │   1978.0942   15846.906       5,840
          SARAH │   2044.2208   12648.068      19,357
          SUSAN │   1818.2477   9127.6428       6,104
         THOMAS │   1466.9695   5777.2505      10,709
        WILLIAM │   1780.6675   10889.471      26,803
             WM │   2033.1903   8976.3855       7,358
          Total │   1940.0685   11875.405     364,959

Recall that Olivetti and Paserman are focused on rich parents and poor parents choosing different names for their children. Let’s restrict our attention to children and collapse the data by name to make it a little easier to work with:

. keep if age<19
(429,152 observations deleted)

. collapse (mean) mean_persprop = hh_persprop_sum mean_realprop = hh_realprop_sum mean_occscore = hh_occscore_max (sd) sd_pers
> prop = hh_persprop_sum sd_realprop = hh_realprop_sum sd_occscore = hh_occscore_max (count) n_name = hh_realprop_sum, by(clea
> ned_namefrst)

Now we can take a look at the variation in household wealth by children’s names. Our n_name variable will let us restrict our attention to common names as it tells us how many children in the sample had that particular name (note that this is different than our name_counter variable which measured the frequency of the name in the entire population). Let’s look at names that are held by at least 500 individuals, listing the 10 highest wealth names and the 10 lowest wealth names:

. gen common_name = 0

. replace common_name = 1 if n_name>499
(108 real changes made)

. gsort -common_name -mean_persprop

. list in 1/10 if n_name>500

     │ cleaned~t   mean_p~p   mean_r~p   mean_o~e   sd_per~p   sd_rea~p   sd_occ~e   n_name   common~e │
  1. │    AMELIA   4683.733   6572.407   24.61992   49090.13   40342.23   11.11685      713          1 │
  2. │ HENRIETTA   3678.848    5457.76   24.25475   37475.78   32999.08   12.44691      526          1 │
  3. │      ROSA   3566.188   5617.619   24.65568    39675.8   35667.18   12.41829      819          1 │
  4. │ FREDERICK   3456.565    5718.45    25.5491   36027.55   32129.81   12.06074      998          1 │
  5. │      CHAS   3155.746   5809.289   26.51714   20770.77   20198.71   12.63821     1021          1 │
  6. │   MATILDA   3148.551   5319.021   23.03366   37384.63   33854.36   11.48259      921          1 │
  7. │     CLARA   3038.783    5409.69   26.17065   14769.41    14761.9   12.31589     1547          1 │
  8. │     ALICE   3013.635   5074.588   24.74371   14674.97   12994.76   12.93647     2661          1 │
  9. │     HARRY   2971.353   7041.266   26.85243   7529.225    32662.7   13.44357     1091          1 │
 10. │    JENNIE    2920.01   6305.681     26.362   11588.57   19379.29   14.86505     1279          1 │

. gsort -common_name mean_persprop

. list in 1/10 if n_name>500

     │ cleaned~t   mean_p~p   mean_r~p   mean_o~e   sd_per~p   sd_rea~p   sd_occ~e   n_name   common~e │
  1. │   PATRICK   849.7098   2587.869   23.55033   2159.022   6942.212   9.970986      765          1 │
  2. │   MICHAEL   1102.291   2928.301   23.49669   3103.881   5673.021   10.11357     1059          1 │
  3. │   BRIDGET   1186.301   2638.668   23.95361   3107.279   5976.119   10.62713      582          1 │
  4. │    MARTIN   1238.226   4319.599   21.81702   2792.288   20381.97   11.28614      705          1 │
  5. │ ELISABETH   1305.282   3932.609   23.07977   2270.936   6894.267   10.12974      514          1 │
  6. │  VIRGINIA   1350.705    2860.15   22.37542   4641.422   6554.518    12.4571      594          1 │
  7. │   RICHARD   1447.788    3533.83   22.78967   4361.714   10391.61    12.2511     1355          1 │
  8. │     PETER   1451.647   4986.575   22.77657   7653.858   36967.51    11.9052     1383          1 │
  9. │   LUCINDA   1462.328   3250.463   21.94888   5365.538   8646.404    12.2797      626          1 │
 10. │   FRANCES   1474.999   3691.926   22.65911   5682.605   11718.98   12.01183     1015          1 │

Note that the difference in average wealth between our richest and poorest names is rather substantial. It is type of variation that allows Olivetti and Paserman to use children’s names as a proxy for parents socioeconomic status. Play around with the data a bit more to explore the ways that first names inform us about socioeconomic status. It would also be instructive for you to repeat this exercise for last names as opposed to first names.


  1. One common naming practice is to name sons after their fathers. We can identify these names in the data by looking for individuals with junior (or III, IV, etc.) as part of their name. Create an indicator variable identifying sons named after their fathers and use this variable to assess whether these sons tend to have higher or lower levels of wealth than the general population. Can you give potential explanation for your finding?
  2. Another common naming practice is choosing biblical names. Assess whether biblical names are associated with higher or lower levels of wealth and determine whether any relationship you find varies across regions.
  3. In the literature on names and labor market discrimination, one aspect of names occasionally associated with worse labor market outcomes is the uniqueness of the name. In this question, you will explore this relationship in two different ways. First, estimate the average difference in wealth between individuals with unique names and those with non-unique names. Second, construct a continuous measure for the commonness of a name and estimate the marginal effect of an increase in name commonness on wealth.