25 September 2019
For this exercise, we will work with 1870 federal census data to explore the links between names and socioeconomic status. The correlation between first names and socioeconomic status forms the basis of the pseudo-linking approach to estimating intergenerational mobility used by Olivetti and Paserman (2015). The correlation between last names and socioeconomic status is exploited by Clark and Cummins (2015) in their work on long term trends in British wealth mobility.
We will work with a dataset constructed from the IPUMS samples of the 1870 federal census, ipums-1870-census-data-with-names-and-wealth.dta. This file is available on our course website. However, you can also obtain the same data directly through IPUMS. The key variables contained in the dataset are first and last name, value of real and personal property, occupational status, and basic demographic characteristics.
First things first, we need to identify common names. We will begin by loading the data and doing a bit of cleaning of the names. There is a slight problem for our purposes with the way IPUMS provides the first names. Many of the first names contain initials either before or after a full first name. These initials are going to make it difficult to group common names so our first step after opening the data is to remove them with a few string function tricks:
. clear . use ipums-1870-census-data-with-names-and-wealth.dta . gen first_space = strpos(namefrst," ")
The line above finds the first space in the first name variable, returning its position in the string using the strpos command. If there is no space, it returns a value of zero. We’ll create a new cleaned version of first names, starting by simply creating a new variable equal to namefrst if the original namefrst had no spaces in it:
. gen cleaned_namefrst = namefrst if first_space == 0 (173,713 missing values generated) . split namefrst, gen(namepart) parse(" ") variables created as string: namepart1 namepart2 namepart3 namepart4 namepart5 namepart6
This splits up the first name string at each space (the parse variable) and creates new variables for each substring (namepart1, namepart2, and so on). Now for people with spaces in their first name (i.e. people with an initial), we can set the cleaned first name equal to the longer of the first name substrings (the full name rather than the initial):
. replace cleaned_namefrst = namepart1 if length(namepart1)>length(namepart2) & namepart2~="" (145,968 real changes made) . replace cleaned_namefrst = namepart2 if length(namepart2)>length(namepart1) & namepart2~="" (10,358 real changes made)
We could keep going through the remaining namepart variables but for now we’ll say this is good enough (only 0.5% of our sample has more than one space in the namefrst string). Now we are ready to sort on our cleaned up first names and count them:
. sort cleaned_namefrst . gen name_counter = 1 . gen new_name = 0 . replace new_name = 1 if cleaned_namefrst~=cleaned_namefrst[_n-1] (29,557 real changes made) . replace name_counter = name_counter + name_counter[_n-1] if new_name==0 (797,178 real changes made, 17,387 to missing) . gsort cleaned_namefrst -name_counter . replace name_counter = name_counter[_n-1] if cleaned_namefrst==cleaned_namefrst[_n-1] (779,791 real changes made)
In the above commands, our new_name variable is set to one for any individual whose first name differs from the previous first name. The name_counter variable is used to count the number of individuals with the same name. We continue counting them until we hit a new name (until new_name is no longer equal to zero). The last person with the first name should then have a value for name_counter equal to the total number of people with the name. The final two lines simply replace the name_counter with this total number for everyone with that name.
Now we can take a look at common names by restricting our attention to first names with a high value for name_counter. Our dataset has a little over 800,000 observations in it. Let’s define a common name as one held by at least 0.5% of the population, or roughly 4,000 people in our sample:
. tab cleaned_namefrst if name_counter>4000 cleaned_namefrst │ Freq. Percent Cum. ─────────────────┼─────────────────────────────────── ANN │ 5,572 1.53 1.53 ANNA │ 7,732 2.12 3.65 CAROLINE │ 4,266 1.17 4.81 CATHARINE │ 4,320 1.18 6.00 CHARLES │ 13,150 3.60 9.60 DAVID │ 5,276 1.45 11.05 EDWARD │ 5,413 1.48 12.53 ELIZA │ 7,209 1.98 14.51 ELIZABETH │ 12,540 3.44 17.94 ELLEN │ 6,484 1.78 19.72 EMMA │ 6,565 1.80 21.52 FRANK │ 6,063 1.66 23.18 GEORGE │ 16,222 4.44 27.62 HENRY │ 13,209 3.62 31.24 JACOB │ 4,338 1.19 32.43 JAMES │ 21,388 5.86 38.29 JANE │ 7,544 2.07 40.36 JOHN │ 41,210 11.29 51.65 JOSEPH │ 9,447 2.59 54.24 JULIA │ 4,822 1.32 55.56 LOUISA │ 4,508 1.24 56.79 MARGARET │ 8,119 2.22 59.02 MARTHA │ 9,631 2.64 61.66 MARY │ 51,407 14.09 75.74 NANCY │ 6,608 1.81 77.55 ROBERT │ 5,745 1.57 79.13 SAMUEL │ 5,840 1.60 80.73 SARAH │ 19,357 5.30 86.03 SUSAN │ 6,104 1.67 87.71 THOMAS │ 10,709 2.93 90.64 WILLIAM │ 26,803 7.34 97.98 WM │ 7,358 2.02 100.00 ─────────────────┼─────────────────────────────────── Total │ 364,959 100.00
These are our common names. Let’s see if we notice any differences in average wealth across these names:
. tab cleaned_namefrst if name_counter>4000, sum(realprop) cleaned_namefrs │ Summary of Real estate value t │ Mean Std. Dev. Freq. ────────────────┼──────────────────────────────────── ANN │ 109.21752 955.42588 5,572 ANNA │ 66.638645 1008.8238 7,732 CAROLINE │ 149.34365 2693.7278 4,266 CATHARINE │ 157.38287 1835.2368 4,320 CHARLES │ 494.81057 4844.0074 13,150 DAVID │ 1072.508 4170.7742 5,276 EDWARD │ 587.57325 3896.3046 5,413 ELIZA │ 160.68525 2472.5584 7,209 ELIZABETH │ 141.8437 1753.4728 12,540 ELLEN │ 44.13942 667.04011 6,484 EMMA │ 99.923839 3231.6566 6,565 FRANK │ 233.15026 2011.2645 6,063 GEORGE │ 695.88263 4339.7406 16,222 HENRY │ 796.44235 7830.7822 13,209 JACOB │ 1315.0251 4425.5907 4,338 JAMES │ 655.86857 5037.8151 21,388 JANE │ 173.6983 2463.7724 7,544 JOHN │ 809.22575 4778.7251 41,210 JOSEPH │ 698.06658 2833.772 9,447 JULIA │ 189.91705 6142.7525 4,822 LOUISA │ 86.435226 1211.0327 4,508 MARGARET │ 89.779529 912.80828 8,119 MARTHA │ 140.83647 4398.3662 9,631 MARY │ 90.501002 1321.2063 51,407 NANCY │ 131.15224 1706.8991 6,608 ROBERT │ 641.41897 3219.652 5,745 SAMUEL │ 1240.1541 14502.762 5,840 SARAH │ 94.800021 937.85688 19,357 SUSAN │ 146.54653 1776.962 6,104 THOMAS │ 787.60855 8578.0982 10,709 WILLIAM │ 716.34164 4307.9256 26,803 WM │ 845.30402 3604.0505 7,358 ────────────────┼──────────────────────────────────── Total │ 444.69382 4305.0563 364,959 . tab cleaned_namefrst if name_counter>4000, sum(persprop) cleaned_namefrs │ Summary of Value of personal estate t │ Mean Std. Dev. Freq. ────────────────┼──────────────────────────────────── ANN │ 48.433058 460.89857 5,572 ANNA │ 28.609674 444.81434 7,732 CAROLINE │ 170.76535 5797.5976 4,266 CATHARINE │ 82.553009 1350.2736 4,320 CHARLES │ 273.83779 3578.223 13,150 DAVID │ 510.25095 3007.7902 5,276 EDWARD │ 481.44892 5824.4034 5,413 ELIZA │ 52.684006 466.95549 7,209 ELIZABETH │ 61.518979 819.20172 12,540 ELLEN │ 16.076342 272.23555 6,484 EMMA │ 10.076161 280.83655 6,565 FRANK │ 150.06762 1934.0911 6,063 GEORGE │ 314.35643 2357.7149 16,222 HENRY │ 329.58112 2378.1733 13,209 JACOB │ 440.14292 1728.5116 4,338 JAMES │ 267.91495 1616.831 21,388 JANE │ 54.068134 646.57473 7,544 JOHN │ 350.72184 3022.5267 41,210 JOSEPH │ 317.80872 2051.4446 9,447 JULIA │ 100.20966 2706.0297 4,822 LOUISA │ 54.941437 1519.506 4,508 MARGARET │ 30.520015 369.5543 8,119 MARTHA │ 97.966982 3693.4969 9,631 MARY │ 39.144805 673.07076 51,407 NANCY │ 65.822791 1338.7636 6,608 ROBERT │ 373.76432 5307.7804 5,745 SAMUEL │ 429.6351 2346.4793 5,840 SARAH │ 62.792633 1327.9369 19,357 SUSAN │ 42.281291 515.4781 6,104 THOMAS │ 297.28079 3589.8667 10,709 WILLIAM │ 315.04511 2477.8445 26,803 WM │ 493.60356 4447.2619 7,358 ────────────────┼──────────────────────────────────── Total │ 201.17816 2421.4008 364,959
We have a slight problem here. Our common female names have much lower average wealth values than our common male names. This is unsurprising given that during this time period, wealth will most likely be listed under the husband but not the wife in a household. For a more meaningful comparison, let’s construct two different measures of household wealth, one that captures the sum of wealth for individual household members and one that is simply the highest wealth value reported by any individual in the household. We can identify households using the serial variable. We will also calculate the highest occscore within the household
. sort serial . gen hh_realprop_sum = 0 . gen hh_realprop_max = 0 . replace hh_realprop_sum = realprop if serial~=serial[_n-1] (10,229 real changes made) . replace hh_realprop_max = realprop if serial~=serial[_n-1] (10,229 real changes made) . replace hh_realprop_sum = hh_realprop_sum[_n-1] + realprop if serial==serial[_n-1] (355,556 real changes made) . replace hh_realprop_max = realprop if serial==serial[_n-1] & realprop>realprop[_n-1] (69,537 real changes made) . gsort serial -hh_realprop_sum . replace hh_realprop_sum = hh_realprop_sum[_n-1] if serial==serial[_n-1] (304,280 real changes made) . gsort serial -hh_realprop_max . replace hh_realprop_max = hh_realprop_max[_n-1] if serial==serial[_n-1] (525,977 real changes made)
. sort serial . gen hh_persprop_sum = 0 . gen hh_persprop_max = 0 . replace hh_persprop_sum = persprop if serial~=serial[_n-1] (57,231 real changes made) . replace hh_persprop_max = persprop if serial~=serial[_n-1] (57,231 real changes made) . replace hh_persprop_sum = hh_persprop_sum[_n-1] + persprop if serial==serial[_n-1] (572,785 real changes made) . replace hh_persprop_max = persprop if serial==serial[_n-1] & persprop>persprop[_n-1] (42,419 real changes made) . gsort serial -hh_persprop_sum . replace hh_persprop_sum = hh_persprop_sum[_n-1] if serial==serial[_n-1] (188,884 real changes made) . gsort serial -hh_persprop_max . replace hh_persprop_max = hh_persprop_max[_n-1] if serial==serial[_n-1] (616,810 real changes made)
. sort serial . gen hh_occscore_max = 0 . replace hh_occscore_max = occscore if serial~=[_n-1] (282,644 real changes made) . replace hh_occscore_max = occscore if serial==serial[_n-1] & occscore>occscore[_n-1] (0 real changes made) . gsort serial -hh_occscore_max . replace hh_occscore_max = hh_occscore_max[_n-1] if serial==serial[_n-1] (690,419 real changes made)
Now we can do a fairer comparison of wealth across common first names:
. tab cleaned_namefrst if name_counter>4000, sum(hh_realprop_max) cleaned_namefrs │ Summary of hh_realprop_max t │ Mean Std. Dev. Freq. ────────────────┼──────────────────────────────────── ANN │ 4292.0352 16630.774 5,572 ANNA │ 4745.5934 14170.522 7,732 CAROLINE │ 4183.2278 11282.965 4,266 CATHARINE │ 4250.5255 11988.4 4,320 CHARLES │ 4351.0875 12332.875 13,150 DAVID │ 4329.5866 11066.307 5,276 EDWARD │ 4581.5365 14158.56 5,413 ELIZA │ 4201.5937 14254.015 7,209 ELIZABETH │ 4617.5067 16263.988 12,540 ELLEN │ 4715.4917 19476.574 6,484 EMMA │ 4761.4821 13850.093 6,565 FRANK │ 4753.2436 14674.191 6,063 GEORGE │ 4286.9851 14024.497 16,222 HENRY │ 4343.9806 18780.531 13,209 JACOB │ 4670.3649 12190.851 4,338 JAMES │ 3727.4042 11979.452 21,388 JANE │ 4454.928 20005.92 7,544 JOHN │ 4029.2289 14006.478 41,210 JOSEPH │ 4023.7135 12070.107 9,447 JULIA │ 4023.9915 13329.259 4,822 LOUISA │ 3876.3951 11075.418 4,508 MARGARET │ 4151.7958 17566.872 8,119 MARTHA │ 3850.2244 14586.55 9,631 MARY │ 4343.4392 16696.405 51,407 NANCY │ 3777.7518 14301.808 6,608 ROBERT │ 3702.6956 9832.979 5,745 SAMUEL │ 4426.0663 21320.09 5,840 SARAH │ 4169.5699 13296.015 19,357 SUSAN │ 4048.422 11807.303 6,104 THOMAS │ 3897.5595 15028.204 10,709 WILLIAM │ 4171.1279 13840.455 26,803 WM │ 4889.0034 26181.796 7,358 ────────────────┼──────────────────────────────────── Total │ 4228.0345 15133.884 364,959 . tab cleaned_namefrst if name_counter>4000, sum(hh_persprop_max) cleaned_namefrs │ Summary of hh_persprop_max t │ Mean Std. Dev. Freq. ────────────────┼──────────────────────────────────── ANN │ 2234.0246 17417.099 5,572 ANNA │ 2398.7254 13537.411 7,732 CAROLINE │ 2109.7506 11457.601 4,266 CATHARINE │ 2110.3565 10544.089 4,320 CHARLES │ 2046.5221 10010.474 13,150 DAVID │ 1883.561 6488.0704 5,276 EDWARD │ 2338.9483 10969.133 5,413 ELIZA │ 1798.2559 7251.552 7,209 ELIZABETH │ 1942.6273 8714.3953 12,540 ELLEN │ 2555.891 20860.553 6,484 EMMA │ 2162.1922 9377.9149 6,565 FRANK │ 2387.3924 14155.326 6,063 GEORGE │ 1905.5353 8072.6011 16,222 HENRY │ 1882.7166 10950.825 13,209 JACOB │ 1731.6955 7105.9839 4,338 JAMES │ 1648.2388 7444.9111 21,388 JANE │ 2231.0881 16746.072 7,544 JOHN │ 1810.1869 12778.949 41,210 JOSEPH │ 1820.0874 11333.676 9,447 JULIA │ 1921.6794 8132.7159 4,822 LOUISA │ 1781.26 7879.4137 4,508 MARGARET │ 2096.0002 15131.974 8,119 MARTHA │ 2004.9226 16064.886 9,631 MARY │ 2034.8685 13582.329 51,407 NANCY │ 1602.4355 12520.717 6,608 ROBERT │ 1784.5909 8395.4167 5,745 SAMUEL │ 1978.0942 15846.906 5,840 SARAH │ 2044.2208 12648.068 19,357 SUSAN │ 1818.2477 9127.6428 6,104 THOMAS │ 1466.9695 5777.2505 10,709 WILLIAM │ 1780.6675 10889.471 26,803 WM │ 2033.1903 8976.3855 7,358 ────────────────┼──────────────────────────────────── Total │ 1940.0685 11875.405 364,959
Recall that Olivetti and Paserman are focused on rich parents and poor parents choosing different names for their children. Let’s restrict our attention to children and collapse the data by name to make it a little easier to work with:
. keep if age<19 (429,152 observations deleted) . collapse (mean) mean_persprop = hh_persprop_sum mean_realprop = hh_realprop_sum mean_occscore = hh_occscore_max (sd) sd_pers > prop = hh_persprop_sum sd_realprop = hh_realprop_sum sd_occscore = hh_occscore_max (count) n_name = hh_realprop_sum, by(clea > ned_namefrst)
Now we can take a look at the variation in household wealth by children’s names. Our n_name variable will let us restrict our attention to common names as it tells us how many children in the sample had that particular name (note that this is different than our name_counter variable which measured the frequency of the name in the entire population). Let’s look at names that are held by at least 500 individuals, listing the 10 highest wealth names and the 10 lowest wealth names:
. gen common_name = 0 . replace common_name = 1 if n_name>499 (108 real changes made) . gsort -common_name -mean_persprop . list in 1/10 if n_name>500 ┌─────────────────────────────────────────────────────────────────────────────────────────────────┐ │ cleaned~t mean_p~p mean_r~p mean_o~e sd_per~p sd_rea~p sd_occ~e n_name common~e │ ├─────────────────────────────────────────────────────────────────────────────────────────────────┤ 1. │ AMELIA 4683.733 6572.407 24.61992 49090.13 40342.23 11.11685 713 1 │ 2. │ HENRIETTA 3678.848 5457.76 24.25475 37475.78 32999.08 12.44691 526 1 │ 3. │ ROSA 3566.188 5617.619 24.65568 39675.8 35667.18 12.41829 819 1 │ 4. │ FREDERICK 3456.565 5718.45 25.5491 36027.55 32129.81 12.06074 998 1 │ 5. │ CHAS 3155.746 5809.289 26.51714 20770.77 20198.71 12.63821 1021 1 │ ├─────────────────────────────────────────────────────────────────────────────────────────────────┤ 6. │ MATILDA 3148.551 5319.021 23.03366 37384.63 33854.36 11.48259 921 1 │ 7. │ CLARA 3038.783 5409.69 26.17065 14769.41 14761.9 12.31589 1547 1 │ 8. │ ALICE 3013.635 5074.588 24.74371 14674.97 12994.76 12.93647 2661 1 │ 9. │ HARRY 2971.353 7041.266 26.85243 7529.225 32662.7 13.44357 1091 1 │ 10. │ JENNIE 2920.01 6305.681 26.362 11588.57 19379.29 14.86505 1279 1 │ └─────────────────────────────────────────────────────────────────────────────────────────────────┘ . gsort -common_name mean_persprop . list in 1/10 if n_name>500 ┌─────────────────────────────────────────────────────────────────────────────────────────────────┐ │ cleaned~t mean_p~p mean_r~p mean_o~e sd_per~p sd_rea~p sd_occ~e n_name common~e │ ├─────────────────────────────────────────────────────────────────────────────────────────────────┤ 1. │ PATRICK 849.7098 2587.869 23.55033 2159.022 6942.212 9.970986 765 1 │ 2. │ MICHAEL 1102.291 2928.301 23.49669 3103.881 5673.021 10.11357 1059 1 │ 3. │ BRIDGET 1186.301 2638.668 23.95361 3107.279 5976.119 10.62713 582 1 │ 4. │ MARTIN 1238.226 4319.599 21.81702 2792.288 20381.97 11.28614 705 1 │ 5. │ ELISABETH 1305.282 3932.609 23.07977 2270.936 6894.267 10.12974 514 1 │ ├─────────────────────────────────────────────────────────────────────────────────────────────────┤ 6. │ VIRGINIA 1350.705 2860.15 22.37542 4641.422 6554.518 12.4571 594 1 │ 7. │ RICHARD 1447.788 3533.83 22.78967 4361.714 10391.61 12.2511 1355 1 │ 8. │ PETER 1451.647 4986.575 22.77657 7653.858 36967.51 11.9052 1383 1 │ 9. │ LUCINDA 1462.328 3250.463 21.94888 5365.538 8646.404 12.2797 626 1 │ 10. │ FRANCES 1474.999 3691.926 22.65911 5682.605 11718.98 12.01183 1015 1 │ └─────────────────────────────────────────────────────────────────────────────────────────────────┘
Note that the difference in average wealth between our richest and poorest names is rather substantial. It is type of variation that allows Olivetti and Paserman to use children’s names as a proxy for parents socioeconomic status. Play around with the data a bit more to explore the ways that first names inform us about socioeconomic status. It would also be instructive for you to repeat this exercise for last names as opposed to first names.