The Basic Process
Each semester we start with a different cohort of Williamsburg residents. The Spring 2019 class first searched the 1920 federal census for all male children born between 1906 and 1917 living in the Williamsburg area. The focus on males is due to the need to use last names to find individuals in later censuses. The age restriction allows us to compare their parents' labor market outcomes in 1920 to the children's own outcomes as adults in 1930 and 1940.
Each student linked a sample of 50 individuals forward to 1930 and 1940. For each individual, students searched the complete census records on
Ancestry Library by name, birth year, birth state and any other relevant information such as a spouse's name and age. Students' recorded the best matches in 1930 and 1940 and a score for their confidence in each match. We then compiled all of the census data available for each match, resulting in the linked dataset provided below. The Fall 2019 class followed a similar procedure to link
all young males in Williamsburg in 1910 forward to 1930 and 1940. Students in the Spring 2020 class are currently working on a dataset focused on local outcomes during Reconstruction, linking black and white individuals from the 1870 census forward to 1880 and 1900.
To get a better sense of how all of this matching works and what sort of information is available in each census, take a look at the example below.
F.D. Roosevelt, 1920
Here we have F.D. Roosevelt, age 37, in the 1920 federal census. From the census, we learn that F.D. is currently the Assistant Secretary of the Navy, residing in Washington, D.C. We also know that he is married to Eleanor, born in New York in 1882, and has children named Anna, James, Elliot, Franklin and John. These details can be used to search for F.D. in the 1930 census. Click here to see the full census manuscript page.
Hon. Franklin Roosevelt, 1930
We have a match in the 1930 federal census, the Hon. Franklin Roosevelt with the right age, the right birth state and the right family members. Franklin is showing a bit of geographic and occupational mobility; he is now living in New York and working as Governor of New York. Click here to see the full census manuscript page.
Franklin D. Roosevelt, 1940
Searching for Franklin by name, birth year, birth state and family members in the 1940 federal census yields another match. We can see he has moved locations and jobs yet again, now residing at 1600 Pennsylvania Avenue in Washington, D.C. and listing his occupation as `President of U.S.A.' Click here to see the full census manuscript page.
The linked data are provided in Stata, Excel and CSV formats below. Cleaning and documenting the data is an ongoing process. In addition, each subsequent class will be adding new linked observations to the dataset. Check back in the future to see if newer versions of the data have been posted. There are a few things to keep in mind when using the data:
While all of the data used to construct this linked sample are public, the names and addresses of individuals have been removed to respect the privacy of people in the dataset and their descendents.
Students were instructed to identify a best match for each person, even if it was a poor match. The data contain two variables for researchers to identify reliable matches. The first is MatchType indicating whether the match was considered good or bad and, if good, was unique. In general, a value of bad indicates that the student does not think this could be the right person. Non-unique, good matches are cases where the individual looks like a plausible match but another individual looks like an equally good match. A second variable, MatchQuality, gives the student's assessment of how good the match is on a scale from zero (a terrible match) to 10 (a perfect match).
The variables generated by the data collection process are MatchType and MatchQuality, described above, LinkingYears giving the linking strategy for that
particular semester (i.e. linking 1910 records to 1930 and 1940) and IdNumber, a unique identifier for each individual. All other variables are data from the census manuscript pages. Variable names should be descriptive enough to identify the relevant column on the manuscript page. Links to sample manuscript pages from the National Archives censuses are provided below:
1910 Manuscript Page, 1920 Manuscript Page, 1930 Manuscript Page, 1940 Manuscript Page
The data are panel data, with three observations per individual, one each for the original observation year and then one each for the 1930 and 1940 links. Use the IdNumber variable to identify individuals and the Year variable to identify census year.
Version 1.1 (Spring 2019 class data, linking from 1920 to 1930 and 1940)
Linked Data Version 1.1, Stata .dta format
Linked Data Version 1.1, Excel .xlsx format
Linked Data Version 1.1, CSV format
Version 1.2 (adds Fall 2019 class data, linking from 1910 to 1930 and 1940)
Linked Data Version 1.2, Stata .dta format
Linked Data Version 1.2, Excel .xlsx format
Linked Data Version 1.2, CSV format
Version 2.1 (Spring 2019 and Fall 2019 data with 1910 and 1920 census data for parents included)