The Census Soundex Code

The following information on the Soundex code for the US census was compiled from various messages posted to the referenced mailing lists and newsgroup.

See also the Rootsweb Soundex Converter online.

Soundex codes quoted from a message posted to the the Roots-L mailing list in June, 1995:

To quote Emily Anne Croom in her book Unpuzzling Your Past—A Basic Guide to Genealogy, here is the Soundex code:

1b, p, f, v
2c, s, k, g, j, q, x, z
3d, t
5m, n
Notice all vowels are dropped.


Campbell = C  MPB  LL
C 51 4   (Notice that there are only 4 characters; like letters are shared.)

The following information on the Soundex code is from GenKit, a freeware genealogy utility software program, created by Ray Cox and Richard A. Pence.

The Soundex Coding System

"A soundex code is a four character representa­tion based on the way a name sounds rather than the way it is spelled. Theoretically, using this system you should be able to index a name so that it can be found no matter how it is spelled. The WPA [Works Progress Administration] used the soundex coding system in the 1930s to do a partial indexing on 3x5 cards of the 1880 and 1900 censuses, and nearly full indexing of the 1910 and 1920 censuses. The soundex indexes of the 1880, 1900 and 1910 censuses are available on microfilm. Every soundex code consists of a letter and three numbers. The letter is always the first letter of the name, and the numbers are assigned in this way:

1 = b, p, f, v;
2 = c, s, k, g, j, q, x, z;
3 = d, t;
4 = l;
5 = m, n;
6 = r;
a, e, i, o, u, h, w, and y are disregarded. Double letters and side-by-side letters with the same value are coded only once. Add zeroes if you run out of letters before you have three numbers.

"To figure out a surname's code, do this:  JOHNSON

- eliminate any a, e, i, o, u, h, w, y          JNSN

- Write the first letter, as is,
   followed by the codes determined
   by the above rules.                       JNSN = J525"

On the topic of why the Soundex was created, the following is quoted from a message posted to the GEN-NYS-L mailing list in 1998:

"The following is quoted from the well-respected genealogist, lecturer, and editor, William Dollarhide in "Census Records: Look Again!" Genealogy Bulletin, No. 35, (Bountiful, Utah: AGLL, Sep—Oct. 1996) p. 16:

"When Social Security began in 1935, the first old-age pension system was established for every citizen of the United State of the age of 65 or over. An immediate concern was how to prove an age for a person applying for social security, since not very many people could produce a birth certificate in 1935. Many people who were qualified could not prove their age.

"To counter this problem, a special branch of the Census Bureau was created, called the Age Search group. This group would take a person's application for social security, and attempt to find that same person in a census record where a name and age would be given. It was soon determined that indexes would be needed to speed up the work of finding a particular person's name and age listing.

"The Census Bureau hired the Rand Corporation to design an indexing system based on phonetic sounds for a name, which became known as "soundex." Under the supervision of the Age Search group, the Works Progress Administration (WPA) employed several hundred clerical workers to create the indexes to the 1880, 1900, and 1920 censuses. For several months, the WPA workers prepared index cards for heads of households from the 1880 census with children 10 years or younger, as well as the index cards for all heads of households from the 1900 and 1920 censuses. The soundex code was given at the top of the index card, followed by the name of the head of the household, and the names and ages of each member of the family was listed below, showing a citation to the census schedules on which they appeared. The cards were then arranged by the soundex codes for each census index.

"For the Age Search group's purposes, it was decided that the 1880 census did not need to be completely indexed. People in 1935 who were 55-65 years old would have been 10 years or younger in 1880. The 1880 index, therefore, was to be used to provide another check to confirm a person's age. Since the only copy of the 1890 census had been destroyed by fire, the Age Search group decided they needed to have a complete heads of households index for the 1900 and 1920 census.

"Many years later, the Age Search group on their own undertook a census index of the 1910 census, but limited the index to just those states that did not have state-wide birth registration by 1910. The 1910 index was the first census index to employ the use of computers. Because of the Rand Corporation's trademark restrictions, the Age Search Group couldn't use the term "Soundex," so they called their index "Miracode" instead. The coding used for both the soundex and miracode systems, however was exactly the same. Today all the soundex cards prepared by the WPA for the Age Search group have all been microfilmed and made available to genealogists."

On the subject of problems using the Soundex code, the following is quoted from a message posted to the soc.genealogy.misc newsgroup in February, 1998:

[Quotation of an earlier message:]

> It appears that the guidance provided today by various
> sources on how to generate Soundex codes is not the
> same as the guidance provided to those that actually
> assigned Soundex codes in the 1930's (WPA projects).
> I would code my name "KATSCHKE" using the guidance
> in the "SOURCE" and the Soundex indexes and would
> come up with the code "K322." However, I could not
> find this name in the Soundex films until I looked
> under "K320."
> Obviously a lot of time was lost using looking for the
> wrong code. Looking for other cases like this I've found
> PATSCHKE (P320) and DEMSHKI (D520) that are also
> incorrectly coded using today's guidance. I wasn't here
> back in the 30's when the Soundex project was being
> done, nor do I have access to the guidance given out to
> the coders at that time. But it appears that the problem
> is with the "H." Today we treat it like any vowel and "W"
> and "Y" and drop it, but equally important, the dropped
> character acted as a divider between the characters on
> either side. For example the "SCHK" converts to "22,"
> i.e., the "C" and the "K" are not considered adjacent to
> each other. However the original coders apparently
> disregarded the "H" as if it did not exist and coded "SCHK"
> the same as "SCK" or a single "2." I suspect that the original
> rules stated that the "H," perhaps because it was considered
> a "silent" character, was disregarded as if it didn't exist.
> This may have applied to the "W" as well, but I have not
> seen any cases to prove it either way.
> Is this assumption possibly correct?
> Have you run across any names that coded differently using
> today's guidance from what is shown in the Soundex films?
> Do you know of any other names that have any of the letters
> C, S, K, G, J, Q, X, or Z on both sides of the letter H or THE
> W such as SHS, CHS, KHZ, SWS, KWS, CWK, etc.? If you
> do, I would like to look the name up in the Soundex films
> and compare their coding with today's rules for coding.
> How about and H or W between B, P, F, or V, e.g., PHF,
> BHV, FWP, etc? Or an H or W between D or T like DHT or
> DHD, TWD, TWT, etc?. Or an H or W between M or N like
> MHN or NWN?
> I have looked at a number of computer programs (Ancestral
> Quest, Family Origins, Legacy, The Master Genealogist,
> Ultimate Family Tree, Family Gatherings and WinFam) and
> they all code these names incorrectly. Do you know of any
> computer program that properly codes KATSCHKE as K320
> (not K322) or DEMSHKI as D520 (not D522)?
> Any suggestions as to why there might be this differences
> between 1930's Soundexes and today's Soundexes?

[Posted Reply]

It appears that almost all Soundex generators found in genealogy programs use the coding rules published by the National Archives, which are evidently incorrect for the censuses.

The book Genealogical Resources in the New York Metro­politan Area by Estelle M. Guzik (Jewish Genealogical Society, 1989) has a table (Appendix C, p. 382) comparing the Soundex coding schemes attributed to the National Archives, the Hebrew Immigrant Aid Society, the NY City Health Dept, and the NY State Health Dept. The HIAS system uses (among other things) the embedded 'H' rule earlier described. The NYS Health Dept uses both the embedded 'H' and 'W' rules mentioned earlier.

The book also says that the Soundex system used by the National Archives was patented by Robert C. Russell, April 2, 1918 (#1,261,167). It would be interesting to look up the original patent to see which rules are given.

Another recent poster presented the name "ASHCROFT" as an example where the National Archive rules did not match the actual census Soundex. He said he checked it against (at least) the 1880, 1900, and 1920 Soundexes for several (unspecified) states and found consistent mismatch.

It may not be very comforting, but any of the programs which use the National Archives scheme can be made to work by just omitting the 'H' where it is embedded in a surname. I've yet to learn of a surname with an embedded 'W' which causes the problem, but I'm sure there must be one, probably Eastern European.

Some database software libraries include Soundex coding functions with proprietary rules which are different from any of the above. Beware of confusion because of this.

It would be worthwhile to collect a list of names and copies of the census Soundex cards where the National archives sys­tem does not produce the same result and notify the National Archives about this. It's probable they're not officially aware of the problem.