[A version
of this article appeared in Key Words,
the newsletter of the American Society of Indexers, May/June & July/August,
1997 [vol. 5, nos. 3 and 4]
For
information on subscribing to
Contact
Bill Earle, the editor, at editor@nilesregister.com
Or see
http://www.nilesregister.com
CREATING
ELECTRONIC DATABASES FROM HISTORICAL PERIODICALS
W.H. Earle
INTRODUCTION
Newspapers and magazines from bygone eras are among the printed materials that will be digitized and made available electronically in the coming decades. Historians, genealogists, and other researchers are already looking forward to being able to search the full texts of, say, Godey's Lady's Book or The National Intelligencer. If mere "full text" is what these electronic products offer, however, searchers are likely to be disappointed. In this arena, full text -- the summum bonum researchers hope for and database creators strive toward -- isn't good enough.
The problem is that the search strings likely to be used by a searcher will routinely fail to retrieve the material the periodical contains. Archaic spellings, styles, and usages will prevent most students and even most scholars from being able to devise appropriate search strings. As the simplest kind of example, imagine a student searching for references to electricity in periodicals of the late 18th and early 19th century. If the student uses "electricity" as a search term, he will retrieve nothing even if the full texts he is searching do indeed refer to electricity -- because the texts call it "galvanism" rather than "electricity." For the student's search to work, therefore, the editor of the database will have to have inserted the term "electricity" into any appropriate articles. In fact, as illustrated below, it will probably be necessary to insert indexing terms that are meaningful to modern searchers into virtually every single article from a historical periodical in order for a database created from it to be truly usable by a broad audience.
What follows is an outline of some of the problems that the creators of databases from historical periodicals must be prepared to deal with. The illustrations are not exhaustive, but they certainly cover the largest problems the database creator will encounter. They were derived from work on Niles' Register, the newsweekly that was published from 1811 until 1849. The product of that work was a cumulative index to the Register rather than a full-text database, but the problems involved in devising appropriate, meaningful index entries are precisely those that a database creator will encounter in devising artificial search strings to remedy the limitations of a full-text database. For simplicity's sake, the outline is organized into sections on Persons, Places, and Things.
PERSONS
Persons - Changes in Spelling. The spelling of words was much more variable in olden days than it is today, and this freedom extended to personal names. For example, in the earliest years of Niles' Register, James Monroe was generally referred to as James Munroe. Unless the database editor inserts "James Monroe" as an artificial indexing term, references to the individual will not be retrievable by the searcher who doesn't realize that multiple spellings are possible.
Foreign names are particularly prone to this difficulty, especially where a non-Roman alphabet is involved. Niles' Register usually referred to the great native war leader who resisted British incursions in India as "Tippoo" and called the great Egyptian pasha of the early 19th century "Mehemit Ali," but the forms of these names likely to be used by a modern searcher are "Tipu" and "Mohammed Ali" Unless these latter terms are artificially inserted in a full-text database, they will not be retrievable.
Persons - Reference by Title. It was not uncommon in the nineteenth century for newspaper articles to refer to an individual only by a title, without mentioning any name, even a last name. Thus an article about John C. Calhoun might refer to him only as "the Secretary of War." Unless his name is artificially inserted into such an article, the article will not be retrievable by a search for his name. A searcher aware of this problem could attempt to date-limit a search for the term "Secretary of War" to the period when Calhoun was in office, but this is clumsy and not likely to work well. (For example, it will fail to retrieve articles where the title "Secretary of War" appears in the archaic form "Secretary at War," as it does in volume 16 on page 112 (April 3, 1819).) Merely linking "secretary" and "war" together with a Boolean AND would retrieve a fairly complete set of responses, but it would be corrupted with a large number of irrelevant responses, too -- for example, references to actions taken by the "Secretary of the Navy ... during the late war."
Changes in titles
can also present problems for searchers in a database constructed from
historical periodicals. One version of
this problem relates to titles of nobility.
If the right years of publication are cumulated together, the Duke of
Wellington could appear in a text as Arthur Wesley (until 1798), Arthur
Wellesley (1798 to 1809), Viscount
Another version of this problem relates to military officers. If enough years of text are cumulated together in one database, a given Army officer might be referred to as Lt. Smith, Capt. Smith, Maj. Smith, Col. Smith, and Gen. Smith. All the references would be to the same individual, but a definitive full name would need to be inserted into each article to make that fact apparent to a searcher.
Yet another version of this problem relates to the married names of women. One famous American beauty in the first half of the 19th century was at various times referred to as Mary Ann Caton, Mary Ann Caton Patterson, Mary Ann Patterson, Mrs. Robert Patterson, and the Marchioness of Wellesley. To complicate matters further, she sometimes used the name Marianne. Unless some consistent search string denoting her is inserted in a full-text database artificially, references to her would be extremely difficult to retrieve.
Persons - Abbreviations. Abbreviations of personal names in the text
of Niles' Register generally took two forms.
One form involved short-hand references like "Mr. E." (volume 66, page 415,
The more common form of abbreviation, however, involved the use of initials in articles containing letters or reports with signatures on them. For example, on page 300 of volume 30 of Niles' Register (June 24, 1826), there is some correspondence involving Henry Clay in which his name appears only as "H. Clay." This was a standard signature for him. Unless "Henry Clay" is inserted into the text of this article as an artificial search term, the article could easily be overlooked in a search for references to him.
Persons - Periphrastic References. "Periphrastic references" are
references to individuals by catch phrases that would have been clear to a
contemporary reader, but which will thwart a modern-day searcher seeking
information about a given individual.
They are not uncommon today in opinion columns in newspapers -- a given
columnist, for example, may write about "Slick Willie" without ever
mentioning the name Bill Clinton -- but they were more common in earlier eras,
when journalistic writing was more florid and self-consciously stylish. For example, an article in volume 17 of Niles' Register (page 10,
Here are some other examples of the same problem, all drawn from Niles' Register:
*
In volume 21, page 245 (
*
In volume 26, page 344 (
*
In volume 30, page 427 (
· In volume 33, page 34 (September 15, 1827) there is a reference to "the old general who conquered the savages of the south west, and rescued New Orleans from [Britain's] fatal grasp." Neither "Andrew Jackson" nor "Jackson" appears in the article.
Persons - Ambiguous References. This difficulty is coming to be known as
"the Mr. Smith problem." It
refers to personal name references which would have been comprehensible to a
contemporary reader, but which can only be interpreted with great difficulty
nowadays. For example, Niles' Register might quote a speech by
"Mr. Smith" in the Senate in a certain year. Contemporary readers would have known that
the only Mr. Smith then serving in the Senate was Samuel Smith, the powerful
Republican from
Furthermore, the problem gets worse when one begins cumulating multiple volumes of a given historical periodical or of multiple periodicals. In the case of Niles' Register, a reference to "Mr. Smith" in the Senate in one volume might denote Samuel Smith while a reference to Mr. Smith in the House in the same volume might denote Caleb Blood Smith of Indiana. So long as there was only one Mr. Smith in the House and one in the Senate, no further identification would have been necessary, and the text would contain none. Ten volumes later, however, identical references to "Mr. Smith" in the Senate and House might have referred not to Samuel Smith and to Caleb Blood Smith, but to some other senator named Smith and some other representative named Smith. Again, if there was only one "Mr. Smith" in each house at the time, no further identification would have been necessary for a contemporary reader. However, when one begins cumulating multiple years of such references -- whether in an index or in full text -- they quickly become meaningless. Unless the editor has artificially distinguished the references by elaborating all the "Mr. Smith" references into full names (plus dates of birth and death when necessary), tens of thousands of entries will congregate under the ambiguous entry for "Mr. Smith." Such indiscriminate, hodgepodge attributions would of course be utterly useless -- and they will get worse and worse (and more and more useless) as the database is made "more powerful" by the incorporation into it of additional years of a given publication or of multiple publications.
PLACES
Places - Changes in Names. Listed below are some illustrative 19th-century geographic terms as they appeared in Niles' Register and their 20th-century equivalents for the same entities. Some are terms that most searchers would think of in trying to devise a search expression in a historical periodical database, but many are not. In general, therefore, modern equivalents will need to be artificially inserted into a full-text historical periodical database in order to make geographic entries retrievable.
|
19th-Century Term |
20th-Century Term |
|
|
|
|
|
|
|
Ava |
|
|
Banda Oriental |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ispahan |
|
|
Jedo |
|
|
|
|
|
Otaheite |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Yeddo |
|
|
|
|
Places - Changes in Spelling. Even when the name of a given place has not changed, its spelling might have -- and in ways that will thwart a modern-day searcher who is unaware of the variations. Listed below are some 19th-century geographic names as they appeared in Niles' Register with the altered spellings of their 20th-century equivalents.
|
19th-Century Term |
20th-Century Term |
|
Aapopka |
Apopka |
|
Arkansaw |
|
|
Beyrout |
|
|
Chili |
|
|
Cooloosahatchee |
Caloosahatchee |
|
Faxyardo |
Fajardo |
|
Milwaukie |
Milwaukee |
|
Nangasaki |
Nagasaki |
|
Nepaul |
Nepal |
|
Ouisconsin |
Wisconsin |
|
Oural Mountains |
Ural Mountains |
|
Outhlachuchy |
Withlacoochee |
|
Pekin |
Peking (or Beijing) |
|
Porto Rico |
Puerto Rico |
|
Wiskonsan |
Wisconsin |
Places - Abbreviations.
Unless geographical abbreviations are artificially expanded by a
database editor, many references containing them will be unretrievable.
A searcher cannot reasonably be expected
to know that he needs to search "I.T." in order to capture references
to "Indian Territory."
Furthermore, he can hardly be expected to know that "
THINGS
Things - Changes in Spelling. As with persons and places, changes in spelling could throw off a searcher in a mere full-text historical periodical database that has not been appropriately augmented with modern spellings. In Niles' Register, the spelling of "molasses" was generally just that, but in an earlier era the word was spelled "melasses." As far as Niles' Register alone is concerned, a searcher interested in "cigars" would need to search that term and "segars" as well. Someone looking for references to the cloth that we call "crepe" would need to search both for that term and for "crape." A searcher interested in the "Comanche" Indians would need to search that term but also "Camanche."
Things - Abbreviations. There is a particular problem relating to displays of data in Niles' Register that certainly affects other publications as well. In listings of commercial transactions, something like the following might appear:
Flour ... $5.25 per barrel
The entry "do." means "ditto," and it means that the quotation applies to rye flour -- but unless the term "rye flour" is artificially inserted into the text, it will not appear, and it will not be retrievable. Of course, the same problem arises when double quote marks are used to mean ditto.
Sometimes unfamiliarity with the
term being abbreviated makes it difficult to interpret an abbreviation. It is not likely that modern searchers will
search for such abbreviations as "crim.
con." or "
Things - Changes in Usage. We don't always use words in the same way that our forefathers did. For example, early in the period covered by Niles' Register, the word "convention" always meant a meeting concerning a constitution. Thus an article about a constitutional convention need not mention the word "constitution" in any form since readers would understand that that was what a "convention" was for. Here are some examples:
*
·
Unless
a database editor inserts the artificial search tag "constitution,"
these articles cannot be retrieved by a searcher interested in the state
constitutions of
Listed below are some common 19th-century terms as they appeared in Niles' Register and their 20th-century equivalents. The list is hardly exhaustive, but it illustrates the difficulty a searcher might have in trying to retrieve relevant items from a mere full-text database.
|
19th-Century Term |
20th-Century Term |
|
administration party |
Democratic Party |
|
amalgamation |
mixed-race sexual relations or marriage |
|
anniversary |
annual meeting |
|
aerostation |
experiments with dirigibles |
|
cars |
train |
|
caoutchouc |
rubber |
|
cracker |
firecracker |
|
defalcation |
embezzlement |
|
duties |
taxes |
|
emigrants |
immigrants |
|
freshet |
flood |
|
friends of Andrew Jackson |
Democratic Party |
|
galvanism |
electricity or electromagnetism |
|
gum elastic |
rubber |
|
ice island |
iceberg |
|
incendiary |
arsonist |
|
incendiary materials |
abolitionist publications |
|
inundation |
flood |
|
national anniversary |
Fourth of July |
|
sulphureous gas |
natural gas |
|
passengers |
immigrants |
|
railroad iron |
rails |
|
receipt |
recipe |
|
spermaceti oil |
whale oil |
|
torpedo |
naval mine |
|
Van Buren party |
Democratic Party |
(In
some cases, too, the use of a twentieth-century search term will pull more from
a full-text historical periodical database than is warranted. The term "hurricane" in Niles' Register referred not only to
Things - Periphrastic References. As with persons, articles referring to a thing only metaphorically were common in bygone days. Thus an article on the abolition of slavery in the British West Indies (volume 47, page 17, September 13, 1834) might discuss "the vast project, to result in good or evil, [that] has just commenced" without mentioning "slaves," "slavery," "abolition," or "emancipation" anywhere in the article. Such terms would need to be inserted in a full-text database for the article to be appropriately retrievable.
Similarly, here is an example that
refers to
She
[a recently deceased individual] also left property worth 50,000 dollars to the
college at
Another form of this problem
commonly applies to such entities as railroads, canals, and wars in Niles' Register. A modern searcher will generally need a text
with a proper name in it to retrieve articles about a given railroad, canal, or
war, but articles in Niles' frequently
omitted such entries. Thus, in volume
41, page 249 (
CONCLUSION
The discussion above identifies at least some of the difficulties that an editor will need to confront in order to convert the text of a historical periodical into a searchable electronic database that yields satisfactory search results to the vast majority of searchers. Future editors of this type of work will perhaps be able to identify additional problem areas, and they will certainly be able to add to the examples used as illustrations in this essay, but attention to these problems will lead to no insurmountable difficulties in creating databases in which users can actually find what they are seeking.