ARCHIVE - news aggregator | The Science of Language

Navigation

User Login

Events

Language Log

Linguist List: Discussion

Linguist List: Book Reviews

Linguist List: Journal Contents

Linguist List: Media

news aggregator

Semantic fail

Language Log - Thu, 2009-09-03 19:11

Leena Rao at TechCrunch points out a case where semantic search turned into anti-semitic search.

This morning I wrote about NetBase Solutions’ healthBase, a semantic search engine that aggregates medical content from millions of authoritative health sites including WebMD, Wikipedia, and PubMed. But is it a semantic engine or an anti-semitic search engine?

Several of our readers tested out the site and found that healthBase’s semantic search engine has some major glitches (see the comments). One of the most unfortunate examples is when you type in a search for “AIDS,” one of the listed causes of the disease is “Jew.” Really.

The ridiculousness continues. When you click on Jew, you can see proper “Treatments” for Jews, “Drugs And Medications” for Jews and “Complications” for Jews. Apparently, “alcohol” and “coarse salt” are treatments to get rid of Jews, as is Dr. Pepper! Who knew?

Apparently this was not the result of amalgamating medical advice from Hamas, but rather a consequence of some artificial stupidity applied to Wikipedia, as a company representative explained:

This is an unfortunate example of homonymy, i.e. words that have different meanings.
The showcase was not configured to distinguish between the disease “AIDS” and the verb “aids” (as in aiding someone). If you click on the result “Jew” you see a sentence from a Wikipedia page about 7th Century history: “Hispano-Visigothic king Egica accuses the Jews of aiding the Muslims, and sentences all Jews to slavery. ” Although Wikipedia contains a lot of great health information it also contains non-health related information (like this one) that is hard to filter out.

And that's not the end of the fun and games:

If you look at the pros of AIDS (yes, it thinks here are pros to having AIDS), it comically lists the “Spanish Civil War.” One of the causes of hemorrhoids is “Bronco” (I don’t even want to know).

It only took a few clicks for me to get here:

Or here:

And vice versa respectively

Language Log - Thu, 2009-09-03 13:24

At some time approximately 30 to 35 years ago — that is, in the 1970s, back when disco had a future — I received a letter from my friend Jim Hurford. We were young lecturers then, me in London and him in Lancaster, though he was later to become Professor of General Linguistics at the University of Edinburgh. Here is what his letter asked me:

"Can you construct a grammatical and meaningful English sentence that ends with the words and vice versa, respectively ?"

Jim is now Professor Emeritus, and I now hold the Chair that he held for so many years, and I still have not succeeded in constructing an example of the mind-twistingly difficult sort he requested.

But Jim made the mistake of asking me many years before the Internet was invented. Today there is the web, and there are blogs, and comments areas, and there is Language Log. I am sure that our ingenious and talented readers — that would be you — will take up the challenge in the comments space below. If any of them should succeed, I will present the sentence to Jim, and take all the credit. (He will not read this post, because he is working busily on a book about the evolution of grammatical structures, and has no time for things like Language Log.) You (if you are the one who succeeds) will have the inward pleasure of knowing that you solved the puzzle. I will have the gratitude and admiration of my friend Jim. And Jim will have the example sentence he has so long sought. Doesn't that sound like a win-win-win scenario to you?

[Afterword: I don't know what reminded me of the puzzle Jim set all those years ago, or why it seemed so hard at the time, or what he thought would be the relevance of the structure. It is of course not a serious surprise that Language Log readers were able to solve it (see below). But it is perhaps a surprise that some of them completed it in less than ten minutes after the post went up on this site! What is particularly wonderful is that (of course, as JS Bangs was the first to point out) today we can simply Google up genuine examples from real texts. I didn't bother because I knew you would. The real point of this post is that thirty years has changed the nature of syntactic explorations beyond imagining. Back in the 1970s when we were young, you could have devoted your whole life optimistically scanning print sources and had only decades of disappointment for your pains. Today we have the trillion-word corpus of the web to search, and the entire investigation takes maybe eight seconds. (Incidentally, I'm deleting the comments that construct sentences that merely quote the phrase. If you allow quotations of ungrammatical bits and pieces to count, then instantly every string becomes grammatical and syntax is pointlessly trivialized. If you don't, then I think it turns out that nearly all word strings are ungrammatical.) —GKP]

Misleading pseudo-scientific argument of the week

Language Log - Thu, 2009-09-03 08:15

According to Abigail Norfleet James, Teaching the Male Brain: How Boys Think, Feel, and Learn in School (2007), p 37:

The shape of the inner ear is not the same for boys and girls. As we have seen in the previous chapter, the female cochlea responds more quickly to sound than does the male cochlea (Don et al., 1993) That means that boys are likely to respond to aural information of questions just a bit slower than girls will. Because boys don't hear soft or high sounds very well and because they don't respond to sounds as rapidly as do girls, boys may have trouble with auditory sources of information.

The reference is to M. Don et al., "Gender differences in cochlear response time: An explanation for gender amplitude differences in the unmasked auditory brain-stem response", J. Acoust. Soc. Am. 94(4): 2135-2148, 1993. And yes, there really are sex differences in cochlear response time — but the distributions for males and females overlap, as usual, and the average sex differences are less than a thousandth of a second.

There are different ways to measure the response, and different frequency bands to check — you can read the paper to survey all the differences in detail — but here's a typical figure from Don et al. showing the sex differences in cochlear latency for a two types of measurement in one frequency range:

And another figure showing differences for one type of measurement in different frequency ranges:

Most of the differences are in the range of .0001 to .0003 seconds (1 to 3 ten-thousandths of a second), and none are larger than .0007 sec.

In comparison, simple acoustic reaction time in adults ranges from about 120 to 300 msec. (R.D. Luce, Response times, 1986). For children, mean simple acoustic reaction times range "from 465 msec at age five to 190 msec at 15 years" (K. Andersen et al., "The Development of Simple Acoustic Reaction Time in Normal Children", Developmental Medicine and Child Neurology 26(4), 2008). "Simple acoustic reaction time" is how long it takes to respond to a sound when you know it's coming, and all you need to do is to press a key as soon as you hear it. Choice reaction times (where you need to interpret the simulus and respond accordingly) are much longer. And the time that it takes even the most attentive and cooperative child to respond to a simple verbal instruction is more like two or three seconds, if only because the instruction itself is likely to take nearly that long to be expressed.

So the average sex difference in cochlear response time of .0002 to .0006 seconds — even if this difference is preserved through the brain stem and the cortex, and translated to the interpretation of the stimulus and the formulation and execution of the response — is roughly a thousandth of children's typical simple acoustic reaction time, and about one part in ten thousand of the time that it takes them to respond to even the simplest verbal instructions.

While we're looking at the Don et al.paper, though, let's also reproduce what they found about pure-tone thresholds:

[The subjects in the Don et al. study were 17 females and 14 males aged 18-38.]

Dr. James claims that "Because boys don't hear soft or high sounds very well and because they don't respond to sounds as rapidly as do girls, boys may have trouble with auditory sources of information."

I don't know whether it's really true that boys have more "trouble with auditory sources of information" than girls do. I do know, however, that when Dr. James tries to persuade her readers of this by citing research about sex differences in cochlear response times and audiometric profiles, her argument is at best irrelevant and at worst dishonest.

And this error is not an isolated one. There is a growing popular literature on the biology of human sex differences, and the use of misinterpreted or overinterpreted "scientific" evidence is all too typical of the strand of this work that emphasizes sex differences in order to argue for sex-specific (and often sex-segregated) educational practices. For more on this with respect to sex differences in audiometric thresholds, comfortable listening levels, etc., see here, here, and here.

Quiz

Language Log - Thu, 2009-09-03 05:12

What language is this?

Hint: it's one that you know.

Well, some might argue that you probably don't know it, because basilectal Jamaican Creole English is not mutually intelligible with the kinds of English that you probably do know.

And according to the discussion at Steve Cotler's Irrepressibly True Tales ("Draw Your Brakes–A Jamaican Creole Shout"), even Peter Patrick, who grew up in Jamaica and who has studied Jamaican Creole for years, didn't know what these two phrases meant. He had to consult with Kenneth Bilby, who knew the answer only because he'd asked Bunny Brown, a friend of the (deceased) speaker, David Scott, who explained a couple of crucial bits of 1970s Jamaican slang.

The study of creole languages led in the 1960s to the concept of the "creole continuum", describing the spectrum of forms between a standard language and a local variety which (in the absence of the intermediate forms) would clearly be a separate language.

Something very similar happens in cases where the rest of the "creole language" package is nowhere in sight. Thus in Italian- or German-speaking parts of Europe, people may use a spectrum of mixtures between the standard national language and a local variety, which may be different enough that speakers from another region would be unable to understand it in a pure form. And I'm told that the same sort of thing happens with mixtures of fuṣḥā (Modern Standard Arabic) and various regional/social colloquial varieties.

The usual term for these non-creole cases is "code-switching", which implies a process where the message is sometimes in one variety and sometimes in another, in contrast to the term "creole continuum", which suggests a spectrum of intermediate varieties. However, I'm skeptical that there's actually any systematic difference between what happens in Jamaica or Haiti and what happens in Stuttgart or Tunis.

Illustrating the maxim of quantity…

Language Log - Wed, 2009-09-02 07:50

… or if you prefer, some aspects of relevance theory, the next-to-latest xkcd:

Joshua Whatmough and the donkey

Language Log - Tue, 2009-09-01 13:00

At Steve Cotler's Irrepressibly True Tales, an irrepressible (and no doubt true) tale of Prof. Whatmough's Linguistics 120 at Harvard in 1962. If you read to the end, you'll find out about the donkey.

G. Nick Clements, 1940-2009

Language Log - Tue, 2009-09-01 08:15

I’m sad to report that phonologist Nick Clements passed away in Chatham, Massachusetts, on August 30. There is an obituary by Beth Hume on LINGUIST List. Beth co-organized a symposium on tones and features in honor of Nick in June; the speaker list was a veritable who’s who of phonology, of which Nick was also of course a prominent member. He will be missed.

Food Choices on Indian Airlines

Language Log - Tue, 2009-09-01 05:53

Whenever I fly on non-Indian carriers to India or there are Indians (South Asians) on non-Indian carriers flying elsewhere, I often encounter individuals who complain that they cannot eat the special meals that they ordered.

Steward(ess): But, sir, you ordered a vegetarian meal, didn't you?

Passenger: Yes, but I cannot eat this kind of vegetarian meal.

Steward(ess): I can assure you, sir / ma'am, that our vegetarian meals have no meat or meat products in them.

Passenger: But what you have given me has X, Y, Z in it. I cannot eat it. Please get me something else.

Steward(ess): I am sorry, we do not have any other kinds of vegetarian meals.

Whereupon the passenger pulls out some biscuits from his / her carry-on bag and survives on them and whatever else he / she can scrounge up for the duration of the flight.

Stefan Krasowski recently booked a flight on India's Jet Airways. Here are the choices he was offered for meals:

As Stefan observes, "The mind-boggling array of options demonstrates India's diversity. I am especially curious about the 'bland meal'!"

I suppose that "bland" implies "not spicy," "spicy" being the default option for (most of the many types of) Indian cuisine.

Now, what intrigues me about this bewildering array of choices is that, although they are all in English and would seem to be fairly common fare for Indian air travelers, some of the choices may set American readers to looking things up on the web, for instance to remind themselves of what "purine" is and why someone might want a meal low in it. Before consulting Wikipedia, I was ignorant of the difference between a Vegetarian Hindu Meal and a Vegetarian Jain Meal. Even knowing what "Lacto Ovo" means, I'm still puzzled about how a Lacto Ovo Meal can be different from a Vegetarian Lacto-Ovo Meal. Furthermore, what can I expect to get if I order just a plain Vegetarian Meal, not a Strict Vegetarian Meal, or any of the otherwise modified vegetarian meals? And what would Edward Said have to say about that Vegetarian Oriental Meal if he were still around to pronounce upon it?

Even with all of these possible selections, I'd bet that somebody is going to ask for a "Low Protein Low Lactose Low Purine Meal" or a "Low Calorie Low Cholesterol Bland Meal" or a…. Starbucks has figured out how to offer customers "more than 19,000 beverage possibilities", and a similar number of menu options is implied by the global combinatorics of dietary preferences and restrictions. In this context, "vegetarian" is apparently no more informative than "coffee".

If I were flying Jet Airways, after much consideration, I think my choice would be "Bland Meal - Non Veg." Being a seasoned traveler, I always carry salt and pepper with me.

Serial improvement

Language Log - Mon, 2009-08-31 11:32

Although I share Geoff Nunberg's disappointment in some aspects of Google's metadata for books, I've noticed a significant — though apparently unheralded — recent improvement. So I decided to check this out by following up Bill Poser's post yesterday about insect species, which I thought was likely to turn up an example of the right sort. And in fact, the third hit in a search for {hemipteran} is a relevant one: Irene McCulloch, "A comparison of the life cycle of Crithidia with that of Trypanosoma in the invertebrate host", University of California Publications in Zoology, 19(4) 135-190, October 4, 1919.

This paper appears in a volume that is part of a serial publication. And until recently, Google Books routinely gave all such publications the date of the first in the series, even if the result was a decade or a century out of whack.

But no longer.

True, Dr. McColloch's article is categorized as "Juvenile Nonfiction"… But the date is correctly given as 1919, despite the fact that the series in question began in 1902. The volume in which McColloch's article was published contains articles dated 1919 and 1920, and the treatment of the 1920 portions is a bit variable:

Still, this is a big step forward over the situation a couple of years ago. See for example the discussion in "Shack!", 7/23/2007, where I observed that Google Book Search misdated the Nov. 1956 issue of Boeing Magazine "as 1934, following its usual unfortunate practice of dating all issues of a serial in terms of the earliest issue".

The relevant page now comes up correctly dated:

And in general, serial publications now seem to be given the date of the bound volume that was scanned, rather than the date of the first publication in the series. This still leaves many errors in the date fields of Google Books' metadata. In fact, I guess it's possible that some of the errors that Geoff found (and that anyone can turn up in a few seconds of searching) were actually created by a process for inferring publication dates from OCR output, rather than relying on whatever catalog information was previously giving them the start date for all subsequent issues of a periodical or serial publication.

Overall, it would be nice if Google were a bit more open about what's going on with their metadata for books — the successes as well as the messes — but they deserve credit for this success, even if the solution created some additional messes.

Uttaris pallidipennis in Miami

Language Log - Sun, 2009-08-30 20:08

In the news today I came across this rather strange report from the Associated Press:

MIAMI — U.S. Customs and Border Protection officials say they have intercepted a rare and dangerous insect found in a shipment of flowers at a Miami airport that could cause significant damage.

Officials said Saturday they were examining a box of flowers last week at Miami International airport when they found Hemiptera. Hemiptera's are typically aphids, cicadas, and leaf hoppers and comprise about 80,000 different species. They feed on the seed heads of grasses and sedges. The insect is found in South America.

Officials believe it is the first time the insect has been found in the U.S.

The report is inaccurate in several ways. The insect actually comes from South Africa, which is different from South America. There is no indication that it is rare. The fact that it isn't found in the United States doesn't make it rare. We don't say that tsetse flies are rare because they aren't found in the United States. Nor is there any reason to believe that it is dangerous. It may be destructive of plants, which is why this discovery is of interest, but even that isn't known. It isn't considered a pest in South Africa and it isn't known what would happen if it got loose and survived in Florida.

Several points are of linguistic interest. One is the reference to "Hemiptera". Hemiptera is the name of the order to which the insect, a member of the species Uttaris pallidipennis Stal belongs. A member of this order is a Hemipteran.

More interesting is the statement that "Hemipteras are typically aphids, cicadas, and leaf hoppers", which is not well-formed. What the author presumably meant is: "Typical members of the order Hemiptera are aphids, cicadas, and leaf hoppers". When you say that "Hemipterans are typically X", X must describe a characteristic; it can't list examples.

I suspect that there's a reason for the erroneous description of this insect as "dangerous". The usage of this word is rather interesting. Anything that can injure or destroy something can be described as "dangerous" to it. One can say, for example, that "Aphids are dangerous to roses.". But this only works if you specify what is subject to the danger. If we remove "to roses" and just say "Aphids are dangerous", the sentence becomes false. It seems that "dangerous" comes with a default bearer of the danger, namely human beings. When you say that something is dangerous without specifying to what, you mean that it is dangerous to people. Furthermore, it seems that a non-human bearer of the danger must be linguistically overt. Although we can felicitously say "Owls are dangerous to mice", even if the topic of conversation is predators of mice, it is not felicitous to say "Owls are dangerous." without adding "to mice". My guess is that someone said or thought that the insect in question might be "dangerous to crops" or something along those lines, and then, without thinking, removed "to crops".

The actual facts can be found in the much better report in the Miami Herald. The odd wording appears to have originated with Customs, in this press release. (Customs is now part of the "Department of Homeland Security" but I avoid using this name. Whenever I see it, I hear "Reichsicherheitshauptamt". )

How science reporting works

Language Log - Sun, 2009-08-30 07:08

The latest from Zach Wiener at SMBC:

(There are four more panels — click on the image to see them all.)

Unfortunately, scientists are often more complicit in the exaggerations and misrepresentations than this narrative suggests.

Sino-American Name Reversion

Language Log - Sat, 2009-08-29 19:08

Yesterday an applicant from China came to my office and introduced herself to me as Runxiao ("Moist Dawn"). However, in previous correspondence, she had always referred to herself as Layn (a variant of Lane; other variants of the name Lane are: Laen, Laene, Lain, Laine, Laney, Lanie, Layn, Layne, and Laynne ("living near a lane"; "descendant of Laighean or Luan" in Gaelic) — so say the name books. When I asked her which name she preferred, she said, "You can call me Runxiao."

"But what about Layn?" I asked. "Didn't you used to go by the name Layn?"

"Oh, yes!" she replied cheerfully with a gleaming smile. "When I was in China, I called myself Layn, but now that I'm in America, I call myself Runxiao."
"Isn't that a bit unusual?" I queried. "You call yourself by your English name in China and call yourself by your Chinese name in America?"

"All my friends do that," she answered merrily. "We all take English names in China, but we feel better using our Chinese names in America. Of course, Americans can't pronounce 'Runxiao,' so I tell them just to call me Run. Now I'm Runxiao to my Chinese friends in America, Run to everybody else in America, and Layn to my old associates in China."

This strikes me as completely counterintuitive. However, since the practice is so widespread, there must be some compelling psychological reason for it.

Here are just a few of the Chinese students and friends I know who went by Western names in China or Taiwan, but have switched to Chinese names in America:

China/TW America

Sophie –> Xiaofei

Tess –> Lei (her American graduate school is trying to make her use this, but she prefers to remain Tess)

Shelly –> Xiaolei

David –> Dawei

Mary –> Mali

Löwen –> Li-wen

Peter (not Petra!) –> Li-ching ("Li" to her American friends)

Gianni –> Xiang

Naturally, they had the Chinese names first when they were young (given to them by their parents), but for one reason or another had decided to go by a Western name when they got older.

Given the large flap over North Texas State Representative Betty Brown's complaint about the difficulty of pronouncing Asian names (“Rather than everyone here having to learn Chinese — I understand it’s a rather difficult language — do you think that it would behoove you and your citizens to adopt a name that we could deal with more readily here?”) [see the comment by Lance to this blog], one might think that some of the arrows would be turned in the opposite direction.

Perhaps it is precisely because progressive Americans want Chinese to feel comfortable using their original names that they actually do decide to resurrect them after they come to America. Indeed, I have often encountered situations where a Chinese student comes to America with a Western name he / she has used for many years and feels comfortable with, even proud of, only to be told by a well-meaning American that it sounds "silly" or "odd" for a Chinese to have a Western name.

Nearly all Americans I know who go to China to stay for more than a few months take Chinese names. This is especially true of those who speak one of the Chinese languages. For example, my Chinese name is Mei Weiheng and my brother Denis' name is Mei Danli. When I am in China, I become Mei Weiheng, and nobody calls me Victor Mair there. Similarly, in China Ed Shaughnessy is Xia Hanyi, William Baxter is Bai Yiping, Jerry Norman is Luo Jierui, and so forth. Not only do we take Chinese given (personal) names, we also adopt Chinese surnames, whereas Chinese (whether in China or in the West) maintain their Chinese surname, regardless of whether they go by a Western given (first, personal) name or by a Chinese given name.

Google Books: A Metadata Train Wreck

Language Log - Sat, 2009-08-29 14:46

Mark has already extensively blogged the Google Books Settlement Conference at Berkeley yesterday, where he and I both spoke on the panel on "quality" — which is to say, how well is Google Books doing this and what if anything will hold their feet to the fire? This is almost certainly the Last Library, after all. There's no Moore's Law for capture, and nobody is ever going to scan most of these books again. So whoever is in charge of the collection a hundred years from now — Google? UNESCO? Wal-Mart? — these are the files that scholars are going to be using then. All of which lends a particular urgency to the concerns about whether Google is doing this right.

My presentation focussed on GB's metadata — a feature absolutely necessary to doing most serious scholarly work with the corpus. It's well and good to use the corpus just for finding information on a topic — entering some key words and barrelling in sideways. (That's what "googling" means, isn't it?) But for scholars looking for a particular edition of Leaves of Grass, say, it doesn't do a lot of good just to enter "I contain multitudes" in the search box and hope for the best. Ditto for someone who wants to look at early-19th century French editions of Le Contrat Social, or to linguists, historians or literary scholars trying to trace the development of words or constructions: Can we observe the way happiness replaced felicity in the seventeenth century, as Keith Thomas suggests? When did "the United States are" start to lose ground to "the United States is"? How did the use of propaganda rise and fall by decade over the course of the twentieth century? And so on for all the questions that have made Google Books such an exciting prospect for all of us wordinistas and wordastri. But to answer those questions you need good metadata. And Google's are a train wreck: a mish-mash wrapped in a muddle wrapped in a mess.

Start with dates. To take GB's word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler's Killer in the Rain, The Portable Dorothy Parker, André Malraux' La Condition Humaine, Stephen King's Christine, The Complete Shorter Fiction of Virginia Woolf, Raymond Williams' Culture and Society, Robert Shelton's biography of Bob Dylan, Fodor's Guide to Nova Scotia, and the Portuguese edition of the book version of Yellow Submarine, to name just a few. (You can find images of most of these on my slides, here — I'm not giving the url's since I expect Google will fix most of these particular errors now that they're aware of them).

And while there may be particular reasons why the 1899 date comes up so much, these misdatings are spread out all over the place. A book on Peter Drucker is dated 1905, a book of Virginia Woolf's letters is dated 1900, Tom Wolfe's The Bonfire of the Vanities is dated 1888, and an edition of Henry James 1897 What Maisie Knew is dated 1848.

It might seem easy to cherry-pick howlers from a corpus as exensive as this one, but these errors are endemic. Do a search on "internet" in books written before 1950 and Google Scholar turns up 527 hits.

Or try searching on the names of writers or famous restricting your search to works published before the years of their birth. You turn up 182 hits for Charles Dickens, more than 80 percent of them misdated books referring to the writer as opposed to someone else of the same name. The same search turns up 81 hits for Rudyard Kipling, 115 for Greta Garbo, and 29 for Barack Obama. (Or maybe that was another Barack Obama.)

A search on books mentioning candy bar that were published before 1920 turns up 66 hits, of which 46, or 70 percent, are misdated. I'd be surprised if that proportion of errors or anything like it held up in general for books in that range, and dating errors are far denser for older works than for the ones Google received from publishers. But even if the proportion is only 5 percent, that suggests hundreds of thousands of dating errors.

In discussion after my presentation, Dan Clancy, the Chief Engineer for the Google Books project, said that the erroneous dates were all supplied by the libraries. He was woolgathering, I think. It's true that there are a few collections in the corpus that are systematically misdated, like a large group of Portuguese-language works all dated 1899. But a very large proportion of the errors are clearly Google's doing. Of the first ten full-view misdated books turned up by a search for books published before 1812 that mention "Charles Dickens", all ten are correctly dated in the catalogues of the Harvard, Michigan, and Berkeley libraries they were drawn from. Most of the misdatings are pretty obviously the result of an effort to automate the extraction of pub dates from the OCR'd text. For example the 1604 date from a 1901 auction catalogue is drawn from a bookmark reproduced in the early pages, and the 1574 dating (as of this writing) on a 1901 book about English bookplates from the Harvard Library collections is clearly taken from the frontispiece, which displays an armorial bookplate dated 1574:

Similarly, the 1719 date on a 1919 edition of Robinson Crusoe in which Dickens' name appears in an advertisement is drawn from the line on the title page that says the book is reprinted from the author's edition of 1719. And the 1774 date assigned to an 1890 book called London of to-day is derived from a front-matter advertisement for a firm that boasts it was founded in that year.

Then there are the classification errors. William Dwight Whitney's 1891 Century Dictionary is classified as "Family & Relationships," along with Mencken's The American Language. A French edition of Hamlet and a Japanese edition of Madame Bovary both classified as "Antiques and Collectibles." An edition of Moby Dick is classed under "Computers": a biography of Mae West classified as "Religion"; The Cat Lover's Book of Fascinating Facts falls under "Technology & Engineering." A 1975 reprint of a classic topology text is "Didactic Poetry"; the medievalist journal Speculum is classified "Health & Fitness."

And a catalogue of copyright entries from the Library of Congress listed under "Drama" — though I had to wonder if maybe that was just Google's little joke.

Here again, the errors are endemic, not simply sporadic. Of the first ten hits for Tristram Shandy, four are classified as fiction, four as "Family & Relationships," one as "Biography & Autobiography," and one is not classified. Other editions of the novel are classified as "Literary Collections," "History," and "Music." The first ten hits for Leaves of Grass are variously classified as "Poetry," "Juvenile Nonfiction," "Fiction," "Literary Criticism," "Biography & Autobiography," and mystifyingly, "Counterfeits and Counterfeiting."

Various editions of Jane Eyre are classified as "History," "Governesses," "Love Stories," "Architecture," and "Antiques & Collectibles" ("Reader, I marketed him").

In his response on the panel, Dan Clancy said that here, too, the libraries were to blame, along with the publishers. But the libraries can't be responsible for books mislabeled as "Health and Fitness" and "Antiques and Collectibles," for the simple reason that those categories are drawn from the BISAC codes that the book industry uses to tell booksellers where to put books on the shelves, not from any of the classification systems used by libraries. And inasmuch as BISAC classifications weren't in use until about 20 years ago, only Google could be responsible for their misapplications on books published earlier than that: the 1904 edition of Leaves of Grass assigned to "Juvenile Nonfiction"; the 1919 edition of Robinson Crusoe assigned to "Crafts & Hobbies"; the 1845 number of the Edinburgh Review assigned to "Architecture"; the 1907 edition of Sir Thomas Browne's 1658 Hydriotaphia: Urne-Buriall, or a discourse of the sepulchrall urnes lately found in Norfolk assigned to "Gardening"; and countless others.

Google's fine Bayesian hand reveals itself even in the classifications of works published after the BISAC categories came into use, such as the 2003 edition of Susan Bordo's Unbearable Weight: Feminism, Western Culture and the Body (misdated 1899), which is assigned to "Health & Fitness" — not a classification you could imagine coming from the University of California Press, though you can see how a probabilistic classifier could come up with it, like the "Religion" tag on the Mae West biography subtitled "Icon in Black and White."

But whether it gets the BISAC categories right or wrong, the question is why Google decided to use those headings in the first place. (Clancy denies that they were asked to do so by the publishers, though this might have to do with their own ambitions to compete with Amazon.) The BISAC scheme is well suited to organizing the shelves of a modern 35,000 foot chain bookstore or a small public library where ordinary consumers or patrons are browsing for books on the shelves. But it's not particularly helpful if you're flying blind in a library with several million titles, including scholarly works, foreign works, and vast quantities of books from earlier periods. For example, the BISAC "Juvenile Nonfiction" subject heading has almost 300 subheadings, including separate categories for books about "New Baby," "Skateboarding," and "Deer, Moose, and Caribou." By contrast, the "Poetry" subject heading has just 20 subdivisions in all. That means that Bambi and Bullwinkle get a full shelf to themselves, while Schiller, Leopardi, and Verlaine have to scrunch together in the lone subheading reserved for "Poetry/Continental European." In short, Google has taken the great research collections of the English-speaking world and returned them in the form of a suburban mall bookstore.

These don't exhaust the metadata errors by any means. There are a number of mismatches of titles and texts. Click on the link from the 1818 Théorie de l'Univers, a work on cosmology by the Napoleonic mathematician and general Jacques Alexander François Allix, and it takes you to Barbara Taylor Bradford's 1983 novel Voices of the Heart, whereas the link on a misdated number of Dickens' Household Words takes you to a 1742 Histoire de l'Académie Royale des Sciences. The link from the title Supervision and Clinical Psychology takes you to a book called American Politics in Hollywood Film. Numerous entries mix up the names of authors, editors, and writers of introductions, so that the "about this book" page for an edition of one French novel shows the striking attribution, "Madame Bovary By Henry James":

More mysterious is the entry for a book called The Mosaic Navigator: The essential guide to the Internet Interface, which is dated 1939 and attributed to Sigmund Freud and Katherine Jones. My guess is that this is connected to Jones' having been the translator of Freud's Moses and Monotheism, which must have somehow triggered the other sense of the word mosaic, though the details of the process leave me baffled.

For the present, then, linguists, humanists and social scientists will have to forego their visions of using Google Books to assemble all the early nineteenth-century book sale catalogues mentioning Alexander Pope or tracking the use of "Gentle Reader" in Victorian novels: the metadata and classifications are simply too poor.

Google is certainly aware of many of these problem (if not on this scale) and they've pledged to fix them, though they've acknowledged that this isn't a priority. I don't doubt their sincere desire to get this stuff right. But it isn't clear whether they plan to go about this in the same way they're addressing the many scanning errors that users report, correcting them one-by-one as they're notified of them. That isn't adequate here: there are simply too many errors. And while Google's machine classification will certainly improve, extracting metadata mechanically simply isn't sufficiently reliable for scholarly purposes. After some early back-and-forth, Google decided it did want to acquire the library records for scanned books along with the scans themselves, and now it evidently has them, but I understand the company hasn't licensed them for display or use — hence, presumably, the odd automated stabs at recovering dates from the OCR that are already present in the library records associated with the file.

In our panel discussion, Dan Clancy suggested that it should fall on users to take some of the responsibility for fixing these errors, presumably via some kind of independent cataloguing effort. But there are hundreds of thousands of errors to pick up on here, not to mention an even larger number of of files with simply poor metadata or virtually no metadata at all. Beyond clearing up the obvious errors, the larger question is whether Google's engineers should be trusted to make all the decisions about metadata design and implementation for what will probably wind up being the universal library for a long time to come, with no contractural obligation, and only limited commercial incentives, to get it right. That's probably one of the questions the Antitrust Division of the Justice Department should be asking as it ponders the Google Books Settlement over the coming month.

Some of the slack here may be picked up by the HathiTrust, a consortium of a number of participating libraries that is planning to make available several million of the books that Google scanned along with their WorldCat records. But at present HathiTrust is only going to offer the out-of-copyright books, which are about 25 percent of the Google collection, since libraries have no right to share the orphan works. And it isn't clear what search functionalities they'll be offering, or to whom — or, in the current university climate, for how long. In any event, none of this should let Google off the hook. Google Books is unquestionably a public good, but as Pam Samuelson pointed out in her remarks at another panel, a great public good also implies a great public trust.

Sorites in the comics

Language Log - Fri, 2009-08-28 21:17

Today's Dinosaur Comics disposes of the sorites paradox:

(Click image for a larger version.)

TOC: Corpora Vol 4, No 1 (2009)

Linguist List: Journal Contents - Fri, 2009-08-28 16:06

TOC: Journal of Semantics Vol 26, No 3 (2009)

Linguist List: Journal Contents - Fri, 2009-08-28 16:01

The Science of Language