ARCHIVE - Language Log

Navigation

User login

Events

Video

Get Firefox

Brain & Behavior

The Philosophy of Genetics

URL: http://languagelog.ldc.upenn.edu/nll

Updated: 18 min 23 sec ago

Is your size your size?

Fri, 09/04/2009 - 7:19am

According to today's Cathy, men now have to worry about this too:

I don't understand spell-checkers

Fri, 09/04/2009 - 6:29am

Steffi Lewis asked whether this sentence (which, as she says, is attributed to Chico Marx) is well analyzed: Time flies like an arrow; fruit flies like a banana.

I answered as follows (with apologies to syntacticians for the casual low-class nontechnical description):

In the sensical version of the sentence, "time" is a noun phrase and "flies like an arrow" is a verb phrase (with "like an arrow" an adverbial modifier of the verb "flies"), while "fruit flies" is a noun phrase and "like a banana" is a verb phrase (with "a banana" as the object of the verb "like"). In the nonsensical version of the sentence, you just reverse those two analyses.

The system I was typing the response on uses a spell-checker, which objected to sensical — and I can't really blame it for that, because I sort of made it up…although I got three hits for it when I googled it just now, so (as I already knew) I'm obviously not the only person to make up that word, and besides, I find that there's an obsolete word sensical in the Oxford English Dictionary. Anyway, the spell-checker's complaint about sensical didn't bother me. But it also objected to analyses, and this seems very weird. I assume it wanted analysis instead; but can someone more expert in spell-checking than I am tell me why on earth the spell-checker wouldn't be trained to recognize the plural? What did it expect, analysises? — Maybe it did: I just googled that, and got 76,100 hits for it. But at least Google asked if I meant to google analyses instead, and for analyses I got 68,000,000 hits. So if Google knows about analyses, why doesn't the spell-checker? (I am sorry to have to report that that the Language Log spell-checker is also objecting to analyses, which it has underlined in red on every occurrence in my draft of this post. The shame!) (It doesn't like analysises either. So I conclude that spell-checkers don't want you to have more than one analysis.)

Atrocious

Fri, 09/04/2009 - 12:19am

Linguists around the world right now are packing for a trip to Scotland to attend the 50th Anniversary Golden Jubilee meeting of the Linguistics Association of Great Britain here in Edinburgh (it starts on Sunday). And those listening to the BBC's Radio 4 this Friday morning may have been a little discomfited to hear the weather man, in his official capacity, use the adjective atrocious to describe the weather in Scotland over the past few days. Really! Adjective control is getting lax at Broadcasting House. The word choice should be interpreted, however, in a cultural context. Not to put too fine a point on it, a linguistic context of whingeing, moaning, snivelling, grumbling, and overstatement about the weather that probably goes back to the first settlement by Angles, Saxons, and Jutes. The fact is that no one whose experience has been limited to the British Isles has any idea what would be an appropriate meteorological use of the adjective atrocious.

Barbara and I just got back last night from a train trip to Oban, a small coastal town in the windy, rainy, northwest of Scotland (trains on time almost to the minute). We also took a day trip to the Isle of Mull (ferry system perfectly integrated with the bus times — a wondrous thing for anyone inured to California's hopelessly unintegrated public transportation chaos). Yes, it did rain. Every ten minutes, day and night. The "bright intervals" that feature so heavily in the most optimistic of British weather forecasts were generally only a few minutes at a time. So we wore head-to-toe waterproof gear that we basically took off only when in our hotel room, and the trip was a delight. That judgment is not in any way clouded by overindulgence in the excellent single malt whisky to which Oban lends its name (as it happened, we didn't do the distillery tour and never even sipped the product). The fact is that Scotland at sea level is always fairly temperate — there is nothing in the whole U.K. that could possibly be compared with the kinds of temperatures familiar to residents of the northern USA and Canada.

We walked everywhere, past dark stone walls, the brightly painted harborside houses of Tobermory, dripping green hedgerows, washed ripe blackberries for the picking, publicly accessible (and entirely unsupervised) disused castles covered with moss. We enjoyed spectacular seafood meals (the Waterfront Restaurant on Railway Quay in Oban is truly excellent), and walked home through the rain afterwards. It was wonderful. We were never cold. If this was atrocious, I'd like to experience a lot more atrocity.

Edinburgh is currently about the same: reasonable temperatures (even when Arctic southward air streams set in, Edinburgh gets temperatures no worse than a crisp October day in New England), and continuous rain. Rain on grey volcanic rocks and handsome Georgian architecture and a thousand-year-old castle, past which I will (after donning my waterproof raingear) walk to my office at the university. So don't be afraid to pack a long plastic raincoat and rainproof hat, and come to Edinburgh for the LAGB. You'll love it. It's atrocious.

Semantic fail

Thu, 09/03/2009 - 6:11pm

Leena Rao at TechCrunch points out a case where semantic search turned into anti-semitic search.

This morning I wrote about NetBase Solutions’ healthBase, a semantic search engine that aggregates medical content from millions of authoritative health sites including WebMD, Wikipedia, and PubMed. But is it a semantic engine or an anti-semitic search engine?

Several of our readers tested out the site and found that healthBase’s semantic search engine has some major glitches (see the comments). One of the most unfortunate examples is when you type in a search for “AIDS,” one of the listed causes of the disease is “Jew.” Really.

The ridiculousness continues. When you click on Jew, you can see proper “Treatments” for Jews, “Drugs And Medications” for Jews and “Complications” for Jews. Apparently, “alcohol” and “coarse salt” are treatments to get rid of Jews, as is Dr. Pepper! Who knew?

Apparently this was not the result of amalgamating medical advice from Hamas, but rather a consequence of some artificial stupidity applied to Wikipedia, as a company representative explained:

This is an unfortunate example of homonymy, i.e. words that have different meanings.
The showcase was not configured to distinguish between the disease “AIDS” and the verb “aids” (as in aiding someone). If you click on the result “Jew” you see a sentence from a Wikipedia page about 7th Century history: “Hispano-Visigothic king Egica accuses the Jews of aiding the Muslims, and sentences all Jews to slavery. ” Although Wikipedia contains a lot of great health information it also contains non-health related information (like this one) that is hard to filter out.

And that's not the end of the fun and games:

If you look at the pros of AIDS (yes, it thinks here are pros to having AIDS), it comically lists the “Spanish Civil War.” One of the causes of hemorrhoids is “Bronco” (I don’t even want to know).

It only took a few clicks for me to get here:

Or here:

And vice versa respectively

Thu, 09/03/2009 - 12:24pm

At some time approximately 30 to 35 years ago — that is, in the 1970s, back when disco had a future — I received a letter from my friend Jim Hurford. We were young lecturers then, me in London and him in Lancaster, though he was later to become Professor of General Linguistics at the University of Edinburgh. Here is what his letter asked me:

"Can you construct a grammatical and meaningful English sentence that ends with the words and vice versa, respectively ?"

Jim is now Professor Emeritus, and I now hold the Chair that he held for so many years, and I still have not succeeded in constructing an example of the mind-twistingly difficult sort he requested.

But Jim made the mistake of asking me many years before the Internet was invented. Today there is the web, and there are blogs, and comments areas, and there is Language Log. I am sure that our ingenious and talented readers — that would be you — will take up the challenge in the comments space below. If any of them should succeed, I will present the sentence to Jim, and take all the credit. (He will not read this post, because he is working busily on a book about the evolution of grammatical structures, and has no time for things like Language Log.) You (if you are the one who succeeds) will have the inward pleasure of knowing that you solved the puzzle. I will have the gratitude and admiration of my friend Jim. And Jim will have the example sentence he has so long sought. Doesn't that sound like a win-win-win scenario to you?

[Afterword: I don't know what reminded me of the puzzle Jim set all those years ago, or why it seemed so hard at the time, or what he thought would be the relevance of the structure. It is of course not a serious surprise that Language Log readers were able to solve it (see below). But it is perhaps a surprise that some of them completed it in less than ten minutes after the post went up on this site! What is particularly wonderful is that (of course, as JS Bangs was the first to point out) today we can simply Google up genuine examples from real texts. I didn't bother because I knew you would. The real point of this post is that thirty years has changed the nature of syntactic explorations beyond imagining. Back in the 1970s when we were young, you could have devoted your whole life optimistically scanning print sources and had only decades of disappointment for your pains. Today we have the trillion-word corpus of the web to search, and the entire investigation takes maybe eight seconds. (Incidentally, I'm deleting the comments that construct sentences that merely quote the phrase. If you allow quotations of ungrammatical bits and pieces to count, then instantly every string becomes grammatical and syntax is pointlessly trivialized. If you don't, then I think it turns out that nearly all word strings are ungrammatical.) —GKP]

Misleading pseudo-scientific argument of the week

Thu, 09/03/2009 - 7:15am

According to Abigail Norfleet James, Teaching the Male Brain: How Boys Think, Feel, and Learn in School (2007), p 37:

The shape of the inner ear is not the same for boys and girls. As we have seen in the previous chapter, the female cochlea responds more quickly to sound than does the male cochlea (Don et al., 1993) That means that boys are likely to respond to aural information of questions just a bit slower than girls will. Because boys don't hear soft or high sounds very well and because they don't respond to sounds as rapidly as do girls, boys may have trouble with auditory sources of information.

The reference is to M. Don et al., "Gender differences in cochlear response time: An explanation for gender amplitude differences in the unmasked auditory brain-stem response", J. Acoust. Soc. Am. 94(4): 2135-2148, 1993. And yes, there really are sex differences in cochlear response time — but the distributions for males and females overlap, as usual, and the average sex differences are less than a thousandth of a second.

There are different ways to measure the response, and different frequency bands to check — you can read the paper to survey all the differences in detail — but here's a typical figure from Don et al. showing the sex differences in cochlear latency for a two types of measurement in one frequency range:

And another figure showing differences for one type of measurement in different frequency ranges:

Most of the differences are in the range of .0001 to .0003 seconds (1 to 3 ten-thousandths of a second), and none are larger than .0007 sec.

In comparison, simple acoustic reaction time in adults ranges from about 120 to 300 msec. (R.D. Luce, Response times, 1986). For children, mean simple acoustic reaction times range "from 465 msec at age five to 190 msec at 15 years" (K. Andersen et al., "The Development of Simple Acoustic Reaction Time in Normal Children", Developmental Medicine and Child Neurology 26(4), 2008). "Simple acoustic reaction time" is how long it takes to respond to a sound when you know it's coming, and all you need to do is to press a key as soon as you hear it. Choice reaction times (where you need to interpret the simulus and respond accordingly) are much longer. And the time that it takes even the most attentive and cooperative child to respond to a simple verbal instruction is more like two or three seconds, if only because the instruction itself is likely to take nearly that long to be expressed.

So the average sex difference in cochlear response time of .0002 to .0006 seconds — even if this difference is preserved through the brain stem and the cortex, and translated to the interpretation of the stimulus and the formulation and execution of the response — is roughly a thousandth of children's typical simple acoustic reaction time, and about one part in ten thousand of the time that it takes them to respond to even the simplest verbal instructions.

While we're looking at the Don et al.paper, though, let's also reproduce what they found about pure-tone thresholds:

[The subjects in the Don et al. study were 17 females and 14 males aged 18-38.]

Dr. James claims that "Because boys don't hear soft or high sounds very well and because they don't respond to sounds as rapidly as do girls, boys may have trouble with auditory sources of information."

I don't know whether it's really true that boys have more "trouble with auditory sources of information" than girls do. I do know, however, that when Dr. James tries to persuade her readers of this by citing research about sex differences in cochlear response times and audiometric profiles, her argument is at best irrelevant and at worst dishonest.

And this error is not an isolated one. There is a growing popular literature on the biology of human sex differences, and the use of misinterpreted or overinterpreted "scientific" evidence is all too typical of the strand of this work that emphasizes sex differences in order to argue for sex-specific (and often sex-segregated) educational practices. For more on this with respect to sex differences in audiometric thresholds, comfortable listening levels, etc., see here, here, and here.

Quiz

Thu, 09/03/2009 - 4:12am

What language is this?

Hint: it's one that you know.

Well, some might argue that you probably don't know it, because basilectal Jamaican Creole English is not mutually intelligible with the kinds of English that you probably do know.

And according to the discussion at Steve Cotler's Irrepressibly True Tales ("Draw Your Brakes–A Jamaican Creole Shout"), even Peter Patrick, who grew up in Jamaica and who has studied Jamaican Creole for years, didn't know what these two phrases meant. He had to consult with Kenneth Bilby, who knew the answer only because he'd asked Bunny Brown, a friend of the (deceased) speaker, David Scott, who explained a couple of crucial bits of 1970s Jamaican slang.

The study of creole languages led in the 1960s to the concept of the "creole continuum", describing the spectrum of forms between a standard language and a local variety which (in the absence of the intermediate forms) would clearly be a separate language.

Something very similar happens in cases where the rest of the "creole language" package is nowhere in sight. Thus in Italian- or German-speaking parts of Europe, people may use a spectrum of mixtures between the standard national language and a local variety, which may be different enough that speakers from another region would be unable to understand it in a pure form. And I'm told that the same sort of thing happens with mixtures of fuṣḥā (Modern Standard Arabic) and various regional/social colloquial varieties.

The usual term for these non-creole cases is "code-switching", which implies a process where the message is sometimes in one variety and sometimes in another, in contrast to the term "creole continuum", which suggests a spectrum of intermediate varieties. However, I'm skeptical that there's actually any systematic difference between what happens in Jamaica or Haiti and what happens in Stuttgart or Tunis.

Illustrating the maxim of quantity…

Wed, 09/02/2009 - 6:50am

… or if you prefer, some aspects of relevance theory, the next-to-latest xkcd:

Joshua Whatmough and the donkey

Tue, 09/01/2009 - 12:00pm

At Steve Cotler's Irrepressibly True Tales, an irrepressible (and no doubt true) tale of Prof. Whatmough's Linguistics 120 at Harvard in 1962. If you read to the end, you'll find out about the donkey.

G. Nick Clements, 1940-2009

Tue, 09/01/2009 - 7:15am

I’m sad to report that phonologist Nick Clements passed away in Chatham, Massachusetts, on August 30. There is an obituary by Beth Hume on LINGUIST List. Beth co-organized a symposium on tones and features in honor of Nick in June; the speaker list was a veritable who’s who of phonology, of which Nick was also of course a prominent member. He will be missed.

Food Choices on Indian Airlines

Tue, 09/01/2009 - 4:53am

Whenever I fly on non-Indian carriers to India or there are Indians (South Asians) on non-Indian carriers flying elsewhere, I often encounter individuals who complain that they cannot eat the special meals that they ordered.

Steward(ess): But, sir, you ordered a vegetarian meal, didn't you?

Passenger: Yes, but I cannot eat this kind of vegetarian meal.

Steward(ess): I can assure you, sir / ma'am, that our vegetarian meals have no meat or meat products in them.

Passenger: But what you have given me has X, Y, Z in it. I cannot eat it. Please get me something else.

Steward(ess): I am sorry, we do not have any other kinds of vegetarian meals.

Whereupon the passenger pulls out some biscuits from his / her carry-on bag and survives on them and whatever else he / she can scrounge up for the duration of the flight.

Stefan Krasowski recently booked a flight on India's Jet Airways. Here are the choices he was offered for meals:

As Stefan observes, "The mind-boggling array of options demonstrates India's diversity. I am especially curious about the 'bland meal'!"

I suppose that "bland" implies "not spicy," "spicy" being the default option for (most of the many types of) Indian cuisine.

Now, what intrigues me about this bewildering array of choices is that, although they are all in English and would seem to be fairly common fare for Indian air travelers, some of the choices may set American readers to looking things up on the web, for instance to remind themselves of what "purine" is and why someone might want a meal low in it. Before consulting Wikipedia, I was ignorant of the difference between a Vegetarian Hindu Meal and a Vegetarian Jain Meal. Even knowing what "Lacto Ovo" means, I'm still puzzled about how a Lacto Ovo Meal can be different from a Vegetarian Lacto-Ovo Meal. Furthermore, what can I expect to get if I order just a plain Vegetarian Meal, not a Strict Vegetarian Meal, or any of the otherwise modified vegetarian meals? And what would Edward Said have to say about that Vegetarian Oriental Meal if he were still around to pronounce upon it?

Even with all of these possible selections, I'd bet that somebody is going to ask for a "Low Protein Low Lactose Low Purine Meal" or a "Low Calorie Low Cholesterol Bland Meal" or a…. Starbucks has figured out how to offer customers "more than 19,000 beverage possibilities", and a similar number of menu options is implied by the global combinatorics of dietary preferences and restrictions. In this context, "vegetarian" is apparently no more informative than "coffee".

If I were flying Jet Airways, after much consideration, I think my choice would be "Bland Meal - Non Veg." Being a seasoned traveler, I always carry salt and pepper with me.

Serial improvement

Mon, 08/31/2009 - 10:32am

Although I share Geoff Nunberg's disappointment in some aspects of Google's metadata for books, I've noticed a significant — though apparently unheralded — recent improvement. So I decided to check this out by following up Bill Poser's post yesterday about insect species, which I thought was likely to turn up an example of the right sort. And in fact, the third hit in a search for {hemipteran} is a relevant one: Irene McCulloch, "A comparison of the life cycle of Crithidia with that of Trypanosoma in the invertebrate host", University of California Publications in Zoology, 19(4) 135-190, October 4, 1919.

This paper appears in a volume that is part of a serial publication. And until recently, Google Books routinely gave all such publications the date of the first in the series, even if the result was a decade or a century out of whack.

But no longer.

True, Dr. McColloch's article is categorized as "Juvenile Nonfiction"… But the date is correctly given as 1919, despite the fact that the series in question began in 1902. The volume in which McColloch's article was published contains articles dated 1919 and 1920, and the treatment of the 1920 portions is a bit variable:

Still, this is a big step forward over the situation a couple of years ago. See for example the discussion in "Shack!", 7/23/2007, where I observed that Google Book Search misdated the Nov. 1956 issue of Boeing Magazine "as 1934, following its usual unfortunate practice of dating all issues of a serial in terms of the earliest issue".

The relevant page now comes up correctly dated:

And in general, serial publications now seem to be given the date of the bound volume that was scanned, rather than the date of the first publication in the series. This still leaves many errors in the date fields of Google Books' metadata. In fact, I guess it's possible that some of the errors that Geoff found (and that anyone can turn up in a few seconds of searching) were actually created by a process for inferring publication dates from OCR output, rather than relying on whatever catalog information was previously giving them the start date for all subsequent issues of a periodical or serial publication.

Overall, it would be nice if Google were a bit more open about what's going on with their metadata for books — the successes as well as the messes — but they deserve credit for this success, even if the solution created some additional messes.

Uttaris pallidipennis in Miami

Sun, 08/30/2009 - 7:08pm

In the news today I came across this rather strange report from the Associated Press:

MIAMI — U.S. Customs and Border Protection officials say they have intercepted a rare and dangerous insect found in a shipment of flowers at a Miami airport that could cause significant damage.

Officials said Saturday they were examining a box of flowers last week at Miami International airport when they found Hemiptera. Hemiptera's are typically aphids, cicadas, and leaf hoppers and comprise about 80,000 different species. They feed on the seed heads of grasses and sedges. The insect is found in South America.

Officials believe it is the first time the insect has been found in the U.S.

The report is inaccurate in several ways. The insect actually comes from South Africa, which is different from South America. There is no indication that it is rare. The fact that it isn't found in the United States doesn't make it rare. We don't say that tsetse flies are rare because they aren't found in the United States. Nor is there any reason to believe that it is dangerous. It may be destructive of plants, which is why this discovery is of interest, but even that isn't known. It isn't considered a pest in South Africa and it isn't known what would happen if it got loose and survived in Florida.

Several points are of linguistic interest. One is the reference to "Hemiptera". Hemiptera is the name of the order to which the insect, a member of the species Uttaris pallidipennis Stal belongs. A member of this order is a Hemipteran.

More interesting is the statement that "Hemipteras are typically aphids, cicadas, and leaf hoppers", which is not well-formed. What the author presumably meant is: "Typical members of the order Hemiptera are aphids, cicadas, and leaf hoppers". When you say that "Hemipterans are typically X", X must describe a characteristic; it can't list examples.

I suspect that there's a reason for the erroneous description of this insect as "dangerous". The usage of this word is rather interesting. Anything that can injure or destroy something can be described as "dangerous" to it. One can say, for example, that "Aphids are dangerous to roses.". But this only works if you specify what is subject to the danger. If we remove "to roses" and just say "Aphids are dangerous", the sentence becomes false. It seems that "dangerous" comes with a default bearer of the danger, namely human beings. When you say that something is dangerous without specifying to what, you mean that it is dangerous to people. Furthermore, it seems that a non-human bearer of the danger must be linguistically overt. Although we can felicitously say "Owls are dangerous to mice", even if the topic of conversation is predators of mice, it is not felicitous to say "Owls are dangerous." without adding "to mice". My guess is that someone said or thought that the insect in question might be "dangerous to crops" or something along those lines, and then, without thinking, removed "to crops".

The actual facts can be found in the much better report in the Miami Herald. The odd wording appears to have originated with Customs, in this press release. (Customs is now part of the "Department of Homeland Security" but I avoid using this name. Whenever I see it, I hear "Reichsicherheitshauptamt". )

How science reporting works

Sun, 08/30/2009 - 6:08am

The latest from Zach Wiener at SMBC:

(There are four more panels — click on the image to see them all.)

Unfortunately, scientists are often more complicit in the exaggerations and misrepresentations than this narrative suggests.

Sino-American Name Reversion

Sat, 08/29/2009 - 6:08pm

Yesterday an applicant from China came to my office and introduced herself to me as Runxiao ("Moist Dawn"). However, in previous correspondence, she had always referred to herself as Layn (a variant of Lane; other variants of the name Lane are: Laen, Laene, Lain, Laine, Laney, Lanie, Layn, Layne, and Laynne ("living near a lane"; "descendant of Laighean or Luan" in Gaelic) — so say the name books. When I asked her which name she preferred, she said, "You can call me Runxiao."

"But what about Layn?" I asked. "Didn't you used to go by the name Layn?"

"Oh, yes!" she replied cheerfully with a gleaming smile. "When I was in China, I called myself Layn, but now that I'm in America, I call myself Runxiao."
"Isn't that a bit unusual?" I queried. "You call yourself by your English name in China and call yourself by your Chinese name in America?"

"All my friends do that," she answered merrily. "We all take English names in China, but we feel better using our Chinese names in America. Of course, Americans can't pronounce 'Runxiao,' so I tell them just to call me Run. Now I'm Runxiao to my Chinese friends in America, Run to everybody else in America, and Layn to my old associates in China."

This strikes me as completely counterintuitive. However, since the practice is so widespread, there must be some compelling psychological reason for it.

Here are just a few of the Chinese students and friends I know who went by Western names in China or Taiwan, but have switched to Chinese names in America:

China/TW America

Sophie –> Xiaofei

Tess –> Lei (her American graduate school is trying to make her use this, but she prefers to remain Tess)

Shelly –> Xiaolei

David –> Dawei

Mary –> Mali

Löwen –> Li-wen

Peter (not Petra!) –> Li-ching ("Li" to her American friends)

Gianni –> Xiang

Naturally, they had the Chinese names first when they were young (given to them by their parents), but for one reason or another had decided to go by a Western name when they got older.

Given the large flap over North Texas State Representative Betty Brown's complaint about the difficulty of pronouncing Asian names (“Rather than everyone here having to learn Chinese — I understand it’s a rather difficult language — do you think that it would behoove you and your citizens to adopt a name that we could deal with more readily here?”) [see the comment by Lance to this blog], one might think that some of the arrows would be turned in the opposite direction.

Perhaps it is precisely because progressive Americans want Chinese to feel comfortable using their original names that they actually do decide to resurrect them after they come to America. Indeed, I have often encountered situations where a Chinese student comes to America with a Western name he / she has used for many years and feels comfortable with, even proud of, only to be told by a well-meaning American that it sounds "silly" or "odd" for a Chinese to have a Western name.

Nearly all Americans I know who go to China to stay for more than a few months take Chinese names. This is especially true of those who speak one of the Chinese languages. For example, my Chinese name is Mei Weiheng and my brother Denis' name is Mei Danli. When I am in China, I become Mei Weiheng, and nobody calls me Victor Mair there. Similarly, in China Ed Shaughnessy is Xia Hanyi, William Baxter is Bai Yiping, Jerry Norman is Luo Jierui, and so forth. Not only do we take Chinese given (personal) names, we also adopt Chinese surnames, whereas Chinese (whether in China or in the West) maintain their Chinese surname, regardless of whether they go by a Western given (first, personal) name or by a Chinese given name.

Google Books: A Metadata Train Wreck

Sat, 08/29/2009 - 1:46pm

Mark has already extensively blogged the Google Books Settlement Conference at Berkeley yesterday, where he and I both spoke on the panel on "quality" — which is to say, how well is Google Books doing this and what if anything will hold their feet to the fire? This is almost certainly the Last Library, after all. There's no Moore's Law for capture, and nobody is ever going to scan most of these books again. So whoever is in charge of the collection a hundred years from now — Google? UNESCO? Wal-Mart? — these are the files that scholars are going to be using then. All of which lends a particular urgency to the concerns about whether Google is doing this right.

My presentation focussed on GB's metadata — a feature absolutely necessary to doing most serious scholarly work with the corpus. It's well and good to use the corpus just for finding information on a topic — entering some key words and barrelling in sideways. (That's what "googling" means, isn't it?) But for scholars looking for a particular edition of Leaves of Grass, say, it doesn't do a lot of good just to enter "I contain multitudes" in the search box and hope for the best. Ditto for someone who wants to look at early-19th century French editions of Le Contrat Social, or to linguists, historians or literary scholars trying to trace the development of words or constructions: Can we observe the way happiness replaced felicity in the seventeenth century, as Keith Thomas suggests? When did "the United States are" start to lose ground to "the United States is"? How did the use of propaganda rise and fall by decade over the course of the twentieth century? And so on for all the questions that have made Google Books such an exciting prospect for all of us wordinistas and wordastri. But to answer those questions you need good metadata. And Google's are a train wreck: a mish-mash wrapped in a muddle wrapped in a mess.

Start with dates. To take GB's word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler's Killer in the Rain, The Portable Dorothy Parker, André Malraux' La Condition Humaine, Stephen King's Christine, The Complete Shorter Fiction of Virginia Woolf, Raymond Williams' Culture and Society, Robert Shelton's biography of Bob Dylan, Fodor's Guide to Nova Scotia, and the Portuguese edition of the book version of Yellow Submarine, to name just a few. (You can find images of most of these on my slides, here — I'm not giving the url's since I expect Google will fix most of these particular errors now that they're aware of them).

And while there may be particular reasons why the 1899 date comes up so much, these misdatings are spread out all over the place. A book on Peter Drucker is dated 1905, a book of Virginia Woolf's letters is dated 1900, Tom Wolfe's The Bonfire of the Vanities is dated 1888, and an edition of Henry James 1897 What Maisie Knew is dated 1848.

It might seem easy to cherry-pick howlers from a corpus as exensive as this one, but these errors are endemic. Do a search on "internet" in books written before 1950 and Google Scholar turns up 527 hits.

Or try searching on the names of writers or famous restricting your search to works published before the years of their birth. You turn up 182 hits for Charles Dickens, more than 80 percent of them misdated books referring to the writer as opposed to someone else of the same name. The same search turns up 81 hits for Rudyard Kipling, 115 for Greta Garbo, and 29 for Barack Obama. (Or maybe that was another Barack Obama.)

A search on books mentioning candy bar that were published before 1920 turns up 66 hits, of which 46, or 70 percent, are misdated. I'd be surprised if that proportion of errors or anything like it held up in general for books in that range, and dating errors are far denser for older works than for the ones Google received from publishers. But even if the proportion is only 5 percent, that suggests hundreds of thousands of dating errors.

In discussion after my presentation, Dan Clancy, the Chief Engineer for the Google Books project, said that the erroneous dates were all supplied by the libraries. He was woolgathering, I think. It's true that there are a few collections in the corpus that are systematically misdated, like a large group of Portuguese-language works all dated 1899. But a very large proportion of the errors are clearly Google's doing. Of the first ten full-view misdated books turned up by a search for books published before 1812 that mention "Charles Dickens", all ten are correctly dated in the catalogues of the Harvard, Michigan, and Berkeley libraries they were drawn from. Most of the misdatings are pretty obviously the result of an effort to automate the extraction of pub dates from the OCR'd text. For example the 1604 date from a 1901 auction catalogue is drawn from a bookmark reproduced in the early pages, and the 1574 dating (as of this writing) on a 1901 book about English bookplates from the Harvard Library collections is clearly taken from the frontispiece, which displays an armorial bookplate dated 1574:

Similarly, the 1719 date on a 1919 edition of Robinson Crusoe in which Dickens' name appears in an advertisement is drawn from the line on the title page that says the book is reprinted from the author's edition of 1719. And the 1774 date assigned to an 1890 book called London of to-day is derived from a front-matter advertisement for a firm that boasts it was founded in that year.

Then there are the classification errors. William Dwight Whitney's 1891 Century Dictionary is classified as "Family & Relationships," along with Mencken's The American Language. A French edition of Hamlet and a Japanese edition of Madame Bovary both classified as "Antiques and Collectibles." An edition of Moby Dick is classed under "Computers": a biography of Mae West classified as "Religion"; The Cat Lover's Book of Fascinating Facts falls under "Technology & Engineering." A 1975 reprint of a classic topology text is "Didactic Poetry"; the medievalist journal Speculum is classified "Health & Fitness."

And a catalogue of copyright entries from the Library of Congress listed under "Drama" — though I had to wonder if maybe that was just Google's little joke.

Here again, the errors are endemic, not simply sporadic. Of the first ten hits for Tristram Shandy, four are classified as fiction, four as "Family & Relationships," one as "Biography & Autobiography," and one is not classified. Other editions of the novel are classified as "Literary Collections," "History," and "Music." The first ten hits for Leaves of Grass are variously classified as "Poetry," "Juvenile Nonfiction," "Fiction," "Literary Criticism," "Biography & Autobiography," and mystifyingly, "Counterfeits and Counterfeiting."

Various editions of Jane Eyre are classified as "History," "Governesses," "Love Stories," "Architecture," and "Antiques & Collectibles" ("Reader, I marketed him").

In his response on the panel, Dan Clancy said that here, too, the libraries were to blame, along with the publishers. But the libraries can't be responsible for books mislabeled as "Health and Fitness" and "Antiques and Collectibles," for the simple reason that those categories are drawn from the BISAC codes that the book industry uses to tell booksellers where to put books on the shelves, not from any of the classification systems used by libraries. And inasmuch as BISAC classifications weren't in use until about 20 years ago, only Google could be responsible for their misapplications on books published earlier than that: the 1904 edition of Leaves of Grass assigned to "Juvenile Nonfiction"; the 1919 edition of Robinson Crusoe assigned to "Crafts & Hobbies"; the 1845 number of the Edinburgh Review assigned to "Architecture"; the 1907 edition of Sir Thomas Browne's 1658 Hydriotaphia: Urne-Buriall, or a discourse of the sepulchrall urnes lately found in Norfolk assigned to "Gardening"; and countless others.

Google's fine Bayesian hand reveals itself even in the classifications of works published after the BISAC categories came into use, such as the 2003 edition of Susan Bordo's Unbearable Weight: Feminism, Western Culture and the Body (misdated 1899), which is assigned to "Health & Fitness" — not a classification you could imagine coming from the University of California Press, though you can see how a probabilistic classifier could come up with it, like the "Religion" tag on the Mae West biography subtitled "Icon in Black and White."

But whether it gets the BISAC categories right or wrong, the question is why Google decided to use those headings in the first place. (Clancy denies that they were asked to do so by the publishers, though this might have to do with their own ambitions to compete with Amazon.) The BISAC scheme is well suited to organizing the shelves of a modern 35,000 foot chain bookstore or a small public library where ordinary consumers or patrons are browsing for books on the shelves. But it's not particularly helpful if you're flying blind in a library with several million titles, including scholarly works, foreign works, and vast quantities of books from earlier periods. For example, the BISAC "Juvenile Nonfiction" subject heading has almost 300 subheadings, including separate categories for books about "New Baby," "Skateboarding," and "Deer, Moose, and Caribou." By contrast, the "Poetry" subject heading has just 20 subdivisions in all. That means that Bambi and Bullwinkle get a full shelf to themselves, while Schiller, Leopardi, and Verlaine have to scrunch together in the lone subheading reserved for "Poetry/Continental European." In short, Google has taken the great research collections of the English-speaking world and returned them in the form of a suburban mall bookstore.

These don't exhaust the metadata errors by any means. There are a number of mismatches of titles and texts. Click on the link from the 1818 Théorie de l'Univers, a work on cosmology by the Napoleonic mathematician and general Jacques Alexander François Allix, and it takes you to Barbara Taylor Bradford's 1983 novel Voices of the Heart, whereas the link on a misdated number of Dickens' Household Words takes you to a 1742 Histoire de l'Académie Royale des Sciences. The link from the title Supervision and Clinical Psychology takes you to a book called American Politics in Hollywood Film. Numerous entries mix up the names of authors, editors, and writers of introductions, so that the "about this book" page for an edition of one French novel shows the striking attribution, "Madame Bovary By Henry James":

More mysterious is the entry for a book called The Mosaic Navigator: The essential guide to the Internet Interface, which is dated 1939 and attributed to Sigmund Freud and Katherine Jones. My guess is that this is connected to Jones' having been the translator of Freud's Moses and Monotheism, which must have somehow triggered the other sense of the word mosaic, though the details of the process leave me baffled.

For the present, then, linguists, humanists and social scientists will have to forego their visions of using Google Books to assemble all the early nineteenth-century book sale catalogues mentioning Alexander Pope or tracking the use of "Gentle Reader" in Victorian novels: the metadata and classifications are simply too poor.

Google is certainly aware of many of these problem (if not on this scale) and they've pledged to fix them, though they've acknowledged that this isn't a priority. I don't doubt their sincere desire to get this stuff right. But it isn't clear whether they plan to go about this in the same way they're addressing the many scanning errors that users report, correcting them one-by-one as they're notified of them. That isn't adequate here: there are simply too many errors. And while Google's machine classification will certainly improve, extracting metadata mechanically simply isn't sufficiently reliable for scholarly purposes. After some early back-and-forth, Google decided it did want to acquire the library records for scanned books along with the scans themselves, and now it evidently has them, but I understand the company hasn't licensed them for display or use — hence, presumably, the odd automated stabs at recovering dates from the OCR that are already present in the library records associated with the file.

In our panel discussion, Dan Clancy suggested that it should fall on users to take some of the responsibility for fixing these errors, presumably via some kind of independent cataloguing effort. But there are hundreds of thousands of errors to pick up on here, not to mention an even larger number of of files with simply poor metadata or virtually no metadata at all. Beyond clearing up the obvious errors, the larger question is whether Google's engineers should be trusted to make all the decisions about metadata design and implementation for what will probably wind up being the universal library for a long time to come, with no contractural obligation, and only limited commercial incentives, to get it right. That's probably one of the questions the Antitrust Division of the Justice Department should be asking as it ponders the Google Books Settlement over the coming month.

Some of the slack here may be picked up by the HathiTrust, a consortium of a number of participating libraries that is planning to make available several million of the books that Google scanned along with their WorldCat records. But at present HathiTrust is only going to offer the out-of-copyright books, which are about 25 percent of the Google collection, since libraries have no right to share the orphan works. And it isn't clear what search functionalities they'll be offering, or to whom — or, in the current university climate, for how long. In any event, none of this should let Google off the hook. Google Books is unquestionably a public good, but as Pam Samuelson pointed out in her remarks at another panel, a great public good also implies a great public trust.

Sorites in the comics

Fri, 08/28/2009 - 8:17pm

Today's Dinosaur Comics disposes of the sorites paradox:

(Click image for a larger version.)

New Scientist rips off Language Log; we don't care

Fri, 08/28/2009 - 1:23pm

When I wrote my little piece of whimsy on the ridiculous story about how Republicans think Stephen Hawking would have been denied medical treatment if he lived in the UK (he has always lived and been treated in the UK), and pretended that I thought the accent of his speech synthesizer was so bad that it had failed to convince dim-witted Americans at the Investor's Business Daily of his Britishness, my idea was to offer a little joke. Ha! Ha! Another merry quip from Geoff in his inimitable tongue-in-cheek style! But it turns out that I am less inimitable than I thought. The New Scientist "Feedback" column seems (just scroll down one section) to have lifted my little conceit without acknowledgment. They relate the story and then add:

Clearly technology is to blame. They had assumed Hawking was a US citizen, safe from the horrors of socialised medicine. The whole hoo-ha arose, we confidently surmise, because of his voice synthesiser's accent.

Ha! Ha! Another merry quip from New Scientist in its inimitable tongue-in-cheek style… stolen right off of Language Log, where you read it two weeks earlier. Oh, well. We have seen this before. People rip off our stuff. We just try not to get too steamed up about it. After all, we were first; and we are funnier and sexier. And the best revenge is to live well.

The Google Books Settlement

Fri, 08/28/2009 - 6:29am

I'm spending today at Berkeley, participating in a one-day conference on "The Google Books Settlement and the Future of Information Access". I'll live-blog the discussion as the day unfolds, leaving comments off until it's over. I believe that the sessions are being recorded, and the recordings will be available on the web at some time in the near future. [Gary Price at Resource Shelf provides some other links here, and a press round-up here. Another summary by an attendee is here.]

Regular LL readers will know that we've been long-time users and supporters of Google Books, with occasional complaints about the poor quality of its metadata. For a lucid discussion of some issues with the terms of the proposed settlement, read Pamela Samuelson's articles "The Audacity of the Google Books Settlement", Huffington Post, 8/10/2009, and "Why is the Antitrust Division Investigating the Google Books Search Settlement?", Huffington Post, 8/19/2009.

[Note that in most cases below, first-person pronouns refer to the speaker, not to me.]

The opening panel is on the topic "Datamining and non-consumptive use", and the panelists are Peter Brantley and Jim Pittman, with Eric Kansa as the moderator.

AnnaLee Saxenian, Dean of Berkeley's School of Information, kicked things off by noting that this all started five years ago at the Frankfort Book Fair, where Google announced what was then called "Google Print". In 2005, the Authors Guild and a number of publishers separately sued Google. A proposed settlement will be considered for approval in New York district court in October. The purpose of this conference is to have a conference about the opportunities and risks of this settlement.

Eric Kansa — Google has publicly said that 7 million books had been scanned as of some time in 2008; some others estimate up to 15 million now. This is a large piece of the total universe of book — US Census reports 2.334 M books published in the U.S. between 1880-1998, WordCat lists 23 million books.

Comparison and contrast to Human Genome project, which had significant public sponsorship to create; the result is largely held in public trust. Google Books is being created with private funds and will largely be privately held, with restrictions imposed by the settlement.

The largest value of this corpus might lie in the results of machine processing, even if no human eyeballs ever viewed it. These "non-display" or "non-consumptive" uses include arbitrary uses allowed within Google. For others, a "research corpus" will be created and housed at two US-based participating libraries. Qualified researchers can experiment with the data for non-commercial purposes. The two sites will be responsible for evaluating research proposals, running audits, etc.

Traditionally, US law makes a distinction between public-domain "facts" and "ideas", and copyrighted expression of such facts and ideas. The settlement appears to give rights over information extracted from the corpus to Google, the Authors Guild and the publishers.

Peter Brantley — representing the "Open Book Alliance" as well as the Internet Archive. I.A. Richards: "A book is a machine to think with". That takes on a new meaning, because we're now moving from "books as books" to "books as data". Future books will be different, in ways that most publishers and lawyers don't seem to be thinking about, and in fact no one really can predict. Moving from Gemeinschaft to Gesellschaft.

Concerns: Corpus is unique — a comprehensive collection of 20th-century literature, based on an exclusive deal, due to the treatment of "orphan works" as well as other aspects of the legal settlement. Do we want to give sole ownership of this unique resource to a single corporate actor? Even if authors pull their books from the display portion, it will still be there in the datamining portion. Prices charged to universities for this unique resource will be unregulated and unrestrained by competition.

Other scenarios: should there be some sort of compulsory licensing? Choice shouldn't just be settlement/no settlement. It's a historic moment. How can we protect the future?

Jim Pitman — The raw content of the books is a tremendous resource. But there's also the information that results from the interaction of users with this content, or from algorithmic analysis of the content, or both. JP is especially interested in subject navigation and classification. Would development of a "map of knowledge" form this corpus belong to Google? To what extent will the scholarly community be inhibited in developing such ideas by the terms of this settlement?

Do we have to ask for permission to pull out (say) a tableau of theorems, or a table of compounds?

There's lots to be done by text analysis, but you can't really do it right if you can't get access to the data on your own computer. Going to one of two designated centers to run some code in a limited time period on someone else's machine is hardly even second best.

Google has an intrinsic commercial interest in stifling innovation by others in textual analysis — at what point will they give in to the temptation to act on this interest?

To get the best results, you want to combine machine learning with human judgment. To what extent will the terms of the settlement allow this? Individuals and small organizations should be encouraged to cultivate the parts of the academic universe that they know best — Google books will be a big part of the resources for this, but it should be easy to combine it with other things. Does the proposed settlement inhibit this?

Dan Clancy of Google Books responds: Nonconsumptive access is fair use, in Google's view. So you don't have to ask permission of Google to do whatever [but you have to have the texts — myl]. Control of research access will be up to the two host sites. There are some provisions, when it talks about the information extracted, that we still don't understand as a community. But the idea is that (for example) if you use the corpus to infer a subject classification system — that's yours. But if you want to use the corpus to derive an index, and offer access to books on that basis, you can't. Similarly, you can't offer a concordance to (say) works belonging to Elsevier.

Jim Pitman: what about an indexing service for mathematical theorems?

Dan Clancy: theorems don't belong to us; but if it's a commercial service, we don't have the rights to give you the right to link to someone else's text. And if we aren't already offering a theorem search service, it's not competing, and we wouldn't be able to object.

Jim Pitman: Is there a list of Google services that we can't compete with?

Dan Clancy: No.

Jim Pitman: What about an algorithm to disambiguate references to people, to be used in other data mining applications?

Dan Clancy: I don't think that would be any problem.

Eric Kansa: What about authors' or publishers' rights to withdraw data-mined facts from downstream applications? And do you need permission to use extracted information in a commercial application?

Dan Clancy: The license in this case adds additional constraints, beyond copyright law, so that yes, the corpus can only be used for research, and if you want to use the results of algorithmic processing of the corpus in a business, you'd need to get permission from the Authors Guild and perhaps others.

Jason Schultz, a clinical professor at the Berkeley Law School: What's interesting about the settlement is its breadth — it sets up a legal regime by which Google is bound, and by which lots of others outside the lawsuits are bound. Things could get changed and a second version of the settlement could come around.

What happens if the settlement is rejected? Probably Google book search goes forward as it is now.

Dan Clancy: But in that case, there's no research corpus.

Jason: If the settlement is rejected and Google wins the case, that would be a precedent that others could rely on. Also, state universities might have sovereign immunity.

Marty: Does this provide a solid enough basis for ongoing research, including the ability to reproduce others' research? What sort of access will researchers really get, and to what? This makes a big difference to the kinds of research that can be done?

Dan Clancy: The simple answer is you get full access. But the question of how to give access to researchers to compute at scale over the research corpus is unclear. We see the two access sites as an open source community.

Jamie Love: it would be helpful to elaborate on what research institutions would NOT have the freedom to do?

Dan Clancy: No restrictions for noncommercial services; except it can't be used to compete with Google.

The second panel is on privacy issues — that is, who gets to know what books someone reads.

The first speaker is Angela Maycock: The settlement is silent on the issue of privacy. There have been lots of informal assurances, but that's all we have to go on.

Why do librarians care? What is the history? What are the threats? How does a loss of privacy rights affect users? What are the implications for this discussions?

In a library context, "Privacy" is the right to engage in open inquiry, without having your activities scrutinized by anyone else. "Confidentiality" is the right of libraries to protect users' information. Lack of privacy and confidentiality has a chilling effect, which damages first-amendment rights.

Examples of challenges — needed to counter the misconception that "good people don't have anything to hide". In Wise County TX, the DA asked the library for all people who checked out a book on childbirth during the previous nine months, as part of an investigation of a child abandonment case. In Colorado, the supreme court ruled that the DA could not access someone's history of purchases from the Tattered Cover bookstore.

The American Library Association has not opposed the settlement, but did file a letter asked for vigorous oversight. Google people have been good listeners.

Tom Leonard: Research libraries are generally not exposed to the kinds of pressures that public libraries are. Still, the core objective for research libraries remains preventing users from being monitored, except in exceptional cases where the user is fully informed.

Readers do know that we've kept a record of what they've borrowed, that this information will never be shared with anyone, and that it will be purged after the book is returned. Similarly, IP addresses are purged as soon as possible after use of online databases. The only exception is the case of rare and valuable books, were records are kept in order to guard against theft.

There are some hard cases, e.g. suppose a group abroad is using our resources to assign people to "untouchable" categories, by looking at census data and maps and so on, should that fact be conveyed to those who are affected?

What about after the GBS? We have the capacity now to monitor all digital searches, but we don't do it, and we shouldn't do it. I believe Google when they say that they feel the same way, but I would like to believe them more.

Jason Schultz: Stanford marshmallow experiment — We have a settlement in front of us: how good will this marshmallow be? How much better (or not) would things be if we wait?

When GB first came on the scene, the issues were all about copyright and fair use. I'm a big fan of the fair use argument for scanning books for information access, information location. This is the right balance in copyright law, and an excellent way to address the orphan works problem, at least in part.

But now we're in a different situation. The settlement deals with so much more than just copyright and fair use. Four things relevant to privacy:

1) The size of the deal. This is the largest copyright licensing deal in history.
2) The compulsory nature of the settlement — it includes everyone who has a copyright interest in the U.S., and it's binding on everyone unless they opt out: "You are giving Google permission, and in return, you get some benefits". Now Google offers GMail, and you can take it or leave it. But here they need permission from more people than have ever been involved in such a process before. Is privacy important enough to be part of the deal?
3) Implications for privacy. If Google can look at every page that you ever read, ??
4) This is a legal hack — which I like in general, EFF often used them — but this is a VERY big hack.

It's not like the privacy issues with the Amazon Kindle or the iPhone, because it's no much bigger, so much longer term, and it's compulsory.

Are the enforcement and accountability mechanisms adequate in this case?

Two models that might be followed:

All of the records in Google Health are kept separate from all your other Google info.
Google's location product — Latitude — has amnesia built into it, it only knows where you are now, not where you were.

Should Google Books do similarly?

If privacy violation occur, how will we know? What will the consequences be?

Michael Zimmer: Planet Google becomes the center of gravity of everyone's information-seeking activities; results are good, which is why we do it, but the resulting infrastructure is a potential threat. This is why we're worried about data retention policies, etc.

Norms of information flow: when you go into a library, in real life or on line, you have some expectations about privacy. Norms for web searching are different. A certain amount of tracking is inherent — and beneficial. A lot of people accept it and move on. Now the question arises, which norms apply to looking at Google Books on line?

Competition: when users are informed, they can "vote with their feet" if they don't like the privacy policy of a given provider. But Google Books will be a monopoly — if people don't like it, where can they go?

There's an FAQ on the Inside Google Books blog. A Google Account will not be necessary; they won't sell access information; their general privacy policy applies.

I trust Google; but like Tom says, I'd like to trust more. Google Street view had very different policies in the U.S. vs. in Europe and Canada, because privacy laws were different. But maybe the ethics should be the same in both cases.

Options: Build in anonymity online? Should book search data be protected the way that health data is? Should everything?

The first afternoon session is on Quality.

Paul Duguid is the moderator.

The panelists are me, Geoff Nunberg, Cliff Lynch, and Dan Clancy.

My presentation is here.

Summary of Geoff Nunberg's talk: This is "the last library". So it's important to get it right. But dates, authorship information, categories are often pretty bad.
[Details are in Geoff's slides here.]

Cliff Lynch: It feels like something is happening here that is much bigger than a simple legal settlement. We're making a national decision about what may well be, as Geoff said, "the last library" — and it's important to get that right.

Metadata problems are much less excusable than scan and OCR problems. OCR is a an error-prone process that gets better as algorithms improve. But bibliographic metadata is a mature and well-understood area — if this is "the last library", why not take the trouble to get it right? Three kinds of things mixed up together — OCR output, assertions from bibliographic metadata, assertions from publisher metadata.

Dan Clancy: Some of the most vocal critics are also some of the most avid users. It's important to understand that dialogue and criticism is an important thing.

Is this about identifying problems and thinking through solutions? or is it about creating a dichotomy? How do we move forward?

I actually don't view Google Books as the one and only library. I don't think it will be and I don't think it should be. There will continue to be other digitization activities. To the extent that Google Book Search is our last shot, then without Google Book Search we never would have had a shot; and I don't really think that's true.

For books scanned at a library, we get the metadata from the library. We get updates from libraries every week or two to fix publication date issues. We probably do have a problem with classification, though it's not done automatically. We get classification metadata from many sources, and perhaps we're not combining them in the best way.

To the extent that the source of our data is part of the problem — which it mostly is — now we need to think about how to make things better. Merging records is a tough problem. But part of this is not about Google doing this, it's something that our partners need to do as well.

Metadata is only the tip of the iceberg — there's also public-domain determination. If we trusted our metadata about what was before 1923, our error rate would be atrocious, so we can't afford to do that.

The Human Genome project is not really comparable, because there's no competition in this case, just cooperation. We have partnerships with Hathi Trust, with Michigan, etc. It's not a competition, where as one gains the other loses. It's really about figuring out how you can build a bunch of repositories, and how do you ensure from a preservation perspective that books are held in multiple locations.

I don't think we're the only library, we're part of a broader community.

[myl: but Dan said later "Do you think that if Google hadn't organized this scanning effort, someone else would have done it? Do you think that if Google hadn't done it, someone else would do it in the next 10 or 20 years? I think the answer to both questions is 'no'".]

Ed Feigenbaum: tens of millions of dollars is a relatively small amount of money. Why is this the last library?

Someone else: Who owns the metadata? Who controls it? Under what conditions can it be distributed? Who can change it and re-distribute the changed version? Is it OCLC? Is it the contributing libraries? What about Google's contribution to merging? Cliff Lynch: it's a mess. Nobody really knows.

Ed Clancy: It's more than tens of millions of dollars, alas.

Guy from EFF: We've suggested that Google should escrow these scans, and after a reasonable time (say 28 years, which was good enough for the copyright act for our founding fathers), make them available to others.

Guy from CNET: How much has Google spent?

Dan Clancy: Alas, we do not publicly disclose the amount of money. We've scanned about 10 million books. Internet archive's cost is said to be about $30 per book.

The last panel is on public access, moderated by Pamela Samuelson.

The first speaker is Dan Greenstein: Why did UC libraries get involved? Libraries are all about access to information, and public libraries including libraries in public universities see public access as a sacred trust. UC libraries were active in the settlement, because of the benefits to the public.

The second speaker is Carla Hesse. Her remarks focused on the question of whether Google Books might turn out to be "too big to fail", and to require a government bailout at some point in the future, perhaps because that slice of Google's business might be sold off to some third party who is less successful at deriving revenue from it, or perhaps for other reasons.

The third speaker is Jamie Love, who offered a detailed and interesting critique of the economics of the proposed settlement. I'll try to get his slides, since there were a lot of details that are hard to get down in real time. One key point is that the settlement allows publishers to collude to fix prices, in a way that would be clearly a violation of anti-trust if they did it on their own, outside of the context of the settlement.

The fourth speaker is Molly Van Houweling. Her remarks focused on the question of what future pricing and subscription options for university libraries might look like.

Dan Clancy: Carla did a good job of broadening the scope — she placed this in the context of the broader evolution in how we access information. Now we're not dealing with physical goods, but rather with digital goods; the book search settlement is not mainly about that, it's mainly about orphaned works and so on. Future books will mainly be distributed in digital form by publishers. We don't see this settlement as the overall solution, just as the solution to the specific problem of out-of-print books that are still in copyright.

Responding to Jamie Love, many months of the settlement discussion were devoted to the questions of competition and pricing. Most of what people want and where the competition is, is for in print books. What happens in the journal market is you have publishers who don't even let the authors distribute their works. But in the settlement, all authors are permitted CC distribution. The only place you can get the latest and greatest is from the publisher. For Google Books, there are lots of places you'll be able to get access to most of these books. For example from the library as a physical copy. For digital copies, the internet drives prices down, and our users will get free search, free preview, and low purchase prices (mostly $14 or less, whereas interlibrary loan is mostly $20 or $30).

Questions from audience:

What about access to Google Books from outside the U.S.?

Dan Clancy: U.S. Courts can only authorize use within the U.S. But as rights-holders are identified, they can choose to authorize access outside the U.S.

What about the role of hackers and pirates? In music and movies, the best metadata is created by hackers and pirates? When Google's DRM is broken, what will happen?

Dan Clancy: DRM is not worth much, true, but the long tail means that for most of the books of interest, you won't find them on a local peer.

Pam Samuelson: There are complicated, elaborate, strong, expensive-looking requirements for data security on the part of the universities hosting the "research corpus" infrastructure. This could get to the point where the research host sites will have to close down for lack of funding.

Because I'm a lawyer, I look at "what could go wrong?" There are lots of places where things could break, and the security responsibilities that the host universities have to undertake is a potential failure point.

Dan Clancy: Those standards were written so as to consistent with current library practices. Libraries wanted to move from a statutory damages regime to an actual damages regime, which the settlement accomplishes. The likelihood that the actual damages will be significant is very very small.

[Mark Liberman: There were a number of additional interesting points in the discussion that I didn't get in time — if you're deeply interested in the topic, you'll find them in the recording that will be on line in a week or so. I'll add a link here, as well as blogging it separately when it becomes available..]

Failing immediately to

Thu, 08/27/2009 - 1:44pm

As BBC Radio 4 reported the death of Senator Kennedy on the news, I heard a line about how his career had been blighted by the incident at the bridge at Chappaquiddick where "he failed immediately to report an accident". You can see what has happened: in an inadvisable attempt to avoid a split infinitive, the adverb has been placed before to, but this puts it next to failed, so we get interference from a distracting and unintended meaning that involves immediate failure (whatever that might mean). It was the reporting that should have been immediate. The right word order to pick would have been "he failed to immediately report an accident". But you just can't stop writers of news copy from being worried (falsely) that splitting an infinitive is some kind of mistake.

I should add one thing. Many of the comments that began appearing below the first version of this post (I have deleted them, for reasons I will now explain) wanted to argue for "he failed to report an accident immediately". But in fact that option was rendered disastrously less plausible in the BBC sentence, which I shortened (thinking the rest of the object irrelevant). It was roughly this: "he failed immediately to report an accident in which he drove off a bridge and his female companion drowned". It would of course be much worse to try "he failed to report an accident in which he drove off a bridge and his female companion drowned immediately", where almost no one would take the adverb to modify the verb report.

Other commenters want to defend "he failed to report immediately an accident". That would in fact also have been acceptable in the full context (which I did not originally give), but only because the noun phrase was long.

Still other commenters simply quibbled with earlier commenters or made sarcastic remarks about them or made the ridiculous suggestion that "an Americanism" was involved (Americans who find a sentence questionable always say it sounds British and British speakers say it's an Americanism — they blame whichever side of the Atlantic they know least about). These bored me, and I spiked them all.

Let me make this clear: I'm not saying that you never have a choice, and I'm not saying the split infinitive is always the right choice to make. All I'm saying that, squirm though you may, it is fairly common for placing an adverb between infinitival to and the following plain-form verb to be not just grammatical (it is always grammatical), but also the best stylistic choice. And this was one. But BBC editors resist that and worry about it. Stupidly.

Being Human