ARCHIVE - news aggregator | The Science of Language

Navigation

User Login

Events

Language Log

Linguist List: Discussion

Linguist List: Book Reviews

Linguist List: Journal Contents

Linguist List: Media

news aggregator

TOC: Word Structure Vol 2, No 1 (2009)

Linguist List: Journal Contents - Fri, 2009-08-28 15:59

New Scientist rips off Language Log; we don't care

Language Log - Fri, 2009-08-28 14:23

When I wrote my little piece of whimsy on the ridiculous story about how Republicans think Stephen Hawking would have been denied medical treatment if he lived in the UK (he has always lived and been treated in the UK), and pretended that I thought the accent of his speech synthesizer was so bad that it had failed to convince dim-witted Americans at the Investor's Business Daily of his Britishness, my idea was to offer a little joke. Ha! Ha! Another merry quip from Geoff in his inimitable tongue-in-cheek style! But it turns out that I am less inimitable than I thought. The New Scientist "Feedback" column seems to have lifted my little conceit without acknowledgment. They relate the story and then add:

Clearly technology is to blame. They had assumed Hawking was a US citizen, safe from the horrors of socialised medicine. The whole hoo-ha arose, we confidently surmise, because of his voice synthesiser's accent.

Ha! Ha! Another merry quip from New Scientist in its inimitable tongue-in-cheek style… stolen right off of Language Log, where you read it two weeks earlier. Oh, well. We have seen this before. People rip off our stuff. We just try not to get too steamed up about it. After all, we were first; and we are funnier and sexier. And the best revenge is to live well.

The Google Books Settlement

Language Log - Fri, 2009-08-28 07:29

I'm spending today at Berkeley, participating in a one-day conference on "The Google Books Settlement and the Future of Information Access". I'll live-blog the discussion as the day unfolds, leaving comments off until it's over. I believe that the sessions are being recorded, and the recordings will be available on the web at some time in the near future. [Gary Price at Resource Shelf provides some other links here, and a press round-up here. Another summary by an attendee is here.]

Regular LL readers will know that we've been long-time users and supporters of Google Books, with occasional complaints about the poor quality of its metadata. For a lucid discussion of some issues with the terms of the proposed settlement, read Pamela Samuelson's articles "The Audacity of the Google Books Settlement", Huffington Post, 8/10/2009, and "Why is the Antitrust Division Investigating the Google Books Search Settlement?", Huffington Post, 8/19/2009.

[Note that in most cases below, first-person pronouns refer to the speaker, not to me.]

The opening panel is on the topic "Datamining and non-consumptive use", and the panelists are Peter Brantley and Jim Pittman, with Eric Kansa as the moderator.

AnnaLee Saxenian, Dean of Berkeley's School of Information, kicked things off by noting that this all started five years ago at the Frankfort Book Fair, where Google announced what was then called "Google Print". In 2005, the Authors Guild and a number of publishers separately sued Google. A proposed settlement will be considered for approval in New York district court in October. The purpose of this conference is to have a conference about the opportunities and risks of this settlement.

Eric Kansa — Google has publicly said that 7 million books had been scanned as of some time in 2008; some others estimate up to 15 million now. This is a large piece of the total universe of book — US Census reports 2.334 M books published in the U.S. between 1880-1998, WordCat lists 23 million books.

Comparison and contrast to Human Genome project, which had significant public sponsorship to create; the result is largely held in public trust. Google Books is being created with private funds and will largely be privately held, with restrictions imposed by the settlement.

The largest value of this corpus might lie in the results of machine processing, even if no human eyeballs ever viewed it. These "non-display" or "non-consumptive" uses include arbitrary uses allowed within Google. For others, a "research corpus" will be created and housed at two US-based participating libraries. Qualified researchers can experiment with the data for non-commercial purposes. The two sites will be responsible for evaluating research proposals, running audits, etc.

Traditionally, US law makes a distinction between public-domain "facts" and "ideas", and copyrighted expression of such facts and ideas. The settlement appears to give rights over information extracted from the corpus to Google, the Authors Guild and the publishers.

Peter Brantley — representing the "Open Book Alliance" as well as the Internet Archive. I.A. Richards: "A book is a machine to think with". That takes on a new meaning, because we're now moving from "books as books" to "books as data". Future books will be different, in ways that most publishers and lawyers don't seem to be thinking about, and in fact no one really can predict. Moving from Gemeinschaft to Gesellschaft.

Concerns: Corpus is unique — a comprehensive collection of 20th-century literature, based on an exclusive deal, due to the treatment of "orphan works" as well as other aspects of the legal settlement. Do we want to give sole ownership of this unique resource to a single corporate actor? Even if authors pull their books from the display portion, it will still be there in the datamining portion. Prices charged to universities for this unique resource will be unregulated and unrestrained by competition.

Other scenarios: should there be some sort of compulsory licensing? Choice shouldn't just be settlement/no settlement. It's a historic moment. How can we protect the future?

Jim Pitman — The raw content of the books is a tremendous resource. But there's also the information that results from the interaction of users with this content, or from algorithmic analysis of the content, or both. JP is especially interested in subject navigation and classification. Would development of a "map of knowledge" form this corpus belong to Google? To what extent will the scholarly community be inhibited in developing such ideas by the terms of this settlement?

Do we have to ask for permission to pull out (say) a tableau of theorems, or a table of compounds?

There's lots to be done by text analysis, but you can't really do it right if you can't get access to the data on your own computer. Going to one of two designated centers to run some code in a limited time period on someone else's machine is hardly even second best.

Google has an intrinsic commercial interest in stifling innovation by others in textual analysis — at what point will they give in to the temptation to act on this interest?

To get the best results, you want to combine machine learning with human judgment. To what extent will the terms of the settlement allow this? Individuals and small organizations should be encouraged to cultivate the parts of the academic universe that they know best — Google books will be a big part of the resources for this, but it should be easy to combine it with other things. Does the proposed settlement inhibit this?

Dan Clancy of Google Books responds: Nonconsumptive access is fair use, in Google's view. So you don't have to ask permission of Google to do whatever [but you have to have the texts — myl]. Control of research access will be up to the two host sites. There are some provisions, when it talks about the information extracted, that we still don't understand as a community. But the idea is that (for example) if you use the corpus to infer a subject classification system — that's yours. But if you want to use the corpus to derive an index, and offer access to books on that basis, you can't. Similarly, you can't offer a concordance to (say) works belonging to Elsevier.

Jim Pitman: what about an indexing service for mathematical theorems?

Dan Clancy: theorems don't belong to us; but if it's a commercial service, we don't have the rights to give you the right to link to someone else's text. And if we aren't already offering a theorem search service, it's not competing, and we wouldn't be able to object.

Jim Pitman: Is there a list of Google services that we can't compete with?

Dan Clancy: No.

Jim Pitman: What about an algorithm to disambiguate references to people, to be used in other data mining applications?

Dan Clancy: I don't think that would be any problem.

Eric Kansa: What about authors' or publishers' rights to withdraw data-mined facts from downstream applications? And do you need permission to use extracted information in a commercial application?

Dan Clancy: The license in this case adds additional constraints, beyond copyright law, so that yes, the corpus can only be used for research, and if you want to use the results of algorithmic processing of the corpus in a business, you'd need to get permission from the Authors Guild and perhaps others.

Jason Schultz, a clinical professor at the Berkeley Law School: What's interesting about the settlement is its breadth — it sets up a legal regime by which Google is bound, and by which lots of others outside the lawsuits are bound. Things could get changed and a second version of the settlement could come around.

What happens if the settlement is rejected? Probably Google book search goes forward as it is now.

Dan Clancy: But in that case, there's no research corpus.

Jason: If the settlement is rejected and Google wins the case, that would be a precedent that others could rely on. Also, state universities might have sovereign immunity.

Marty: Does this provide a solid enough basis for ongoing research, including the ability to reproduce others' research? What sort of access will researchers really get, and to what? This makes a big difference to the kinds of research that can be done?

Dan Clancy: The simple answer is you get full access. But the question of how to give access to researchers to compute at scale over the research corpus is unclear. We see the two access sites as an open source community.

Jamie Love: it would be helpful to elaborate on what research institutions would NOT have the freedom to do?

Dan Clancy: No restrictions for noncommercial services; except it can't be used to compete with Google.

The second panel is on privacy issues — that is, who gets to know what books someone reads.

The first speaker is Angela Maycock: The settlement is silent on the issue of privacy. There have been lots of informal assurances, but that's all we have to go on.

Why do librarians care? What is the history? What are the threats? How does a loss of privacy rights affect users? What are the implications for this discussions?

In a library context, "Privacy" is the right to engage in open inquiry, without having your activities scrutinized by anyone else. "Confidentiality" is the right of libraries to protect users' information. Lack of privacy and confidentiality has a chilling effect, which damages first-amendment rights.

Examples of challenges — needed to counter the misconception that "good people don't have anything to hide". In Wise County TX, the DA asked the library for all people who checked out a book on childbirth during the previous nine months, as part of an investigation of a child abandonment case. In Colorado, the supreme court ruled that the DA could not access someone's history of purchases from the Tattered Cover bookstore.

The American Library Association has not opposed the settlement, but did file a letter asked for vigorous oversight. Google people have been good listeners.

Tom Leonard: Research libraries are generally not exposed to the kinds of pressures that public libraries are. Still, the core objective for research libraries remains preventing users from being monitored, except in exceptional cases where the user is fully informed.

Readers do know that we've kept a record of what they've borrowed, that this information will never be shared with anyone, and that it will be purged after the book is returned. Similarly, IP addresses are purged as soon as possible after use of online databases. The only exception is the case of rare and valuable books, were records are kept in order to guard against theft.

There are some hard cases, e.g. suppose a group abroad is using our resources to assign people to "untouchable" categories, by looking at census data and maps and so on, should that fact be conveyed to those who are affected?

What about after the GBS? We have the capacity now to monitor all digital searches, but we don't do it, and we shouldn't do it. I believe Google when they say that they feel the same way, but I would like to believe them more.

Jason Schultz: Stanford marshmallow experiment — We have a settlement in front of us: how good will this marshmallow be? How much better (or not) would things be if we wait?

When GB first came on the scene, the issues were all about copyright and fair use. I'm a big fan of the fair use argument for scanning books for information access, information location. This is the right balance in copyright law, and an excellent way to address the orphan works problem, at least in part.

But now we're in a different situation. The settlement deals with so much more than just copyright and fair use. Four things relevant to privacy:

1) The size of the deal. This is the largest copyright licensing deal in history.
2) The compulsory nature of the settlement — it includes everyone who has a copyright interest in the U.S., and it's binding on everyone unless they opt out: "You are giving Google permission, and in return, you get some benefits". Now Google offers GMail, and you can take it or leave it. But here they need permission from more people than have ever been involved in such a process before. Is privacy important enough to be part of the deal?
3) Implications for privacy. If Google can look at every page that you ever read, ??
4) This is a legal hack — which I like in general, EFF often used them — but this is a VERY big hack.

It's not like the privacy issues with the Amazon Kindle or the iPhone, because it's no much bigger, so much longer term, and it's compulsory.

Are the enforcement and accountability mechanisms adequate in this case?

Two models that might be followed:

All of the records in Google Health are kept separate from all your other Google info.
Google's location product — Latitude — has amnesia built into it, it only knows where you are now, not where you were.

Should Google Books do similarly?

If privacy violation occur, how will we know? What will the consequences be?

Michael Zimmer: Planet Google becomes the center of gravity of everyone's information-seeking activities; results are good, which is why we do it, but the resulting infrastructure is a potential threat. This is why we're worried about data retention policies, etc.

Norms of information flow: when you go into a library, in real life or on line, you have some expectations about privacy. Norms for web searching are different. A certain amount of tracking is inherent — and beneficial. A lot of people accept it and move on. Now the question arises, which norms apply to looking at Google Books on line?

Competition: when users are informed, they can "vote with their feet" if they don't like the privacy policy of a given provider. But Google Books will be a monopoly — if people don't like it, where can they go?

There's an FAQ on the Inside Google Books blog. A Google Account will not be necessary; they won't sell access information; their general privacy policy applies.

I trust Google; but like Tom says, I'd like to trust more. Google Street view had very different policies in the U.S. vs. in Europe and Canada, because privacy laws were different. But maybe the ethics should be the same in both cases.

Options: Build in anonymity online? Should book search data be protected the way that health data is? Should everything?

The first afternoon session is on Quality.

Paul Duguid is the moderator.

The panelists are me, Geoff Nunberg, Cliff Lynch, and Dan Clancy.

My presentation is here.

Summary of Geoff Nunberg's talk: This is "the last library". So it's important to get it right. But dates, authorship information, categories are often pretty bad.
[Details are in Geoff's slides here.]

Cliff Lynch: It feels like something is happening here that is much bigger than a simple legal settlement. We're making a national decision about what may well be, as Geoff said, "the last library" — and it's important to get that right.

Metadata problems are much less excusable than scan and OCR problems. OCR is a an error-prone process that gets better as algorithms improve. But bibliographic metadata is a mature and well-understood area — if this is "the last library", why not take the trouble to get it right? Three kinds of things mixed up together — OCR output, assertions from bibliographic metadata, assertions from publisher metadata.

Dan Clancy: Some of the most vocal critics are also some of the most avid users. It's important to understand that dialogue and criticism is an important thing.

Is this about identifying problems and thinking through solutions? or is it about creating a dichotomy? How do we move forward?

I actually don't view Google Books as the one and only library. I don't think it will be and I don't think it should be. There will continue to be other digitization activities. To the extent that Google Book Search is our last shot, then without Google Book Search we never would have had a shot; and I don't really think that's true.

For books scanned at a library, we get the metadata from the library. We get updates from libraries every week or two to fix publication date issues. We probably do have a problem with classification, though it's not done automatically. We get classification metadata from many sources, and perhaps we're not combining them in the best way.

To the extent that the source of our data is part of the problem — which it mostly is — now we need to think about how to make things better. Merging records is a tough problem. But part of this is not about Google doing this, it's something that our partners need to do as well.

Metadata is only the tip of the iceberg — there's also public-domain determination. If we trusted our metadata about what was before 1923, our error rate would be atrocious, so we can't afford to do that.

The Human Genome project is not really comparable, because there's no competition in this case, just cooperation. We have partnerships with Hathi Trust, with Michigan, etc. It's not a competition, where as one gains the other loses. It's really about figuring out how you can build a bunch of repositories, and how do you ensure from a preservation perspective that books are held in multiple locations.

I don't think we're the only library, we're part of a broader community.

[myl: but Dan said later "Do you think that if Google hadn't organized this scanning effort, someone else would have done it? Do you think that if Google hadn't done it, someone else would do it in the next 10 or 20 years? I think the answer to both questions is 'no'".]

Ed Feigenbaum: tens of millions of dollars is a relatively small amount of money. Why is this the last library?

Someone else: Who owns the metadata? Who controls it? Under what conditions can it be distributed? Who can change it and re-distribute the changed version? Is it OCLC? Is it the contributing libraries? What about Google's contribution to merging? Cliff Lynch: it's a mess. Nobody really knows.

Ed Clancy: It's more than tens of millions of dollars, alas.

Guy from EFF: We've suggested that Google should escrow these scans, and after a reasonable time (say 28 years, which was good enough for the copyright act for our founding fathers), make them available to others.

Guy from CNET: How much has Google spent?

Dan Clancy: Alas, we do not publicly disclose the amount of money. We've scanned about 10 million books. Internet archive's cost is said to be about $30 per book.

The last panel is on public access, moderated by Pamela Samuelson.

The first speaker is Dan Greenstein: Why did UC libraries get involved? Libraries are all about access to information, and public libraries including libraries in public universities see public access as a sacred trust. UC libraries were active in the settlement, because of the benefits to the public.

The second speaker is Carla Hesse. Her remarks focused on the question of whether Google Books might turn out to be "too big to fail", and to require a government bailout at some point in the future, perhaps because that slice of Google's business might be sold off to some third party who is less successful at deriving revenue from it, or perhaps for other reasons.

The third speaker is Jamie Love, who offered a detailed and interesting critique of the economics of the proposed settlement. I'll try to get his slides, since there were a lot of details that are hard to get down in real time. One key point is that the settlement allows publishers to collude to fix prices, in a way that would be clearly a violation of anti-trust if they did it on their own, outside of the context of the settlement.

The fourth speaker is Molly Van Houweling. Her remarks focused on the question of what future pricing and subscription options for university libraries might look like.

Dan Clancy: Carla did a good job of broadening the scope — she placed this in the context of the broader evolution in how we access information. Now we're not dealing with physical goods, but rather with digital goods; the book search settlement is not mainly about that, it's mainly about orphaned works and so on. Future books will mainly be distributed in digital form by publishers. We don't see this settlement as the overall solution, just as the solution to the specific problem of out-of-print books that are still in copyright.

Responding to Jamie Love, many months of the settlement discussion were devoted to the questions of competition and pricing. Most of what people want and where the competition is, is for in print books. What happens in the journal market is you have publishers who don't even let the authors distribute their works. But in the settlement, all authors are permitted CC distribution. The only place you can get the latest and greatest is from the publisher. For Google Books, there are lots of places you'll be able to get access to most of these books. For example from the library as a physical copy. For digital copies, the internet drives prices down, and our users will get free search, free preview, and low purchase prices (mostly $14 or less, whereas interlibrary loan is mostly $20 or $30).

Questions from audience:

What about access to Google Books from outside the U.S.?

Dan Clancy: U.S. Courts can only authorize use within the U.S. But as rights-holders are identified, they can choose to authorize access outside the U.S.

What about the role of hackers and pirates? In music and movies, the best metadata is created by hackers and pirates? When Google's DRM is broken, what will happen?

Dan Clancy: DRM is not worth much, true, but the long tail means that for most of the books of interest, you won't find them on a local peer.

Pam Samuelson: There are complicated, elaborate, strong, expensive-looking requirements for data security on the part of the universities hosting the "research corpus" infrastructure. This could get to the point where the research host sites will have to close down for lack of funding.

Because I'm a lawyer, I look at "what could go wrong?" There are lots of places where things could break, and the security responsibilities that the host universities have to undertake is a potential failure point.

Dan Clancy: Those standards were written so as to consistent with current library practices. Libraries wanted to move from a statutory damages regime to an actual damages regime, which the settlement accomplishes. The likelihood that the actual damages will be significant is very very small.

[Mark Liberman: There were a number of additional interesting points in the discussion that I didn't get in time — if you're deeply interested in the topic, you'll find them in the recording that will be on line in a week or so. I'll add a link here, as well as blogging it separately when it becomes available..]

Failing immediately to

Language Log - Thu, 2009-08-27 14:44

As BBC Radio 4 reported the death of Senator Kennedy on the news, I heard a line about how his career had been blighted by the incident at the bridge at Chappaquiddick where "he failed immediately to report an accident". You can see what has happened: in an inadvisable attempt to avoid a split infinitive, the adverb has been placed before to, but this puts it next to failed, so we get interference from a distracting and unintended meaning that involves immediate failure (whatever that might mean). It was the reporting that should have been immediate. The right word order to pick would have been "he failed to immediately report an accident". But you just can't stop writers of news copy from being worried (falsely) that splitting an infinitive is some kind of mistake.

I should add one thing. Many of the comments that began appearing below the first version of this post (I have deleted them, for reasons I will now explain) wanted to argue for "he failed to report an accident immediately". But in fact that option was rendered disastrously less plausible in the BBC sentence, which I shortened (thinking the rest of the object irrelevant). It was roughly this: "he failed immediately to report an accident in which he drove off a bridge and his female companion drowned". It would of course be much worse to try "he failed to report an accident in which he drove off a bridge and his female companion drowned immediately", where almost no one would take the adverb to modify the verb report.

Other commenters want to defend "he failed to report immediately an accident". That would in fact also have been acceptable in the full context (which I did not originally give), but only because the noun phrase was long.

Still other commenters simply quibbled with earlier commenters or made sarcastic remarks about them or made the ridiculous suggestion that "an Americanism" was involved (Americans who find a sentence questionable always say it sounds British and British speakers say it's an Americanism — they blame whichever side of the Atlantic they know least about). These bored me, and I spiked them all.

Let me make this clear: I'm not saying that you never have a choice, and I'm not saying the split infinitive is always the right choice to make. All I'm saying that, squirm though you may, it is fairly common for placing an adverb between infinitival to and the following plain-form verb to be not just grammatical (it is always grammatical), but also the best stylistic choice. And this was one. But BBC editors resist that and worry about it. Stupidly.

Where evidence counts for nothing and nobody will listen

Language Log - Thu, 2009-08-27 14:27

You just can't stop people putting themselves in harm's way. If they're not walking into the buzzsaw they're crashing like bugs into the windshield… As the previously referenced discussion about usage in The Guardian's online pages developed a bit further, a commenter called scherfig responded to Steve Jones's devastating piece of evidence about Mark Twain not obeying Fowler's which/that rule by saying this:

OK, steve, let's forget Mark Twain and Fowler (old hat) and take a giant leap forward to George Orwell in the 30's and 40's. In my opinion, in his essays, the finest writer of the English language ever . Check out his use of English - it is, after all, several decades after Twain and still 70 years ago, and he has actually written sensibly about language (quite a lot).

What Steve immediately did, of course, was to take a relevant piece of Orwell's work and look at it; scherfig, the Orwell fan, astonishingly, had been too lazy to do this. And again his result was total and almost instant annihilation of the opponent.

Here's what Steve wrote to scherfig:

I didn't bring out Mark Twain, Michael did. And it was Fowler who was responsible for the non-rule in the first place.

If you are suggesting that we should copy Orwell, 'the finest writer in the English Language ever' then you'd better jettison your nonsense about 'which' not been used in restrictive relative clauses because his famous essay 'Politics and the English Language' is full of it being used thus, starting with the very first paragraph.

belief that language is a natural growth and not an instrument which we shape for our own purposes.

Did scherfig acknowledge that this was a point against him? Not at all. Not one bit (read for yourself).

What is going on here? People insist that they believe in the fictive rule that which is a mistake at the beginning of a restrictive relative clause; they cite examples of writers they admire, and predict that these writers would never disrespect the rule; they don't look to see, not even at the first page or so; instead they publish their unchecked claim in an online department of a national newspaper; Steve Jones checks their claim quite easily in a minute or two of research and shows that they are just plain wrong; and they refuse to accept that the evidence tells us anything. They move on to suggesting a different author, or change the subject, or post personal insults against the messenger (scherfig tells Steve, who is spot-on relevant and exactly correct, "I have come to the conclusion that you have no idea what you're talking about … you just witter on…", and also accuses Steve of claiming that there are no rules at all).

Linguistics is a very strange business to be in. Matters like what rules or regularities expert users of English are following when they construct relative clauses are not difficult like quantum mechanics is difficult. They are readily settled by inspection of text that anyone could do. And all linguists want to base on such inspection is an accurate description of the language of which the texts are a sample. Linguists believe (and could anyone seriously think otherwise?) that on the whole a correct description of the grammar of a language has to be an account of the rules or regularities that expert users of English follow when they construct sentences. That doesn't mean everything any user writes down is in compliance with the rules — we all make occasional mistakes from inattention. But it does mean that overwhelming evidence concerning a regularity in published English prose should count for something (probably quite a lot) when we're discussing English grammar.

And the overwhelming evidence says that the regularity about English restrictive clauses, in speech as well as published prose, and in cases where the users themselves would say that they were not in error and did not choose their words carelessly, is that they sometimes begin with which (a thing which I have often wondered about), and sometimes begin with that (the thing that I can't understand), and sometimes begin with neither (the thing I want to explain). The evidence is overwhelming. But people can't accept it, and insist it isn't so.

It's like being a chemist and explaining to people that mercury is poisonous and accumulates in the body and causes mental deterioration; and they just keep adding mercury to their food, and eating it and going mad, and claiming that chemists say there should be no food.

"Team, Meet Girls; Girls, Meet Team"

Language Log - Thu, 2009-08-27 12:50

The ideal David Bowie song, according to (Nick Troop's interpretation of) the output of Jamie Pennebaker's LIWC program, correlated with sales figures across Bowie's oeuvre:

This is a big step up (or down, depending on your perspective) from the typical "Experts solve mystery of ___'s success" story — Prof. Troop puts his theory into practice, and lets the public judge the results.

[Hat tip: Gordon Campbell]

TOC: Language & Intercultural Communication Vol 9, No 3 (2009)

Linguist List: Journal Contents - Thu, 2009-08-27 08:07

TOC: Journal of Child Language Vol 36, No 4 (2009)

Linguist List: Journal Contents - Thu, 2009-08-27 08:04

Nun study update

Language Log - Thu, 2009-08-27 01:58

For the last dozen years, it's been known that young people who follow the stylistic advice of Strunk & White are more likely to get Alzheimer's disease when they get old. Well, at least, in a cohort of nuns,

Low idea density and low grammatical complexity in autobiographies written in early life were associated with low cognitive test scores in late life. Low idea density in early life had stronger and more consistent associations with poor cognitive function than did low grammatical complexity. Among the 14 sisters who died, neuropathologically confirmed Alzheimer's disease was present in all of those with low idea density in early life and in none of those with high idea density.

And if you look into what "idea density" means, you'll see that many aspects of Strunkish writing style, especially avoidance of adjectives and adverbs, are precisely designed to lower it. (For details and links, see "Writing style and dementia", 12/3/2004; and "Miers dementia unlikely", 10/21/2005.)

Now there's a new chapter in the story, based on looking for physical symptoms of Alzheimer's in living nuns using positron emission tomography (PET) brain imaging, rather than relying on post-mortem examination of the brains of dead ones ("Can Language Skills Ward Off Alzheimer's? A Nuns' Study", Time, 7/9/2009).

The recent journal article under discussion is D. Iacono et al., "The Nun Study. Clinically silent AD, neuronal hypertrophy, and linguistic skills in early life", Neurology, published online 7/8./2009.

Background: It is common to find substantial Alzheimer disease (AD) lesions, i.e., neuritic β-amyloid plaques and neurofibrillary tangles, in the autopsied brains of elderly subjects with normal cognition assessed shortly before death. We have termed this status asymptomatic AD (ASYMAD). We assessed the morphologic substrate of ASYMAD compared to mild cognitive impairment (MCI) in subjects from the Nun Study. In addition, possible correlations between linguistic abilities in early life and the presence of AD pathology with and without clinical manifestations in late life were considered.

Methods: Design-based stereology was used to measure the volumes of neuronal cell bodies, nuclei, and nucleoli in the CA1 region of hippocampus (CA1). Four groups of subjects were compared: ASYMAD (n = 10), MCI (n = 5), AD (n = 10), and age-matched controls (n = 13). Linguistic ability assessed in early life was compared among all groups.

Results: A significant hypertrophy of the cell bodies (+44.9%), nuclei (+59.7%), and nucleoli (+80.2%) in the CA1 neurons was found in ASYMAD compared with MCI. Similar differences were observed with controls. Furthermore, significant higher idea density scores in early life were observed in controls and ASYMAD group compared to MCI and AD groups.

Conclusions: 1) Neuronal hypertrophy may constitute an early cellular response to Alzheimer disease (AD) pathology or reflect compensatory mechanisms that prevent cognitive impairment despite substantial AD lesions; 2) higher idea density scores in early life are associated with intact cognition in late life despite the presence of AD lesions.

See also William E. Klunk et al., "Amyloid Imaging with PET in Alzheimer’s Disease, Mild Cognitive Impairment, and Clinically Unimpaired Subjects", chapter 6 in Dan Silverman (Ed.) PET in the Evaluation of Alzheimer's Disease and Related Disorders, 2009.

Crash blossoms

Language Log - Wed, 2009-08-26 18:59

From John McIntyre:

You've heard about the Cupertino. You have seen the eggcorn. You know about the snowclone. Now — flourish by trumpets and hautboys — we have the crash blossom.

At Testy Copy Editors.com, a worthy colleague, Nessie3, posted this headline:

Violinist linked to JAL crash blossoms

(If this seems a bit opaque, and it should, the story is about a young violinist whose career has prospered since the death of her father in a Japan Airlines crash in 1985.)

A quick response by subtle_body suggested that crash blossom would be an excellent name for headlines done in by some such ambiguity — a word understood in a meaning other than the intended one. The elliptical name of headline writing makes such ambiguities an inevitable hazard.

And danbloom was quick to set up a blog to collect examples of "infelicitously worded headlines."

Chris Waigl, reporting on the same neologism, describes "crash blossoms" as "those train wrecks of newspaper headlines that lead us down the garden path to end up against a wall, scratching our head and wondering what on earth the subeditor might possibly have been thinking." Indeed, when such infelicitous headlines have come up here on Language Log, they have typically been discussed as examples of "garden path sentences." After the break, a recent headline of the classic "garden path" variety.

On Sunday night, this headline appeared on the CNN Wire before events took a more tragic turn:

The lede graf of the story explained:

Three people missing Sunday after large ocean waves knocked several people into the Atlantic off Maine's Acadia National Park have been located, a park official said.

(The headline and lede graf were quickly replaced after news emerged that one of the three people, a 7-year-old girl, had died.)

As John McIntyre points out, headlines are particularly susceptible to improper "garden path" parsing, since the elliptical nature of headlinese can lead to various syntactic ambiguities. In this example, we're so used to copula deletion in headlines that "3 missing…" is easily parsed as "3 are missing…" But by the time we get to "located" at the end of the headline, we discover that this is a misparsing. Following the structure of the lede, the headline is intended to be read as "3 [people] missing after waves hit Maine [have been] located." I'd be impressed if anyone got that reading the first time through.

For more garden path headlines, see:

"Garden paths at the Guardian" (9/21/2004)
"Surprising crocodile kin" (1/26/2006)
"Linguist thought able to read isn't" (7/16/2006)

Museum musing

Language Log - Tue, 2009-08-25 11:42

John McIntyre at You Don't Say considers a hypothetical Museé des Peevologies. The curator's job is apparently open, or will be once a founding donor is located.

Annals of offense-finding

Language Log - Tue, 2009-08-25 09:47

From the Times Online of August 23, under the head "Quangos blackball … oops, sorry … veto 'racist' everyday phrases", a story that begins:

It could be construed as a black day for the English language — but not if you work in the public sector.

Dozens of quangos and taxpayer-funded organisations have ordered a purge of common words and phrases so as not to cause offence.

Among the everyday sayings that have been quietly dropped in a bid to stamp out racism and sexism are “whiter than white”, “gentleman’s agreement”, “black mark” and “right-hand man”.

Details to follow, but first a word about quangos, for readers unfamiliar with the term.

(Hat tip to Danny Bloom.)

Quango is a mostly British term. Here are the OED (draft revision of March 2008) definition and etymology:

Originally: an ostensibly non-governmental organization which in practice carries out work for the government. Now chiefly: an administrative body which has a recognized role within the processes of national government, but which is constituted in a way which affords it some independence from government, even though it may receive state funding or support and senior appointments to it may be made by government ministers.

[Acronym, originally < the initial letters of quasi non-governmental organization…, but in later use also frequently reinterpreted as the initial letters of either quasi-autonomous non-government(al) organization or quasi-autonomous national government(al) organization.

The coinage of the acronym is frequently attributed to A. Barker of the University of Essex: see e.g. R. L. Wettenhall in Current Affairs Bulletin 57(1981) 14-22, and compare:
1982 A. BARKER Quangos in Britain 220 This was around 1970, when I invented this near-acronym from an American term ‘quasi-non-governmental organisation’.]

(NOAD2 characterizes it as "chiefly derogatory", but the OED has no such usage label, though it does label as "depreciative" the derivatives quangocracy and quangocrat.) Probably more than you wanted to know, but quangos have come up on Language Log only in a passing remark by Geoff Pullum, here, that "quangos are now called NDPBs" (that is, non-departmental public bodies).

Now let's return to our muttons, with some specific pieces of language advice from the Times Online story:

The Northern Ireland Human Rights Commission has advised staff to replace the phrase “black day” with “miserable day”, according to documents released under freedom of information rules.

It points out that certain words carry with them a “hierarchical valuation of skin colour”. The commission even urges employees to be mindful of the term “ethnic minority” because it can imply “something smaller and less important”.

The National Gallery in London believes that the phrase “gentleman’s agreement” is potentially offensive to women and suggests that staff should replace it with “unwritten agreement” or “an agreement based on trust” instead. The term “right-hand man” is also considered taboo by the gallery, with “second in command” being deemed more suitable.

Many institutions have urged their workforce to be mindful of “gender bias” in language. The Learning and Skills Council wants staff to “perfect” their brief rather than “master” it, while the Newcastle University has singled out the phrase “master bedroom” as being problematic.

Advice issued by the South West Regional Development Agency states: “Terms such as ‘black sheep of the family’, ‘black looks’ and ‘black mark’ have no direct link to skin colour but potentially serve to reinforce a negative view of all things black. Equally, certain terms imply a negative image of ‘black’ by reinforcing the positive aspects of white.

“For example, in the context of being above suspicion, the phrase ‘whiter than white’ is often used. Purer than pure or cleaner than clean are alternatives which do not infer that anything other than white should be regarded with suspicion.”

I haven't read all of the comments — there are 158 at the moment — but they seem to be uniformly negative, in various modes (dismay, outrage, mockery, and so on, with plenty of references to Orwell). It does seem like a (laudable) desire to avoid offense has gotten out of control, probably as a joint effect of the etymological fallacy (the idea that the original meaning of a word, or rather what people believe the original meaning to have been, continues to color uses of the word) and the belief that all uses of a particular pronunciation or spelling are united, so that if some use gives offense, all do (offense inheres in the physical object itself, not in a relationship between the object, the people who say or write it, the people who hear or read it, and the context of use).

[Addendum 26 August: Ray Girvan writes with an important cautionary note — something I'd intended to incorporate in the original posting but left out in my hurry to get the posting out:

The Times piece follows a very common story format in the more right-wing UK newspapers, which tend to be hostile toward various bodies (local councils, quangos, arts organisations, groups helping minorities, etc) and use "political correctness" stories as a stick to beat them with.

Quite often the edicts cited in these stories turn out to be exaggerated, urban myths, or even fictitious.

Alas, yes. I don't know the true status of the four reports in the Times story.]

Please be careful

Language Log - Tue, 2009-08-25 09:18

Not only are the stereotypical Japanese fastidiously clean, they are also extraordinarily polite. They will not just tell you to be careful not to endanger yourself. They will be sure to preface the warning with a "please" (actually the word for "please" in Japanese, KUDASAI, comes at the end of the sentence).

In today's Japan mail (from Kathryn Hemmann) come two signs, one warning, "Please Be Careful to Strong Sunlight" and the other, "Please be careful to traffic."

The first example utilizes the intriguing device of a sign within a sign, and it is all in English.

The second example in Japanese reads KURUMA NI GO-SHUGI KUDASAI (car with-regard-to [honorific]-pay-attention please; please pay attention to the cars / traffic).

The consistency of usage leads me to suspect that this may be an established pattern in Japanese translations into English. And using "be careful to" in order to mean "be aware of the dangers of" can create odd results, even when what follows is a verb phrase rather than a noun phrase (from here):

Indeed, "Please be careful to forget valuables" won the 2005 Sign Language Award of the English-Speaking Union of Japan.

And there are many web site warnings along the lines of "This is a scratch site, then please be careful to lose your way."

TOC: Journal of Germanic Linguistics Vol 21, No 3 (2009)

Linguist List: Journal Contents - Mon, 2009-08-24 22:07

Walking into a buzzsaw

Language Log - Mon, 2009-08-24 14:07

Michael Bulley made a profoundly incautious comment in a discussion in the Guardian newspaper's "Comment is free" online section today. He was following up a pathetic column on usage by the paper's style guide editor, David Marsh. Unsurprisingly, Marsh had attempted to defend the totally fake which-that rule for integrated (or ‘defining’) relative clauses, which we have so often critiqued here at Language Log. Wrote Bulley, rather pompously:

No one would deny that there are numerous examples of "which" introducing a relative clause that defines (if they weren't any, no one would object to them as being bad style!), but are you just going to say to someone "This is what lots of people do, so it's OK for you to do it as well"? I'm reading Mark Twain's Innocents Abroad. I haven't checked, but I'd bet he never uses "which" as a defining relative.

Oh, no! It was like watching someone walking backwards toward a buzzsaw. I could hardly bear to look. You don't say things like that in the age of We-Can-Fact-Check-Your-Ass!

What's more, Bulley had said this in a thread where Steve Jones was participating. Steve has (how shall I put it?) never worried unduly about whether his smartness-to-niceness ratio might stray above 1.0 sometimes. I knew he would grab an electronic copy of Twain and check immediately, and Bulley could expect no mercy. Sure enough, a few comments later, there was Steve, with a response even more brilliantly acid than I was expecting:

Well, when you've finished the table of contents, and get round to reading the Preface, you'll find this example in the third paragraph.

In this volume I have used portions of letters which I wrote for the Daily Alta California

Smack! When will people learn the new first rule of the blogosphere, WCFCYA?

(Twain was, of course, an excellent writer. He knew there was no rule forbidding which in integrated relatives, so he used it when it struck him as right. Excellent writers who have not been forced to submit to American copy editors on this point tend to use which and that in roughly equal proportions at the beginnings of their integrated relative clauses. And there may be a subtle meaning difference: it seems to me that there is a slight bias toward using which when the noun phrase is indefinite or introduces something new, and that when the noun phrase is definite or refers to something established in the discourse. It is the nervous, insecure, and gullible minor writers, not the great ones, who believe there is a hallowed rule of English grammar that they can comply with if they deprive themselves of their freedom to choose between which and that as best suits the context. Great writers know better.)

Instigation and intention

Language Log - Mon, 2009-08-24 12:44

A couple of weeks ago, Yale University Press decided to remove the illustrations from Jytte Klausen's forthcoming book The Cartoons that Shook the World. (See "Yale Press Bans Images of Muhammad in New Book", NYT, 8/12/2009). Among the many condemnations of this decision that I've read, Christopher Hitchens' ("Yale Surrenders", Slate, 8/172009) is the only one that makes a lexicographical argument:

[YUP director John] Donatich is a friend of mine and was once my publisher, so I wrote to him and asked how, if someone blew up a bookshop for carrying professor Klausen's book, the blood would be on the publisher's hands rather than those of the bomber. His reply took the form of the official statement from the press's public affairs department. This informed me that Yale had consulted a range of experts before making its decision and that "[a]ll confirmed that the republication of the cartoons by the Yale University Press ran a serious risk of instigating violence."

So here's another depressing thing: Neither the "experts in the intelligence, national security, law enforcement, and diplomatic fields, as well as leading scholars in Islamic studies and Middle East studies" who were allegedly consulted, nor the spokespeople for the press of one of our leading universities, understand the meaning of the plain and common and useful word instigate. If you instigate something, it means that you wish and intend it to happen. If it's a riot, then by instigating it, you have yourself fomented it. If it's a murder, then by instigating it, you have yourself colluded in it. There is no other usage given for the word in any dictionary, with the possible exception of the word provoke, which does have a passive connotation. After all, there are people who argue that women who won't wear the veil have "provoked" those who rape or disfigure them … and now Yale has adopted that "logic" as its own.

Samuel Johnson, the first great English lexicographer, wrote about the connotations of instigate from a slightly different point of view ( The Plan of an English Dictionary, 1747):

There are many other characters of words which it will be of use to mention. Some have both an active and passive signification, as fearful, that which gives or which feel terror, a fearful prodigy, a fearful hare. Some have a personal, some a real meaning, as in opposition to old we use the adjective young of animated beings, and new of other things. Some are restrained to the sense of praise, and others to that of disapprobation, so commonly, though not always, we exhort to good actions, we instigate to ill; we animate, incite and encourage indifferently to good or bad. So we usually ascribe good, but impute evil; yet neither the use of these words, nor perhaps of any other in our licentious language, is so established as not to be often reversed by the correctest writers. I shall therefore, since the rules of stile, like those of law, arise from precedents often repeated, collect the testimonies on both sides, and endeavour to discover and promulgate the decrees of custom, who has so long possessed, whether by right or by usurpation, the sovereignty of words.

In his dictionary, Johnson defined instigate as "to tempt or urge to ill", which asserts that the instigated action is a bad thing, and also supports the view that the instigating party intends it. The OED gives two senses, corresponding to different syntactic frames: instigate (someone) to VerbPhrase meaning "To spur, urge on; to stir up, stimulate, incite, goad (now mostly to something evil)", and instigate NounPhrase meaning "To bring about by incitement or persuasion; to stir up, foment, provoke". In all of the citations, the subject does seem to desire the result. But following Johnson's prescribed method, let's collect recent "testimonies on both sides". Among the first 20 uses of instigate that I found in this morning's Google News, 17 (read in context) support Hitchen's contention that "[i]f you instigate something, it means that you wish and intend it to happen":

In another video, a resident was told to push another far bigger resident to instigate a fight.
WUC member spreads faked video to instigate riot
Greenpeace Uses Design to Instigate Corporate Change.
Russia, it seems, has had enough of the constant bickering and wants to instigate a change in Ukraine, one which will play in its favor.
To fail to instigate an inquiry will continue to hide the truth from terrorism’s victims and from the public.
Flintoff himself will certainly not instigate a return to Test cricket.
It would be left up to the United States to continue on its own in Afghanistan or instigate an emergency withdrawal.
… any people who are coming into the community that just want to have their way, trying to instigate the community to create more tension within the community, is just not acceptable for the Council …
Hodgson hardly had the manpower to instigate sweeping changes, so Chelsea remained largely unruffled after the interval.
O'Brien's approach enables him to instigate emotional reactions from the less stable and immature portions of our society.
It is particularly offensive when this is expressed by political hacks who are well-known to instigate such tactics.
The Christian Association of Nigeria in Kaduna says it is alarmed at an attempt by unknown persons to instigate religious antagonism
… the aim was to instigate indoctrinated Muslim youth to fight against the Russian armies …
With electron beams used widely to instigate industrial and manufacturing chemical reactions, the company’s offerings could add up to substantial energy savings if it can increase its market share.
The primary focus of this role will be to instigate a strategic marketing campaign in terms of online e-commerce and direct marketing…
100 reformists, including senior officials, stand accused of trying to instigate a so called “velvet revolution”.
The best evidence Ms. Kadeer did not instigate the riots paradoxically comes from the Chinese themselves.

But I found 3 examples where instigate seems simply to mean "cause" or "start":

If an inmate were to read the article and told other inmates, that in itself could be enough to instigate a riot.
A title certain to instigate debate is Princeton University Press's Nothing Less than Victory: Decisive Wars and the Lessons of History
The media does not look the other way when phrases like "loosely based", "inspired" and "heavily borrowed" are thrown casually in interviews.
If anything, these types of responses instigate further grilling.

It's not clear that these count as examples of "the correctest writers", but at least with respect to contemporary journalistic prose, Hitchens' view that "If you instigate something, it means that you wish and intend it to happen" needs to be qualified by Johnson's "commonly, but not always". And though I'm certainly in favor of Hitchens' general oppposition to "blame the victim" theories of moral responsibility, the whole argument is probably beside the point. Martin Kramer, among others, makes a plausible case that the Yale administration's motivation was not to avoid bloodshed, but rather to preserve access to money.

[Update: we can check the status of instigate's implication of intentionality more directly, by searching for things like {"unintentionally instigate|instigated|instigates|instigating"} in various places. Google Scholar turns up e.g. this

It may surprise some to learn that the pop star has, albeit unintentionally, instigated a broader examination of HIPAA violations. But HIPAA, as a relatively new law, was festering behind the curtain, waiting for a high-profile patient privacy violation before enforcement could truly begin.

where the odd use of festering doesn't give us a lot of confidence that this counts as one of the "correctest authors"; or this one

This alternative hypothesis would be in keeping with the “victim precipitation” model of aggression that assumes that victims of aggression intentionally or unintentionally instigate some negative acts such as aggression.

which is more convincing as English prose, and also suggests that for some people at least, the connection between instigation and intention is at best an implicature. This impression is strengthened by a search of Google Books, which turns up examples like

But the truth is, I unknowingly and unintentionally instigated my own seduction.

Yang Changjun unintentionally instigated the Hezhou violence by trying to save characters in his announcement. Instead of writing "Sa-la-er Hui", which unambiguously means "the Salar Muslims", he wrote "sa Hui", which might refer to the Salars or might mean "disperse and scatter the Muslims".

… by annexing a significant portion of tribal lands, had unintentionally instigated an uprising in the mountains that threatened to overwhelm the Ottoman forces in the region.

These examples seem plausible and idiomatic to me.]

]

Bierce's bugbears

Language Log - Mon, 2009-08-24 07:59

Just a pointer to Jan Freeman's "On Language" column — she was subbing for Bill Safire — in Sunday's New York Times Magazine, about Ambrose Bierce's advice on English usage in Write It Right: A Little Blacklist of Literary Faults (1909), which Jan characterizes as "often mysterious, perverse and bizarre". With examples.

Almost every 90 seconds

Language Log - Mon, 2009-08-24 02:34

Max Heiman wrote to me with a nice point. I present it here as a guest post.

An ambiguity in a New York Times story caught my eye:

But in the wake of the financial crisis, attendance at the [Museum of American Finance] ― located at 48 Wall Street, near the epicenter of last year’s market collapse ― has risen to about 200 visitors a day, nearly double its tally last summer. (The Metropolitan Museum of Art averages that many visitors almost every 90 seconds.)

Quiz: does the Met average more or less than 200 visitors every 90 seconds?

If I were to read, "The winner ate 200 hot dogs in almost 90 seconds", my reaction would be, (1) "Gross!", and (2) "It took slightly longer than 90 seconds to eat all those hot dogs."

But in the museum example I interpreted the meaning (after a pause) in the opposite way, as 200 visitors arriving in slightly less than 90 seconds.

What struck me as interesting is that I realized that, during that pause, I was thinking about how the sentence wasn't written — specifically, in the unambiguous form "[the Met] averages almost that many visitors every 90 seconds" — and then deciding that the ambiguous version in front of me must mean the opposite.

I realize there are some problems with this analysis. For example, the author could have written "[the Met] averages more than that many visitors every 90 seconds" but this didn't occur to me as the 'unambiguous' counterpart to what I read.

My main point is that I noticed my mind trying to resolve ambiguity not by taking apart the sentence that was there, but by comparing it to the sentences that weren't.

— Max Heiman

Bloggingheads: Of Cronkiters and corpora, of fishapods and FAIL

Language Log - Sat, 2009-08-22 10:51

My brother Carl, a science writer who blogs over at The Loom, has a regular gig on Bloggingheads.tv, interviewing science-y folks for "Science Saturday." For Carl's latest installment, the Bloggingheads producers suggested he interview me about lexicography and other wordy stuff. Many of the topics we cover, from lexical blends to snowclones, will be familiar to readers of Language Log and my Word Routes column on the Visual Thesaurus. So here is our nepotistic "diavlog" for your enjoyment. (Diavlog is a second-order blend, by the way: it blends dialog and vlog, with the latter element representing a blend of video and blog. Or make that third-order, since blog blends Web and log.)

Modals of life and death

Language Log - Sat, 2009-08-22 10:25

Rope 'may have saved girl', said the headline in the Metro alongside a photo of pretty 21-year-old British tourist Emily Jordan, and I felt my heart leap with new optimism. I had read the previous day that Emily had been trapped under water while riverboarding on vacation in New Zealand, and the story had said that although her river guide had been saved, poor Emily had drowned. Now it seemed that was inaccurate: she survived, and it may have been rescue ropes that saved her! But no, reading the full story confirmed again that she was dead. What had gone wrong with my interpretation process?

The answer is that in my variety of Standard English the modal verb may, which has the present tense form may and the preterite form might, absolutely must be in the preterite form to convey either a past time reference sense (John thought he might join us, but it didn't work out that way) or the counterfactual "remote conditional" sense (If you offered me money I might take it). The story made it clear that it is now thought that rescue ropes might have saved Emily if they had been available. I can interpret Ropes might have saved her — that is, as a remote conditional (the apodosis of a conditional claim with the prodosis clause implicit: it would have been possible for ropes to save her (if they had been available) but they didn't). But for me, the sentence Ropes may have saved her cannot have the counterfactual sense of the remote conditional: it means either "Possibly ropes have saved her" or "She has been saved, possibly by ropes."

But that's me. There is a slowly growing tendency for other Standard English speakers to use may for both past time reference (%John thought he may join us but it didn't work out that way — the prefix % is used to mark a sentence on which there are divided opinions within a dialect concerning the grammaticality of the example) or in a remote conditional context as in the headline just discussed.

The usage in the newly developing subdialect indicates a separation of may from might, and perhaps a very slow eroding away of the latter. (Might already sounds a bit pompous or 20th-century to many young American speakers.)

I am well aware of the trend the new subdialect represents. It is carefully documented by Rodney Huddleston in The Cambridge Grammar of the English Language, chapter 3, section 9.8.4, pages 202-203). But this time the headline-writer's use of it threw me. I thought poor Emily might still be alive, but it didn't work out that way.

The Science of Language