Saturday, 18 December 2010

Google Books N-gram - wow!

Google just blew my bibliographic socks off.

Geoff Nunberg at Language Log (see Humanities research with the Google Books corpus) just posted news and some links concerning the Books N-gram Viewer that just went live.

I've enthused previously about the power of Google Books to hack into historical texts in a way that would have been impossible less than a decade ago. The Books Ngram Viewer adds to this facility with a powerful search interface that accesses a humungous corpus of texts (the English one, for instance, covers 360 billion words) and can graph, singly or in comparison, normalised frequencies. The possibilities are immense. As the Science research article abstract says:
We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of "culturomics", focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. "Culturomics" extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.
- Quantitative Analysis of Culture Using Millions of Digitized Books, * Michel, et al. Science 1199644DOI:10.1126/science.1199644
What this actually means, even at a trivial level, is that anyone can do linguistic studies that would have taken years (or even be impossible). You can chart the continuous decline of "whom" in British English over nearly two centuries.  You can compare change of acceptable usages with time: for instance, Mohammedan vs Moslem vs Muslim or Esquimaux vs Eskimos vs Inuit.  You can get long-term statistics for inflected vs. periphrastic versions of adjective comparisons: e.g. "pleasanter" vs. "more pleasant". You can plot the fate of variant spellings: such as how "focused", originally a minority spelling, overtook "focussed" around 1900 and came to dominate it.  You can check age of words: for instance, how people have been "holidaying" (often taken to be a neologism) since 1840. You can look at the history of coexisting forms, such as "none of us is" vs "none of use are" or "Devonshire" vs "Devon".  This is a delight for lexicographical enthusiasts.

The setup isn't perfect. As I and others have mentioned, bad metadata and OCR errors can be a problem. For example, Mark Liberman's follow-up at Language Log - More on "culturomics" - mentions how attempts to trace the history of the word "fuck" in print (see graph) are confused by the "long s", so that pre-1820 you're actually finding occurrences of the word "suck" (like this). More fundamentally, though, once you get away from raw lexical observation and into sociological analysis - the "culturomics" part - it shouldn't be forgotten that frequency of appearance in books is a merely a proxy for the multiple social factors driving that frequency. It would be, for instance, an unreliable conclusion that the British have steadily become less interested in love over the past two centuries because the word's appearance in print has more or less continuously declined.

Nevertheless, searches I've tried often reveal striking patterns, even if they may be inexplicable. Why the seemingly cyclic book references to red sunsets? Why have references to Sherlock Holmes steadily grown over the 20th century? Why do occurrences of the word "fat" rise steadily from 1840 to peak in the late 1870s? What do the peaks in references to opium mean? (this one can be partially answered; two of them coincide with the Opium Wars).  Why are there two 19th century peaks for "Batman" (it seems to be a confluence of coverage of people withthat surname, notably John Batman). Does the post-1960 rise in references to "Frankenstein" mean anything culturally or does it just reflect the success of particular movies. I have a feeling I'm going to be making a lot of use of this.

The Guardian has a more general piece on it here: Culturomics and the new Google tool for tracking cultural trends. See also the official Culturomics site.

Addendum: the paper Quantitative Analysis of Culture Using Millions of Digitized Books (Science DOI: 10.1126/science.1199644) - free registration with Science is required - is very worth reading. It mentions some highly interesting areas including:
  • The recent massive growth in the English lexicon (over 70% during the last 50 years).
  • The trade-off of dictionaries in balancing comprehensiveness and conciseness, with the result that over half of the English lexicon comprises "dark matter" that doesn't appear in dictionaries.
  • The ability to track trends such as the regularisation of verbs, such as the shift from "-nt" endings to "-ned" (e.g. "burnt" to "burned").
  • The characteristic trajectories of appearance in print as a proxy of fame.
  • Detection of censorship by non-appearance in print: notably the absence from German texts of individuals identified as undesirables under the Nazi regime.
  • Culturomics - the identification of "fossils" of cultural trends through print frequency (e.g. "influenza" being mentioned a lot in print at the time of known pandemics).

The epidemiology example illustrates an important limitation to "culturomics". As quoted in Wired:

Patterns that can be queried from its cloud are not necessarily answers unto themselves, they say, but a way of illuminating subjects for further investigation.

"It’s not just an answer machine. It’s a question machine," said study co-author Erez Lieberman-Aiden, a computational biologist at Harvard University. "Think of this as a hypothesis-generating machine."
- Cultural Evolution Could Be Studied in Google Books Database, Wired, Dec 16th 2010

A look at the references to "cholera" shows peaks that may correspond to epidemics, but the largest, in the mid-1880s, more likely corresponds to the topicality of Robert Koch's isolation of Vibrio cholerae in 1884.

Addendum: discussion at Language Log - see True Grit isn't true - highlighted another significant problem with the setup.  For some reason (maybe to do with OCR, indexing, tokenization or the search interface) Google Books N-gram Viewer seemed to underestimate by three to four orders of magnitude (!) occurrences of forms with apostrophes. This made it useless for examining historical occurrences of contractions in English. Correction: see Google n-gram apostrophe problem fixed.

Ray

5 comments:

  1. JSB> "Google just blew my
    JSB> bibliographic socks
    JSB> off."

    [laughter] now that's an opening line that takes some topping!! :-)

    I see what you mean, though ... it's an amazing resource, however flawed.

    ReplyDelete
  2. I'm even more interested, by the way, in the access to raw data...

    ReplyDelete
  3. I'm even more interested, by the way, in the access to raw data...

    Agreed. Much as I like the public version, the lack of ability to incorporate contextual information makes it nearly impossible to use for some of the "culturomic" exercises claimed, such as the epidemiology example they gave.

    For example, theoretically peaks in text frequency for, say, "Lurgi" could correspond with undocumented Lurgi outbreaks - but only a pandemic major enough to show up on English overall. If you're looking for historical outbreaks of Lurgi in Devon, there's no away to narrow the search (maybe via metadata filtering) to that context.

    ReplyDelete
  4. What's making me crazy are the number of anachronisms, for want of a better word. Dates are quite frequently erroneously (It seems magazines may be all grouped by the date of first publication???). Unless the text is previewed there is no way to confirm the accuracy of claim that the publication was of that date. I just found a usage of "Obama" that says its 1965 but discusses Obama's doings as a Senator...Precision is impossible and we are left having to hit the print books again...

    ReplyDelete
  5. Yes: bad metadata is a long-standing problem with Google Books. This interface just makes it more explicitly visible. (The metadata was in many cases inherited from library data, so Google isn't entirely to blame).

    ReplyDelete