Monday, 17 October 2011

Brigham Young corpus tools

I've a lot of use of Google Books Ngram Viewer since it was launched: an excellent system for searching the whole Google Books corpus (for example, for finding when a word first came into the language). It has some limitations: occasionally poor metadata and OCR errors; the inability to select context (e.g. a search comparing the verb forms "smelled" and "smelt" will be contaminated by the fish called "smelt"); the purely graphical interface; the not-very-useful automated choice of time slots for the search links it generates; and the failure to show words at all if the frequency is too low.

Many of these problems have been overcome in the extremely nice unofficial interface to the Google corpus designed by Mark Davies, Professor or Corpus Linguistics at Brigham Young University: the Google Book American English Corpus ("155 billion words, 1810-2009"). It's very powerful and versatile:

This improves greatly on the standard n-grams interface from Google Books. It allows users to actually use the frequency data (rather than just see it in a picture) ...
This interface allows you to search the Google Books data in many ways that are much more advanced than what is possible with the simple Google Books interface. You can search by word, phrase, substring, lemma, part of speech, synonyms, and collocates (nearby words). You can copy the data to other applications for further analysis, which you can't do with the regular Google Books interface. And you can quickly and easily compare the data in two different sections of the corpus (for example, adjectives describing women or art or music in the 1960s-2000s vs the 1870s-1910s).

It's disappointing that it's currently limited to US English, but this is early days.

Note however that what you see here is just a very early version of the corpus (interface), and many features will be added and corrections will be made over the coming months. Also, in June 2011 we applied for a grant to integrate other Google Books collections into our interface, including British English, English texts from the 1500s-1700s, and texts from Spanish, German, and French. If funded (we'll receive word on this in December 2011), each of these additional corpora will be at least 50 billion words in size.

The other corpora by Professor Davies are, despite being considerably smaller, nevertheless still worth exploring. They include: Corpus of Contemporary American English (COCA), Corpus of Historical American English (COHA), TIME Magazine Corpus of American English, BYU-BNC: British National Corpus, Corpus del Español, and Corpus do Português. See the main entry page: CORPUS.BYE.EDU.

- Ray (via Language Log)

1 comment:

  1. Great stuff ... I shall toddle over there immediately! :-)