Ngram Fun

by admin on February 21, 2011

Ok, I’ve been waiting to do this for a couple months.

Today I finally checked out Google’s “Ngram Viewer,” which lets you see how the frequency of use of various words or phrases in the English language (and other languages) has changed over the years. The frequencies come from the huge collection of books that Google has digitized.

First, to give you a general idea, here is the Ngram of “chess” versus “poker.”

(Sorry about the tiny print. “Chess” is in blue and “poker” is in red. Vertical lines mark the years 1910, 1930, 1950, 1970, and 1990.)

Interestingly, the words “chess” and “poker” were about equally common through until about 1948, but then chess became more popular while poker became less so. Until a little before the year 2000, when the popularity of poker skyrocketed. Now the two words are about equally frequent again.

What’s scary is to do the same graph in Russian.

As you can see, the word “poker” was almost non-existent in Russian — until the last year for which data is available, 2008, when “poker” shoots up and becomes almost as common as “chess.”

The day poker passes chess in popularity in Russia will be a very sad day.

Some other chess-related word searches: Which is more common, “checkmate” or “stalemate”? Let’s find out.

I was very surprised by this. In the very early years of the century, “checkmate” (blue) was more common than “stalemate,” and then they were about even through 1930. But then during the Second World War and the Cold War, “stalemate” becomes vastly more common than “checkmate,” appearing up to 8 times more often in the Google book database. After the end of the Cold War in 1990, “stalemate” starts to decline, and it is now only about 3 times as common as “checkmate.”

Of course, the reasons, I believe, have nothing to do with chess!

Here is another example that is quite influenced by non-chess factors. Which occurs more often, “white” or “black”? In chess books, I suppose they appear just about equally. Not so much in the language as a whole …

Here, “white” is in red and “black” is in blue. (This is confusing, I know.) As you can see, “white” had a significant lead until about 1970, but only a very slight lead since then. The next figure illustrates very vividly, I think, what happened in the five years between 1965 and 1970.

Same graph as before, only now I have added the frequency of the word “red” (in green). What we see is quite remarkable. From 1910 to 1965, both “red” and “black” were just colors, and their frequency was virtually identical. But in the last half of the 1960s, with the civil rights movement, “black” took on a new meaning: “African-American.” I think that the green curve is an excellent proxy for what the frequency of “black” would have been if “black” had remained just a color, and the difference between the blue and green is an excellent measure of how frequently “black” is used to mean “African-American.”

It’s amazing to see language change before your eyes!

Finally, it was interesting to investigate the frequency of various chess openings. But I ran into a slight problem, as you’ll see.

The blue curve is “Ruy Lopez,” red is “French Defense,” and green is “French Defence.” As you can see, the British and American spellings of “Defense/ce” are split rather evenly. From about 1980 to 2000, “Ruy Lopez” was running neck and neck with both of them — which really means that “French Defense/ce” was twice as common as “Ruy Lopez.” Unfortunately, the Ngram Viewer website doesn’t let you combine two spellings into one meme. (They do let you download the raw data, so presumably it would be possible to do this on my own, but it would take a lot more time than I am prepared to spend …)

Even if I had the inclination to do that, I’m not sure the data is completely trustworthy! I think that there are enough occurrences of “French defense” in the titles of books or the titles of chapters or tables within books (where the word “defense” would be capitalized) to seriously distort the results. As evidence, look at the different frequencies of French and Sicilian defenses:

The terms “French Defense/ce” are in blue and red, while the terms “Sicilian Defense/ce” are in green and yellow. Surprisingly, there are many more references in print to “French Defense/ce” than to “Sicilian Defense/ce”, even though the Sicilian is more popular as a chess opening. I think that the “Sicilian” curves are bona fide cases of chess terminology, because no one is ever going to write about the Sicilian defense budget, for example. However, I think that the “French” curves probably contain more non-chess references than chess references.

Conclusion: Ngrams are lots of fun, but need to be interpreted with a great deal of skepticism.

Print Friendly, PDF & Email

{ 2 comments… read them below or add one }

Marc March 10, 2011 at 2:09 pm

Whenever I hear “stalemate” in the popular media, I think they mean to say “zugzwang”. Example, “The city budget negotiations ended in stalemate today…” Did the city council really get to a point where there was no action they could legally take? No, they got to a point where everyone thought their options were so rotten that they would rather do nothing, which is actually a choice all itself. The nature of law is that it always gives you options for making your situation worse.


Jason March 22, 2011 at 8:02 pm

I’ve been having a blast with the Google ngram plotter. One I like is “cyberspace” vs. “Internet”. But there are hundreds of cool examples, and I recommend the original paper in Science for some super examples, including evidence that the Nazis purged certain artists from the German literature during their rise to power (for some artists, this was already known, but this analysis highlighted new examples). The plot of the word slavery was also quite interesting, with a new peak in usage after the Civil Rights Movement..


Leave a Comment

Previous post:

Next post: