Quizzacious. You have almost certainly heard of this word in the last 10 days. Michael Stevens, creator of the popular Vsauce YouTube channel posted a video on the topic exploring Zipf’s Law, the Pareto Principle, and the frequency of specific word occurrence in language and literature. Why this pattern occurs consistently remains one of the most fascinating unsolved mysteries of human language.
According to George Zipf, the second-most used word in a language occurs half as often as the most often-used word. The third in popularity occurs one third of the time. This pattern continues to the end of the list.
Explained more precisely, if x is the frequency with which the most popular word occurs, and r is the order of popularity (2nd most popular, 3rd most popular, etc.), the expected frequency of occurrence of a particular word would be x/r. This pattern repeats in virtually any piece of spoken or written language ever created. If it’s long enough to have repeating words that is. Even stranger, this pattern repeats for all languages, ever.
Zipf’s Law seems to hold true for literature, casual conversations, every book ever written, tweets, really anything. As you would guess, there are words that have appeared only once out of a lifetime of an author’s work, or even of all the written work in an entire language. Since we figure out the meaning of a word from context, it’s difficult to figure out the meanings of these seldom-used words. These words are called hapax legomena. Michael Stevens takes the word quizzaciously as the example of hapax legomena, the least-popular word in English. Virtually nobody knew the meaning of this word 10 days ago. There was only one Google search result for the word, the Oxford English Dictionary definition: “in a mocking way”. If not for Michael Stevens, quizzaciously would have perished and been forgotten and faded out of the English language.
Since italki is filled with language enthusiasts, Vsauce’s video was a big deal to us. In fact, we decided to test Zipf’s Law on our own body of written content: our language learning articles. We collected all 140 articles published on italki in English into one giant document, and generated a word frequency list. The top 20 most frequently occurring words across our English articles corpus are compared to the graph predicted by Zipf’s Law.
The list includes some of the words that Michael Stevens ranks among the most used words in the English language (“the”, “to”, “a”, “you”, “and”, “of”). It’s amusing that the words that don’t match Michael’s frequency graph for all language are (“English”, “can”, and “language”).
You must have noticed by now that something about the graph doesn’t look quite right. The number of occurrences of the second-most used word (“to”) doesn’t actually correspond to one-half of the number of occurrences of the most used word (“the”). Stranger yet, the third-most used word (“a”) occurs more often than one-third of ("the"). On the surface, it looks like Zipf’s Law doesn’t hold. Did we disprove it? Did the teachers writing articles for italki break Zipf’s Law?
Hold on a second before you furiously storm off to the comments section to tell us we’re wrong. Looking at the data set, the shape of the curve of all the words after the most popular one fall roughly in line with expectations. If we doubled the number of (“the”), the frequency of all the other words ranked in our list would suddenly resemble the curve predicted by the Zipf’s Law. Still, that seems like fudging the data to fit an expected result.
Stumped yet? We were.
Then, we did a quick poll of our staff comparing Spanish, French, Chinese, German, Italian, Russian, Ukrainian, Tagalog, and Basque. Not a single one has a direct translation for (“the”). Some languages have no direct equivalent, others have similar words which change based on the gender of the object described. A large number of our articles, though in English, are written by non-native speakers. This seems to be a reasonable explanation for the discrepancy in our results. Taking the multi-lingual nature of our source text set, it looks like we didn’t somehow break Zipf’s Law, language, or human cognition.
The Zipf’s distribution seems to apply to all the existing phenomena in the world, including human language. The question is: how is this possible? As Michael Stevens points out, language is extremely personal, as individuals have unique thoughts, experiences, and impulses. Our inner reality is vastly complex, formed by our work, thinking, and existential accident.
Likewise, every culture and language is a unique system of words, images, notions and meanings. The vocabulary, phraseology, metaphorical expressions and grammar are all distinctive, unique in their existence.
Numbers, however, follow the same pattern as words. In a very similar concept to Zipf’s Law, there is Benford’s Law, which follows the same exact pattern, but with the frequency distribution applicable to individual digits. “1” is most frequent, “2” occurs half the time, “3” one-third, and so on all the way to “9”. This pattern seems to be descriptive of the very nature of the universe.
Though italki has figured out our own little mystery about word frequency in our learning articles, the mystery of Zipf’s Law remains unsolved. There is no doubt that the Zipf’s Law is a fascinating phenomenon suggesting that a lot still needs to be done to find a thorough explanation.
Regardless of the why, the what remains seemingly etched within the fabric of the universe, human brain, and language. These are the kinds of things that inspire us, allowing us to keep striving for more knowledge. As you do your life’s work of exploration, thought, and learning, we hope you keep the sense of mystery, curiosity, and wonder close. We hope you apply the same attitude to learning languages as well.
Following the Pareto Principle, learning the 20% of most popular words in a language gets you 80% of the meaning. This is yet another concept mentioned by Vsauce. 20% of the words comprise 80% of what is being said, or written. Learning the 100 most popular words gets you a lot more in terms of language comprehension than worrying about the obscure and the rarely-used.
After the initial bulk of learning, you are no longer trying to acquire a brand new language; rather, you're just improving your vocabulary. What’s more, there are lots of words at the bottom of the ranked list of words. These words are obscure. Most native speakers didn’t know what quizzacious meant until only a few days ago. Words like these are as foreign to the native speaker of your target language as they are to you.
Don’t be afraid of making mistakes. We are all perpetual language-learners, and we will never know any language perfectly, not even our native language. Keep learning, keep exploring.