Category Archives: linguistics

How Many English Tweets are Actually Possible?

So, recently (last week, maybe?), Randall Munroe, of xkcd fame, posted an answer to the question “How many unique English tweets are possible?” as part of his excellent “What If” series. He starts off by noting that there are 27 letters (including spaces), and a tweet length of 140 characters. This gives you 27140 — or about 10200 — possible strings.

Of course, most of these are not sensible English statements, and he goes on to estimate how many of these there are. This analysis is based on Shannon’s estimate of the entropy rate for English — about 1.1 bits per letter. This leads to a revised estimate of 2140 x 1.1 English tweets, or about 2 x 1046. The rest of the post explains just what a hugely big number that is — it’s a very, very big number.

The problem is that this number is also wrong.

It’s not that the calculations are wrong. It’s that the entropy rate is the wrong basis for the calculation.

Let’s start with what the entropy rate is. Basically, given a sequence of characters, how easy is it to predict what the next character will be. Or, how much information (in bits) is given by the next character above and beyond the information you already had.

If the probability of a character being the ith letter in the alphabet is pi, the entropy of the next character is given by

– Σ pi log2 pi

If all characters (26 letter plus space) were equally likely, the entropy of the character would be log227, or about 4.75 bits. If some letters are more likely than others (as they are), it will be less. According to Shannon’s original paper, the distribution of letter usage in English gives about 4.14 bits per character. (Note: Shannon’s analysis excluded spaces.)

But, if you condition the probabilities on the preceding character, the entropy goes down. For example, if we know that the preceding character is a b, there are many letters that might follow, but the probability that the next character is a c or a z is less than it otherwise might have been, and the probability that the next character is a vowel goes up. If the preceding letter is a q, it is almost certain that the next character will be a u, and the entropy of that character will be low, close to zero, in fact.

When we go to three characters, the marginal entropy of the third character will go down further still. For example, t can be followed by a lot of letters, including another t. But, once you have two ts in a row, the next letter almost certainly won’t be another t.

So, the more characters in the past you condition on, the more constrained the next character is. If I give you the sequence “The quick brown fox jumps over the lazy do_,” it is possible that what follows is “cent at the Natural History Museum,” but it is much more likely that the next letter is actually “g” (even without invoking the additional constraint that the phrase is a pangram). The idea is that, as you condition on longer and longer sequences, the marginal entropy of the next character asymptotically approaches some value, which has been estimated in various ways by various people at various times. Many of those estimates are in the ballpark of the 1.1 bits per character estimate that gives you 1046 tweets.

So what’s the problem?

The problem is that these entropy-rate measures are based on the relative frequencies of use and co-occurrence in some body of English-language text. The fact that some sequences of words occur more frequently than other, equally grammatical sequences of words, reduces the observed entropy rate. Thus, the entropy rate tells you something about the predictability of tweets drawn from natural English word sequences, but tells you less about the set of possible tweets.

That is, that 1046 number is actually better understood as an estimate of the likelihood that two random tweets are identical, when both are drawn at random from 140-character sequences of natural English language. This will be the same as number of possible tweets only if all possible tweets are equally likely.

Recall that the character following a q has very low entropy, since it is very likely to be a u. However, a quick check of Wikipedia’s “List of English words containing Q not followed by U” page reveals that the next character could also be space, a, d, e, f, h, i, r, s, or w. This gives you eleven different characters that could follow q. The entropy rate gives you something like the “effective number of characters that can follow q,” which is very close to one.

When we want to answer a question like “How many unique English tweets are possible?” we want to be thinking about the analog of the eleven number, not the analog of the very-close-to-one number.

So, what’s the answer then?

Well, one way to approach this would be to move up to the level of the word. The OED has something like 170,000 entries, not counting archaic forms. The average English word is 4.5 characters long (5.5 including the trailing space). Let’s be conservative, and say that a word takes up seven characters. This gives us up to twenty words to work with. If we assume that any sequence of English words works, we would have 4 x 10104 possible tweets.

The xkcd calculation, based on an English entropy rate of 1.1 bits per character predicts only 1046 distinct tweets. 1046 is a big number, but 10104 is a much, much bigger number, bigger than 1046 squared, in fact.

If we impose some sort of grammatical constraints, we might assume that not every word can follow every other word and still make sense. Now, one can argue that the constraint of “making sense” is a weak one in the specific context of Twitter (see, e.g., Horse ebooks), so this will be quite a conservative correction. Let’s say the first word can be any of the 170,000, and each of the following zero to nineteen words is constrained to 20% of the total (34,000). This gives us 2 x 1091 possible tweets.

That’s less than 1046 squared, but just barely.

1091 is 100 billion time the estimated number of atoms in the observable universe.

By comparison, 1046 is teeny tiny. 1046 is only one ten-thousandth of the number of atoms in the Earth.

In fact, for random sequences of six (seven including spaces) letter words to total only to 1046 tweets, we would have to restrict ourselves to a vocabulary of just 200 words.

So, while 1046 is a big number, large even in comparison to the expected waiting time for a Cubs World Series win, it actually pales in comparison to the combinatorial potential of Twitter.

One final example. Consider the opening of Endymion by John Keats: “A thing of beauty is a joy for ever: / Its loveliness increases; it will never / Pass into nothingness;” 18 words, 103 characters. Preserving this sentence structure, imagine swapping out various words, Mad-Libs style, introducing alternative nouns for thing, beauty, loveliness, nothingness, alternative verbs for is, increaseswill / pass prepositions for of, into, and alternative adverbs for for ever and never.

Given 10000 nouns, 100 prepositions, 10000 verbs, and 1000 adverbs, we can construct 1038 different tweets without even altering the grammatical structure. Tweets like “A jar of butter eats a button quickly: / Its perspicacity eludes; it can easily / swim through Babylon;”

That’s without using any adjectives. Add three adjective slots, with a panel of 1000 adjectives, and you get to 1047 — just riffing on Endymion.

So tweet on, my friends.

Tweet on.

C. E. Shannon (1951). Prediction and Entropy of Written English Bell System Technical Journal, 30, 50-64

Talk like a (Somali) pirate

So, today is International Talk Like a Pirate Day. Fortunately, over at Darwin Eats Cake, iBall just got back from a big trip, and he’s ready to help you out with all your talking-like-a-pirate needs.

Best URL for sharing: http://www.darwineatscake.com/?id=142
Permanent image URL for hotlinking or embedding: http://www.darwineatscake.com/img/comic/142.png

As it turns out, everything here actually exists as a fully formed translated phrase on one of the two following sites:

http://www.omniglot.com/language/phrases/somali.php

http://www.freelang.net/online/somali.php?lg=gb

All of which makes me think that this scenario must have come up before.

What’s the plural of "octopus"?

So, how do you refer to more than one octopus? “Octopuses”? “Octopi”? “Octopodes”? In case you’re uncertain which way to go, here’s a handy guide from Darwin Eats Cake, which you can print out for easy reference.

The text may be a little bit hard to read here, but you can view a higher-resolution version here. More discussion after the picture.

Best URL for sharing: http://www.darwineatscake.com/?id=113
Permanent image URL for hotlinking or embedding: http://www.darwineatscake.com/img/comic/113.jpg

Some of you may recall this video from Kory Stamper, who argues that “octopuses” is fine, as is “octopi,” as is “octipodes,” for that matter, although, as she says, if you’re going to use it, you’d better be prepared to explain and defend it.

I think that’s all dead on, with one small addition. Stamper argues that when a word is borrowed into English, it gets the standard english pluralization, hence “octopuses.” I feel like there actually is a living grammatical rule in spoken English, where you are allowed pluralize a word ending in “us” by changing it to “i” provided that the word is long enough, and especially if the word sounds sort of foreign-ish.

Now, I mean “allowed to” in a descriptive, rather than a prescriptive sense. That is, I take the viewpoint that if I say a word (or a phrase, or use a grammatical construct, etc.), and most native English speakers understand that word in roughly the sense in which I meant it, then it’s a part of English, whether or not it follows a rule that has been codified in a book.

So, if I were talking about more than one Krampus, and I used the word “Krampi,” I think that most people would understand what I meant (assuming that they had heard of Krampus in the first place). On the other hand, if I drop by an elementary school and start talking to the children about the line of yellow schoolbi, I’m probably going to get arrested.

From where I stand, then, “octopuses” and “octopi” are both native English pluralizations. “Octopi” just happens to use a rule that came into English through an appeal by (prescriptive) grammarians to Latin. “Octopodes,” by contrast, will only be comprehensible to someone who either has studied Greek, or who has had this particular debate pointed out to them.

The poet in me feels the need, of course, to point out that there is no such thing as an exact synonym (blah, blah, blah). So, while “octopuses” and “octopi” both refer to more than one octopus, they don’t mean the exactly the same thing. In particular, if I say “Look at the octopi,” I am really saying something like “Look at the more than one octopus, and, hey, I’m doing that Latin thing.” Whichever one you use, there are aspects of social positioning involved (maybe I want to look smart, or educated, or maybe salt-of-the-earth-ish, etc.), the details of which are going to depend a lot on the specifics of the social context in which you’re talking. Really, pluralizing “octopus” is the third rail of talking about cephalopods (cephalopodes?), in that there is no way to do it where someone in the room is not going to make an issue out of it.

I also feel like maybe I should clarify what I perceive to be the game in the dorky/sophisticated outcome. The goal is not necessarily to implement pluralization as it would be done in language X by a native speaker of language X. Rather, it is to take a simple pluralization rule from language X, remove it from its native context, and implement it in English, sort of like the Krampus / Krampi thing. It’s like trying to figure out how to pluralize something in a sort of Xglish (the language-X analog of Spanglish). For example, David Winter (@TheAtavism) points out that in Maori, one octopus would be “Te wheke,” while two or more would be “Nga wheke.” I take that to imply that the appropriate Maoglish plural of “octopus” would be “ngactopus,” which is pretty fun to say.

That being said, in addition to this Maori tidbit, I have already learned some cool stuff via Twitter responses to the cartoon. Here’s a sampling:

@symbolicstorage notes that the same ambiguity exists in German, where one might say “oktopusse” or “oktopi,” adding that “octopusen” is 100% wrong. But, you know, I don’t know about that. It only looks about 20% wrong to me. 30% at most. 100% wrong would be more like “farfegnugen.”

@BobOHara says that Finns would most often use the partative form “octopusta,” rather than the plural “octopust.” I still don’t fully understand the distinction, but is seems that “octopusta” would best be translated something like “some octopus, like probably more than one, but I’m not going to count them right now, since I have better things to do, like participate in my world-leading public education system.”

And “Kraken-wrangler” @DrSeaRotmann suggests “octoposse,” which is the only plural I am going to use from this day forward.

Have you got more? How do you say “octopus” in your native (or secondarily learned) language? How do you refer to more than one? And, how would you create an English hybrid (Xglish plural) using that pluralization rule? Post in the comments, or send a note on Twitter (@jonfwilkins), and I’ll update the list!

Oh, and by the way, I forgot to include the French “octopeaux.”

Allophones: Linguistics Humor from Darwin Eats Cake

So, here are the last two Darwin Eats Cakes. They go together to form a sort of continuing story. It’s like a soap opera, except instead of people killing each other and having weird supernatural experiences, they engage in clunky set-ups for jokes about linguistics. Woo!

Best URL for sharing: http://www.darwineatscake.com/?id=57
Permanent image URL for hotlinking or embedding: http://www.darwineatscake.com/img/comic/57.jpg

Best URL for sharing: http://www.darwineatscake.com/?id=58
Permanent image URL for hotlinking or embedding: http://www.darwineatscake.com/img/comic/58.jpg

Douchebilly of the Magi

So, for Christmas my wife got me an Urban Dictionary mug with the definition of “douchebilly” on it:

A combination of a douchebag and a hillbilly. Not just a douchebag and not just a hillbilly but both! A Douchebilly! 

My ex-husband is a real douchebilly

Why, you ask? Two reasons. First, my wife is AWESOME! Second, this is a word that I made up at a bar, and a friend contributed to Urban Dictionary. It had its origin in an unbearably precious conversation about which Ivy League school was the douchiest. (If it occurs to you to ask, the answer is, “Whichever one you went to.”) I suggested “douchebilly” as the answer posed to the question (asked in reference to me), “What do you call someone with two degrees from Harvard who wears old jeans and cowboy boots?”

I’m telling you this because I hope that you will start using this word all the time, and that you will mail me a nickel every time you do.

You might be wondering, do we really need more words, especially one like “douchebilly”? If the only use for “douchebilly” was to describe me at a bar, well, then you could argue it either way. But I also think that there’s a real need for this word in American political discourse.

For reasons I do not fully understand, American politicians have to downplay their education, upbringing, and accomplishments, at least in certain contexts. If they are not able to do so, they risk losing the votes of people for whom it is critically important that their leaders be “like them.” Bill Clinton grew up poor in Arkansas. He went on to tremendous academic achievements, but maintained a folksy, southern manner that was critical to his political success. I suspect that this was a calculated decision on his part. George Bush was a fifth-generation Yalie. Sure, he grew up partly in Texas, but in incredibly privileged circumstances, and finished high school at Phillips Adademy before going to Yale. The only way he talks like that is through deliberate construction. Even Barack Obama, while running for president, would periodically slide into this vaguely southern accent. Obama did not grow up in privileged circumstances, but the guy is from Hawaii, went to Columbia and Harvard, then moved to Chicago. What’s up with that intermittent accent, then?

And when I say, “I do not fully understand,” what I mean is that I am completely and utterly baffled by this. I don’t want the people in charge of running the country to be like me. I want them to be better than me in every possible way. Maybe if we started referring to politicians as “douchebillies” whenever they actively misrepresent their educational and economic status, we could encourage them to portray themselves more honestly.

Don’t misunderstand me. I am a linguistic relativist, and there is absolutely nothing wrong with a southern, Texas, or any other sort of accent. There is nothing inherent in any accent or dialect that indicates intelligence, or education, or the ability to ably lead. I am also sympathetic to the idea (although I don’t personally feel this way) of having the country run by people who are truly representative of the overall population. The fact is, however, that accents in America are not just regional, but correlate strongly with education level and socio-economic status.

What bothers me is that we have a system ]stocked with a lot of “elites,” as Fox News likes to say, but elites who pander to the public by pretending to be un-elite. Some are self-made, but many were born into privilege. I would love to see more people in government who are intelligent and hard-working, but who are not obscenely wealthy, and do not come from privileged backgrounds. There also seems to be a desire among the electorate to vote for such people. I’m not sure we’re going to get a lot of them, though, so long as all you have to do to come off as a “man of the people” is drawl a little bit.

Like any good American, I have only a passing familiarity with the politics of other countries, so I do not know how wide-spread this phenomenon is. I am heartened, however, by the recent election in Brazil of Tiririca, a 45-year old television clown. Not television clown like Glenn Beck, but television clown like Bozo. He ran on a campaign with slogans like: “What does a federal deputy do? Truly, I don’t know. But vote for me and I will find out for you.” After being elected to congress, Tiririca had to take a literacy test, which he passed after displaying “a minimum of intellect concerning the content of a text despite difficulties in writing.”

Tiririca – NOT a douchebilly – Just awesome.

Humanity as an emergent property: of douches and douchbags

So, why is it an insult to call someone a douchebag?

The “douche” part is easy. Anything associated with the crotchal region, however tangentially, eventually makes its way into the lexicon as an insult. These terms are insults because they carry a connotation of being “dirty,” physically and/or morally. However, I think that most people would agree that calling someone a douchebag is a step up in the degree of insult, and that is what is curious. If we naively try to interpret its meaning based on what a douchebag actually is, it should be, if anything, a milder insult, if not a complement, as it is an implement with the specific function of remedying the very dirtiness that is the basis of any crotch-related insult.
Obviously, that is not the right way to parse “douchebag.” When we call someone a douchebag, we are really calling them a douche + bag. It works through analogy to the other “-bag” insults, like dirtbag or scumbag. Suffixes like “-bag,” “-sack,” and “-bucket” function as as intensifiers, or perhaps insultifiers. They can be productively added to just about anything, and it sounds more insulting. “Jerkbag” sounds worse than “jerk.” In many contexts (e.g., volleyball), calling someone a “wall” might be a complement. Calling someone a “wallsack” would be mostly confusing, but could probably safely be interpreted as an insult.
I think that these modifiers implicitly deny the emergent humanity of the insultee, by reducing them to no more than the sum of their constituent parts. Stick with me here. The fact is, the very thing that makes us human — or alive for that matter — is not the components that make us up, but rather the complex relationships among those components. To quote Patty Loveless, “Break us down / to our elements, / and you might think He failed. / We’re not copper for / one penny or / even iron for one nail.” [1] The critical attribute of a bag or a sack is that its contents are disordered, and that complex functional or structural relationships among those contents are unlikely to exist. If someone calls you a douchebag, they are both calling you a douche and implying that your value is not greater than the sum of your parts.
What I like is that the productive family of suffixes “-bag,” “-sack,” etc. implies a sophisticated, if unconscious, understanding that the essence of our humanity is a classic emergent property, not derivable from a simple summation of our human components. For those of you who aspire to greater explicitness when insulting people (I know you’re out there), let me suggest the following variant next time you are thinking of calling someone a scumbucket: “high-entropy scum.”
——————————————————-
[1] This is not technically true anymore, since the introduction of copper-plated zinc pennies in 1982. According to Wikipedia, penny weighs about 2.5 grams, and is now about 2.5% copper, which means that it takes about 62 mg of copper to make a penny. According to the Copper Development Association, the human body contains perhaps 100-200 mg of copper, depending on the size of the particular human body in question. So, we are, in fact, copper for one penny, but most of us are not copper for four pennies. In Patty’s defense, the song was released in 1994, and, given that coins can commonly last for 25 years or more, the majority of the pennies in circulation at the time may well have been of the pre-1982 variety, containing 15 to 20 people worth of copper.