The Benford’s Law Hustle™

Posted by Noah Stephens-Davidowitz on March 9, 2012

Hi, poker blog readers! I was going to post this on my new nerd blog, and maybe I’ll do that as well. But, I figured that poker players might like this, even if it’s not strictly related to poker.

Basically, this is the simplest, most convincing +EV prop bet that I know of. In other words, it’s an awesome hustle. I’ve occasionally imagined walking into a poker room, finding the nearest guy with a visible newspaper or smart phone, and leaving with his money in my pocket, but unfortunately, nerds don’t make good hustlers.

Here’s how you do the Benford’s Law Hustle™:

Walk into a casino and find one of a very large class of (pseudo)random number generators that roughly satisfy Benford’s law. (Don’t cheat and google that yet.) You can pick random numbers from newspaper articles (Turn to page A5 and find the first number in the first article) or something similar. Tons of things will work; just about the only things that won’t will be numbers from casino games. (E.g., dice and keno boards are no good.)
Find someone who’s willing to bet on the first digit of the random number. (The first digit of 2,458,193 is two. Nothing fancier than that)
Bet on one and two.
Give your opponent both eight and nine.
Lay two to one… (E.g., you pay him $200 if the number is 8,283 or 9,722, and he pays you $100 if the number is 10,136 or 2. If it’s 637, then no money changes hands.)
Make an absurd profit.

Wait… what?

Since you’re laying 2:1 and your opponent/sucker gets just as many numbers as you, he’ll obviously agree to the bet. But, it turns out your edge on this bet is about $28.30 if you bet $200 to his $100. (That’s the edge that you get per number, not per times that money changes hand. It’s absolutely massive.)

Wait… what?

[toc]

Playing with Wikipedia

Do a quick experiment for me: Open a random article in Wikipedia by clicking here. (I didn’t doctor that link or anything, but if you don’t trust me, feel free to navigate to Wikipedia the old-fashioned way. The random article link can be found on the left of any article’s page.) Find the first number in the article and remember its first digit. (If there isn’t a number, go to another random article.) Repeat this three times. The vast majority of you will have been a one or a two. Only about one in six of you will have seen an eight or a nine.

Weird, huh?

However, there’s also a pretty good chance that you noticed a gimmick when you did that, and you might think you figured this out. If you didn’t, check out the Wikipedia page for the Nobel Prize in Physics. The first number in that article is 1895, which does indeed start with a one, but, it’s a year. Wikipedia probably references a lot more years after the year 1000 than before the year 1000, so of course years appearing on Wikipedia are going to start with a one or a two a lot more often than other digits. Or, you might have seen something like the article for Lorentz, whose first number is eighteen because Lorentz was born on July 18. Numbers like this can only be between one and 31, so obviously there are a lot more of them that start with a one or a two (22 for all months except for February and 21 for February) than an eight or nine (just one of each). Obviously, dates are pretty common things to find in Wikipedia (or newspapers, etc.).

So, you’re in a casino. You’ve hustled some poor sucker out of a bet or four before he figures that out. That’s nice and all, but it’s hardly earth-shattering.

Now comes the fun part: Tell your new friend/enemy/sucker that you’ll give him the same bet, but now dates don’t count (neither the 1987 kind nor the March 12th kind), AND you’ll keep laying 2:1 and keep giving him both eight and nine, but now you only want the number one for yourself. Then, simply enjoy your profit.

Wait… what?

The Numbers

So, at this point, I guess I should stop being obnoxious and explain what’s going on. Here’s the distribution of the first digit of all numbers on Wikipedia, effectively ignoring years and dates. (See details below if you’re curious) It turns out that Wikipedia had 172,498,007 numbers that fit the bill:

[slider title=”Some detail:”]The script used the awesome Python module lxml to quickly scan the contents of every Wikipedia article for numbers matching a simple regular expression \d(\d|,)*\.?\d*. Numbers that start with a zero (e.g., 0.1) were simply in the bar graphs, but for those who are curious, about 10% of numbers in Wikipedia actually start with a 0. (That’s a lot higher than I would’ve guessed.)

I did my best to effectively solve the date problem in the laziest way possible: I excluded two-digit numbers and four-digit numbers without commas, under the assumption that years are almost never written with a comma. That’s obviously a bit of a blunt instrument, but it’s good enough for my purposes. The data comes from a May 2011 version of Wikipedia that I happened to have on my hard drive, a 31 GB XML file. It includes 11,120,945 articles, although the definition of “article” used to get that number probably isn’t a very good one. You can download all of Wikipedia here.[/slider]

As you can see, the numbers get progressively less popular as they get bigger, and the effect is really really big. If we use Wikipedia as our random number generator, the correct odds for our bet in which we take numbers beginning with one and our sucker takes numbers that begin with either eight or nine are about 3.9:1. If you risk $200 to his $100, your edge on one bet is about $15.55 if numbers beginning with other digits end the bet or about $38.95 if you keep choosing numbers until someone wins. That’s absurd.

For completeness, here’s the same data with dates included:

Not surprisingly, Wikipedia has a lot of dates that start with a one or a two.

A First Explanation

So, what’s going on here? Let’s start with a concrete example: Consider the height of all the various man-made buildings in the world in stories. These numbers are certainly not uniformly distributed–There are obviously way more buildings that are between, say, one and ten stories high than there are buildings between eleven and twenty stories high. In fact, they’re probably closer to logarithmically distributed. (Excuse my abuse of vocabulary, fellow nerds.) In other words, there are probably about as many buildings between one and two stories high as there are buildings between two and four stories, and likewise between four and eight, eight and sixteen, etc. This rather intuitive concept is quite universal–There are about as many businesses that earn between $1 million and $2 million per year as there are businesses that earn between $2 million and $4 million; there are about as many hills/mountains between 200 and 800 feet high as there are hills/mountains between 800 and 3200 feet high; there are about as many savings accounts with between $1,000 and $10,000 in them as there are savings accounts with between $10,000 and $100,000, etc.

This type of distribution gives rise to Benford’s law. Indeed, if there are as many one-story buildings as there are two-to-four-story buildings AND as many ten-to-twenty-story buildings as there are twenty-to-forty-story buildings, then obviously if you write down the number of stories in a random building, the number is more likely to start with a one than a two, more likely to start with a two than a three, etc. And, this type of data is pretty damn common.

If you work out the math, here’s the theoretical distribution of first digits from random numbers distributed logarithmically:

Notice how incredibly close this is to the actual distribution that I found on Wikipedia. That’s largely because a ton of the “random numbers” on Wikipedia are logarithmically distributed.

More Explanation

However, Benford’s law arises in even more situations than that. Earlier, I excluded dates because they felt somehow uniquely unfair–Days of the month are way more likely to start with the numbers one or two than eight or nine, for example. But, this actually isn’t a particularly unique situation.

It’s quite common for a certain type of number to be restricted to values between one and some maximum value (e.g., the day of the month, a social security number, an IP address, a year). We can think of these numbers as being chosen randomly with equal probability from the integers between one and N. When N is nine, 99, 999, 9,999, etc., then the first digit of the randomly chosen number will be evenly distributed. (1/9 of the numbers between one and 999 start with a seven, for example.) However, for any other value of N, the situation is different: If, for example, N is 2,001 then exactly half of all possible values will start with a one. It turns out that if we choose N in a suitably random way, then we see Benford’s law again. (This is intentionally vague because it’s cumbersome to talk about distributions of distributions. See this paper for a nice proof.)

That type of distribution comes up all the time. Some sets of numbers simply have fixed limits. For example,. days of the month can’t be higher than 31, Social Security numbers can’t be higher than 772-99-9999 (well, prior to June 2011, when the system changed), a Roulette spin can’t be higher than 36. Other sets of numbers don’t quite have a formal maximum, but they still tend to be capped in practice. For example, while Manhattan could presumably have a 8,313,921st street, in practice, the highest-numbered street in Manhattan is 220th Street (although the numbering continues up to 263rd Street in the Bronx). The same idea applies to all sorts of addresses and man-made number schemes, as well as a lot of natural phenomena.

In each of those individual examples, the actual distribution predicted by Benford’s law will not apply, but in all of them, randomly chosen numbers will start with a one MUCH more often than they start with a nine. For example, a random valid Social Security number will start with a one about 14% of the time and will never start with a nine. A random street in Manhattan will start with a one over half the time and only start with a nine only about five percent of the time. If you then sample a random number from a large collection of examples like this (which is pretty close to what picking a random number from a Wikipedia article does), Benford’s law will apply exactly.

Indeed, Benford’s law is even more common than the above two arguments would imply. The true reason for Benford’s law’s incredible ubiquity is that it is scale invariant. In other words if a distribution of heights of objects satisfies Benford’s law in feet, it will satisfy it any units–inches, meters, yards, hands, etc. While Benford’s law seems incredibly asymmetric and unnatural, it actually has an underlying symmetry that is in fact much more natural than the naive idea that the first digit of “randomly chosen numbers” should be evenly distributed.

So, it’s really not surprising at all that Wikipedia’s data follows Benford’s law so incredibly closely.

In practice, obviously the Wikipedia data is slightly imperfect. I’m guessing that the primary effect of this is due to rounding and the different ways that people write numbers. For example, the number 997,723 is very unlikely to appear in Wikipedia when it could just be rounded to 1,000,000. Likewise, sometimes 1,000,000 is written as one million. There are some other effects that have to do with choosing units–A recipe is more likely to call for one quart than two pints, for example. And, some disributions simply don’t follow Benford’s law, such as the distribution of heights of people.

Fin.

Anyway, that’s the Benford’s Law Hustle™. To review, here’s what you do:

Find a sucker.
Get him to bet with you on the first digit of (pseudo)randomly chosen numbers in any of a wide class of distributions. (Random numbers from the newspaper or Wikipedia will work fine.)
Give yourself low digits (One works best!) and give him high digits (like nine).
Lay him odds that still give you a massive edge according to Benford’s law (see chart below).
Profit.

Here’s a chart with the correct odds for a bunch of different bets that you might try:

	You
Mark		4	4, 5	4, 5, 6, 7, 8, 9	5	5, 6	5, 6, 7, 8, 9	6	6, 7	6, 7, 8, 9	7	7, 8	7, 8, 9	8	8, 9	9
	1	3.1	1.7	0.8	3.8	2.1	1.0	4.5	2.4	1.4	5.2	2.8	1.9	5.9	3.1	6.5
	1, 2	4.9	2.7	1.2	6.0	3.3	1.6	7.1	3.8	2.1	8.2	4.4	3.1	9.4	4.9	10.4
	1, 2, 3	6.2	3.4	1.5	7.6	4.1	2.0	9.0	4.8	2.7	10.4	5.5	3.9	11.8	6.2	13.1
	1, 2, 3, 4	X	X	X	8.8	4.8	2.3	10.4	5.6	3.1	12.1	6.4	4.5	13.7	7.2	15.2
	2	1.8	1.0	0.4	2.2	1.2	0.6	2.6	1.4	0.8	3.0	1.6	1.1	3.5	1.8	3.8
	2, 3	3.1	1.7	0.8	3.8	2.1	1.0	4.5	2.4	1.4	5.2	2.8	1.9	5.9	3.1	6.5
	3	1.3	0.7	0.3	1.6	0.9	0.4	1.9	1.0	0.6	2.2	1.1	0.8	2.5	1.3	2.7
	3, 4	X	X	X	2.8	1.5	0.7	3.3	1.8	1.0	3.8	2.0	1.4	4.4	2.3	4.8
	4	X	X	X	1.2	0.7	0.3	1.4	0.8	0.4	1.7	0.9	0.6	1.9	1.0	2.1

Remember to give yourself a nice cushion, since this bet is so counter-intuitive and real-world distributions won’t be perfect (and because money is fun). For example, you might get someone to bet that the first digit will be a four, five, six, seven, eight, or nine, leaving you with one, two, and three. Looking up in the table, you can see that you would break about even if you laid 1.5:1 on this bet (How cool is that?!), but certainly any reasonable mark would accept 1:1 when you’re giving him six numbers to your three. That’ll give you an edge of about $20 on a $100 bet. If you get him to lay you 1.5:1 (and who wouldn’t accept that?), you’ll have a $50 edge on a $100 bet.

And, of course, be very careful to choose numbers that actually do follow Benford’s law.

Anyway, that’s my rant. If you try this out, please please please let me know how it goes. (And you can make any checks payable to Noah Stephens-Davidowitz.) If you want to see more cool Benford’s law stuff, check out the Wikipedia article, this page that illustrates it in practice on a bunch of different real-world data sets, and this page that has a nice script illustrating what’s happening. If you want to read more stuff by me, you should follow me on Twitter, subscribe to this blog’s RSS, and check out my nerd blog, Solipsist’s Log.

Poker-Free ContentBenford's law, hustle, statistics, wikipedia

← New Blog

Passwords Security →

NoahSD's Awesome Poker Blog