The Law of Large Numbers and the Central Limit Theorem: A Polling Simulation
Nov. 24, 2008
This is for those who say: "Math was my worst subject in high school". If you've ever placed a bet at the casino, at the track or played the lottery, you already know the basics. It's about probability. It's about common sense. It's not all that complicated.
It's for Excel spreadsheet users who enjoy creating math models. The Excel model can be downloaded:
It's for reporters, blogs and politicians who seek the truth: Robert Koehler, Brad Friedman, John Conyers, Barbara Boxer, Mark Miller, Fitrakis, Wasserman, Kathy Dopp, Steve Freeman, Ron Baiman, Jonathan Simon, Alistair Thompson, Paul Krugman, Keith Olbermann, Mike Malloy, Randi Rhodes, Thom Hartman Stephanie Miller, Joseph Cannon, Sam Seder, Janeane Garofalo, etc.
It's for those who have taken algebra, probability or statistics and want to see how the math is applied to election polling.
It's for graduates with degrees in mathematics, political science, an MBA, etc. who may or may not be familiar with simulation concepts. Simulation is a powerful tool for analyzing uncertainty in simple and complex models. Like in coin flipping and election polling.
It's for browsers who frequent Discussion Forums.
It's for those Corporate Media reporters who are still waiting for editor approval to discuss the documented evidence of election fraud, statistical and anecdotal in all elections since 2000.
In Selection 2000, Gore won the
popular vote by 540,000. But Bush won the election by a single vote. SCOTUS voted along party lines: Bush 5, Gore 4. That stopped
It's for the exit poll naysayers who promote faith-based hypothetical arguments in their unrelenting attempts to debunk the accuracy of the pre-election and exit polls.
FALSE RECALL, RELUCTANT RESPONDERS, HOW THEY VOTED IN 2000: IMPLAUSIBLE, CONTRADICTORY AND MATHEMATICALLY IMPOSSIBLE
Naysayers have a problem with the 2004 pre-election and exit polls. Regardless of how many were taken or how large the samples, the results are never good enough for them. They prefer to cite two implausible hypotheticals: Bush non-responders (rBr) and Gore voter memory lapse ("false recall").
How do pollsters handle non-responders? They just increase the sample-size! Furthermore, statistical studies show that there is no discernible correlation between non-response rates and survey results.
How do pollsters handle "false recall"? They know that in a large sample, forgetfulness on the part of Gore and Bush voters will cancel out! There is no evidence that Gore voters forget any more than Bush voters.
On the contrary, if someone you knew robbed you in broad daylight, would you forget who it was four years later? In 2000, Gore and the voters were robbed in broad daylight.
Naysayers claim that bias favored Kerry in the pre-election and exit polls. Yet they offer no evidence to back it up. They claim that Gore voters forgot and told the exit pollsters they voted for Bush in 2000. It's their famous "false recall" hypothetical. They were forced to use it when they could not come up with a plausible explanation for the impossible weightings of Bush and Gore voter turnout in the Final National Exit poll.
According to the final 2004 NEP, which Bush won by 51-48%, 43% of the 13660 respondents voted for Bush in 2000 while only 37% voted for Gore. This contradicts the reluctant Bush responder (rBr) hypothesis. Furthermore, 43% of the 122.3 million who voted in 2004 is 52.57mm, yet Bush only got 50.45 mm votes in 2000. The 43/37% split is a mathematical impossibility.
In addition, approximately 1.75 mm Bush 2000 voters died prior to the 2004 election. Therefore, no more than 48.7 mm of Bush 2000 voters could have turned out to vote in 2004. The Bush 2000 voter share was 48.7/122.3 (or 39.8%), assuming that all of the Bush 2000 voters still living came to the polls. These mathematical facts are beyond dispute. Kerry won the final 1:25pm exit poll by 50.93-48.66%, assuming equal 39.8% weights.
For the same reason, Kerry must have done even better than his 51.4-47.6% winning margin at the 12:22am timeline (13047 respondents). Here the Bush/Gore mix was 41/39%. But we have just shown that 39.8% was the absolute maximum Bush share. If we apply equal weightings to the 12:22am results, then Kerry won by 52.25-46.77%, a 6.7 million vote margin (63.8-57.1mm).
First-time voters and those who sat out the 2000 election, as well as Nader and Gore 2000 voters, were overwhelming Kerry voters. The recorded Bush 2004 vote was 62 million. Where did he get the 13 million new voters from 2000? How do the naysayers explain it? Only by ignoring the mathematical facts and raising new implausible theories.
Itís time to put on the defoggers. Weíve had enough disinformation, obfuscation and misrepresentation. Let the sunshine in. Let's review the basics.
A COIN-FLIP EXPERIMENT
Consider this experiment. Flip a fair coin 10 times. Calculate the percentage of heads. Write it down. Increase to 20 flips. Calculate the new total percentage. Write it down.
Keep flipping. Write down the percentage after every ten flips. Stop at 100. That's our final coin flip sample-size.
When you're all done, check the percentages. Is the sequence converging to 50%? Thatís the true population mean (average). That's the Law of Large Numbers.
The coin-flip is easily simulated in Excel. Likewise, in the polling simulations which follow, we will analyze the result of polling experiments over a range of trials (sample size).
THE MATHEMATICAL FOUNDATION
This model demonstrates the Law of Large Numbers (LLN). LLN is the foundation and bedrock of statistical analysis. LLN is illustrated through simulations of polling samples. In a statistical context, LLN states that the mean (average) of a random sample taken from a large population is likely to be very close to the (true) mean of the population.
Start of math jargon alert...
In probability theory, several laws of large numbers say that the mean (average) of a sequence of random variables with a common distribution converges to their common mean as the size of the sequence approaches infinity.
The Central Limit Theorem (CLT) is another famous result .The sample means (averages) of an independent series of random samples (i.e. polls) taken from the same population will tend to be normally distributed (the bell curve) as the number of samples increase. This holds for ALL practical statistical distributions.
End of math jargon alert....
It's really not all that complicated. Naysayers never consider LLN or CLT. They maintain that polls are not random-samples. They would have us believe that professional pollsters are incapable of creating accurate surveys (i.e. effectively random samples) through systematic, clustered or stratified sampling, especially when Bush is running.
LLN and CLT say nothing about bias.
Just like in the above coin-flipping example, the Law of Large Numbers takes effect as poll sample-size increases. That's why the National Exit Poll was designed to survey at least 13000 respondents.
Note the increasing sequence of polling sample size as we go from the pre-election state (600) and national (1000) polls to the state and national exit polls: Ohio (1963), Florida (2846) and the National (13047).
Here is the National Exit Poll Timeline:
Updated; respondents; vote share
3:59pm: 8349; Kerry led 51-48
7:33pm: 11027; Kerry led 51-48
12:22am:13047; Kerry led 51-48
1:25pm: 13660 ; Bush led 51-48
The final was matched to the vote.
So much for letting LLN and CLT do their magic.
USING RANDOM NUMBERS TO SIMULATE A SEQUENCE OF POLLS
Random number simulation is the best way to illustrate LLN:
1) Assume a true 2-party vote percentage for Kerry (i.e. 52.6%).
2) Simulate a series of 8 polls of varying sample size.
3) Calculate the sample mean vote share and win probability for each poll.
4) Confirm LLN by noting that as the poll sample size increases,
the sample mean (average) converges to the population mean ("true" vote).
It's just like flipping a coin.
Assume there is a p =52.6% probability that a random poll respondent voted for Kerry (HEADS).
This represents Kerry's TRUE vote (his population mean)
Bush is TAILS with a 47.4% (1-p) probability.
A random number (RN) between zero and one is generated for each respondent.
If RN is LESS than Kerry's TRUE share, the vote goes to Kerry.
If RN is GREATER than Kerry's TRUE share, the vote goes to Bush.
For example, assume Kerry's TRUE 52.6% vote share (.526).
If RN is less than .526, Kerry's poll count is increased by one.
If RN is greater than .526, Bush's poll count is increased by one.
The sum of Kerry's votes is divided by the poll sample (i.e. 13047). This is Kerry's simulated 2-party vote share. It approaches his TRUE 52.6% vote share as poll samples increase.
The LLN works in polling the same way as in the coin flip experiment.
THE STATE ELECTORAL VOTE SIMULATION
In addition to simulating Kerry's popular 2-party vote, the model also includes a State Electoral Vote (EV) Simulator. The method is similar to the previous National polling samples, with this exception:
Each simulation consists of 100 election trials.
When the F9 key is pressed, one
The RN is compared to the probability of Kerry winning the state. If RN is less than the probability, the state EV is added to his total. If RN is greater, Bush wins the state.
If Kerry's total EV exceeds 269, he wins the election trial.
1) Assume that Kerry and Bush were tied in the FL exit poll.
Therefore, the probability that Kerry would win FL is 50%.
If RN is less than 0.50, Kerry wins FL 27 electoral votes.
2) Assume that Kerry won the CA exit poll by 55-45%.
The probability of winning the state was 99.9%.
If RN is less than .999, Kerry wins CA 55 electoral votes.
Kerry's total number of winning election trials (out of the 100) is his expected (mean) electoral vote win probability. In addition to Kerry's expected mean EV (average), his median (middle), maximum and minimum electoral vote is calculated for the 100 trials.
Kerry's state win probability is calculated using the Excel Normal Distribution Function. Inputs to the NDF:
1) Kerry's 2-party share of the state exit poll
2) the standard deviation Stdev = MoE/1.96
MoE is the poll Margin of Error.
THE MARGIN OF ERROR
The MoE (at the 95% confidence level) is the interval surrounding the sample mean which has a 95% probability of containing the TRUE population mean.
For example, assume a 2% MoE for a state exit poll won by Kerry: 52-48%. The probability is 95% that Kerry's TRUE vote is in the interval from 50% to 54%. The (one tail) probability is 97.5% that Kerry's vote will exceed the interval lower limit of 50%.
This is the standard formula used to calculate the MoE:
MoE = 1.96 * sqrt (p*(1-p)/n) * DE
n is the sample size,
p and 1-p are the 2-party vote shares.
DE is the exit poll "design effect" ratio of the total number of repondents required using cluster random sampling to the number required using simple random sampling. A cluster randomized trial which has a large design effect will require many more samples. As the number of respondents increases so does the design effect. We can only estimate the impact of the DE on the MoE. But DE is only a factor in exit polls. There is no equivalent adjustment made to the MoE in pre-election or approval polls.
The MoE decreases as the sample-size (n) increases while the sample poll mean approaches the population mean. It's the Law of Large Numbers. For a given n sample, the MoE is at it's maximum value when p =0.50. As p increases, the MoE declines. In the p-o.50 case, the formula can be simplified to: MoE = 1.96 * .5 / sqrt (n) =.98 / sqrt (n)
Let's calculate the MoE for the 12:22am National Exit Poll:
n = 13047 sampled respondents
p = Kerry's true 2-party vote share = .526
1-p = Bush's vote share = .474
Adjusting for an assumed 30% exit poll cluster design effect,
MoE = 1.30*0.86% = 1.12%
Pollsters use proven methodologies, such as cluster sampling, stratified sampling, etc. to attain a near-perfect random sample. Why would a polling firm include the MoE for a poll that was not an effective random sample?
Kerry win probabilities are the main focus of the simulation. They closely match the theoretical probabilities obtained from the Excel Normal Distribution function.
The probabilities are calculated using two methods:
1) running the simulation and counting the votes
2) calculating the Excel Normal Distribution function
Prob = NORMDIST (P, V, Stdev, true)
P = .526 is the mean Kerry poll vote share
V = 0.50 is the majority vote threshold.
Stdev = MoE/1.96. The standard deviation is a measure of dispersion around the mean.
Given that Kerry's led by 3% in the 2-party vote (12:22am National Exit Poll), his popular vote win probability was close to 100%. And that assumes a 30% cluster effect!
For a 2% lead (51-49), the win probability is 97.5% (still very high).
For a 1% lead (50.5-49.5), it's 81% (4 out of 5).
For a 50/50 tie, itís 50%.
The following probabilities are calculated in the model:
1) The confidence level for Kerry's minimum vote share (MVS).
There is a 97.5% probability that Kerry's true vote exceed MVS.
The MVS increases as the polling sample size grows.
2) The probability of Bush obtaining his recorded two-party vote (51.24%).
The probability is virtually zero that Bush's recorded vote would be almost 4% higher than his 47.4% two-party share.
3) The probability of the state exit poll discrepancy from the recorded vote is a function of the magnitude of the deviation, the MoE and cluster effect. The normal distribution is used to calculate the probability.
4) The probability that the MoE is exceeded in any given state is 1 in 40. The probability that the MoE is exceeded in at least N states is calculated using the binomial distribution function. The cluster effect makes a big difference in the probability calculation. As the cluster effect is increased, so does the MoE and is therefore less likely to be exceeded.
Assuming a 30% cluster effect, the vote discrepancy exceeded the exit poll MoE for Bush in 10 states. The probability of this occurrence is 1 in 2.5 MILLION.
Assuming a 20% cluster effect, the MoE was exceeded in 13 states, a 1 in 4.5 BILLION probability.
For a cluster effect of 12% or less, the MoE was exceeded in 16 states, a 1 in 19 TRILLION probability!
DOWNLOADING THE EXCEL MODEL AND RUNNING THE SIMULATION
Two inputs drive the state and national vote simulations:
1) Kerry's 2-party true vote share (52.6%)
2) exit poll cluster effect (set to 30%).
Press F9 to run the simulation.
The graphs illustrate polling simulation output based on the inputs:
1- Kerry's 2-party vote (true population mean): 52.60%
2- The Exit Poll Cluster effect (zero for pre-election): 30%
Play "what-if" to see the effect of changing assumptions:
Lower Kerry's 2-party vote share.
Press F9 to run the simulation.
Note how the 1% reduction in Kerry's "true vote" results in a decline of his polling popular and electoral vote shares , corresponding win probabilities and minimum vote at the 97.5% confidence level.
Introduction to Statistics and Probability