Hi, welcome back to this lecture on asymptotics. We're going to talk about the central limit theorem, one of the most important and celebrated theorems in statistics. The idea behind the central limit theorem is very neat. It gives you a way to perform inference with random variables. We actually don't know what distribution they come from. And basically what the CLT states is that the distribution of averages of iid variables, properly normalized, becomes that of a standard normal as the sample size increases. It's basically saying is if you wanna evaluate error rates associated with averages, it's often enough to only compare them relative to a normal distribution. And the CLT applies in an endless variety of settings, and with a collection of various asymptotic tools, people have figured out a way to apply the central limit theorem in so many cases. The majority of inference you do, if you've used statistical software and calculated a p value or confidence interval, probably the underlying motivation of what you're doing is asymptotics in most cases, maybe not all cases, definitely not all cases, but in most cases, you're probably appealing to some use of the central limit theorem especially if you have larger sample sizes. Let's actually just state the central limit theorem at least as far as we're gonna use it. So let's let X1 to Xn be a collection of iid random variables that come from some population with mean, mu, and variant sigma squared. And let's let Xn bar be their sample average. Then, probabilities, and in this case let's look at the distribution function, the distribution function of the normalized mean. So we take Xn bar, subtract off its expected value, mu, and divide by its standard deviation, sigma over square root n. So this whole quantity here has mean zero and variance one, so this probability that this z random variable is less than or equal to the specific point z, is the standard limits to the standard normal distribution evaluated at that point z. Okay, so what does this say? This basically says that probabilities associated with sample means looks like probabilities associated with normals. And if you standardize the sample mean so that it has mean zero in variance one, then the probabilities look like standard normal probabilities. So I wanted to reiterate the form of this normalized quantity here. It's Xn bar, n estimate, minus the population mean of the estimate, divided by the standard error, okay? And this is something that you can practically bank on. Cuz if you take any statistical estimate based on iid data, and subtract off its population mean and divide by its standard error, that quantity's most likely going to wind up limiting to having a standard normal distribution. Let's just go through an example. And this is kind of a neat example and I'll explain why in a minute. Imagine if you were stuck on a desert island. And, bear with me, imagine if you needed a standard normal random variable. You had to generate a standard normal random variable because you were kind of going crazy. And all you had was a die. So you were thinking, I need a standard normal random variable and I have a die. Let's roll the die a bunch of times. And then take the standardized sample mean and see how well that works as a standard normal by applying the central limit theorem. So, remember that for a die, the expected value of a die roll is 3.5. If you don't remember, we went through this calculation a couple lectures ago when we covered the variants. The variance of a die roll is 2.92. And the standard error is square root 2.92 divided by the number of die rolls going into the average so that's 1.71 divided by square root n. And so the standardized mean would just be the average of the die rolls minus 3.5 divided by 1.71 divided by square root of n. On the next slide, basically what I've done is I rolled the die, let's say one time, and I did that over and over and over again. So an average of one. And I standardized the average of 1, right? So in case, for an average of 1, it would be a die roll minus 3.5 divided by 1.71. And so now we have a distribution that's centered at zero and has variance one. And I plotted the standard normal density in gray in the back. I plotted the histogram of my die rolls. And of course it can only take six possible values, and you see those six spikes at one to six, and it's not perfectly discrete because the software I'm using to plot the histogram assumes the data is continuous. So, you see the six spikes basically from the fact that one die roll, if we plot a histogram of a bunch of one die rolls, they are going to look like a bunch of spikes, one to six, and in this case cuz it was normalized, they're gonna look like the numbers one to six, where you subtract off 3.5 and divide by 1.71. Okay now imagine if I just took two die rolls. I took a die, rolled it once, I rolled it a second time, I got my average. I subtracted off 3.5 and I dived by 1.71 divided by square root 2. And I repeated that process over, and over, and over again and I plotted histogram of the result of a lot of averages of two die rolls. And so the histogram is going to give me a good sense of what the distribution of the average of two die rolls what the standardized average of two die rolls is. And in the background we have the normal distribution, and on top of it we have the distribution of the average of two die rolls. And I think that you'll agree that it's already, just by two die rolls, looking pretty good. Amazingly good. And now imagine if you had six die rolls. So, I rolled the dice six times, right? Took the average, subtracted 3.5, and then divided by 1.7 when divided by square root 6, right? And then I did that process over and over and over again, and I got lots of normalized averages of six die rolls, and I plotted a histogram, and you can't even see the standard normal distribution in the background. Because the distribution of the average of six die rolls looks so similar. So if I was on my desert island and if I needed a standard normal, I think I could probably get away with six die rolls and taking an average subtracting by 3.5 and dividing by 1.71 divided by square root 6. And the reason why I bring this up is because, it's an interesting fact that the famous statistician Francis Galton, who was quite a character, you should look up Francis Galton if you get a chance, he was Charles Darwin's cousin, and he's a very brilliant guy. He needed standard normals, but was trying to get to simulate standard normals prior to having computers. So how did he do it? Well, he basically rolled dice and applied the central limit theorem to get standard normals, which is really quite clever. And he had, because it was a pain in the butt, and he wasn't on a desert island. So he had time constraints. He actually invented dice that made it a little bit easier for him to do. I think he took standard dice and kind of wrote on the corners and stuff like that. So there was more values than one to six. But the basic idea is that the distribution of averages looks like that of a normal distribution. And just about in any sense, regardless of what the underlying distribution of the date is. There's some assumptions there that we assume that the variance was finite and some other things like that. But, for the purpose of this class, it's basically any distribution that we can probably think of. So let's take another version of the central limit theorem, or another instance of the central limit theorem, from flipping coins. So now, instead of a die I have coin. And I want to evaluate the average of a bunch of coin flips. So let's let Xi be zero or one result of the ith coin flip of a possibly unfair coin where the p is the true success probability of the coin. And the sample proportions say p hat is just the average of the coin flips. Right? The p hat is, in this case, the percentage of ones and the average of the x i is the same thing, of course. So remember that the expected value of the x i's is p. Right? The true success probability of the coin. And the variance of the x i's works out to be p times one minus p. And then the standard error of the mean in this case is the variance divided by square root n. So it's p times one minus p divided by n square root the whole thing. So then what this says is that if we take a sample proportion, p hat, we subtract off the population proportion, the success probability, the probability of getting ahead, and divide it by square root p times one minus p divided by n that'll be approximately normally distributed. And things look worse than the die roll, in this case. Let's take a fair coin and generate some plots. So if you take a fair coin, and flip the coin, and record either zero or one, depending on whether it's heads or tails, subtract off 0.5, and divide by square root .5 times 1 minus .5 and do that over and over again and get a histogram of the results. You only get two possible values right? They're not 0 1 because you've normalized them and here in this first plot you see it doesn't look normal at all of course. After 10 coin flips, so here now we're flipping the coin 10 times, taking the sample proportion of heads subtracting off 0.5 and then dividing by 0.5 times 1 minus 0.5 divided by 10 square root of that whole thing. And we repeat that process over, and over, and over again and get a histogram of the results. We see that the distribution of the normalize average of ten coin flips looks pretty normally distributed. But still is very discreet compared to the normal distribution. Once you get to 20 coin flips it's looking pretty good. It's overlaying the standard normal distribution pretty well. And it turns out in the coin flipping example, it converges to normality quick if the coin is fair. If the coin is unfair, if you see the bottom row of plots. Here we have a coin that's unfair where it's more likely to get a tale then a head. I think it's 0.7 0.3, I believe is what I used for the simulation. Then you see that at the start it's much more likely to get the zero value than the head value, the one value. These aren't zero and one because they've been normalized. If you look at the distribution of averages of ten coin flips it doesn't look very normal at all, twenty coin flips it still doesn't look very normal. So it takes a lot longer. So this is a problem with the central limit theorem that generally doesn't get a lot of play. The central limit theorem says, basically, if you have I ID random variables and you take an average, the distribution of the normalized average converges to that of a distribution. But it doesn't tell you how fast it does that? Right? It just tells you that it does eventually. So, for some distributions, you know, it might take thousands of observations going into the sample mean for the distribution of sample means to behave like that of a Gaussian distribution. And for others like the die roll, we saw that it only took six. And then it was nearly overlaying the standard normal distribution. So, it's an unfortunate fact that the central limit theorem can't give you any guarantees on how quickly things converge to normality. Oh, and I wanted to point out one last thing on this coin flipping example. So, if you've ever been to a science museum, and you've seen these machines, where say, a ping pong ball is dropped and it goes through like kind of a Pachinko collection of left-right decisions kind of randomly. That's exactly a binomial experiment. Every time the ping pong ball hits a nail, basically, it can either, with a 50% probability, go left of right. So by the end of the process, it's had a bunch of coin flips, so the position it is at the bottom Is exactly the sum of a bunch of Bernoulli trials and principle. You have to actually create it so that it approximates the coin flipping well. For example, if it was tilted to the side, it wouldn't be a fair coin anymore. But as it gets towards the bottom where the ping-pong balls are collected, It's sum of a bunch of Bernoulli random variables. Well the distribution of averages is approximately Gaussian, so of course the distribution of sums is then approximately Gaussian, cuz sums are just averages multiplied by n. So what you'll see in the science museum is that in the bottom they'll have traced out a Gaussian distribution. And you'll see that if they run this ping pong ball machine long enough the balls tend to fall in a bell shaped curve. And that's just an application of the central limit theorem saying that for this particular value of n, the central limit theorem approximation of the sum of a bunch of coin flips is pretty good. And this ping pong device is actually called a quincunx, and it was invented by Francis Dalton, who we talked about earlier, the cousin of Charles Darwin. So it's kind of a very interesting little tidbit. So the next time you're at the science museum, you can explain this to whoever you're with. So the reason we use the Central Limit Theorem, in practice it's useful as an approximation, basically saying that the normalized mean. It has a distribution that's approximately like the standard normal distribution. So let me give you an example. So, remember that 1.96 is a good approximation to the 0.975th quantile of a standard normal and so negative 1.96 is a pretty good approximation to to the two point 5th percentile. So what the central limit theorem then says is that 95% is about the probability that a standardized mean lies between minus two and plus two. So let me just repeat that. So it's about 95% the probability that a standardized mean lies between the values minus 2 and plus 2. And then, so let's like take the interior. Of this probability statement and just rearrange terms a little bit. Making sure that if we multiply everything by a minus sign we flip the inequalities. And we get that xm bar plus 1.96 sigma over square root end is bigger than or equal to mu. Which is greater than or equal to xm bar minus 1.96 sigma over square root end. That probability is about .95, what that's saying Is that the random interval, Xn-bar, plus or minus 2 standard errors contains mu, the non-random entity, with about 95% probability. In this case we wanted, let's say, a 95% interval, so we took 5%, divided it by 2 and got 2.5 in the 97.5th. Quantiles, basically add and subtracted those, x and bar, plus or minus, that quantile times the standard error. So that's 95%, but if we wanted something other than 95% why don't we just say, the standard normal quantile z, at one minus alpha over two? So in this case it was one, minus alpha is 0.05, divided by two, that's .025. So that would be .975. So that would be z.975, which is 1.96. That's where we got the 1.96 from. But we could do it for another value. Imagine if you wanted a 90% interval. Right, then we would have z, the quantile for the probability, 1- alpha in this case, which would be 0.1 divided by 2. So 0.1 divided by 2 is 0.05. So we would need the 95th percentile to plug in there if we wanted a 90% interval. Okay, so this is the idea that we can create so-called confidence intervals, random intervals that create the quantity that they're trying to estimate with 95% probability with 1 minus alpha over 2% probability. We can create such quantities by taking the estimate plus or minus a standard normal quantile times the standard error. And in this case it's the sample mean. But we'll find that we can kind of trick and play games with the central limit theorem in a lot of large numbers and get this kind of interval to work in a lot of cases. So tons of the intervals that you're going to look at in your statistics classes are going to be exactly of the form, estimate, plus or minus, standard normal quantile, times standard error. Anyway this is called a 95% confidence interval, it's a main stay of statistics and especially so called frequent statistics. And it's basically an estimate from you that has some acknowledgement of the uncertainty from the fact that we have data that we're treating as if it's random. So that's what a confidence interval is. If we just give Xn bar, we just give the sample averages an estimate that has no acknowledgement that there's random variation that we're not accounting for. So in this case, we're saying if we take Xn bar, We're willing to assume that the data's independent and identically distributed with a finite variance, then we can apply the central limit theorem. And the central limit theorem says that, well, if the distribution is cooperating and n is large enough, then the interval contains mu with probability 95%. Now, so unfortunately, I'm gonna pontificate a little bit. It's an unfortunate fact that confidence intervals are kind of really horribly hard to interpret. And this is just a by-product of so called frequentist inference. So, the interval, if we actually get data, and calculate a confidence interval, then we just have two numbers. Those two numbers either contain mu or they don't. So the standard frequentist logic says that probably that interval contains mu or not, is either zero or one. It either contains it or it doesn't. So what the real interpretation of a confidence interval is, is that this procedure, when applied over and over again, creates confidence intervals that will contain mu 95% of the time. So that's actually the real interpretation of a confidence interval if you are a hardball frequentist. And it's unfortunate that that interpretation is so hard. So let me just repeat it because it is kinda of tricky. It's basically saying that the confidence interval procedure, given that the central limit theorem applies and all of our assumptions apply, the confidence interval procedure creates intervals that if we are to repeat the procedure over and over and over again on repeated experiments, let's say if for 95% intervals, about 95% of the time the intervals will contain the value that they're trying to estimate. That's a very confusing statement, and that's one of the main criticisms of confidence intervals, that if you're strict about their interpretation, which not everyone's strict about it because the strict interpretation's so hard, that it's actually quite difficult. Maybe I'll try and dig up some examples when a confidence interval makes its way into some very important problem and show you how the press is basically incapable of interpreting intervals this way. And to their credit, it's because it's hard and it's kind of a crazy interpretation. There is, in fact, a different way of creating intervals that gives you a better interpretation. There's another trick. If you happen to be taking a statistics test, there's a trick to get around this mental games associated with fictitious repetitions of experiments and so on. The trick is just to say we're 95% confident the interval contains mu, and statisticians have said well that's enough hedging to count as a legitimate instance of the strict definition. So, if you're taking a statistics test don't say there's a 95% chance that the interval that I just calculated contains mu, cuz your teacher might yell at you about that. But if you say you're 95% confident, they'll begrudgingly give you credit. Let me give you another formula that I find a super useful instance of the CLT, because it's a kind of quick back of the envelope type calculation. So, by the way, we're gonna spend much more time calculating specific confidence intervals and so on. Right now we just wanted to give you the theory behind why confidence intervals work. But there's a specific incidence of confidence intervals that's really quite useful so for sample proportions remember that variance is p(1- p). And so the confidence interval just plugging in takes the form p hat the sample proportion plus or minus the standard normal quantile times square root p(1-p) over n. Now in all the intervals p is what we want estimate, so we obviously can't plug into this formula. We need to replace p with an estimate. If we replace p with the sample proportion, p hat, 1- p hat, you get a so called walled interval. And as I stated in this previously slide but didn't really talk about, you generally, replacing the unknown parameters in the standard error calculation with their estimates, we usually work with something that's called Slutsky's Theorem. And that will usually work and will create a interval that's asymptotically valid. So let's try and instead of doing that right now we'll talk a lot more about walled intervals and that sort of thing. Let's actually talk right now about quick back of the envelope bounds. So remember that the worse that p(p- 1) can be, the biggest it can be is if p is a half. So this thing is less than or equal to a quarter as long as p is between 0 and 1 which is our restrictions because we're talking about a proportion or probability. So in this case if we let alpha be 0.05 so that the standard normal quantile is 1.96, which is close enough to 2, among friends, let's just call it 2. Then the standard error, the whole margin of error, part of the confidence interval 2 square root p(1- p) over n. Works out to be 2 times square root a quarter divided by n, and so you wind up with 1 over square root n. So, if you want a quick back of the envelope confidence interval estimate for a sample proportion, just take the sample proportion and and add and subtract 1 over square root n. And that's a really handy little formula, and it kind of tells you sort of when you calculate a proportion, about how accurate it is. So, for example, if I have a proportion of 100 coin flips, we know that the accuracy is going to be about 1 over square root 100, or 0.1. So it's a very useful back of the envelope calculation. Just remember p-hat plus or minus 1 over square root n. Well that's the end of today's lecture on asymptopia, I hope you enjoyed your visit to asymptopia. It's a very nice place. When you have an infinite amount of data, things tend to work out. Look forward to seeing you next time. [MUSIC]