Sampling distributions

The mean of a sample is a relatively good estimate of the mean of the population. More specifically, the mean of a large sample will be a better estimate of the population mean than the mean of a smaller sample. So, the larger the sample, the better the estimate of the population mean. *The plot of the sample statistics (a mean is an example of a statistic) of our samples (reminder: it is of our samples not of our populations) as a histogram is called sampling distribution.* So plotting the sample means of our samples gives us the sampling distribution of the mean. Click here to see how to compute the sampling distribution of the mean.

One point to note is that the sampling distribution of the mean will be normally distributed regardless of the shape of the population distribution.

Confidence intervals

It has been said in the beginning that the mean of a sample is a relatively good estimate of the mean of the population. However, we don’t how good a estimate it is. That is why we use confidence intervals. *Confidence intervals are interval estimates of where the population mean might be and we give a probability X to the likehood that the population mean might fall inside the intervals.*

So if we gave a sample of 5 people a dice to throw (side 1 is 1 point, side 2 is 2 points, etc) and the sample mean score was 3. Since we know that the minimum score is 1 and the maximum is 6, we are 100% confident that the population mean lies between 1 and 6. This is an example of a confidence interval. However, it is not too broad to be helpful. The usefulness of a CI increases as the CI narrows down. So let us say that for a 95% CI, the interval is 2 to 4. It is important to note that all this means is that if we performed repeatedly random sampling of the population, 95% of the sample means would fall somewhere inside the 2-4 interval.

We know that the normal distribution can be computed from the mean and the standard deviation. We also know that a sampling distribution of the population mean is a normal distribution. Thus, we know that the sampling distribution of the population mean can be computed from the population mean and the population standard deviation.

Normally, we don’t have access to the population mean but we can get samples from the population. Since we do not know the population mean, we don’t know where our sample means stand in relation to the population mean (they might fall above, below or be equal to the population mean). So the question is:

how do we know how close the population mean is to the sample mean?

We know that the sampling distribution is normally distributed and its mean is a relatively good approximation to the population mean. The modus operandi to answer the above question goes as follows.

- We plot a normal distribution known to be a relatively good approximation to the population mean
- We use what we know about normal distributions to
*estimate*how far the sample mean is from the population mean

See the probabilities of the SND below. We can see that there is a 96% probability that the mean will fall in the -2 to 2 z-score range.

Normally, we are interested in 95% CI. Checking a table of z-scores allows to know that 95% of the scores in a normal distribution fall within -1.96 and 1.96 standard deviations. This means that we can be 95% confident that the sample mean will fall somewhere in the -1.96 to 1.96 z-score range. So if the population mean was somewhere below the sample mean, we could be 95% confident that the population mean was 1.96 standard deviations of the population below the sample mean.

Knowing all the above, we can estimate how far a sample mean lies from the population mean knowing:

- the sample mean
- population standard deviation (we do not know this but we know that the standard deviation of the sampling distribution of the mean is a relatively good approximation to the population mean)

So how do we estimate the population SD?

Standard error

With estimate the population SD with the standard error.* The standard deviation of the sampling distribution is called the standard error. *It measures the degree to which the individual sample means deviate from the mean of the sample means. Before we go on, it is important to note some things. The means of small samples have high SD (so they deviate a lot from the population mean) while the means of big samples have low SD (they don’t deviate a lot from the population mean). Since the standard error is the SD of the sampling distribution, the standard error from smaller samples will be higher than the standard error of bigger smaller samples. So a rule of thumb is that low standard errors are more desirable than low standard errors. The standard error for a particular sample can be calculated as follows: *(sample SD)/(square root of sample size)*.

Computing the CI

Once we have the standard error, for a 95% CI, we can estimate the interval inside which population mean might fall as follows: for the lower range: –*1.96 * standard error *and for the upper range:* 1.96 * standard error*. So the 95% CI would be a range between a lower value and an upper value in the following fashion: (lower value, upper value) and more especifically: (*sample mean +* (–*1.96 * standard error)*, *sample mean +*(*1.96 * standard error)**)*.

For smaller sample sizes, we will get broader CIs, while for bigger CIs we will get narrow CIs. The narrower the CI, the more useful it is.