Probability and Sampling

KEY TERMS

  • Population: All members of a certain group.

  • Sample: A subset of the population.

  • Probability Sampling: A sampling technique where a sample is built based on the probability of a member of a population becoming a part of a sample.

  • Random Sampling: An kind of probability sample with randomization that ensures the sample is representative of the population and is unbiased.

  • Stratified Sampling: An kind of probability sample where separate samples are taken across each group of a population to ensure representation of different groups in the final sample.

When dealing with statistical data, we might often come across population data sets and sample data sets. While a population data set refers to all members of a certain group, a sample refers to a subset of the population. For example, population can refer to all people living in California while a sample of this population would be some people living in California.

Samples can be especially useful when dealing with large data sets. Studying all entries in the data set (the population) can be take up too much time and resources, in which case it makes more sense for us to study samples taken from the population that are representative of the population. This means the sample has the same characteristics as the actual population - this allows us to draw inferences about the population while dealing with a manageable data set. Some ways we can draw samples are by creating a random sample or a stratified sample; both are examples of probability sampling:

Probability Sampling

Probability sampling is a technique by which samples are representative of the population as a whole. A probability sample is any sample for which you can calculate the chance of any subset of the population being sampled. For example, if you have a bag with 5 red marbles and 3 blue marbles, you can calculate the probability of pulling out any possible combination of marbles. Not all elements have to have the same chance of being chosen, like in the example above.

Random Sample

A random sample means using randomization to draw our sample. This can be done in a number of ways under a variety of conditions. One of the easiest way to draw a random sample would be similar to drawing numbers out of a hat. We can have a simple random sample in which every element in the population has the same probability of being a part of the sample - for example, when asked to pick a letter from A-Z each letter has an equal probability of being chosen (1/26). Random sampling is the most common form of a probability sample.

Stratified Sample

A stratified sample divides people into groups, and randomly samples a from each of those groups based on the proportion of those groups in the overall population. For example, if we want to sample 10 people from a group that is 30 women and 70 men, we may want to keep that proportion of men and women equal in our final sample! Then, because the proportion of women in our overall sample is 30/100, we know we want 3 women in our final group of 10 people. We can then randomly sample 3 women from our group of 30 women and randomly sample 7 men from our group of 70 men.

Sampling With/Without Replacement

You select a card from a deck of cards, shuffle it back into the deck, and select another card. It's possible that you will pick the same card more than once, because it has been replaced in the sample space/population. This is sampling with replacement - randomly selecting an item with the possibility of selecting the same item more than once.

Sampling without replacement is when you do not "return" a sampled item. This is similar to distributing a deck of cards among a series of players: nobody can get two of the same card, and no two people can have the same card, since there is only one of that card and it is not replaced.

Let's look at another example: We have 3 letters -- A, B, and C -- and I randomly draw 2 samples: AAB and BCA. What sampling methods are being used? The first one is sampling with replacement because A occurs more than once in the sample while A occurs only once in the actual population. The second one could be either sampling with replacement or without replacement. If it is sampling with replacement, the fact that each letter occurs only once, could be a coincidence. Alternatively, it could be sampling without replacement where the sampled letter is not "returned" to the population.

Despite the name, random samples are not haphazard: they require care and precision to collect.

Do the following sampling methods describe a random sample?

  1. Drawing a sample of 500 people from a population of 2000 people where each person has a 1/2000 chance of being a part of the sample.

  2. Picking every data entry at the 100th index in a large data set of 500000 entries.

  3. Collecting information about popular colors among women by asking every woman that passes you by on the street about their favorite color.

Last updated