There’s No Perfect Sample, But How Do We Get Close
A good sample is an unbiased sample. The ideal sample should be representative of the entire population with no bias that would underestimate or overestimate the true characteristic or behavior of the population as a whole. Therefore, as long as we are conducting an experiment or developing a model by using a sample rather than the population, we will have error or bias. However, this doesn’t mean that it will be useless data, but the severity of the bias will determine how useful the model will be.
“Bias is disproportionate weight in favor of or against an idea or thing”
Bias in sampling is just like over-fitting in machine learning, where the model performs exceptionally well to explain the given sample, but fails miserably when the model is applied on new data. So how do we obtain a sample that is good enough to conduct statistical analysis that would sufficiently represent the general population ? What are the practical constraints that have to be considered when we decide on the number of samples required and the sampling techniques used.
As sampling requirements are usually determined on a case-by-case basis, we will look at the usual constraints faced by statisticians and market surveyors.
- Number of Samples
As anyone can guess, the more sample the better, but what exactly constitutes the minimum required sample for a research or a survey to be statistically significant. The answer is this depends on several factor including the population size, the desired confidence interval and margin of error. Here, Cochran’s equation can be useful to determine the required sample size:
where:
Z = Corresponding z-value of desired confidence interval
p = Estimated proportion of the population which has the attribute in question
q = 1-p
e = Margin of error in %
N = Population Size
The rule of thumb is that the analysis should have at least between 100 observations to 10% of the population size while not exceeding 1000 observations, this ensures the general population is sufficiently represented.
In the case where population is below 100, the Cochran’s equations should be used and as expected, the sample size required will be a big percentage of the population in small population studies. For example, using the Cochran’s equation keeping the other parameters constant, a population of 100 would require a sample size of 80, but a population of 1000 would only require a sample size of 278.
Cochran’s Equation can be tested on this online calculator : Cochran’s Calculator
2. Resources vs Accuracy
The second consideration will be the trade-off between available resources and desired accuracy from the model. As the scale of the study increases, gathering more sample and performing statistical analysis will need more time, money and computational power. Therefore, we have to determine the importance of the analysis and the accuracy required, and then optimize based on the resources available.
For example, for a research with limited budget that only requires a rough estimate of the result, a simpler analysis can be conducted with fewer sample data. On the other hand, for studies of great importance or intention for accredited publication should invest more and give more thoughts to the sampling process to ensure the acquired results are as accurate as possible.
As described in the engineering rule of metrology:
The perfect resolution for a tool is not the smallest unit of measurement possible, but the smallest unit of measurement needed.
3. Sampling Methods
There are numerous ways of sampling, but they are usually classified into 2 types, Probability Sampling Techniques and Non-Probability Sampling Techniques. We will look at techniques like Random Sampling and Stratified Sampling for Probability-driven techniques, then we will look at Convenience Sampling and Quota Sampling for Non-Probability-driven techniques.
Probability Sampling Techniques
I. Random Sampling
This method is performed by selecting each subject independently of other members of the population, this can be done by running a simple random generator on a computer. This method is often used when the population size is large and the available sample size is considerably large as well, as repetitions (observations) will reduce random error to a minimum.
Advantages:
i) Each member do have an equal chance of being chosen for the study
Disadvantages:
i) Accessibility might be an issue in some cases when a group of the population is more accessible for survey than other group of the population, this can lead to sampling bias.
II. Stratified Sampling
This method is performed by recreating a smaller-scaled sample that has identical statistical features from the original population. This is accomplished by dividing the population into different categories, like race, religion, gender, education level, etc.
For example, if 56% of the population are represented by Christians and 44% are represented by non-Christians. Then, in a stratified sample of 100, we would expect to randomly select 56 Christians and 44 non-Christians for the study, so that we accurately represent the distribution of Christians in the population.
Advantages:
i) Ensures every category of the population is represented, especially minorities as random sampling has a high chance of leaving out minorities in a large demographic during sampling.
Disadvantages:
i) It gets extremely complex and selective when more conditions of stratification are introduced to the sampling process.
Non-Probability Sampling Techniques
I. Convenience Sampling
This sampling method has no selective filtering and prioritizes convenience as described in the name, samples will be selected based on availability and accessibility. This method is often used during early-stage of a research to confirm or produce rough estimate with little effort.
Advantages:
i) Useful when sample is needed quickly
Disadvantages:
i) Likely will be an extremely poor representation of the true population, severe bias and error will likely be present.
ii) Does not offer the statistical insights provided by probability methods
II. Quota Sampling
This method operates like Stratified Sampling, but the quota will not be fulfilled by random selection but instead uses convenience sampling until the quotas of subcategory are fulfilled.
Advantages:
i) It provides a more representative view than convenience sampling with the quota introduced
Disadvantages:
i) Still prone to selection bias as it is built on top of convenience sampling
ii) Does not offer the statistical insights provided by probability methods
Conclusion
When we gather sample for a statistical analysis, we should consider the number of samples required, resources available, accuracy required and the appropriate sampling techniques to use for different situation.