A good choice of study subjects is vital to ensure that the study findings accurately represent the true picture in the population. Since the entire population cannot be included in a study, there is a need to select a sample which is a subset of the population for study. The sample should be large enough so that the study findings can be generalised, but at the same time should be acceptable and feasible in terms of cost, time and resources. There are various methods which are used in epidemiological studies for sampling. They are broadly classified into probability sampling and non-probability sampling.
Probability Sampling 
Probability sampling, the gold standard for ensuring generalisability, uses a random process to guarantee that each unit of the population has an equal probability of being selected for the study. There are several methods which are commonly used:
A simple random sample is drawn by enumerating the units of the population and selecting a sample at random. This is the ideal method when the investigator wishes to select a representative sample from a population. In simple random sampling, each unit/ individual has an equal probability of being selected in the study. In this type of sampling, each individual in the sampling frame is assigned a number, and the individuals are selected by the use of a random number table. Other methods like the lottery method can also be used.
A stratified random sample involves dividing the population into subgroups according to characteristics and taking a random sample from each of these “strata” e.g age, gender. The sample is deliberately drawn in a systematic way, where each portion of the sample represents corresponding strata of the population. The subsamples in a stratified sample can be weighted to draw disproportionately from subgroups that are less common in the population but of special interest to the investigator. In studying the prevalence of hypertension, for example, it would be possible to stratify the population according to age and then to sample equal numbers from each age group. This would yield age-wise prevalence of hypertension by various age groups. A sample could also be stratified by place of residence e.g. urban and rural areas in order to get adequate representation of each subgroup.
A cluster sample is a random sample of natural groupings (clusters) of individuals in the population which could be a school, urban wards, villages etc. Cluster sampling is very useful when the population is widely dispersed and it is impractical to list and sample from all its elements. Community surveys often use a two stage cluster sample: A random sample is drawn from clusters (villages/ wards/ blocks or schools), and within each selected cluster all the eligible subjects are included. A second level of sampling can also be done within the cluster where the individuals are enumerated and a subsample for study is selected by a second random process.
A disadvantage of cluster sampling is the fact that naturally occurring groups are often homogeneous (relative to the population) for the variables of interest; each city ward or village, for example, tends to have people of uniform socioeconomic status.
A systematic sample resembles a simple random sample in first enumerating the population but differs in that the sample is selected by a predetermined periodic process (e.g, in a fever survey, taking every tenth house from a list of town residents). Systematic sampling is susceptible to errors caused by natural periodicities in the population. In systematic sampling, the selection of the first unit is done randomly, and thereafter the rest of the units are selected by a pre-determined process.
In non-probability sampling method, the selection of the study subjects is determined by the investigator and does not follow any random procedure.
Convenience Samples/ Purposive samples
In clinical research the study sample is usually made up of people who meet the entry criteria and are easily accessible to the investigator. This is termed a convenience sample. It has obvious advantages in cost and logistics, and is often the choice for many research questions. e.g. in convenience sampling, including every accessible patient who meets the inclusion criteria among hospital OPD (out-patient departments) patients. The advantage of a convenient sample is when one needs to include subjects over a long period to include seasonal variations of a disease. The disadvantage of convenience sampling is that those who are available and easily accessible may have different characteristics from those who refused or are unavailable for the study, e.g. those patients attending a hospital OPD may have severe form of a disease as compared to those with milder form of the disease and did not need medical attention.
The use of descriptive statistics and tests of statistical significance to draw inferences about the population from observations in the study sample is based on the assumption that a probability sample has been used. But in clinical research a random sample of the target population may not always be possible. Convenience sampling, with a consecutive design, is a usual approach that can be suitable for most clinical research projects. In such a case, the conclusions of the study need to be interpreted cautiously.
Sample size estimation
Estimation of sample size is an important step while planning for a research study. There are various ways of calculating a sample size, which depends on the study design, sampling method, etc. We will describe here the method of estimating sample size for a cross-sectional prevalence study. There are certain prerequisites of calculating sample size:
1) Known prevalence of the disease
2) Accuracy required – Allowable/Permissible relative error
3) The confidence interval
The formula for sample size estimation is given below:
Sample Size (n) = z2pq/d2
z = Standard Deviation, which depends on the required confidence interval. When confidence interval is 95%, z = 1.96.
p = estimated prevalence (in %), q = 1-p (in %)
d = allowable relative error (20% of p)
In this example (Table1), we can see that as the relative error increases, the sample size decreases. The margin of error depends on the investigator, but it is to be kept in mind that the higher the error, the lesser is the precision of the study.
When we are calculating the sample size, we should also take into account the probability of non-response which could be due to refusal to participate or unavailability of the study subjects. Usually 10-15% is added to the calculated sample size to take care of the non-response. The sample size should be reasonable (not too many or too less) and representative of the population from which it is taken.
Randomisation means to decide the assignment of a patient to a study group in an experimental study design. The critical element of any randomised control trial is randomisation. This can be achieved by using random number tables or the lottery method.
If randomisation is done properly, we can ensure non-predictability of the next assignment; thereby eliminating the possibility of any subjective biases of the investigators, which may be introduced into the process of selecting patients for one treatment group or the other. Also, randomisation will increase the likelihood that the groups will be comparable in regard to characteristics about which we may be interested, such as sex, age, residence etc. and the groups will tend to be similar.
Apart from randomisation, one way of ensuring the comparability between two groups is by matching. However, we can only match on variables that we know about and that we can measure. Thus, we cannot match on many variables that may affect the study results, such as an individual’s genetic constitution, environmental exposure, or other variables of which we may not even be aware. Randomisation increases the likelihood that the groups will be comparable not only in terms of variables that we recognise and can measure, but also in terms of variables that we may not recognise and may not be able to measure but that nevertheless may affect the study results.
Choosing a good sample from a population is crucial for any research study. Ideally, probability sampling should be used to ensure representation of the population. Besides the method of sampling, it is also essential to have an adequate sample size. There are various ways of calculating the sample size depending on the type of study design and sampling method. Randomisation is the crux of an experimental study which ensures that the groups have similar baseline characteristics so that meaningful comparisons can be made.
• Descriptive statistics and tests of statistical significance draw inferences about the population from a study sample, based on the assumption that a “probability” sample has been used.
• Sample size should be large enough so that the study findings can be generalised, but at the same time should be acceptable and feasible in terms of cost, time and resources.
• Randomisation is the crux of an experimental study. If done properly, can ensure non-predictability of assignment of subjects, thereby eliminating the possibility of subjective biases.
1. Hulley SB, Cummings SR, Brwner WS. Choosing the study subjects. Designing clinical research. 2nd edn 2001, Lippincott Williams & Wilkins Philadelphia, pp.30-31.