Sources of Biases in the Consumer Pyramids Household Survey

Salil Sanyal, a retired member of the Indian Statistical Service and a former consultant to the UNDP writes:

The Consumer Pyramids Household Survey (CPHS) of CMIE since 2014 has of late been commented upon as suffering from biases towards the better-off (Jean Dreze & Anmol Somanchi)1.

In their article on The India Forum, Jesim Pais & Vikas Rawal2 went into details of every stage of the survey, inter alia criticising CMIE’s random sampling process but wrongly put the number of urban households in to rural and rural numbers into urban, thereby missing a major source of bias (which was acknowledged and corrected in their subsequent comment). Vyas’s rejoinder3 to Dreze & Anmol claiming that CPHS has no bias and his reply4 to Pais & Rawal, and the rejoinder from Pais & Rawal5 would form the basis of the present article. No results of the CPHS are discussed as the premise on which they are based is shaky and the article is therefore, limited to the sampling methodology.

Bias in formation of rural and urban strata

In the two-stage stratified design, 640 districts were stratified into 110 homogeneous strata, equivalent to 550 strata (rural strata and towns classified into very large, large, medium and small strata). In effect, after deleting the inaccessible ones, the number of strata for rural were 100 as compared to 312 for urban. Pais & Rawal had overlooked this urban-centric CPHS conceived during stratification by CMIE. The more populous rural sector is relegated and hence the well-off sections given priority in stratification.

Use of weights for estimation

There were no troubles then in 2014 for field operations and it was supposedly easier to contact the Gram Panchayat/Block headquarters for the details about the village/enumeration block, listing out the households (refer to the plea taken by Vyas in his rejoinder to Dreze3) and arrange them into a descending order of consumer expenditure of the households so as to have representation of all classes for a systematic selection of sample households. Instead, no listing was done, the initial samples were drawn in absence of the total number of households in the village/block. A sort of self-weighting design was chosen in order to avoid listing out at the village/block level and preferred 'weights'6 as the common multiplier, avoiding estimation of the characteristics through normal estimation procedure for a multistage design. Pais & Rawal had at length discussed the non-response and substitution cases where the use of these weights without adjustment leads to unreliable estimates. How reliable are these projections based on the growth rates of increase during 2001-2011 Censuses without taking into account the declining trend of growth rates during subsequent periods as well as the decline in Total Fertility Rates?

Evolution of sample design

A probability sample demands that once a sample is drawn, it should not be examined as to which districts or sub-districts are left out as the estimation procedure takes care of that in a well- designed and well-planned sample survey. The CPHS thinks it the other way. The note on `Sample Survival & Response Rates’ gives a detailed account of how the samples were increased from 1,66,744 in the first wave (Jan-Apr 2014) to 1,74,405 in the 18th wave Sept-Dec 2019 due to additions and deletions in the original sample.

These postmortem additions along with legacy samples from a 2008 survey form the story of evolution of design: selecting at will districts or sub-districts, enhancing sample size, rendering results in different waves of varying reliability, thereby making fast frequency data utterly meaningless. If the objective were to select samples from each district, the CPHS could have stratified each district into rural and urban sectors as was done by the National Sample Survey Organisation in in its 72nd round. Can we call this mishmash of samples a probability sampling?

What remains of the panel dataset?

Out of 1,66,744 households, only 1,45,984 were surveyed and accepted past the validation checks; of these only 24,647 were surveyed and accepted in each of the waves until the 18th wave. That is, only 16.9% (refer Table 8, Sample Survival & Response Rates) of the cohorts survived for each of the 18 waves. With such a low survival rate, any inference about the behaviour of a panel dataset over the years could be misleading.

Bias in selection of first stage units

Pais & Rawal have not discussed the allocation of the total sample size into rural and urban sectors, thereby missing out on the bias towards the urban sector in CPHS. For an all-India survey, there has been an undue over-representation of the urban sector in the first stage units, 7,579 enumeration urban blocks as compared to 2,844 sample villages. The NSSO, on the other hand in its comparable 72nd round (2014), surveyed 6072 urban blocks as first stage units, against 8016 sample villages even with the double-weighted urban sector. The CPHS argues for a larger sample size on the ground of a greater dispersion in the urban sector. This could have been taken care of while allocating the total sample size to the towns, sub-stratified according to population by using the procedure of probability proportional to size (pps). Instead, 'one or two for each size–class,’ were perhaps allocated by investigators, a sort of ad hoc allocation.

Tables 3 and 4 of the CPHS note on “Survey Design and Sample”8, show how very large towns, small in number ,were given larger weights in general, than medium and smaller towns (only one per HR), an obvious bias to the 'haves', taking the urban tilt further in an ad-hoc allocation.

On the one hand, too many enumeration blocks have been assigned per sample town at the expense of the sample size for the rural sector, and, on the other, probably increasing the non-sampling errors (NSE) in the form of coverage, response and ascertainment errors. The NSE tend to increase with the increase in sample size. We have already seen many samples in different waves not accepted perhaps due to NSE, not mentioned though by CMIE in so many words.

While the sample size is determined with an eye on a pre-fixed precision, in actual practice, the sample size in a large-scale sample survey is dictated by the number of enumerators available for the survey. It appears that not enough of the field staff are available for the rural sector as compared to the urban sector for the CPHS. Judged by the number of sample villages in 72nd round of the NSS and those in CPHS 2014, the former was 2.82 times more than the latter. Rural representation in CPHS is not adequate.

Systematic sampling

In response to the criticism by Pais & Rawal on systematic sampling, Vyas in response, writes “we do not do only linear administration of the sampling. It is circular where households are organized in the form of concentric circles round a centre”. Is this an Indian village scenario? Has anybody witnessed any such arrangement of households in concentric circles around a centre in any Indian village.? Now coming to circular systematic sampling (css), the interval is k= N/n and the random number is selected from 1 to N, where N is the total number of households in the village and n is the number to be selected. In cases of exhaustion of all households, the samples still left are selected, coming back to the original list but maintaining the sequence. And that is why the name `circular. In linear systematic sampling (lss), the random number is chosen from 1 to k.

The CPHS has its own rules: no reference to N, interval k= n and not k=N/n and arbitrary numbers 5 and 15 for selecting random numbers and focusing on the main street for sample selection and going to the interior, if required, and not following a listed sequence of households. Wherefrom these numbers 5 and 15 emerge, the CMIE may like to tell us, as they were responsible for omission of a large number of households from the selection process as shown below:

Suppose in the absence of N, we take Ni=300 (village has on average 300 households3), the interval would be 300/16 that is, 20 to round off as an integer. Every unit in the frame should have equal probability of selection. In CPHS with this prior range of random numbers given to investigators, the numbers 1 to 4 and 16 to 20 are excluded and hence these eight units out of 20 have zero probability of selection. A sampling bias is introduced during the selection of households by this peculiar systematic sampling that would have catered to a truncated population of unknown dimensions in the field. A minimum of 27% of households in rural or urban sector were left out of selection. The larger the village, larger would have been the proportion of the village left out of selection. It is not known how CPHS has dealt with large villages. More than 60% of villages with large population (say, 500 households) were missed out during sampling, unless the same were divided into equal-sized hamlets and one drawn with equal probability with the number of hamlets noted for estimation purposes. In view of this unnoticeable omission of households, weights were not adjusted, leading to overestimates in general.

Vyas further writes, “It is known that systematic random sampling does not render the same probability for all households”. If we take the rules as prescribed in sampling theory, that is, k= N/n, every unit has the same probability of selection. On account of unequal probabilities of households in the CPHS, the estimates generated would not be unbiased.

Vyas thinks that it is a conjecture that the choice of systematic sampling in place of simple random sampling and the failure to do the listing injected an element of bias in favour of the well-off. Systematic sampling, if done the correct way, will never inject a bias towards the well-off. It is the failure of obtaining N combined with imperfections in the use of systematic sampling (as pointed out earlier) plus making the design urban-centric from stratification, allocation and selection points of view that has positively injected too many biases.

Vyas further states that the bias in favour of the well-off will be studied during the September-December 2021 wave and corrections in sampling wherever necessary will be done by an expansion of the sample in the January-April 2022 wave. The CMIE may note that for obtaining unbiased estimates, the whole methodology of sampling design has to be overhauled because of the points noted above. Theoretically, in the absence of N, no lss or css is possible and the so-called systematic sampling innovated by CPHS is not random sampling. The CPHS data have no validity of a probability sample. Simply, expanding the sample size will not do.

The CMIE may like to consider setting up a Sampling Design Unit headed by a person well-versed in sampling theory and its application, as in the NSSO.

(I acknowledge with thanks the interest shown by Jean Dreze in an earlier draft and provide me with the CMIE documents without which I could not have had the insight into the sampling design.

-- Salil Sanyal

References

  1. Jean Dreze and Anmol Somanchi, 'View: New barometer of India's economy fails to reflect deprivations of poor households', The Economic Times, 21 June 2021
  2. Jesim Pais and Vikas Rawal, 'Consumer Pyramid Household Survey – An Assessment', The India Forum, 13 August 2021
  3. Mahesh Vyas, 'View: There are practical limitations in CMIE's CPHS sampling, but no bias’, Economic Times, 24 June, 2021
  4. Mahesh Vyas, 'Consumer Pyramids Household Survey: A response to Pais and Rawal,' The India Forum, 23 Aug, 2021
  5. Jemis Pais and Vikas Rawal, 'A Rejoinder to Mahesh Vyas', The India Forum, 3 Sept, 2021
  6. Mahesh Vyas, 'Weights’, November 11, 2020, CMIE
  7. Mahesh Vyas, 'Sample survival and Response rates', March 12, 2020
  8. Mahesh Vyas, 'Survey Design and Sample', March 12, 2020, CMIE
Commentator name
Salil Sanyal