Household clustering by type of residence

The NFHS-3 household dataset includes a constructed Wealth Index variable. Our premise is that wealth should be defined very differently depending on the type of residence region: rural, urban, mega-city, and urban excluding mega-cities.

First we subset the full dataset into these separate residence type subpopulations, and select only the household characteristics we are most interested in. Then we cluster each of these subpopulations using k-means. To determine which k value to use, we refer to Sai Pranav’s analysis of within-group sum of squares.

For each residence type subpopulation, we compare our clusters to the NFHS wealth index.

Finally, we look at several variables specifically:

electricity usage (HV206 in the codebook)
access to improved sanitation (HV205)
access to improved drinking water (HV201)
households using clean fuel for cooking (HV226)

Preet Rajdeo has already taken a geographic approach to these variables, and generated district-level chloropleths.

Feature engineering

There are about 40 variables that pertain to household characteristics indicating wealth. Some of these, like the type of water supply or floor material, have many (up to 20) levels, although the majority have only two levels. We need to bin these to 2-3 levels, for better performance with the clustering algorithm. Also, one variable (number of rooms for sleeping) is numeric, and will be centered and scaled.

For the variables with many levels, we find guidance in the NFHS factsheets. For example, there drinking water supply is binned into “improved source” and “un-improved source”. So our binning follows these NFHS classifications when possible.

The number of clusters to use for each subpopulation was determined by Sai earlier. For each subpopulation, we try both a Euclidean metric and a cosine distance metric.

Rural

If we cluster the rural subpopulation with k=3, we get the following frequency heatmap (with values below, annoyingly sorted differently).

##          clustering
## df_WI        1    2    3
##   Richest  902 3940   19
##   Richer  1379 2408 1523
##   Middle   576  115 3013
##   Poorer    76    0 1940
##   Poorest    8    0  626

Key variable distributions in different Rural clusters

This rural clustering does seem to capture three different groups well, based on the distributions of these four important variables. The three clusters also seem consistent with the 5-level wealth index variable.

Mega-cities

If we cluster the mega-cities subpopulation with k=5, we get the following frequency heatmap (with values below, annoyingly sorted differently).

Here we can certainly see some of the bias in using one wealth index for all of India. The vast majority of people living in mega-cities have been classified as either Richer or Richest. We suspect this is mostly due to the more prevalent infrastructure and resources in a city, e.g. with plumbing and sewage. This is confirmed by looking at the criteria used to construct the NFHS wealth index.

Although it is hard to tell for sure from the distribution of the four variables we look at, it is possible that our clustering is picking up a more meaningful classification of these mega-city dwellers.

##          clustering
## df_WI        1    2    3    4    5
##   Richest    3  520 2099  582  985
##   Richer   826  564    2   19  603
##   Middle   609    2    0    0    4
##   Poorer    84    0    0    0    0
##   Poorest    5    0    0    0    0

Key variable distributions in different Mega-city clusters

Urban

If we cluster the urban subpopulation with k=4, we get the following frequency heatmap (with values below, annoyingly sorted differently).

##          clustering
## df_WI         1     2     3     4
##   Richest 11601  6332  5452   174
##   Richer     31  6039   492  5146
##   Middle      0   417    45  3224
##   Poorer      0     8     3   883
##   Poorest     0     0     0   158

Key variable distributions in different Urban clusters

Urban excluding mega-cities

If we separate urban from mega-cities, and cluster this subpopulation with k=5, we get the following frequency heatmap (with values below, annoyingly sorted differently).

##          clustering
## df_WI        1    2    3    4    5
##   Richest    0 4797 3225 2124 9224
##   Richer  2216  388 3945 3114   31
##   Middle  2461   36  445  129    0
##   Poorer   797    2   11    0    0
##   Poorest  153    0    0    0    0