The NFHS-3 household dataset includes a constructed Wealth Index variable. Our premise is that wealth should be defined very differently depending on the type of residence region: rural, urban, mega-city, and urban excluding mega-cities.
First we subset the full dataset into these separate residence type subpopulations, and select only the household characteristics we are most interested in. Then we cluster each of these subpopulations using k-means. To determine which k value to use, we refer to Sai Pranav’s analysis of within-group sum of squares.
For each residence type subpopulation, we compare our clusters to the NFHS wealth index.
Finally, we look at several variables specifically:
electricity usage (HV206 in the codebook)
access to improved sanitation (HV205)
access to improved drinking water (HV201)
households using clean fuel for cooking (HV226)
Preet Rajdeo has already taken a geographic approach to these variables, and generated district-level chloropleths.
There are about 40 variables that pertain to household characteristics indicating wealth. Some of these, like the type of water supply or floor material, have many (up to 20) levels, although the majority have only two levels. We need to bin these to 2-3 levels, for better performance with the clustering algorithm. Also, one variable (number of rooms for sleeping) is numeric, and will be centered and scaled.
For the variables with many levels, we find guidance in the NFHS factsheets. For example, there drinking water supply is binned into “improved source” and “un-improved source”. So our binning follows these NFHS classifications when possible.
The number of clusters to use for each subpopulation was determined by Sai earlier. For each subpopulation, we try both a Euclidean metric and a cosine distance metric.
If we cluster the rural subpopulation with k=3, we get the following frequency heatmap (with values below, annoyingly sorted differently).
## clustering
## df_WI 1 2 3
## Richest 902 3940 19
## Richer 1379 2408 1523
## Middle 576 115 3013
## Poorer 76 0 1940
## Poorest 8 0 626
This rural clustering does seem to capture three different groups well, based on the distributions of these four important variables. The three clusters also seem consistent with the 5-level wealth index variable.
If we cluster the mega-cities subpopulation with k=5, we get the following frequency heatmap (with values below, annoyingly sorted differently).
Here we can certainly see some of the bias in using one wealth index for all of India. The vast majority of people living in mega-cities have been classified as either Richer or Richest. We suspect this is mostly due to the more prevalent infrastructure and resources in a city, e.g. with plumbing and sewage. This is confirmed by looking at the criteria used to construct the NFHS wealth index.
Although it is hard to tell for sure from the distribution of the four variables we look at, it is possible that our clustering is picking up a more meaningful classification of these mega-city dwellers.
## clustering
## df_WI 1 2 3 4 5
## Richest 3 520 2099 582 985
## Richer 826 564 2 19 603
## Middle 609 2 0 0 4
## Poorer 84 0 0 0 0
## Poorest 5 0 0 0 0
If we cluster the urban subpopulation with k=4, we get the following frequency heatmap (with values below, annoyingly sorted differently).
## clustering
## df_WI 1 2 3 4
## Richest 11601 6332 5452 174
## Richer 31 6039 492 5146
## Middle 0 417 45 3224
## Poorer 0 8 3 883
## Poorest 0 0 0 158
If we separate urban from mega-cities, and cluster this subpopulation with k=5, we get the following frequency heatmap (with values below, annoyingly sorted differently).
## clustering
## df_WI 1 2 3 4 5
## Richest 0 4797 3225 2124 9224
## Richer 2216 388 3945 3114 31
## Middle 2461 36 445 129 0
## Poorer 797 2 11 0 0
## Poorest 153 0 0 0 0