Under the enactment of Kenya’s 2010 constitution, 47 counties were created as part of a devolved government. For benchmarking purposes, segmentation of these counties should be based on similarity. Clustering models and machine learning allow us to do just that.
The objective of our analysis is to find out which counties are similar and how they can best be segmented to allow for fair comparison and benchmarking between them. Just consider how unfair it is to compare a county like Mombasa with a size of 218.86 km2 and a counted population of 939,370 with a county like Marsabit that has a size of 70,961.19 km2 and a counted population of 291.166.
To meet this objective clustering analysis was performed within a Python Anaconda environment on a local machine. The code was managed in a Jupyter notebook and the following libraries were used:
- Pandas for data manipulation (e.g. read_csv, sort_index, drop)
- Matplotlib for plotting various graphs (e.g. boxplot, line graph, scatter plot)
- Scikit-learn for scaling (RobustScaler), machine learning (KMeans) and high-dimension visualization (TSNE)
- GeoPandas for shapefile manipulation and visualization
- Scipy for hierarchical clustering (linkage) and its visualization (dendrogram)
The analysis consisted of the following key steps:
- Getting the data – county data was obtained from a variety of trustworthy sources, compiled in a spreadsheet and imported into Python.
- Exploring and cleaning the data – visualization and analysis were performed to better understand the data and correct any errors.
- Data modeling – the data was transformed with a scaler and clustered with K-Means and a hierarchical clustering algorithm. The county clusters were visualized as a report, map, scatter plot and dendrogram.
Getting the Data
The county data was cobbled up from different sources and assembled into a single spreadsheet. Here is a description of the different fields:
- County_id – each county in Kenya has a unique number starting with Mombasa (1) and ending with Nairobi (47).
- County_name – most commonly used name of the county.
- Former_province – name of the former province, an administrative division that was disbanded under the new constitution.
- Census_area – the area in km2 reported by KNBS in the Kenya Population and Housing Census 2009.
- Population_2009 – total county population reported by the Kenya Population and Housing Census 2009.
- Housholds_2009 – number of households reported by Kenya Population and Housing Census 2009.
- Pop_under5 – total population under the age of 5 reported by the Kenya Population and Housing Census 2009.
- Fertility_rate – total fertility rate (number of children per woman) reported by the 2014 Kenya Demographic and Health Survey.
- Stunted_growth – percentage of children under 5 years who are stunted (moderate or severe) reported by the 2014 Kenya Demographic and Health Survey.
- Crude_death_rate – the crude death rates were obtained from the analytical report on mortality of the Kenya Population and Housing Census 2009.
- Pupils_per_teacher – average pupil to teacher ratio for public and private schools. The source data was obtained from the Kenya Open Data Initiative site.
- Secondary_enrollment – total number of pupils enrolled in secondary schools reported by KNBS in a synopsis report that mention statistical abstract 2013 as the source.
- Pop_urban – percentage of county population living in an urban area reported by the Kenya Population and Housing Census 2009.
- Pop_poor – percentage of the county population living below the poverty line. I don’t remember the source, but this is a good development indicator.
- Road_density – length of roads by area (km/km2), an indicator that I calculated using a national road dataset from the Kenya Roads Board and the county area reported by KNBS.
First, we import the spreadsheet data into a Pandas data frame with the read_csv method using County_id as the index column. Then we use sort_index to sort the data frame on the row index. Here is how the head of the data frame looks like:
Exploring and Cleaning the Data
To confirm that the data is properly imported we can use the following methods on the data frame.
- columns – to confirm that there are no spaces in the column names.
- shape – to confirm that the data frame has the expected dimensions. In our case (47, 14) representing 47 observations (1 for each county) and 14 features.
- info() – to confirm that each feature is of the expected data type. In our case, there should be two Object (String) fields (County_name, Former_province) and 3 integer fields (Population_2009, Households_2009, Pop_under5). The remaining fields should be of type float.
- describe() – to confirm that the min, max values are within the expected range.
To find out whether there are outliers in the dataset, we now create a boxplot for each of the numeric features.
The open dots on the graph indicate that many of the features have outliers. This suggests that the data is not normally distributed, and we’ll use this information later in our analysis.
The K-Means clustering only works with numeric features. String features can be made categorical and converted to numeric features with label encoding or dummy variables. However, here is why we can simply drop both string features (and Census_area) from our analysis:
- County_name – the name is unique to each county, so it shouldn’t be included in the clustering process.
- Former_province – the former province to which a county belonged is arbitrary and therefore irrelevant in the clustering process.
- Census_area – there is a huge variation in county size, but we don’t want it as a distorting factor in the clustering process.
The K-Means clustering algorithm minimizes on the distance between the members of a cluster. Consequently, it gives a high weight to features with a high variance like Population_2009. To remove this bias, we will scale our data in a preprocessing step. There are many scalers that we choose from (eg. StandardScaler, MinMaxScaler), but we’ll use the RobustScaler of Scikit-learn. This scaler doesn’t assume that our data is normally distributed like StandardScaler and is less affected by outliers than MinMaxScaler. So, let’s go ahead and use the RobustScaler to transform our data frame into a new data frame with the fit_transform method.
Now we need to determine the value for the n_clusters argument, which informs KMeans on the number of clusters that it should create. This presents a dilemma since we have no way of telling how many county clusters there are. However, the performance of a cluster model can be measured by model inertia, calculated as the sum of squared distances to the nearest cluster center. Let’s run KMeans with a range of values and plot the model attribute inertia_ against the n_clusters argument:
The graph clearly shows that the decrease in model inertia (inertia_) is reducing with an increase in the number of clusters (n_clusters). Since the elbow in the curve best balances model inertia and number of clusters, we choose n_clusters=4 for the clustering of our counties.
Recall that we scaled our data in a preprocessing step. We can now visualize the impact of scaling in a scatter plot with features on the x-axis (Population_2009) and y-axis (Fertility_rate) and dots colored by cluster.
You can see that without scaling the cluster assignment is largely dependent on population, which has the largest variation. With scaling fertility rate becomes a determining factor in the clustering process as shown by the blue and magenta dots which have similar populations.
Finally, we are ready to run K-Means on our scaled features with n_clusters=4. Not sure what you were expecting, but are the clusters that K-Means formed in report format:
- Nairobi (1)
- Kiambu, Kisumu, Machakos, Mombasa, Nakuru (5)
- Baringo, Bomet, Bungoma, Busia, Elgeyo Marakwet, Embu, Homa Bay, Kajiado, Kakamega, Kericho, Kilifi, Kirinyaga, Kisii, Kitui, Kwale, Laikipia, Lamu, Makueni, Meru, Migori, Murang’a, Nandi, Narok, Nyamira, Nyandarua, Nyeri, Siaya, Taita Taveta, Tharaka Nithi, Trans Nzoia, Uasin Gishu, Vihiga (32)
- Garissa, Isiolo, Mandera, Marsabit, Samburu, Tana River, Turkana, Wajir, West Pokot (9)
This information might be helpful if you know the 47 counties and have an idea where they are located. Nonetheless, a picture is worth a thousand words, so let’s create a map. To do this, let’s use import GeoPandas and use it for the following:
- Use GeoPandas’ read_file method to import a shapefile of the 47 counties as a geo-data frame.
- Use the columns command to give the fields appropriate names.
- Merge the geo-data frame with a data frame that contains county name and cluster assignment using county name as the join field.
- Visualize the geo-data frame with the plot command using cluster assignment as the column argument.
The map clearly shows that the county clusters are contiguous with Mombasa and Kisumu (in green) as the only exceptions. This is evidence of a strong geographic trend that we will debate in a later section.
We can also visualize the county clusters on a two-dimensional scatter using t-SNE, which is short for T-distributed Stochastic Neighbor Embedding. Without getting into the details of the algorithm, this is how the t-SNE scatter plot looks like:
Notice that the county clusters are well separated, with some indication that the magenta cluster could be split into 2 or 3 sub-clusters. How to do this could be a story for another day.
We’ve been using K-Means, but you might be wondering whether there are other clustering methods and models. To answer that question you might find An introduction to clustering and different methods of clustering a helpful read. K-Means provides a robust model that is highly scalable, but hierarchical clustering is interesting when working with a small dataset for the following reasons:
- Hierarchical clustering works bottom-up so you don’t need to specify the number of clusters beforehand but and can determine the number of clusters from the results.
- Hierarchical clustering gives repeatable results, unlike K-Means where subsequent runs of the algorithm can produce different results.
- The results of hierarchical clustering can be visualized in a tree-like dendrogram, which illustrates the entire clustering process.
When running hierarchical clustering on our transformed county dataset and visualizing the clustering process as a dendrogram we get the following:
Isn’t that beautiful? The labels for the counties are hardly legible, but here are a few key observations:
- The vertical distances between the merging of clusters suggest that Kenya has 2 or 3 clusters.
- Notice the similarities and differences with the results of the K-Means results:
- Nairobi can clearly be identified as a cluster on its own
- Mombasa, Kiambu, and Nakuru form a cluster, but Kisumu and Machakos are left out.
- Garissa, Turkana, Tana River, Marsabit, Wajir, Samburu, Mandera, and West Pokot form a cluster, but Isiolo this time decides to team up with Kajiado.
- There is a strong indication that the other counties can be grouped in 3 clusters.
- Nyandarua and Nyamira must be very similar since they are the first two counties to be merged before Embu joins them shortly thereafter.
To find out which of Kenya’s 47 counties are similar and how they can best be segmented we performed K-Means and hierarchical clustering on a county dataset that incorporates vital statistics. Here are some of the key observations:
- Clearly, there Is no other county in Kenya like Nairobi. To some extent, this can be attributed to the neglect of other regions, but Nairobi remains the beating heart of Kenya’s economy.
- Mombasa, Kiambu, Nakuru, Machakos, and Kisumu are counties characterized by high urbanization and development. Kisumu and Mombasa are major ports, while Nakuru, Kiambu, and Machakos are neighboring Nairobi along major road arteries.
- The Northern part of Kenya consisting of the counties of Garissa, Isiolo, Mandera, Marsabit, Samburu, Tana River, Turkana, Wajir, West Pokot appears to very different from the rest of the country. In general, counties cover a large area, have a small population and road density and high incidence of poverty. Low rainfall could be a key factor that sets these counties apart.
- The remaining 32 counties appear to be a mixed bunch, but more data and further study is needed to segment them in a meaningful way.