Segmentation of a population containing very many features
PCA analysis and clustering by k-means metod
One of the visible evidence of the technological development of humanity is the constantly increasing saturation of features describing particular phenomena and processes. In the past, consumers, natural or economic phenomenas were described by only a few simple indicators. Currently, we can see a huge increase in the detail of observations embodied in the large number of complex indicators.
From the point of view of a data analyst, a large number of indicators is unfavorable for model analysis and clustering. We may be dealing with the so-called curse of dimensionality, a phenomenon in which the number of describing features is close to or even greater than the number of observations or observed objects. The PCA method comes with help. This method allows aggregation of many features into couple of complex features, so caled components. In practice, the PCA method allows to create several important complex features from dozens of features, which in 70–80% explain the analyzed phenomena. It should also be mentioned that the first two features contain the most information. More information on the PCA methodology can be found here in this link: https://en.wikipedia.org/wiki/Principal_component_analysis
PRACTICAL USE OF THE PCA METHOD TO CREATE COMBINED FEATURES SO CALLED: COMPONENTS
To illustrate the procedure, I downloaded a free database from UCI Machine Learning Repository.
Database involve research of communities within the United States. As quote UCI: “The data combines socio-economic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR”. This is the link to this souce: https://archive.ics.uci.edu/dataset/183/communities+and+crime
The database consists of two files: the proper database and the file containing the names. Downloading the feature names turned out to be not a simple task. That’s why I allowed myself to show only a ready list of features that will be needed when downloading the main database.
First, I download the python libraries I need.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings("ignore")
The next step is to download the proper database, taking into account the names of the features that were the subject of the research.
headers = ['county', 'community', 'communityname', 'fold', 'population', 'householdsize',
'racepctblack', 'racePctWhite', 'racePctAsian', 'racePctHisp', 'agePct12t21',
'agePct12t29', 'agePct16t24', 'agePct65up', 'numbUrban', 'pctUrban', 'medIncome',
'pctWWage', 'pctWFarmSelf', 'pctWInvInc', 'pctWSocSec', 'pctWPubAsst', 'pctWRetire',
'medFamInc', 'perCapInc', 'whitePerCap', 'blackPerCap', 'indianPerCap', 'AsianPerCap',
'OtherPerCap', 'HispPerCap', 'NumUnderPov', 'PctPopUnderPov', 'PctLess9thGrade',
'PctNotHSGrad', 'PctBSorMore', 'PctUnemployed', 'PctEmploy', 'PctEmplManu',
'PctEmplProfServ', 'PctOccupManu', 'PctOccupMgmtProf', 'MalePctDivorce',
'MalePctNevMarr', 'FemalePctDiv', 'TotalPctDiv', 'PersPerFam', 'PctFam2Par',
'PctKids2Par', 'PctYoungKids2Par', 'PctTeen2Par', 'PctWorkMomYoungKids',
'PctWorkMom', 'NumIlleg', 'PctIlleg', 'NumImmig', 'PctImmigRecent', 'PctImmigRec5',
'PctImmigRec8', 'PctImmigRec10', 'PctRecentImmig', 'PctRecImmig5', 'PctRecImmig8',
'PctRecImmig10', 'PctSpeakEnglOnly', 'PctNotSpeakEnglWell', 'PctLargHouseFam',
'PctLargHouseOccup', 'PersPerOccupHous', 'PersPerOwnOccHous', 'PersPerRentOccHous',
'PctPersOwnOccup', 'PctPersDenseHous', 'PctHousLess3BR', 'MedNumBR', 'HousVacant',
'PctHousOccup', 'PctHousOwnOcc', 'PctVacantBoarded', 'PctVacMore6Mos', 'MedYrHousBuilt',
'PctHousNoPhone', 'PctWOFullPlumb', 'OwnOccLowQuart', 'OwnOccMedVal', 'OwnOccHiQuart',
'RentLowQ', 'RentMedian', 'RentHighQ', 'MedRent', 'MedRentPctHousInc', 'MedOwnCostPctInc',
'MedOwnCostPctIncNoMtg', 'NumInShelters', 'NumStreet', 'PctForeignBorn', 'PctBornSameState',
'PctSameHouse85', 'PctSameCity85', 'PctSameState85', 'LemasSwornFT', 'LemasSwFTPerPop',
'LemasSwFTFieldOps', 'LemasSwFTFieldPerPop', 'LemasTotalReq', 'LemasTotReqPerPop',
'PolicReqPerOffic', 'PolicPerPop', 'RacialMatchCommPol', 'PctPolicWhite', 'PctPolicBlack',
'PctPolicHisp', 'PctPolicAsian', 'PctPolicMinor', 'OfficAssgnDrugUnits', 'NumKindsDrugsSeiz',
'PolicAveOTWorked', 'LandArea', 'PopDens', 'PctUsePubTrans', 'PolicCars', 'PolicOperBudg',
'LemasPctPolicOnPatr', 'LemasGangUnitDeploy', 'LemasPctOfficDrugUn', 'PolicBudgPerPop',
'ViolentCrimesPerPop']
df = pd.read_csv("communities.data",
header=None,
na_values="?",
skip_blank_lines=True,
names=headers)
df
A first glance at the table shows that the database contains mostly numerical values. So it analyzes how many columns contain data in text format.
df.select_dtypes('object').columns
df['communityname'].value_counts()
df['communityname'].nunique()
# 1828
df.shape
# (1994, 127)
The database contains only one column in text format. The number of unique values in this column is 1828. Several cities appear in the database 4–5 times. Undoubtedly, information about the locality is not of great importance in the population clustering process. Therefore, I decided to delete this feature.
DATA COMPLETENESS ANALYSIS
In the process of creating clusters, it is very important to check the completeness of the data (i.e. whether there are gaps in the data). If the information describing each individual person is not complete, incorrect assignment of these people to clusters may occur. There is a danger that a large part of people will be assigned to clusters only due to missing information. The lack of information does not mean that this information was uniform for all persons not having this information. The chart below shows missing information in the database as yellow boxes. We can see the incomplete features gruped in blocks (i.e sources).
def show_missing():
missing = df.columns[df.isnull().any()].tolist()
return missing
df[show_missing()].isnull().sum()
With missing information, you can fill the empty cells with one randomly selected value. As I mentioned earlier, there is a danger that the clustering algorithm will by suggested by such artificial information. This can result in faulty clustering. Therefore, he decides to delete features with missing records. As you can see, the lion’s share of these listed yellow colored features are missing values.
We save this one ‘OtherPerCap’ record
df['OtherPerCap'] = df['OtherPerCap'].replace(np.nan, 0)
plt.figure(figsize=(30,3))
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
plt.xlabel('Features', fontsize = 45)
I remove incomplete features
fa =df[show_missing()].isnull().sum().reset_index()
g = fa['index'].tolist()
df2 = df.drop(g, axis=1)
len(g)
After removing 24 missing features and one text feature, 102 features containing complete data remained. These features will form the basis for the clustering process.
STANDARDIZATION
Individual characteristics of the population differ significantly from each other. To apply the PCA method, standardization must be carried out. Standardization standardizes features so that their statistical characteristics are the mean equal to 0 and the standard deviation equal to 1.
scaler = StandardScaler()
segmentation_std = scaler.fit_transform(df2)
segmentation_std[1:2]
DIMENSIONAL REDUCTION USING THE PCA METHOD
At the beginning we can ask ourselves why we chose this method of dimension reduction. The answer is quite simple, we don’t have a phenomenon that contains dependent variables and independent variables. If there is no relationship between the dependent and independent variables, it is not possible to use any filtering or modeling methods to reduction features. Besides, the point is not to eliminate the features, but to consolidate them into a few most important sets of categories, the so-called components. Next we will be able to point most efficive components.
So, we run the PCA algorithm in its simplest form available in python.
pca = PCA()
pca.fit(segmentation_std)
pca.explained_variance_ratio_.shape
# (102,)
plt.figure(figsize = (12,4))
plt.plot(range(1, 103),
pca.explained_variance_ratio_.cumsum(),
marker = 'o',
linestyle = '--')
plt.title('Explained Variance by Components')
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained Variance')
The graph of the curve below shows the meaning of the grouped variables. The x-axis describes the number of variables, while the y-axis informs how a specific number of variables affects the amount of information about the population. As you can easily see, about 20 variables account for 90% of the population information.
It can also be seen that approximately 6 components consisting of aggregated primary features account for 70% of the information. In my opinion, it’s optimal to take 6 component to furter analitic work.
pca = PCA(n_components = 6)
pca.fit(segmentation_std)
# PCA(n_components=6)
PCA(copy=True,
iterated_power ='auto',
n_components=6,
random_state = None, svd_solver = 'auto',
tol = 0.0,
whiten = False)
pca.transform(segmentation_std)
scores_pca = pca.transform(segmentation_std)
Thus, a database consisting of six components is created. This database will be the basis for clustering using the K-means method.
K-MEANS CLUSTERIZATION
First of all, we should answer the question of how many clusters there should be. To resolve this issue, use the elbow method. The place where the inflection occurs will indicate the number of clusters.
Sum_of_squared_distances = []
K = range(1, 20)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(pca.transform(segmentation_std))
Sum_of_squared_distances.append(km.inertia_)
plt.plot(K, Sum_of_squared_distances, "bx-")
plt.xlabel("k")
plt.ylabel("Sum_of_squared_distances")
plt.title("Elbow Method For Optimal k")
plt.show()
The graph showed that the optimal number of slices is 7.
kmeans_pca = KMeans(n_clusters=7, init = 'k-means++', random_state = 148)
kmeans_pca.fit(scores_pca)
Analyzing the results of our algorithm
df_finish = pd.concat([df2.reset_index(drop = True),
pd.DataFrame(scores_pca)],
axis=1)
df_finish.columns.values[-6:] = ['Com_1',
'Com_2',
'Com_3',
'Com_4',
'Com_5',
'Com_6']
df_finish['Segments K means PCA'] = kmeans_pca.labels_
df_finish.head(3)
One small step remains: we should add the names of the segments to the labels.
df_finish['Segment'] = df_finish['Segments K means PCA'].map({0:"1",
1:"2",
2:"3",
3:"4",
4:"5",
5:"6",
6:"7"})
df_finish.head(5)
df_finish['Segment'].value_counts().plot(kind='bar')
The graph below shows that the objects subjected to clustering form two large clusters 1 and 2. The fewest objects are in cluster no. 6. Large differences in the number of objects in clusters are indicated by many researchers as information about weak clustering. In my experience, this type of disproportion in the number of objects in clusters is very typical in the clustering process.
THE FINAL RESULT
The conducted analysis arranged the population in very coherent clusters. As I mentioned earlier in the PCA methodology, the first two components are the most important. In the graph I showed earlier, these components contain 50% of the population information. These clusters are ordering in relation to the conductedcomponents process. As can be seen in the charts below, the study population was very effectively grouped into clusters according to PCA components.
plt.figure(figsize = (12,4))
sns.scatterplot(x=df_finish['Com_2'],
y=df_finish['Com_1'],
hue=df_finish['Segment'],
palette=('husl'))
plt.xlabel('Component_2')
plt.ylabel('Component_1')
plt.title('Cluster_by_PCA_components')
plt.figure(figsize = (12,4))
sns.scatterplot(x=df_finish['Com_3'],
y=df_finish['Com_1'],
hue=df_finish['Segment'],
palette=('husl'))
plt.xlabel('Component_3')
plt.ylabel('Component_1')
plt.title('Cluster_by_PCA_components')
plt.figure(figsize = (12,4))
sns.scatterplot(x=df_finish['Com_4'],
y=df_finish['Com_1'],
hue=df_finish['Segment'],
palette=('husl'))
plt.xlabel('Component_4')
plt.ylabel('Component_1')
plt.title('Cluster_by_PCA_components')