Paul Kiragu
The Pixel District

The Pixel District

Personality Prediction Using K-Means Clustering Algorithm and Django Rest Framework (Article  Partly Written By GPT3)

Personality Prediction Using K-Means Clustering Algorithm and Django Rest Framework (Article Partly Written By GPT3)

Paul Kiragu's photo
Paul Kiragu
·Jan 26, 2022·

13 min read


Artificial Intelligence is revolutionizing the world today as we know it through technical transformations. Machine learning applications are applied in our day-to-day lives, and one of the incredible applications of machine learning is to classify individuals based on their personality traits. Each person on this planet is unique and carries a unique personality. The availability of a high-dimensional and large amount of data has paved the way for increasing marketing campaigns' effectiveness by targeting specific people. Such personality-based communications are highly effective in increasing the popularity and attractiveness of products and services.

In this article, we build a model to predict personality based on 50 questions and deploy the model using Django Rest Framework.

I will also be sharing a snapshot of my code on how I was able to use GPT3 to help me generate content in this article. Every text below, except 'the steps to implement' section, was generated by GPT3, a natural-language processing system that can generate tweets, pens poetry, summarizes emails, answers trivia questions, translates languages and even write its own computer programs.

K-means Clustering Algorithm.

The K-Means clustering algorithm is an algorithm that is used in unsupervised machine learning. To put it simply, it is a clustering algorithm where it groups the data into clusters and then assigns a label to each of the clusters. The K-Means algorithm is an iterative algorithm, it means that it repeats the following steps in a loop until the algorithm is satisfied.

  • Assign the input data to K clusters (there are K clusters, the algorithm sorts the data into K clusters)
  • Assign each input data to the closest cluster (the algorithm calculates the distance between the data and each of the clusters and then assigns each data to the closest cluster)
  • Update the cluster centroids (this is the average of all the data in the cluster)

Applications of K-means Algorithm.

The k-means clustering algorithm is used in a variety of applications. It is used in marketing, advertisement, market segmentation, and customer segmentation. It is also used in various scientific fields, like in determining someones personality.nIn determining someones personality.

Five Personality Traits (OCEAN)

The big five personality traits are:

  • Openness to experience
  • Conscientiousness
  • Extraversion
  • Agreeableness
  • Neuroticism.

These are the five traits that human behavior is divided into. People can score between 1 and 5 in each of the five traits. For example, let's say that someone scores 3 in openness to experience. This means that he/she is not very open to new experiences, he/she is a person that is very comfortable with their own ways and he/she wants to stick to it. He/she is not very open to new things and he/she tries to avoid it. He/she is a person that doesn't like to try new things. For example, in a party, he/she is not very social and he/she doesn't like to talk a lot. He/she is a person that speaks little and he/she is a person that is not very sociable. In a party, he/she likes to stay at the corner and he/she won't go talk to other people. He/she will just stay and look at other people in the party. This is his/her personality.

Now that we have a basic understanding of the k-means algorithm and the big five personality traits, let us dive deep into building the Big Five personality traits.

Steps to implement.

Let us start by building and saving our model that will be later used to make predictions for our API.


Our dataset consists of 1,015,342 questionnaire answers collected online by Open Psychometrics. Let's look at how the dataset actually appears. The dataset and entire code for this blog can be found on my Github Repo

We first import the necessary dependencies. If you do not have the libraries installed, kindly do so before proceeding.

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
import joblib

Next, we load the dataset using pandas and then we display the number of participants:

data_raw = pd.read_csv('data-final.csv', sep='\t')
data = data_raw.copy()
pd.options.display.max_columns = 150

data.drop(data.columns[50:107], axis=1, inplace=True)
data.drop(data.columns[51:], axis=1, inplace=True)

print('Number of participants: ', len(data))
Number of participants:  1015341

Exploration of Dataset.

Let us begin by checking for missing values and removing the missing values, like so:

print('Missing value? ', data.isnull().values.any())
print('How many? ', data.isnull().values.sum())
print('Number of participants after eliminating missing values: ', len(data))
Is there any missing value?  True
How many missing values?  89227
Number of participants after eliminating missing values:  1013481

Let us now look at the participant distribution per nationality, by entering the code:

# Participants' nationality distriution
countries = pd.DataFrame(data['country'].value_counts())
countries_5000 = countries[countries['country'] >= 5000]
sns.barplot(data=countries_5000, x=countries_5000.index, y='country')
plt.title('Countries With More Than 5000 Participants')


Let us now visualize the question and answer distribution of the questionnaires, like so:

# Defining a function to visualize the questions and answers distribution
def vis_questions(groupname, questions, color):
    for i in range(1, 11):
        plt.hist(data[groupname[i-1]], bins=14, color= color, alpha=.5)
        plt.title(questions[groupname[i-1]], fontsize=18)
print('Q&As Related to Extroversion Personality')
vis_questions(EXT, ext_questions, 'orange')


print('Q&As Related to Neuroticism Personality')
vis_questions(EST, est_questions, 'pink')


print('Q&As Related to Agreeable Personality')
vis_questions(AGR, agr_questions, 'red')


print('Q&As Related to Conscientious Personality')
vis_questions(CSN, csn_questions, 'purple')


print('Q&As Related to Open Personality')
vis_questions(OPN, opn_questions, 'blue')


Building the Model

We already know the number of clusters that will be present in our model, that is 5, the big five of personality traits. Let us see how we can get this value using code. For ease of calculation, we shall scale all the values between 0-1 and use only a sample of 5000, like so:

df = data.drop('country', axis=1)
columns = list(df.columns)

scaler = MinMaxScaler(feature_range=(0,1))
df = scaler.fit_transform(df)
df = pd.DataFrame(df, columns=columns)
df_sample = df[:5000]

Let us now visualize our elbow curve. In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use.

kmeans = KMeans()
visualizer = KElbowVisualizer(kmeans, k=(2,15))


As you can see 5 clusters looks optimum for the data set and we already know this researh is to identify 5 different personalities.

Clustering Participants into 5 Personality Groups

To do this, we use the unscaled data but without the country column.

df_model = data.drop('country', axis=1)

# I define 5 clusters and fit my model
kmeans = KMeans(n_clusters=5)
k_fit =
joblib_file = "BigFivePersonalityTestModel.joblib"
joblib.dump(k_fit, joblib_file)

# Predicting the Clusters
pd.options.display.max_columns = 10
predictions = k_fit.labels_
df_model['Clusters'] = predictions

EXT1    EXT2    EXT3    EXT4    EXT5    ...    OPN7    OPN8    OPN9    OPN10    Clusters
0    4.0    1.0    5.0    2.0    5.0    ...    5.0    3.0    4.0    5.0    3
1    3.0    5.0    3.0    4.0    3.0    ...    4.0    2.0    5.0    3.0    2
2    2.0    3.0    4.0    4.0    3.0    ...    5.0    3.0    4.0    4.0    2
3    2.0    2.0    2.0    3.0    4.0    ...    4.0    4.0    3.0    3.0    1
4    3.0    3.0    3.0    3.0    5.0    ...    5.0    3.0    5.0    5.0    3

Analysing the Model and Predictions

Let us see how many individuals we have for each cluster, like so:

4    227063
2    212816
3    210075
0    200226
1    163301
Name: Clusters, dtype: int64

Let's group the results acording to clusters. That way we can investigate the average answer to the each question for each cluster.

That way we can have an intuition about how our model classifies people.

pd.options.display.max_columns = 150

Let's now sum up the each question groups (EXT, EST ..) and see if we can see a pattern.

# Summing up the different questions groups
col_list = list(df_model)
ext = col_list[0:10]
est = col_list[10:20]
agr = col_list[20:30]
csn = col_list[30:40]
opn = col_list[40:50]

data_sums = pd.DataFrame()
data_sums['extroversion'] = df_model[ext].sum(axis=1)/10
data_sums['neurotic'] = df_model[est].sum(axis=1)/10
data_sums['agreeable'] = df_model[agr].sum(axis=1)/10
data_sums['conscientious'] = df_model[csn].sum(axis=1)/10
data_sums['open'] = df_model[opn].sum(axis=1)/10
data_sums['clusters'] = predictions

extroversion    neurotic    agreeable    conscientious    open
0    2.965969    3.645931    3.148628    3.173210    3.245529
1    2.909467    2.525743    2.851802    2.914458    3.120373
2    3.051889    2.984940    3.187544    3.159140    3.243641
3    3.085431    2.423577    3.209064    3.106899    3.327173
4    3.072319    3.426610    3.300147    3.211454    3.352370

Let us now visualize the mean for each of our 5 personality clusters:

# Visualizing the means for each cluster
dataclusters = data_sums.groupby('clusters').mean()
for i in range(0, 5):
    plt.subplot(1,5,i+1), dataclusters.iloc[:, i], color='green', alpha=0.2)
    plt.plot(dataclusters.columns, dataclusters.iloc[:, i], color='red')
    plt.title('Cluster ' + str(i))


Visualizing the Cluster Predictions

pca = PCA(n_components=2)
pca_fit = pca.fit_transform(df_model)

df_pca = pd.DataFrame(data=pca_fit, columns=['PCA1', 'PCA2'])
df_pca['Clusters'] = predictions
sns.scatterplot(data=df_pca, x='PCA1', y='PCA2', hue='Clusters', palette='Set2', alpha=0.8)
plt.title('Personality Clusters after PCA');


Predict Personality

I answered the questions in an Microsoft Excel spread sheet. Then I added that data into this notebook and put my answers to the model to see in which category I will be.

my_data = pd.read_excel('my_personality_test.xlsx')
my_personality = k_fit.predict(my_data)
print('My Personality Cluster: ', my_personality)
My Personality Cluster:  [2]

Visualizing my personality cluster

col_list = list(my_data)
ext = col_list[0:10]
est = col_list[10:20]
agr = col_list[20:30]
csn = col_list[30:40]
opn = col_list[40:50]

my_sums = pd.DataFrame()
my_sums['extroversion'] = my_data[ext].sum(axis=1)/10
my_sums['neurotic'] = my_data[est].sum(axis=1)/10
my_sums['agreeable'] = my_data[agr].sum(axis=1)/10
my_sums['conscientious'] = my_data[csn].sum(axis=1)/10
my_sums['open'] = my_data[opn].sum(axis=1)/10
my_sums['cluster'] = my_personality
print('Sum of my question groups')
my_sum = my_sums.drop('cluster', axis=1), my_sum.iloc[0,:], color='green', alpha=0.2)
plt.plot(my_sum.columns, my_sum.iloc[0,:], color='red')
plt.title('Cluster 2')


Now that our model works, let us proceed by turning our model into a Restful API

Turning the Model into an RESTFUL API

Following Python best practices, we will create a virtual environment for our project, and install the required packages.

First, create the project directory.

$ mkdir djangoapp
$ cd djangoapp

Now, create a virtual environment and install the required packages.

For macOS and Unix systems:

$ python3 -m venv myenv
$ source myenv/bin/activate
(myenv) $ pip install django requests numpy joblib scikit-learn xlsxwriter openpyxl

For Windows:

$ python3 -m venv myenv
$ myenv\Scripts\activate
(myenv) $ pip install django requests numpy joblib scikit-learn xlsxwriter openpyxl

Setting Up Your Django Application

First, navigate to the directory djangoapp we created and establish a Django project.

(myenv) $ django-admin startproject mainapp

This will auto-generate some files for your project skeleton:


Now, navigate to the directory you just created (make sure you are in the same directory as and create your app directory.

(myenv) $ python startapp monitor

This will create the following:


On the mainapp/ file, look for the following line and add the app we just created above.

    'monitor', #new line

Ensure you are in the monitor directory then create a new directory called templates then a new file called Your directory structure of monitor application should look like this


Ensure your mainapp/ file, add our monitor app URL to include the URLs we shall create next on the monitor app:

from django.contrib import admin
from django.urls import path, include

urlpatterns = [
    path('', include('monitor.urls')),#monitor app url

Now, on the monitor/ file, add our website there:

from django.urls import path
from .views import *

urlpatterns = [
    path('persona', PersonalityPrediction.as_view(), name='personality'),

Let’s create another directory to store our machine learning model. I’ll also add the dataset to the project for those who want to achieve the whole dataset. (It is not compulsory to create a data folder.) Be sure to move the vectorizer file and the joblib file we created earlier to ml/model folder

(venv)$ mkdir ml
(venv)$ mkdir ml/models
(venv)$ mkdir ml/data

We also need to tell Django where our machine learning model is located. Add these lines to file:

import os
MODELS = os.path.join(BASE_DIR, 'ml/models')

Load Model and Vectorizer through

Load the model we created and saved in so that when the application starts, the trained model is loaded only once. Otherwise, the trained model is loaded each time an endpoint is called, and then the response time will be slower.

Let’s update

import os
import joblib
from django.apps import AppConfig
from django.conf import settings

class ApiConfig(AppConfig):
    name = 'api'
    MODEL_FILE = os.path.join(settings.MODELS, "BigFivePersonalityTestModel.joblib")
    model = joblib.load(MODEL_FILE)


The views will be mainly responsible for:

  • Process incoming POST requests.
from django.shortcuts import render
from rest_framework.views import APIView
from rest_framework.response import Response
import requests
from .apps import *
import pandas as pd
import xlsxwriter

#kmeans personality prediction view
class PersonalityPrediction(APIView):
    def post(self, request):
        #get user input
        EXT1 ='EXT1')
        EXT2 ='EXT2')
        EXT3 ='EXT3')
        EXT4 ='EXT4')
        EXT5 ='EXT5')
        EXT6 ='EXT6')
        EXT7 ='EXT7')
        EXT8 ='EXT8')
        EXT9 ='EXT9')
        EXT10 ='EXT10')

        EST1 ='EST1')
        EST2 ='EST2')
        EST3 ='EST3')
        EST4 ='EST4')
        EST5 ='EST5')
        EST6 ='EST6')
        EST7 ='EST7')
        EST8 ='EST8')
        EST9 ='EST9')
        EST10 ='EST10')

        AGR1 ='AGR1')
        AGR2 ='AGR2')
        AGR3 ='AGR3')
        AGR4 ='AGR4')
        AGR5 ='AGR5')
        AGR6 ='AGR6')
        AGR7 ='AGR7')
        AGR8 ='AGR8')
        AGR9 ='AGR9')
        AGR10 ='AGR10')

        CSN1 ='CSN1')
        CSN2 ='CSN2')
        CSN3 ='CSN3')
        CSN4 ='CSN4')
        CSN5 ='CSN5')
        CSN6 ='CSN6')
        CSN7 ='CSN7')
        CSN8 ='CSN8')
        CSN9 ='CSN9')
        CSN10 ='CSN10')

        OPN1 ='OPN1')
        OPN2 ='OPN2')
        OPN3 ='OPN3')
        OPN4 ='OPN4')
        OPN5 ='OPN5')
        OPN6 ='OPN6')
        OPN7 ='OPN7')
        OPN8 ='OPN8')
        OPN9 ='OPN9')
        OPN10 ='OPN10')
        #load model
        kmeansModel = PersonalityTestConfig.model
        # Predicting the Clusters
        pd.options.display.max_columns = 10
        predictions = kmeansModel.labels_

        workbook = xlsxwriter.Workbook('hello.xlsx')

        worksheet = workbook.add_worksheet()

        worksheet.write('A1', 'EXT1')
        worksheet.write('A2', (EXT1))

        worksheet.write('B1', 'EXT2')
        worksheet.write('B2', (EXT2))

        worksheet.write('C1', 'EXT3')
        worksheet.write('C2', (EXT3))

        worksheet.write('D1', 'EXT4')
        worksheet.write('D2', (EXT4))

        worksheet.write('E1', 'EXT5')
        worksheet.write('E2', (EXT5))

        worksheet.write('F1', 'EXT6')
        worksheet.write('F2', (EXT6))

        worksheet.write('G1', 'EXT7')
        worksheet.write('G2', (EXT7))

        worksheet.write('H1', 'EXT8')
        worksheet.write('H2', (EXT8))

        worksheet.write('I1', 'EXT9')
        worksheet.write('I2', (EXT9))

        worksheet.write('J1', 'EXT10')
        worksheet.write('J2', (EXT10))

        worksheet.write('K1', 'EST1')
        worksheet.write('K2', (EST1))

        worksheet.write('L1', 'EST2')
        worksheet.write('L2', (EST2))

        worksheet.write('M1', 'EST3')
        worksheet.write('M2', (EST3))

        worksheet.write('N1', 'EST4')
        worksheet.write('N2', (EST4))

        worksheet.write('O1', 'EST5')
        worksheet.write('O2', (EST5))

        worksheet.write('P1', 'EST6')
        worksheet.write('P2', (EST6))

        worksheet.write('Q1', 'EST7')
        worksheet.write('Q2', (EST7))

        worksheet.write('R1', 'EST8')
        worksheet.write('R2', (EST8))

        worksheet.write('S1', 'EST9')
        worksheet.write('S2', (EST9))

        worksheet.write('T1', 'EST10')
        worksheet.write('T2', (EST10))

        worksheet.write('U1', 'AGR1')
        worksheet.write('U2', (AGR1))

        worksheet.write('V1', 'AGR2')
        worksheet.write('V2', (AGR2))

        worksheet.write('W1', 'AGR3')
        worksheet.write('W2', (AGR3))

        worksheet.write('X1', 'AGR4')
        worksheet.write('X2', (AGR4))

        worksheet.write('Y1', 'AGR5')
        worksheet.write('Y2', (AGR5))

        worksheet.write('Z1', 'AGR6')
        worksheet.write('Z2', (AGR6))

        worksheet.write('AA1', 'AGR7')
        worksheet.write('AA2', (AGR7))

        worksheet.write('AB1', 'AGR8')
        worksheet.write('AB2', (AGR8))

        worksheet.write('AC1', 'AGR9')
        worksheet.write('AC2', (AGR9))

        worksheet.write('AD1', 'AGR10')
        worksheet.write('AD2', (AGR10))

        worksheet.write('AE1', 'CSN1')
        worksheet.write('AE2', (CSN1))

        worksheet.write('AF1', 'CSN2')
        worksheet.write('AF2', (CSN2))

        worksheet.write('AG1', 'CSN3')
        worksheet.write('AG2', (CSN3))

        worksheet.write('AH1', 'CSN4')
        worksheet.write('AH2', (CSN4))

        worksheet.write('AI1', 'CSN5')
        worksheet.write('AI2', (CSN5))

        worksheet.write('AJ1', 'CSN6')
        worksheet.write('AJ2', (CSN6))

        worksheet.write('AK1', 'CSN7')
        worksheet.write('AK2', (CSN7))

        worksheet.write('AL1', 'CSN8')
        worksheet.write('AL2', (CSN8))

        worksheet.write('AM1', 'CSN9')
        worksheet.write('AM2', (CSN9))

        worksheet.write('AN1', 'CSN10')
        worksheet.write('AN2', (CSN10))

        worksheet.write('AO1', 'OPN1')
        worksheet.write('AO2', (OPN1))

        worksheet.write('AP1', 'OPN2')
        worksheet.write('AP2', (OPN2))

        worksheet.write('AQ1', 'OPN3')
        worksheet.write('AQ2', (OPN3))

        worksheet.write('AR1', 'OPN4')
        worksheet.write('AR2', (OPN4))

        worksheet.write('AS1', 'OPN5')
        worksheet.write('AS2', (OPN5))

        worksheet.write('AT1', 'OPN6')
        worksheet.write('AT2', (OPN6))

        worksheet.write('AU1', 'OPN7')
        worksheet.write('AU2', (OPN7))

        worksheet.write('AV1', 'OPN8')
        worksheet.write('AV2', (OPN8))

        worksheet.write('AW1', 'OPN9')
        worksheet.write('AW2', (OPN9))

        worksheet.write('AX1', 'OPN10')
        worksheet.write('AX2', (OPN10))

        # Finally, close the Excel file
        # via the close() method.

        my_data = pd.read_excel('hello.xlsx', engine='openpyxl')

        my_personality = kmeansModel.predict(my_data)
        print('My Personality Cluster: ', my_personality)

        # Summing up the my question groups
        col_list = list(my_data)
        ext = col_list[0:10]
        est = col_list[10:20]
        agr = col_list[20:30]
        csn = col_list[30:40]
        opn = col_list[40:50]

        my_sums = pd.DataFrame()
        my_sums['extroversion'] = my_data[ext].sum(axis=1)/10
        my_sums['neurotic'] = my_data[est].sum(axis=1)/10
        my_sums['agreeable'] = my_data[agr].sum(axis=1)/10
        my_sums['conscientious'] = my_data[csn].sum(axis=1)/10
        my_sums['open'] = my_data[opn].sum(axis=1)/10
        my_sums['cluster'] = my_personality
        print('Sum of my question groups')

        response_dict = {"Prediction": my_sums}
        return Response(response_dict, status=200)

The code above starts by getting the data appended to the request body. The code then creates a new excel file programmatically which is used in the prediction and returns the cluster as a response.

Testing our model.

To test the system, make the necessary migrations and run the django server. Open POSTMAN and make a POST request to our server like so, with the answers to the questionnaire appended to the body of the request. We should get a sample response showing us our personality cluster and our scores. To test my live system, make a POST request to the URL below:

The code below is what i used to prompt GPT3 to help me generate the content in this blog:

import openai
openai.api_key = "VISIT OPENAI TO GET YOUR KEY"
response = openai.Completion.create(
  prompt="The Pixel District Janury 16, 2022n Title: Personality Segmentation Using K-Means Clustering Algorithm and Django Rest Framework!n tags: machine-learning, kmeans, gpt3, kmeans code sample Summary:  I am sharing my exprience in implementing kmeans clustering algorithmn in determining someones personality. I am explaining why kmeans clustering algorithms is, how it works configuration. I am explaining what the big five personality traits are. I am explaining why I think kmeans algorithm is the best to use in finding someones personality based on the  big five personality traits. I am also adding various example codes of the kmeans clustering to find the big five personality traits.n Full text: ",

That's it for today. Thanks for staying tuned in!

Share this