Case study of easy recommendation system in an e-commerce store

10 min readMar 22, 2024

The expansion of e-commerce sales has been going on for a long time. What distinguishes e-commerce stores from typical retail stores is the approach to customer identification. They are usually identified in e-commerce stores, but usually not in retail stores.
The word usually means that in e-commerce stores it may happen that customers will, at least theoretically, buy as anonymous customers without logging in. However, in retail stores, customer identification can be carried out by so-called loyalty cards.

There is no doubt that the e-commerce industry offers great opportunities to analyze customer behavior, their behavior, analysis of purchase motives, analysis of their preferences and concerns. Even if the client does not log in, we can identify him.

Recommendation Systems

A recommendation system is a device that aims to increase the scope of purchases made by individual customers. A recommendation system can result in a significant increase in sales. This system can also help to increase customer loyalty to the store.
The recommendation system is used to recommend products or goods to customers. If the customer receives an interesting offer, he or she can take advantage of it. Each good proposal increases the likelihood of selling another product. We can also say the opposite, every bad proposal reduces the probability of sale compared to a situation in which such a proposal would not appear. Therefore, it is very important to create an effective recommendation system.

Cosine Principle

The cosine principle in recommendation systems is mainly used in the collaborative filtration method, i.e. where the functional properties of sold products are compared. This method involves identifying similarities between items based on user ratings or descriptions entered on the e-commerce platform’s website. For example, we have two items, each of them represented by specific rating parameters, then the cosine of the angle between these vectors in a multidimensional space is a measure of their similarity. The closer to 1 cosine, the more similar the items are. Ratings or descriptions can be numerical or textual.
To shed some light on this, I will use an example. We have three winter jackets:
• jacket A has a length of 110 cm and a circumference of 85 cm
• jacket B has a length of 45 cm and a circumference of 90 cm
• jacket C has a length of 65 cm and a circumference of 75 cm.

We place the length and circumference of the jacket in a coordinate system.

I have marked where each jacket is located on the chart. The x-axis represents the length of the jacket. The y-axis represents the circumference of the jacket. I drew a line from the points showing the parameters of the jackets to the beginning of the coordinate system. The sharper the angle between individual lines, the more similar the products are. The above graph shows that jacket A is very different from jacket B, while theoretically the difference between jacket C and A is small because the angle between the lines is quite sharp.

Similarities through product descriptions?

In a recommendation system based on product similarity, finding similarity is possible based on the analysis of photos or analysis of descriptions. Product similarity obtained through photo analysis is possible thanks to image-recognizing neural networks. The second method that can be used in parallel is the identification of similarity based on product descriptions. Descriptions can be created by customers or by the store in the form of a product description. Customers are a relatively unstable group of people. Typically, customers decide to post comments on products when they are positively or negatively affected. Their descriptions may range from extreme rage to extreme euphoria. If we were looking for similarities between products based on the degree of emotional involvement of customers, such descriptions would probably be useful. However, such actions would probably make little sense. For example, customers who viewed a very exciting sports shoe would receive another very exciting sports shoe as a recommendation. It seems that the only reasonable solution to the problem of an effective recommendation system is to look for product similarities in the descriptions provided by the store staff.

Example of a recommendation system for an e-commerce jacket store

Imagine an e-commerce website selling large quantities of clothes. One of the categories sold in this store are jackets and winter coats. The store offers 5,000 different types of jackets. Typically, when a customer looks at a jacket, the website suggests 5 other similar jackets. Sales effectiveness depends on how accurate the other products proposed by the website are. The customer has many similar online stores selling winter clothes. If the jackets suggested by the website are not accurate, the customer will simply leave the store and buy the jacket elsewhere.
When looking for product similarity, we will use the cosine angle sharpness method described above. We will use the e-commerce store’s database of winter jackets. Below is a list of database fields.

‘type’ (“hunting”, “sporty”, “elegant”, “recreational”)

‘gender’ (‘female’, ‘male’)

‘size’ (tiny’, ‘small’, ‘medium’, ‘large’, ‘huge’)

‘color’ (‘black’, ‘white’, ‘yellow’, ‘green’, ‘brown’, ‘blue’, ‘lemon’})

‘material’ (“no-waterproof”, “waterproof”)

‘kind ‘ (“thermoactive”, “no-thermoactive”)

‘pattern ‘ (“flowers”, “animals”, “planets”, “cars”, “none”)

‘use ‘ (“low”, “moderate”, “high”, “intense”, “extreme”)

‘sport’ (“skiing”, “skating”, “hiking”, “bicycle”, “running”)

This database was created in Python. Everyone can create such a training base. Below is the code on how I created this database.

import pandas as pd
import numpy as np

#np.random.seed(148)
types = np.random.default_rng().integers(0, 4, size=5000)
gender = np.random.default_rng().integers(0, 2, size=5000)
size = np.random.default_rng().integers(0, 5, size=5000)
color = np.random.default_rng().integers(0, 6, size=5000)
material = np.random.default_rng().integers(0, 2, size=5000)
kind = np.random.default_rng().integers(0, 2, size=5000)
pattern = np.random.default_rng().integers(0, 5, size=5000)
use = np.random.default_rng().integers(0, 5, size=5000)
sport = np.random.default_rng().integers(0, 5, size=5000)

index2 = list(np.random.permutation(np.arange(0, 5000)))

key = ["index2","types","gender","size","color","material","kind","pattern","use","sport"]
vals = [index2, types, gender, size, color, material, kind, pattern, use, sport]
data = {key: vals[n] for n, key in enumerate(key)}
df = pd.DataFrame(data)

df["gender"] = df["gender"].map({0: "female", 1: "male"})
df["types"] = df["types"].map({0: "hunting", 1: "sporty", 2: "elegant", 3: "recreational"})
df["size"] = df["size"].map({0: "tiny", 1: "small", 2: "medium", 3: "large", 4: "great"})
df["color"] = df["color"].map({0: "black",1: "white",2: "yellow",3: "green",4: "brown",5: "green",6: "brown",})
df["material"] = df["material"].map({0: "no-waterproof", 1: "waterproof"})
df["kind"] = df["kind"].map({0: "thermoactive", 1: "no-thermoactive"})
df["pattern"] = df["pattern"].map({0: "flowers", 1: "animals", 2: "planets", 3: "cars", 4: "none"})
df["use"] = df["use"].map({0: "low", 1: "moderate", 2: "high", 3: "intense", 4: "extreme"})
df["sport"] = df["sport"].map({0: "skiing", 1: "skating", 2: "hiking", 3: "bicycle", 4: "running"})
df["index"] = "Id_" + df["index2"].astype(str)
df

The jacket database after being generated from the above code looks like this:

Index indicates the jacket’s identification number. There are no names of jackets in the database, the role of the name is fulfilled by the index.

Now I need to create a description field. Usually, stores have an extensive description field that includes functional features, dimensions and intended use of the products. Creating such a field for 5,000 jackets would be very troublesome for me. On the other hand, there may be a situation in which such a descriptive field does not exist or for some reason this field is not suitable for use in the recommendation system being created. To create an effective description field for the recommendation system, all jacket features should be combined into one string of words. The new database field consisting of individual elements will be called ‘description’.

df['description'] = df['types']+ ' '+df['color']+' '+df['material']+' '+df['kind']+' '+df['pattern']+' '+df['use']+' '+df['sport']

Now, in order for the system to search for similarities, the tokenization process must be carried out. Tokens are words that are connected to other tokens — words. The system searches for the words of the product indicated by the customer in the ‘description’ field and looks for the most similar products also in the ‘description’ field in the database of 5,000 jackets. The system does this using the code below.


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

# Create a TfidfVectorizer and Remove stopwords
tfidf = TfidfVectorizer(stop_words="english")

# Fit and transform the data to a tfidf matrix
tfidf_matrix = tfidf.fit_transform(df["description"])

# Print the shape of the tfidf_matrix
tfidf_matrix.shape

Then a token matrix is created. In which the above-described similarity search method using the cosine account can be used.
Below is the code that created a 5000 by 5000 jacket matrix. Inside the matrix there are similarity values between individual jackets.
A value of 1 indicates perfect similarity. The more the cosine value approaches zero, the lower the similarity of the jackets.

cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim.shape

Creating pairs of the most similar jackets.

indices = pd.Series(df.index, index=df["index"]).drop_duplicates()

The get_recommendations function will receive the jacket id, a cosine similarity matrix and a series of recommended jackets as input. It will then return a list of recommended jackets most similar to the jacket the customer selected in the purchasing process.

def get_recommendations(index, cosine_sim=cosine_sim, num_recommend=10):

    idx = indices[index]

    # Get the pairwsie similarity scores of all jackets with that jacket
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the jackets based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar jackets
    top_similar = sim_scores[1 : num_recommend + 1]

    # Get the jacket indices
    movie_indices = [i[0] for i in top_similar]

    # Return the top 10 most similar jackets
    return df["index"].iloc[movie_indices]

We can now check how similar the two selected jackets are to each other. Just insert the indexes of the selected jackets into the code below to find out the degree of similarity of both products. We substituted jackets with the indexes: ‘Id_4019’ and ‘Id_2168’.

a = df.loc[df['index'] =='Id_4019'].index.tolist()
b = df.loc[df['index'] =='Id_2168'].index.tolist()
cosine_sim[a,b]  

# array([0.22425288])

Use of a recommendation system

Now we can finally use our recommendation tool. Let’s assume that a customer who is currently in the process of purchasing a winter jacket has chosen a jacket with the index. As I mentioned, jackets in our recommendation system have ID names.

The customer chose a jacket with the index “Id_997”.

The system does not see the customer, but it can learn the customer’s preferences and match similar products according to the consumer’s choice. We ask the recommendation system to find us 20 similar jackets.

kot = get_recommendations("Id_997", num_recommend=20).to_dict()
ind = list(kot.keys())
Id = list(kot.values())

si = df[df["index"].isin(Id)]
si.head(10)

The system found the 20 most similar jackets to the jacket the customer chose. As we can see, the jackets are quite similar. The client wanted a hunting jacket, most jackets are for hunting. The use of the jacket was supposed to be ‘extreme’, most are ‘extreme’, in the ‘sport’ category the jackets are ‘hiking’. Obviously, the jackets are similar to the customer’s choice, forest jackets for hunting and long hikes. Everything seems to be fine. Unfortunately, in the clothing industry, there is such a thing as hard parameters. If someone is large and the customer clearly indicated a large jacket, you cannot offer him a small or tiny jacket because it may make him angry or irritated. The customer turned out to be women, while part of the offer proposed by the recommendation system showed men’s jackets. To get rid of this situation, search for similar jackets in separate collections, only women’s jackets and only the ‘great’ size.

Separating results for the client

Separation in terms of gender and size can be made before the process of identifying similarities or after identifying similarities. I used the second option.

choice = df[df["index"] == i][["index", "gender", "size"]].values
choice

i = choice[0, 0]
sex = choice[0, 1]
size = choice[0, 2]

dk = si.loc[
    (si["gender"].apply(lambda x: x in [sex]))
    & (si["size"].apply(lambda x: x in [size]))
]
print(dk.shape)
dk.head(5)

The system found the 20 most similar jackets to the jacket chosen by the customer. As we can see, the jackets are quite similar, the client wanted a jacket for hunting, most of the jackets are for hunting, the use of the jacket was supposed to be extreme, most of them are extreme and sport high King. It seems that everything is fine, unfortunately in the clothing industry there is such a thing as hard parameters. If someone is large and the customer clearly indicated a large jacket, you cannot offer him a small or tiny jacket because it may make him furious. The customer turned out to be women, but part of the offer proposed by the recommendation system showed men’s jackets. To get rid of this situation, you should search for similar jackets in separate collections, only women’s jackets and only size great. Here I applied the exclusion after identifying similar products.

Summary

As we can see, a fully functional recommendation system, which can increase the probability of selling products by several dozen percent, consists of only a dozen or so lines of code. This code simply needs to be copied to the system supporting the online store’s website so that the website offers accurate products to its customers.
Online trade in Poland is becoming more and more mature. Even the slightest improvement in the effectiveness of customer communication can contribute to additional sales. The recommendation system presented today consists of several lines of code. This fully effective recommendation system can result in gaining a significant competitive advantage and financial success of an online store.
Recently I was in the online store of a large construction company. I was looking for a drill in the power tools section. Each time, the system offered me work gloves and extension cords. This meant that the website did not have a recommendation system and the proposed products were the most frequently purchased products.