Building a Product Recommendation System with your Sales Data

5 min readMay 19, 2021

Introduction

This is a post sharing my implementation of a product recommendation system from end to end. The recommendation system I built is based on item-item collaborative filtering. We’ll build a multi dimensional vector representation of a product via a co-occurrence matrix and find similar products measured by the cosine similarity between all product vectors.

In terms of application, this system was built to power e-commerce product to product recommendations. For example when a customer clicks on a product, most sites will show a product detail page (PDP) and commonly you might see more products shared on that page under headings such as ‘You Might Also Like’ or ‘Similar Products’.

Concepts

Co-occurrence Matrix

I won’t go into the details of a co-occurrence matrix as this is something that has written about quite a bit. But I will cover conceptually how a co-occurrence matrix works and why we can use it to create a recommendation system for similar products. The data we are utilizing is simply customer order data and specifically we are interested in orders where products are purchased with other products.

Simplified Example

In our simplified example we only have two orders in our entire history:

We have 3 unique items, two different colored tooth brushes which were both independently purchased with toothpaste. Using this information we are able to build a bridge from the blue toothbrush to the green toothbrush via the toothpaste.

Now imagine that we have thousands of orders. You’d probably expect that both colors of toothbrushes occur with things like floss, mouthwash, etc. We can utilize these co-occurrence of products to build a multi dimensional vector representation of all the products in our catalogue.

Cosine Similarity

To determine if a product is similar to another product we take the cosine similarity between their vector representations from our co-occurrence matrix and receive a score between -1 and 1. A score of 1 are vectors that are in the same direction and a score of -1 is an opposite vector. You can find the details of cosine similarity here.

Data

As mentioned earlier our data is customer order data. For most datasets you’ll probably want to join an order level table to an order line level table so you can see all the items that occurred within a specific order id.

Model Code

Library Dependencies:

import pandas as pd
import numpy as np
import s3fs
import json
from sklearn.metrics.pairwise import cosine_similarity
import datetime

Begin by loading your customer order data in the format above as a pandas data frame named sales_df.

Once you’ve loaded your data you’ll need to pivot it where each row is an order and each column is a product and the values are the counts of the products in each of the orders. Note: pandas will automatically change the data type in a pivot to float64. You might want to downcast the data if you have memory considerations

pivot_df = pd.pivot_table(sales_df,index = 'order_id',columns = 'product_id',values = 'category',aggfunc = 'count')pivot_df.reset_index(inplace=True)
pivot_df = pivot_df.fillna(0)
pivot_df = pivot_df.drop('order_id', axis=1)

Next we transform our pivot table into a co-occurrence matrix by taking the dot product of the pivot table and its transpose.

co_matrix = pivot_df.T.dot(pivot_df)
np.fill_diagonal(co_matrix.values, 0)

And to transform the co-occurrence matrix into a matrix of cosine similarities between our products we utilize the cosine_similarity function from sklearn.

cos_score_df = pd.DataFrame(cosine_similarity(co_matrix))
cos_score_df.index = co_matrix.index
cos_score_df.columns = np.array(co_matrix.index)

Product x Product Cosine Similarity Scores

Model Validation

As with most unsupervised learning models, model validation can be tricky. For our dataset we have a diverse set of product categories. Since we are creating a recommender to show similar products we should expect our model to return recommendations that are in the same category as the original base product.

For Each Product Category:

Count(Best Recommendation for Each Product in Category)/ Count(Products in Category) = % of Recommendations in Same Category

Example:

We had 735 wellness products we generated recommendations for and based on our best cosine similarity for each wellness product we had 720 recommendations that were in the wellness category, or 98% same category recommendations. With such a high percentage of same category recommendations we can feel more confident that we have a strong signal in our purchase data to power our model.

Next Steps

From here model validation would continue once we promote the first version of our model to production and commence with an AB test. Some parameter tuning considerations as you iterate your model would be having a cosine similarity score threshold or sample size threshold to limit recommendations to ones where we have the highest confidence.

Deployment

Our deployment process was fairly straight forward. We output a JSON file with the top n recommendations into an S3 bucket. This JSON file is then picked up by our platform engineering team and loaded into a Postgres database which will be used to serve products on the front end.

Top Five Highest Scoring Recommendations and JSON Output:

#Take top five scoring recs that aren't the original productproduct_recs = []
for i in cos_score_df.index:
    product_recs.append(cos_score_df[cos_score_df.index!=i][i].sort_values(ascending = False)[0:5].index)
    
product_recs_df = pd.DataFrame(product_recs)
product_recs_df['recs_list'] = product_recs_df.values.tolist()
product_recs_df.index = cos_score_df.index#Semi-colon delimited JSON Outputproduct_recs_df['json_out'] = product_recs_df['recs_list'].apply(lambda x: [str(element) for element in x])
product_recs_df['json_out'] = product_recs_df['json_out'].apply(lambda x: ";".join(x))
product_recs_dict = product_recs_df.to_dict()
json_out = json.dumps(product_recs_dict['json_out'],indent = 4,ensure_ascii = False).encode('UTF-8')

Output JSON to S3 Bucket :

ts = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M")fs = s3fs.S3FileSystem(key=s3_key, secret=s3_value)
with fs.open('s3://yourbucket/key'+ts+'.json', 'wb') as f:
    f.write(bytes(json_out))

Conclusion

With that we have created an end to end product recommendation system for similar products simply with historical sales data. With all models the quality of your model outputs will be dependent on the quality of your data. Typically the larger sample of orders we have for each product the better; as we would expect to reduce the noise from random product co-occurrences at larger sample sizes. To find the right sample size threshold for your model you can evaluate the model validation metric (% of Recommendations in Same Category) at different sample thresholds to see at which threshold you start seeing a meaningful drop off in the evaluation metric.