Taha Ashtiani bio photo

Taha Ashtiani

Searching for EV in the tails.

This post is an overview of my amazon research project which I’ve been working on for a while now. I decided to share some of my thoughts and insights here, hoping that it might be inspiring for anyone interested. Especially the score building section is the result of hours of thinking and discussions about first principles of competition in a market. I hope you enjoy.

Table of content

  1. The idea

  2. Generating Product Niche Titles

  3. Collecting, tagging and cleaning the data

  4. Exploring the data

  5. Building a score for finding the gaps

  6. Results

The Idea

More than %60 of amazon retail sales revenue comes from individual third party merchants. It’s called amazon FBA (fulfillment by amazon). Think millions of products in different niches. My goal is to first understand the competition in these niches and then try to find the gaps in the competition all over the market.

There are no public datasets for the niches, products, or sales data. I had to gather around the data from multiple sources. It started with using NLP tools for coming up with, and validating tens of thousands of niche titles. Then I collected data on different product features such as category, title, and price. And finally the sales data which is most important piece of the puzzle. It all goes together in the score building process. I’ve used simple statistics tools and reward/penalize functions for evaluating the demand and supply in these niches, and building a score for each of them.

Generating Product Niche Titles (NER with Spacey and BERT)

Think of all the businesses and individuals selling stuff on their own websites ( shopify, woocommerce, … ) or on marketplaces like ebay or amazon. Countless numbers of products being uploaded on these platforms every single second.

Most of this data is very well structred. there is a product title, description, price, images, reviews, and etc. It is all well categorized and tagged. And yet it is not that easy to extract the categories and niches of these products. There’s no specific feature in the well structured data to tell us what a product actually is. In this post I’ll go through how i went about extracting these niche titles out of product titles.

I need these niche titles for my other project. I tried 4 different ways for getting the titles. scraping the titles, spacy , Google’s NLP and finally building my own NER with BERT.

Here is a summary of these trials:

Ready to use labeled titles

Image

before using NER for extracting niche titles from product titles, let’s first see if there’s an easier way for finding these niches: https://www.buzzfeed.com/shopping

There is a lazy load of countless product reviews on buzzfeed. take a close look at these articles and you’ll immediately see the pattern.

So there it is. a large number of niche titles, all hand labled.

I wrote a script for scraping these for my other amazon data analysis project. here is a quick guide for that. start with buzzfeed.com/us/feedpage/feed/shopping-amazon?page {}&page_name=shopping and get the links for the articles.

def get_buzzfeed_articles():
    url = "https://www.buzzfeed.com/us/feedpage/feed/shopping-amazon?page={}&page_name=shopping"
    articles = []
    for page in range(2,20):
        browser.get(url.format(page))
        rows = browser.find_elements_by_class_name("js-card__link")
        for row in rows:
            article_link = row.get_attribute("href")
            articles.append(article_link)
    return articles


def check_product_price(product_text):
        try:    
            myprice= re.findall(r"(\d+\.\d{1,2})",product_text)[0]
            myprice = float(myprice)
            if myprice > 15 and myprice <60:
                return True
        except:
            return False

def get_products_from_article(article_url):
        browser.get(article_url)
        product_wrapper = browser.find_elements_by_class_name("subbuzz")
        products = []
        for product in product_wrapper:
            try:
                product_title = product.find_element_by_class_name("js-subbuzz__title-text")
                myproduct = {}
                mylink_tag = product_title.find_element_by_tag_name("a")
                mylink = mylink_tag.get_attribute("href")
                if not check_product_price(product.text):
                        continue
                source_product_name = product_title.find_element_by_tag_name("a").text
                if mylink.startswith( 'https://www.amazon.com/dp' ):
                        amazon_asin = mylink[26:36]
                        myproduct["asin"]= amazon_asin
                        myproduct["source_product_name"] = source_product_name
                        products.append(myproduct)      
            except: 
                pass
        return products

It is always nice to tag the source of the scraped data so that we can use it later in the data cleaning and processing stage.

for product in products :
        product["source"]="buzzfeed_price_15_to_60"
        product["scraping_status"]="queue"
        requests.post(url,json=product)

Using Spacy’s “en_core_web_lg” model

spacy could tag 638 “PRODUCT” entities out of 20k product titles on en_core_web_lg and 129 on en_core_web_sm.

import spacy.cli
spacy.cli.download("en_core_web_lg")

nlp = spacy.load("en_core_web_lg")
f = io.BytesIO(uploaded['titles.csv'])
products_value = f.getvalue()
products = str(products_value).split("\\n")

mined_niches = open("spacey_ner_niches.csv","w")
qualified_niches = []
for i in range(len(products)) :
        text  = products[i].split(",")[1]
        doc = nlp(text)
        for entity in doc.ents:
                if entity.label_ == "PRODUCT":
                        title = entity.text
                        title_words = len(title.split(" "))    
                        if  title_words >1 and title_words <8 and not re.search(r"\d",title):
                                if title not in qualified_niches:
                                        qualified_niches.append(title)
                                        mined_titles.writelines(title+"\n")

Google’s NLP API

google got about 3200CONSUMER_GOOD” entities our of 20k product titles.

from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types

os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="   "
client = language.LanguageServiceClient()

text  = products[i].split(",")[1]
document = types.Document(
content=text,
type=enums.Document.Type.PLAIN_TEXT)
entities = client.analyze_entities(document).entities

for entity in entities:
    entity_type = enums.Entity.Type(entity.type)
    try :
        if entity_type.CONSUMER_GOOD:
            if entity_type.name == "CONSUMER_GOOD":
                niche = entity.mentions[0].text.content
                niche_words = len(niche.split(" "))                            
                if  niche_words >1 and niche_words <8 and not re.search(r"\d",niche):
                    if niche not in qualified_niches:
                        qualified_niches.append(niche)
                        mined_niches.writelines(niche+"\n")
		...

here are some of extracted niches:

Protractor Set
Box  Lathe Tool
Hobby Bead Craft Tools
Trigger Spray Bottle
Crystal Plastic Lid Cover
Acrimet Magazine File Holder
Acrimet Premium Metal Bookends
Acrimet Stackable Letter Tray
TimeQplus Proximity Bundle
Acroprint BioTouch
Clear Parts Storage Box
...

Building my own training dataset with prodcut tags

I was curious about training my own NER model for recognizing PRODUCT entities. I started searching for NER training datasets with product tags in them. I found a dataset of best buy products tagged with Category, Brand, ModelName, ScreenSize, RAM, Storage, and Price. I couldn’t download the dataset though. It seems to be deleted for some reason!

I did a lot of googling but i couldn’t find a public dataset with product tags in it. I decided I still wanted to train my own NER for its fun at least. and then see if I can also build my own dataset using the data i collected from buzzfeed. I followed this awesome tutorial and trained a model on CoNLL2003 with BERT and using sparknlp.

I also found a related paper which is using Amazon review dataset for annotating products in the reviews.

I couldn’t understand how they’ve exactly annotated the so-called Components. but I thought of something similar myself. I had an idea for using CoNLL2003 to annotate entities in 20k product titles. except that there would be no product tags in the training data. which i could provid using my own product labels from my own buzzfeed dataset.

Here is the format of eng.train in CoNLL2003 :

-DOCSTART- -X- O O

Goldman NNP I-NP I-ORG
Sachs NNP I-NP I-ORG
sets VBZ I-VP O
warrants NNS I-NP O
on IN I-PP O
Continental NNP I-NP I-ORG
. . O O

LONDON NNP I-NP I-LOC
1996-08-23 CD I-NP O

So for example look at this product on buzzfeed shopping:

Image

The amazon link is to a product with the title: “Maytex 50681 Mesh Pockets Shower Curtain Or Liner”. we can search it on amazon and collect hundreds of similar product titles with the same niche. And here is the predictions of the model I trained on CoNLL2003:

Maytex NNP NNP B-ORG
50681 CD CD O
Mesh NNP NNP O
Pockets NNP NNP O
Shower NNP NNP O
Curtain NNP NNP O
Or CC CC O
Liner NNP NNP O


{'document': ['Maytex 50681 Mesh Pockets Shower Curtain Or Liner'],
 'embeddings': ['Maytex',
  '50681',
  'Mesh',
  'Pockets',
  'Shower',
  'Curtain',
  'Or',
  'Liner'],
 'ner': ['B-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O'],
 'ner_chunk': ['Maytex'],
 'pos': ['NNP', 'CD', 'NNP', 'NNP', 'NNP', 'NNP', 'CC', 'NNP'],
 'sentence': ['Maytex 50681 Mesh Pockets Shower Curtain Or Liner'],
 'token': ['Maytex',
  '50681',
  'Mesh',
  'Pockets',
  'Shower',
  'Curtain',
  'Or',
  'Liner']}

Now I have the labels for Shower and Curtain which I can replace B-PRODUCT and I-PRODUCT. Then I can write it back to a new train dataset, which i should be able to use for training the new model. I’m not sure if it would be worth the effort so I stopped here and moved on with my amazon research project.

Active learning with prodi.gy and buzzfeed’s hand labeled data

Training nlp models require lots of labeld data and I don’t know of any available NER datasets with “product” tags. and that is where active learning comes in. So here is the idea:

using active learning for fine-tuning a model using the hand labled data which we got from Buzzfeed. For each niche title we can collect hundreds of product titles related to that niche from amazon. then we can improve an existing general purpose spacy model like “en_core_web_lg” (which does have the PRODUCT tags included) with correcting it’s predictions and adding the entities which spacy is missing out.

prodigy is a pretty expensive tool and i didn’t get to try it out.(and google nlp results were good enough for my amazon research project). but I think active learning on buzzfeed’s labeled data might have got us the top results compared to all other methods discussed in this post.

Collecting, tagging and cleaning the data

The data has been collected from multiple sources such as amazon, estimations on product sales numbers, and also keyword search data. I’ve built a pipeline and a tag management tool which was used by crawlers on multiple machines, working with a Django Rest API. These crawlers ask the rest api for a niche, product, or keyword to crawl. These scraping tasks are put in a queue, get crawled and finally submitted to the DB.

Image

Product Niche titles

The titles are extracted from hundreds of thousands of product titles. I have made another post explaining how i tried different approaches for the NER process. here are some examples of the niches extracted:

Protractor Set
Box Lathe Tool
Hobby Bead Craft Tools
Trigger Spray Bottle
Crystal Plastic Lid Cover
Acrimet Magazine File Holder
Acrimet Premium Metal Bookends
Acrimet Stackable Letter Tray
TimeQplus Proximity Bundle
Acroprint BioTouch
Clear Parts Storage Box
...

I did most of my analysis in Jupyter notebooks. Here I share small pieces of that.

The data is in 3 different tables for niches, products and keywords. Now let’s load them and take a look.

niches = pd.read_sql_query("SELECT * FROM rest_subniche",conn)
products = pd.read_sql_query("SELECT * FROM rest_product",conn)
keywords = pd.read_sql_query("SELECT * FROM rest_keyword",conn)

These niches are picked randomly, from a wide spectrum of product titles. This gives us all the major categories.

Beauty & Personal Care       529
Automotive                   508
Clothing, Shoes & Jewelry    242
Tools & Home Improvement     105
Home & Kitchen                88
Health & Household            87
Sports & Outdoors             45
Industrial & Scientific       33
Toys & Games                  30
Office Products               30
Electronics                   24
Cell Phones & Accessories     22
Patio, Lawn & Garden          20
Kitchen & Dining              18
Pet Supplies                  18
Grocery & Gourmet Food        13
Baby                          12
Books                          8
Arts, Crafts & Sewing          8
Musical Instruments            4
Computers & Accessories        3
Camera & Photo                 3
Appliances                     2
Name: category, dtype: int64

Cleaning the data and reliability

After some experimenting and thinking about multiple strategies for data cleaning, I decided I need to filter out a big chunk of the data, containing empty values and nulls. I tried replacing them which makes the score less reliable and makes things complicated . We don’t want to let loosely supported niches leak into our score building process.

products[num_columns].isna().sum()

price            5830
margin           5779
sales            6609
revenue          6730
bsr              6564
reviews             1
weight           3832
number_images    3940
rank_position    3720
dtype: int64

And finally we end up with 394 niches left after dropping empty and null values.

products.dropna(axis=0,how="any",inplace=True)
products.shape
~~(~~1852, 17)

products.mother_niche_id.unique().size
394

Exploring the data

distributions

Looking into the distribution of prices, revenues and reviews helps us develop some sort of an intuition for building the score.Let’s look into some plots.

Image

Image

Image

Image

Scatter of reviews and revenues

revenue_reviews_scatter  =  products[(products["revenue"]<100000)&(products["reviews"]<1000)]
revenue_reviews_scatter[["reviews","revenue"]].plot(x="reviews",y="revenue",kind="scatter",ax=ax

Image

Building a score for finding the gaps

The intuition behind the formula for the score is based on simple supply and demand economics of a market. We’re trying to find the niches with highest revenues and lowest number of reviews. Then there are other factors which we have to be careful about. We have to watch out for monopolies where few sellers have dominated the niches We’re looking for high volume of monthly searches. We need to avoid high variance across sales and reviews numbers.

Here is a simple list of the main elements going into the score:

  • Counting the the products within a specific range of revenue and review

  • Revenue/Review ratios

  • Ranking weights and confidence score

  • Penalizing niches for too much Variance

  • Penalizing the toxic niches with a few sellers in a monopoly position

  • Search demand ( Exact Match Volumes), sponsored ads and biddings

Counting products in the range

Counting is the core of our score building strategy. It’s simply a count of products within our desired band of revenue and review.

def count_score(niche_products):
		...
    return niche_products[(niche_products["revenue"]>Count_min_revenue) &
			                    (niche_products["reviews"]<Count_max_review)].shape[0] / niche_products.shape[0]

Comparing revenue/review ratios to the median

revenue/review ratios is the second most important part of the score. We set an upper and lower band around median of all revenue/review ratios , and compare each one to those bands which maps them to [ _1 , 0 , 1] . We then use the mean of those baselined ratios as our ratio_score.

...
ratio_baseline = products["revenue/reviews"].median()
...

def compare_to_baseline(niche_products):
        baselined_ratios = []
        band = 0.2
        upper_band = (1+band) * ratio_baseline
        lower_band = (1-band) * ratio_baseline
        for i, product in niche_products.iterrows():
						....

            if lower_band < product["revenue/reviews"] < upper_band :
                    baselined_ratios.append(0)
            elif upper_band < product["revenue/reviews"] :
                    baselined_ratios.append(+1*product_rank_weight)
            elif lower_band > product["revenue/reviews"] :
                    baselined_ratios.append(-1*product_rank_weight)
						...

Image

Image

ranking the niches

count_score, ratio_score and other building blocks of the score discussed above are aggregated into a single score, which is then used for ranking the niches.

Image

Feel free to contact me if you find the collected data or the tool useful for your specific use cases.

cheers