Consumer goods Classification - NLP & DIP

Problem Definition

On a marketplace site, sellers offer items to buyers by posting images of the item and providing a detailed description. In order to make the user experience (sellers and buyers) as smooth as possible, and with a view to scaling up, automation of item classification is necessary.

This project studies the feasibility of a classification engine of articles into different predefined categories, with a sufficient level of precision, based on an image and a description.

The methodology will be as follows:

A pre-processing step on text and image
A step for extracting features from text and image
A reduction of variables
A clustering step based on the extracted variables
Finally, a comparison to the real categories of the products for evaluation of the classification model

Data Collection

A data sheet contains all the product information, including the description necessary for the NLP, as well as a product image file.

Exploratory Data Analysis

Single feature analysis

Description

Descriptions are between 13 and 587 words long, with a majority between 13 and 100.

Product Name

Product names are between 2 and 27 words long, with a majority between 4 and 10 words.

These two features will be used for the NLP part of the classification model.

Data Pre-processing

A preprocessing step is required for text and images.

Text

The “Description” and “Product name” variables are grouped together.

The following treatments are then applied:

lower: removes capital letters
expand_contraction: the text being in English, it is necessary to expand the contractions
noise_removal: remove urls, HTML tags, non-ASCII characters…
punctuation removal: removes punctuation
number removal: removes numbers

We continue with:

Tokenization: separates the sentences into a list of tokens
Stopwords: removes very frequent words without impact
Lemmatization: keeps only the root of a word taking into account the context.

Image

In order to make feature extraction more efficient, it is necessary to pre-process images.

For Bag-of-Features techniques such as Sift or ORB, apply in order:

grey scale: converts the image to grayscale
histogram equalization: contrast improvement
histogram stretching: exposure correction
mean filter: noise attenuation by local averaging

However, CNN techniques such as VGG16 have their own pre-processing implemented in their library.

Model

Text and image parts are developed and evaluated separately. They are then evaluated together, in order to determine the interest of coupling the two methods.

Feature Extraction

Text

Two types of methods are available:

bag-of-word type algorithms: gives a reduced and simplified representation of a text document in the form of vectors based on specific criteria such as the frequency of words. Example: CountVectorizer, TF-IDF. Pros: fast, works with unknown words. Cons: does not consider the place of the word in the sentence, and does not capture the meaning of the word.
the so-called Sentence Embedding methods: give digital vector representations of the semantics or meaning of words, including the literal and implicit meaning. Thus, these word vectors can capture their connotation, and are combined into one dense vector per sentence. Example: Word2Vec, BERT, USE Pros: often pre-trained, takes word position into account, understands semantics. Cons: does not consider words outside the corpus, more complex, “black box”.

Image

For feature extraction from images, several types of methods are available.

Bag-of-visual-words algorithms: takes an image and returns key points of this image in the form of features/vectors, the digital fingerprint of the image, invariant regardless of transformations. Example: SIFT, ORB
CNN Transfer Learning algorithms: pre-trained convolutional neural network taking an input image and automatically returning the features of this image, by automatic extraction and prioritization of said features. Example: VGG16, in Standalone Feature Extractor

Reduction & Clustering

The reduction method is t-SNE. The clustering method is k-Means.

We then draw a projection of the products with real and calculated categorization.

Evaluation

In order to evaluate the accuracy of the method used and the efficiency of feature extraction algorithm tested, the ARI (Adjusted Rand Index) is calculated, which gives a measure of similarity between calculated categories and the real ones. Computation time is also considered, another important element.

Text

Algorithm	ARI	Computation time
`Countvectorizer`	0.49	19 s
`TF-IDF`	0.50	18 s
`Word2Vec`	0.41	15 s
`BERT`	0.32	2 min 30 s
`USE`	0.63	10 s

USE seems to be the correct method for our problem for the NLP part.

Image

Algorithm	ARI	Computation time
`SIFT`	0.04	10 min 45 s
`ORB`	0.03	1 min 50 s
`VGG16`	0.45	4 min

VGG16 is the most efficient algorithm for the DIP part.

Text & Image

Algorithm	ARI	Computation time
`USE + VGG16`	0.65	4 min 20 s

Conclusion

The association of classification model for image and text makes it possible to reach an ARI of 0.65. The feasibility of the classification engine is therefore proven.

From a performance + computation time point of view, the use of text only can be considered.

pgrondein / goods_classification_nlp_dip Goto Github PK

goods_classification_nlp_dip's Introduction

Consumer goods Classification - NLP & DIP

Problem Definition

Data Collection

Exploratory Data Analysis

Single feature analysis

Description

Product Name

Categories

Data Pre-processing

Text

Image

Model

Feature Extraction

Text

Image

Reduction & Clustering

Evaluation

Text

Image

Text & Image

Conclusion

Recommend Projects

Recommend Topics

Recommend Org