gofynd / mildnet Goto Github PK

Visual Similarity research at Fynd. Contains code to reproduce 2 of our research papers.

Home Page: https://arxiv.org/abs/1903.00905

License: Apache License 2.0

Jupyter Notebook 94.48% Python 5.47% Shell 0.05%

deep-learning artificial-intelligence ai machine-learning machinelearning ml recommender-system recommendation-system image-retrieval computer-vision

mildnet's Introduction

MILDNet

This repo cantains the training code used during Visual Similarity research at Fynd. One can easily reproduce our state of the art models MILDNet and Ranknet and 2 other research works from the past. 25 configs are present which constitutes configurations of most critical experiments by us.

For more details, refer to the Colab Notebook (execute training on Free GPUs or TPUs provided by Google in just 2 clicks) or head to our research papers on Arxiv:

We have also open-sourced 8 of our top experiment results with weights here. To analyze and compare all the results (training) head to this Colab notebook. To get an idea on using any of our open-sourced models models to find n similar items (inferencing) from your entire dataset head to this Colab notebook.

Introduction

Visual Recommendation is a crucial feature for any ecommerce platform. It gives the platform power of instantly suggesting similar looking products to what a user is browsing, thus capturing his/her immediate intent which could result in higher customer engagment (CTR) and hence the conversion.

The task of identifying similar products is not trivial as the details concerned here (pattern, structure etc.) are complexely grained in the product image pixels and these product comes in various variety even within the same class. CNNs have showed great understanding and results in this task.

The base of such a system is a CNN extracting key features from product images and returning a vector respresenting those features. When these embeddings for all the products are mapped on an n-dimensional space, it places similar products closer to non-similar items. The nearest neighbours are then the top most visual similar items. Below diagram gives a brief overview:

Repo Overview

execute.py: Execute this to run training locally or on Google Cloud ML Engine.
MILDNet_on_Colab.ipynb: Google Colaboratory notebook describes the task and contains training, exploration and inference code.
requirements-local-cpu.txt/requirements-local-gpu.txt: Requirement files only need to execute when running locally.
settings.cfg: Global configs to setup: -- MILDNET_JOB_DIR (mandatory): Requires directory path to store training outputs. Either pass path of local directory or Google cloud storage (gs://.....) -- MILDNET_REGION (optional): Only needed when running on ML Engine (e.g. us-east1) -- MILDNET_DATA_PATH (mandatory): Path where training data is stored. Change only when using custom data. -- HYPERDASH_KEY: Hyperdash is a nice tool to log system out or to track training metrics. One can easily monitor all the jobs running using their Android app or webpage.
job_configs: Contains 25 configs defines the basic job configs like the training model architecture, loss function, optimizer, number of epoch, learning rate etc.
trainer: Contains all script needed for training.
training_configs: Training related environment config, only needed when running on ML Engine. Defines the cluster type, gpu type etc.

Job Configs:

We carried out various experiments to study the performace of 4 research works (including ours). 8 of those variants can be readily tested here by this notebook:

Multiscale-Alexnet: Multiscale model with base convnet as Alexnet and 2 shallow networks. We couldn’t find a good implementation of Alexnet on Tensorflow, so we used Theano to train this network.
Visnet: Visnet Multiscale model with base convnet as VGG16 and 2 shallow networks. Without LRN2D layer from Caffe.
Visnet-LRN2D: Visnet Multiscale model with base convnet as VGG16 and 2 shallow networks. Contains LRN2D layer from Caffe.
RankNet: Multiscale model with base convnet as VGG19 and 2 shallow networks. Hinge Loss is used here.
MILDNet: Single VGG16 architecture with 4 skip connections
MILDNet-Contrastive: Single VGG16 architecture with 4 skip connections, uses contrastive loss.
MILDNet-512-No-Dropout: MILDNet: Single VGG16 architecture with 4 skip connections. Dropouts are not used after feature concatenation.
MILDNet-MobileNet: MILDNet: Single MobileNet architecture with 4 skip connections.

Based on this experiments, below is the list of all the configs available to try out:

Default Models:
- alexnet.cnf
- ranknet.cnf
- vanila_vgg16.cnf
- visnet.cnf
- mildnet.cnf
- visnet-lrn2d.cnf
Mildnet Ablation Study
- mildnet_skip_3.cnf
- mildnet_skip_2.cnf
- mildnet_skip_4.cnf
- mildnet_skip_1.cnf
Mildnet Low Features
- mildnet_512_512.cnf
- mildnet_1024_512.cnf
- mildnet_512_no_dropout.cnf
Mildnet Other Losses
- mildnet_hinge_new.cnf
- mildnet_angular_2.cnf
- mildnet_contrastive.cnf
- mildnet_lossless.cnf
- mildnet_angular_1.cnf
Mildnet Other Variants
- mildnet_without_skip_big.cnf
- mildnet_vgg19.cnf
- mildnet_vgg16_big.cnf
- mildnet_without_skip.cnf
- mildnet_mobilenet.cnf
- mildnet_all_trainable.cnf
- mildnet_cropped.cnf

Note that mildnet_contrastive.cnf and the Default Models configs are the models compared in the research paper.

Training

execute.py: Single point entry for running training job on local or Google Cloud ML Engine. Asks user whether to run the training locally or on ML Engine. The user then need to select a config from a list of configs. Finally, the script executes gcloud.local.run.keras.sh if user selects to run locally or gcloud.remote.run.keras.sh if user selects to run on Google Cloud ML Engine. Make sure to setup settings.cfg if running on ML Engine.
MILDNet_on_Colab.ipynb: Google Colaboratory notebook, gives a brief introduction of the task. Also one can execute training in just 2 clicks: -- 1. Open notebook on google colab. -- 2. From menu select Runtime -> Run all

Installation (only when running locally)

Make sure to have gsutil installed and the user to be logged in:

curl https://sdk.cloud.google.com | bash
exec -l $SHELL
gcloud init

Install requirements:

Running training on CPU: pip install -r requirements-local-cpu.txt
Running training on GPU: pip install -r requirements-local-gpu.txt

Set below configs in settings.cfg:

MILDNET_JOB_DIR=gs://....
MILDNET_REGION=us-east1
MILDNET_DATA_PATH=gs://fynd-open-source/research/MILDNet/
HYPERDASH_KEY=your_hyperdash_key

Run Training on Custom Job Config

Config for the job to be trained need to be added in job_config folder
Run python execute.py

View Logs from ML Engine

Stream logs on terminal using:

gcloud ml-engine jobs stream-logs {{job_name}}

Check tensorboard of ongoing training using:

tensorboard --logdir=gs://fynd-open-source/research/MILDNet/top_jobs/{{job_name}} --port=8080

Hyperdash: Either use Hyperdash Website or Android App/iOS App to monitor logs.

Contributing

Please read CONTRIBUTING.md and CONDUCT.md for details on our code of conduct, and the process for submitting pull requests to us.

Versioning

Please see HISTORY.md. For the versions available, see the tags on this repository.

Authors

Anirudha Vishvakarma - Initial work ([email protected])

License

This project is licensed under the Apache License - see the LICENSE.txt file for details

Acknowledgments

All aspiring research works cited in our research paper
Google Colaboratory to provide free GPUs/TPUs, helped us lot with experimentations and reporting.
Annoy: Easy to use and fast Approximate Nearest Neighbour library.
convnets-keras: Github repo contains Alexnet implementation on Keras, helped us to evaluate Alexnet base models as presented in this research work.
image-similarity-deep-ranking: Github repo helped us to use triplet data in keras.
Hyperdash: Free monitoring tool for ML tasks.

mildnet's People

Contributors

Stargazers

Watchers

mildnet's Issues

Mildnet Mobilnet possible typo.

Hello,

first of all thanks for the job you have this in repo!! It's great.

I wonder is this line is correct or there is a typo
Original Line 127: convnet_output = GlobalAveragePooling2D()(vgg_model.output)

What I think it should be:
Proposed Line 127: convnet_output = GlobalAveragePooling2D()(convnet_output)

I think convnet_output it's overwritten with the original line and therefore the previous for loop concatenation is "lost".

Thank you !!

Why did you use count//batch_size*batch_size?

In trainer.datagen.Iterator class constructor, you counted lines in csv file and then set size of dataset in this way:

count = count//batch_size*batch_size

This way you make sure that count is divisible by batch_size, but is it necessary?

Google colab notebook failing. Please find the logs.

WARNING:
Cannot import tensorflow under path /usr/local/bin/python. Using "chief" for cluster setting.
If this is not intended, Please check if tensorflow is installed. Please also
verify if the python path used is correct. If not, to change the python path:
use gcloud config set ml_engine/local_python $python_path
Eg: gcloud config set ml_engine/local_python /usr/bin/python3
WARNING:
Cannot import tensorflow under path /usr/local/bin/python. Using "chief" for cluster setting.
If this is not intended, Please check if tensorflow is installed. Please also
verify if the python path used is correct. If not, to change the python path:
use gcloud config set ml_engine/local_python $python_path
Eg: gcloud config set ml_engine/local_python /usr/bin/python3
INFO:root:Downloading Training Image from path gs://fynd-open-source/research/MILDNet
INFO:root:Building Model: ranknet
2020-07-04 21:29:30.991683: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-04 21:29:31.091798: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-04 21:29:31.092072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:00:04.0
totalMemory: 15.90GiB freeMemory: 542.88MiB
2020-07-04 21:29:31.092099: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2020-07-04 21:29:31.544583: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-04 21:29:31.544639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2020-07-04 21:29:31.544651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
2020-07-04 21:29:31.544782: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 253 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0)
INFO:root:Total params: 68,308,224
INFO:root:Trainable params: 68,308,224
INFO:root:Non-trainable params: 0
2020-07-04 21:30:03.416273: I tensorflow/core/platform/cloud/retrying_utils.cc:77] The operation failed and will be automatically retried in 1.13554 seconds (attempt 1 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 7 meaning 'Couldn't connect to server', error details: Failed to connect to metadata port 80: Unknown error 110
2020-07-04 21:30:36.183920: I tensorflow/core/platform/cloud/retrying_utils.cc:77] The operation failed and will be automatically retried in 1.7801 seconds (attempt 2 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 7 meaning 'Couldn't connect to server', error details: Failed to connect to metadata port 80: Unknown error 110
2020-07-04 21:31:09.976024: I tensorflow/core/platform/cloud/retrying_utils.cc:77] The operation failed and will be automatically retried in 2.38894 seconds (attempt 3 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 7 meaning 'Couldn't connect to server', error details: Failed to connect to metadata port 80: Unknown error 110
2020-07-04 21:31:44.279964: I tensorflow/core/platform/cloud/retrying_utils.cc:77] The operation failed and will be automatically retried in 4.59586 seconds (attempt 4 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 7 meaning 'Couldn't connect to server', error details: Failed to connect to metadata port 80: Unknown error 110
2020-07-04 21:32:20.631952: I tensorflow/core/platform/cloud/retrying_utils.cc:77] The operation failed and will be automatically retried in 8.19454 seconds (attempt 5 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 7 meaning 'Couldn't connect to server', error details: Failed to connect to metadata port 80: Unknown error 110

Colab problem with metadata

Hello, I am having a problem to run a training step in the colab and the same problem is even when I run your code localy with google cloud sdk authenticated. I am not using hyperdash token.

Do you know what could be a problem?

2019-07-13 09:30:43.978054:
E tensorflow/core/platform/cloud/curl_http_request.cc:596] The transmission of request 0x564e8ccb2900 (URI: http://metadata/computeMetadata/v1/instance/service-accounts/default/token) has been stuck at 0 of 0 bytes for 61 seconds and will be aborted. CURL timing information: lookup time: 0.006321 (No error), connect time: 0 (No error), pre-transfer time: 0 (No error), start-transfer time: 0 (No error)

2019-07-13 09:30:43.978267:
I tensorflow/core/platform/cloud/retrying_utils.cc:77] The operation failed and will be automatically retried in 1.23757 seconds (attempt 1 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 42 meaning 'Operation was aborted by an application callback', error details: Callback aborted

CommandException: No URLs matched: /tops.zip

CommandException: No URLs matched: /tops.zip
unzip: cannot find or open dataset/tops.zip, dataset/tops.zip.zip or dataset/tops.zip.ZIP.
link for downloading dataset is not working..

RuntimeError: Failed. Model function mildnet_vgg16 not found

I try to run your model !bash gcloud.local.run.keras.sh $MILDNET_CONFIG on my Jupyter but it has the RuntimeError: Failed. Model function mildnet_vgg16 not found

please kindly help to solve this problem.

Best Regards,

More details;
job_configs/mildnet.cnf
Python 2.7.16
TF 2.3.0

in task.py at line 8,9, I have to commented
#from tensorflow import set_random_seed
#set_random_seed(2)
and changed to the
import tensorflow
tensorflow.random.set_seed(2)

and add @tf.function before def main

Google Colab Notebook Run Failing

The Google Colab with no modifications currently fails.

On the command: !bash gcloud.local.run.keras.sh $MILDNET_CONFIG

This is the output:

job_configs/mildnet.cnf
WARNING: The `gcloud ml-engine` commands have been renamed and will soon be removed. Please use `gcloud ai-platform` instead.
WARNING: Unexpected tensorflow version Traceback (most recent call last):
  File "<string>", line 1, in <module>
AttributeError: 'module' object has no attribute 'version'
, using the default primary node name, aka "master" for cluster settings
<subprocess.Popen object at 0x7feb32f6f990>

The training job itself is never kicked off, and the future commands, naturally, fail. Any help diagnosing this problem would be greatly appreciated.