Code Monkey home page Code Monkey logo

finding-similar-images's Introduction

Finding similar images

Introduction

  • In my Data Science project, my team had to collect images through many kinds of Search Engines for creating dataset and we chose Google Sheets for assigning labeling tasks to each member because of its convenient.

  • There are lots of similar images when crawling from the Internet, this will result in biases in the dataset. Here is my solution to filter similar images for the Data Preparation step.

Implementation

  1. Get image urls from Search Engines. I have a repo for that here

  2. Copy + paste these urls to Google Sheets. Here, we can see how similar images arranged next to each other

  3. Connect to Google Sheets using Python

  4. If just using 1 hash value, some images will be said to be the same even if they are different. Therefore, we decided to caculate 3 hash values for each 2 images:

    • Average hashing (ahash)
    • Perceptual hashing (phash)
    • Difference hashing (dhash)

  1. If the distances of 2 in these 3 values tell 2 images are similar (โ‰ค different points) then arrange these images next to each other

    distances = [ahash0 - ahash1, phash0 - phash1, dhash0 - dhash1]
    diff_results = sum(dist < args['diff'] for dist in distances)
    
    if diff_results >= 2:
        print(f'|--Similar with url {idx1 + 1}: {url1}')
  2. Decide what images to keep and begin labeling

Usage

  1. Install libraries: pip install -r requirements.txt

  2. Sort similar images in Google Sheets:

  • Example: python sort_similar.py -s "example" -w "Sheet1" -r "B2:C" -a credentials.json
usage: sort_similar.py [-h] -s SPREADSHEET -w WORKSHEET -r RANGE -a AUTH [-d DIFF]

optional arguments:
-h, --help                                    show this help message and exit
-s SPREADSHEET, --spreadsheet SPREADSHEET     spreadsheet name
-w WORKSHEET, --worksheet WORKSHEET           worksheet name
-r RANGE, --range RANGE                       updated range
-a AUTH, --auth AUTH                          credentials file
-d DIFF, --diff DIFF                          different points
  1. Download images from urls in Google Sheets:
  • Example: python download_images.py -s "example" -w "Sheet1" -r "B2:C" -a credentials.json -o images/
usage: download_images.py [-h] -s SPREADSHEET -w WORKSHEET -r RANGE -a AUTH -o OUT

optional arguments:
-h, --help                                    show this help message and exit
-s SPREADSHEET, --spreadsheet SPREADSHEET     spreadsheet name
-w WORKSHEET, --worksheet WORKSHEET           worksheet name
-r RANGE, --range RANGE                       updated range
-a AUTH, --auth AUTH                          credentials file
-o OUT, --out OUT                             path to images directory

Reference

finding-similar-images's People

Contributors

18520339 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.