Code Monkey home page Code Monkey logo

sdhash's Introduction

SDHash Build Status

A Python library for computing hashes of images which ignore perceptual differences.

Usage

import sdhash
from PIL import Image

i1 = Image.open('test1.png')
i2 = Image.open('test1_noise.png')
i3 = Image.open('test2.png')

h = sdhash.Hash()

h.test_duplicate(i1, i2) # True
h.test_duplicate(i1, i3) # False
h.hash_image(i1) # [ an md5 output ]

Background

As humans, it's very easy to spot if two images are "the same". Unfortunately, the same thing can't be said of computers. A simple approach such as comparing two images pixel by pixel will fail in all but the most simple of cases. Even a more sophisticated method such as computing an L1 or L2 mean distance between the images and taking a small value to imply similarity is not much better.

For example, given an original image A, the following should produce equivalent images:

  • Scaling with the same aspect ratio.
  • Scaling with a very similar aspect ratio.
  • Adding high frequency noise.
  • Small blurring / high pass filtering.
  • Lossy compression and reconstruction.

The previous set of transformations can be said to be natural, that is, they will most certainly occur in any systen, as images get moved around, edited etc . However, there's an even bigger class of transformations which are adversarial, that is, somebody is trying to trick the system to believe one image is original when it is in fact not. User generated content sites face this problem, regardless of the mode of the content (text, image, audio, video).

The following adverserial transformations should produce equivalent images, as well:

  • Removing a small area from the borders.
  • 90, 180, 270 degrees rotation.
  • Horizontal or vertical flipping.
  • Adding or removing a watermark.
  • Alteration of color planes, but not of the luminance one.
  • Grayscale conversion.
  • Large-scale editing restricted to a small area (replacing some text with another, for example)

SDHash tries to solve the problem of whether two images are identical or not, modulo all the transformations in the first group, and some (removing of borders and color plane lterations) from the second.

The API it exposes is simple. The test_duplicate method receives two PIL images as input and returns either True or False depending on whether it considers the images as equivalent or not. The hash_image method returns a base64 encoded md5 hash of "stable" image contents. The test_duplicate method is essentially a test of whether the hashes of the arguments are equal. For more advanced usage, the second method is the tool of choice.

For example, a database table of the hashes can be used, with the result of hash_image as a primary key. Whenever new image needs to be added it can be checked first against the table and only if it is not found already, inserted. This allows O(1) comparisons to be performed for each insertion, instead of O(n). Similarly, a MapReduce job can compute the hashes of images in the map stage, which will result in all identical images being grouped together with the same key in the reduce stage. This allows an O(n) algorithm for deuplicating a large dataset.

As a bonus, SDHash works with GIF animations. It treats them as a sequence of frames. Only the first, fifth, tenth etc. frames are considered. The same basic approach is used, but all the frames are considered at once.

Algorithm

The core algorithm is straightforward:

  • The input image is converted to a single plane of high-precision grayscale.
  • This plane is resized to a standard width, while maintaing the aspect ratio.
  • The DCT is computed on the plane.
  • An MD5 hash is computed from a stream which starts with the width and height of the image, and continues with the top-left most significant DCT coefficients, in row major order, clamped to an interval and precision.

All details of the process can be controlled via arguments to the hasher object constructor, although their effect is somewhat esoteric. Good defaults have been provided.

Installation

The main dependencies are on the Python image library and NumPy/SciPy etc. As such, depending on SDHash brings along a lot of baggage. Ideally and since it only depends on a very small piece of functionality, the library would just include the DCT code directly. Feel free to contribute a well tested implementation.

Installation is simple, via pip:

pip install sdhash

sdhash's People

Contributors

horia141 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

pombredanne

sdhash's Issues

Add mirroring and rotation invariance

Mirroring (both horizontal and vertical) and rotation invariance (by 90 and 270 degrees) seem like an easy win, and they're a pretty important adversarial setup.

The basic approach would be to compute all transformation from the original image, compute the hash like we do now for the original, but from each one and then combine them in such a way as to be indifferent to order (lexicographic sorting or something like that). That would mean that an image and it's mirror would have the same representation.

encoding error

Using the sample code with a probably newer Python version we get

  File ".../lib/python3.7/site-packages/sdhash/__init__.py", line 90, in _hash_image
    hasher.update('IMAGE')
TypeError: Unicode-objects must be encoded before hashing

tried to pass a base64 encoded string instead of a PIL image but then I get other errors...

Implement adaptive DCT coefficient quantization

Currently, DCT coefficients are quantized by clamping them to [-1024, 1023] and dividing them in 128 buckets. This is an uniform quantization scheme and is applied the same to all coefficients.

However, not all coefficients are equally important. We should have coarser grained quantization farther away from [0, 0]. Basically, implement the JPEG quantization procedure.

This would also required switching from specifying a bucketing to specifying a quality, and using one of several quantization masks.

Perceptually similar images end up being not-similar under the standard quantization setup, and I need to hand-tune a lower bucketing, but that would skyrocket the FPR to 0.2% (the lower bound).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.