Code Monkey home page Code Monkey logo

nsfw_data_scrapper's Introduction

NSFW Data Scrapper

Description

This is a set of scripts that allows for an automatic collection of 10s of thousands of images for the following (loosely defined) categories to be later used for training an image classifier:

  • porn - pornography images
  • hentai - hentai images, but also includes pornographic drawings
  • sexy - sexually explicit images, but not pornography. Think nude photos, playboy, bikini, beach volleyball, etc.
  • neutral - safe for work neutral images of everyday things and people
  • drawings - safe for work drawings (including anime)

Here is what each script (located under scripts directory) does:

  • 1_get_urls.sh - iterates through text files under scripts/source_urls downloading URLs of images for each of the 5 categories above. The Ripme application performs all the heavy lifting. The source URLs are mostly links to various subreddits, but could be any website that Ripme supports. Note: I already ran this script for you, and its outputs are located in raw_data directory. No need to rerun unless you edit files under scripts/source_urls
  • 2_download_from_urls.sh - downloads actual images for urls found in text files in raw_data directory
  • 3_optional_download_drawings.sh - (optional) script that downloads SFW anime images from the Danbooru2018 database
  • 4_optional_download_neutral.sh - (optional) script that downloads SFW neutral images from the Caltech256 dataset
  • 5_create_train.sh - creates data/train directory and copy all *.jpg and *.jpeg file into it from raw_data. Also removes corrupted images
  • 6_create_test.sh - creates data/test directory and moves N=2000 random files for each class from data/train to data/test (change this number inside the script if you need a different train/test split). Alternatively, you can run it multiple times, each time it will move N images for each class from data/train to data/test.

Prerequisites

  • Python3 environment: conda env create -f environment.yml
  • Java runtime environment:
    • debian and ubuntu:sudo apt-get install default-jre
  • Linux command line tools: wget, convert (imagemagick suite of tools), rsync, shuf

For Windows users

  • option 1: download a linux distro from windows 10 store and run the scripts there

  • option 2

    • download and install git from here. Git also installs Bash on your pc
    • download and install wget from here and add it to PATH
    • run the scripts

How to run

Change working directory to scripts and execute each script in the sequence indicated by the number in the file name, e.g.:

$ bash 1_get_urls.sh # has already been run
$ find ../raw_data -name "urls_*.txt" -exec sh -c "echo Number of URLs in {}: ; cat {} | wc -l" \;
Number of URLs in ../raw_data/drawings/urls_drawings.txt:
   25732
Number of URLs in ../raw_data/hentai/urls_hentai.txt:
   45228
Number of URLs in ../raw_data/neutral/urls_neutral.txt:
   20960
Number of URLs in ../raw_data/sexy/urls_sexy.txt:
   19554
Number of URLs in ../raw_data/porn/urls_porn.txt:
  116521
$ bash 2_download_from_urls.sh
$ bash 3_optional_download_drawings.sh # optional
$ bash 4_optional_download_neutral.sh # optional
$ bash 5_create_train.sh
$ bash 6_create_test.sh
$ cd ../data
$ ls train
drawings hentai neutral porn sexy
$ ls test
drawings hentai neutral porn sexy

I was able to train a CNN classifier to 91% accuracy with the following confusion matrix: alt text

As expected, anime and hentai are confused with each other more frequently than with other classes.

Same with porn and sexy categories.

Note: anime category was later renamed to drawings

nsfw_data_scrapper's People

Contributors

alexkim-gh avatar parmusingh avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.