Code Monkey home page Code Monkey logo

fastclass's Introduction

FastClass

Version Python Style GitHub stars

A little set of tools to batch download images and weed through, delete and classify them into groups for building deep learning image datasets.

I wrote up a small blog post on my site www.christianwerner.net.

Installation

pip install git+https://github.com/cwerner/fastclass.git#egg=fastclass

The installer will also place the executables fcc and fcd in your $PATH.

The package currently contains the follwing tools:

Download images

Use fcd to crawl search engines (Google, Bing, Baidu, Flickr) and pull all images for a defined set of queries. In addition, files are renamed, scaled and checked for duplicates.

You provide queries and terms that should be excluded when naming the category folders. There is an example (guitars.csv) provided in the repository.

Usage

Call the script from the commandline. If you omit any input parameters it will show you the help page.

Usage: fcd [OPTIONS] INFILE

Options:
  -c, --crawler [ALL|GOOGLE|BING|BAIDU|FLICKR]
                                  selection of crawler (multiple invocations
                                  supported)  [default: ALL] (Note: BAIDU and FLICKR are not included in ALL option)
  -k, --keep                      keep original results of crawlers  [default:
                                  False]
  -m, --maxnum                    maximum number of images per crawler [default: 1000]
  -s, --size INTEGER              image size for rescaling  [default: 299]
  -o, --outpath TEXT              name of output directory  [default: dataset]
  -h, --help                      Show this message and exit.

  ::: FastClass fcd :::

  ...an easy way to crawl the net for images when building a dataset for
  deep learning.

  Example: fcd -c GOOGLE -c BING -s 224 example/guitars.csv

If you specify the -k, --keep flag a second folder called outpath.raw containing the original/ unscled images will be created.

Search file format

The csv file currently requires two columns (columns are seperated by a comma (,)) and each row defines a image class you want to download (see the guitars.csv file in the example folder). The first row contains a header which will be skipped.

Column 1 contains the search terms. You can specify multiple searchterms using space between them. If you want to require a search term enclose it in quotation marks (") (you can use the normal query syntax you'd normally use in a google search - i.e. filetype:jpg). In column 2 you can specify terms that should not be included in the final class names. An example would be that you want to add guitar to your search terms to help the search but don't need that term in the final folder class names. If you do not want to specify this column you can leave it blank (i.e., end the line with a comma).

Clean image sets

Once downloaded use fcc to quickly inspect the loaded files and rate or classify them. You can also mark them for deletion.

FastClass cleaner: fcc

Usage

Call the script from the commandline. If you omit any input parameters it will show you the help page.

Usage: fcc [OPTIONS] INFOLDER [OUTFOLDER]

  FastClass fcc

Options:
  --nocopy TEXT  disable filecopy for cleaned image set  [default: False]
  -h, --help     Show this message and exit.

  ::: FastClass fcc ::: ...a fast way to cleanup/ sort your images when
  building a dataset for deep learning.

  Note: In the application use the following keys: <1>, <2>, ... <9> for
  class assignments or quality ratings <space> assigns <1> <d> to mark a
  deletion <x> to terminate the app/ write output

  Use the buttons to navigate back and forth without changing the
  classification. The current classification of an image is given in the
  title bar (X indicated a mark for deletion). The counter in the titlebar
  gives number of classified images vs the total number in the input folder.

  In the output csv file 1,2 depcit class assignments/ ratings,  -1
  indicates files marked for deletion (if not excluded with -d).

Flickr Crawler

The Flickr crawler requires an API key. FastClass looks for the key in an environment variable called FLICKR_API_KEY. Request one from the Flickr API key application page.

FLICKR_API_KEY=asdf1234asdf456 fcd -c FLICKR my_project.csv

fastclass's People

Contributors

cwerner avatar h4dr1en avatar mpoisot avatar mraggi avatar v-raja avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

fastclass's Issues

Download images more than 1000?

Hey, Thanks for the wonderful tool !! It's a great help.
I wanted to ask how to change the limit of -m which is max. limit 1000 right now.
If we want to download more than 1000 images, then what to do?

NotADirectoryError with fcd

I encountered the following error:

Searching: >> Skechers Skech-Air: Porter - Zevelo <<
(1) Crawling ...
Traceback (most recent call last):
File "...\envs\py3\lib\runpy.py", line 193, in _run_module_as_main                                                       "__main__", mod_spec)
File "...\envs\py3\lib\runpy.py", line 85, in _run_code                                                                  exec(code, run_globals)
File "...\envs\py3\Scripts\fcd.exe\__main__.py", line 9, in <module>                                                   
File "...\envs\py3\lib\site-packages\click\core.py", line 764, in __call__                                               return self.main(*args, **kwargs)
File "...\envs\py3\lib\site-packages\click\core.py", line 717, in main                                                   rv = self.invoke(ctx)
File "...\envs\py3\lib\site-packages\click\core.py", line 956, in invoke                                                 return ctx.invoke(self.callback, **ctx.params)
File "...\envs\py3\lib\site-packages\click\core.py", line 555, in invoke                                                 return callback(*args, **kwargs)
File "...\envs\py3\lib\site-packages\fastclass\fc_download.py", line 168, in cli                                         main(infile, size, crawler, keep, maxnum, outpath)
File "...\envs\py3\lib\site-packages\fastclass\fc_download.py", line 117, in main                                        source_urls = crawl(raw_folder, search_term, maxnum, crawlers=crawler)                                                                                              File "...\envs\py3\lib\site-packages\fastclass\fc_download.py", line 46, in crawl                                        os.makedirs(folder, exist_ok=True)
File "...\envs\py3\lib\os.py", line 220, in makedirs                                                                     mkdir(name, mode)
NotADirectoryError: [WinError 267] The directory name is invalid: 'C:\\Users\\x\\AppData\\Local\\Temp\\tmpi3pqkz6f\\Skechers_Skech-Air:_Porter_-_Zevelo'  

The bug comes from this line because the character ":" of the string Skechers Skech-Air: Porter - Zevelo is not escaped and is not valid inside a Windows path.

I would suggest to use the sanitizer of Django for this.

I implemented it in #16

UnicodeEncodeError while writing to log file

The follwing error occured when writing to the log file:

Searching: >> Nike Moon Racer <<
(1) Crawling ...
    -> GOOGLE
    -> BING
    Number of duplicate image files: 1. Removing...
(2) Resizing images to (299, 299)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:01<00:00, 32.20it/s]                
Traceback (most recent call last):
File "...\envs\py3\lib\runpy.py", line 193, in _run_module_as_main                                                        "__main__", mod_spec)
File "...\envs\py3\lib\runpy.py", line 85, in _run_code                                                                   exec(code, run_globals)
File "...\envs\py3\Scripts\fcd.exe\__main__.py", line 9, in <module>
File "...\envs\py3\lib\site-packages\click\core.py", line 764, in __call__                                                return self.main(*args, **kwargs)
File "...\envs\py3\lib\site-packages\click\core.py", line 717, in main                                                    rv = self.invoke(ctx)
File "...\envs\py3\lib\site-packages\click\core.py", line 956, in invoke                                                  return ctx.invoke(self.callback, **ctx.params)
File "...\envs\py3\lib\site-packages\click\core.py", line 555, in invoke                                                  return callback(*args, **kwargs)
File "...\envs\py3\lib\site-packages\fastclass\fc_download.py", line 163, in cli                                          main(infile, size, crawler, keep, maxnum, outpath)
File "...\envs\py3\lib\site-packages\fastclass\fc_download.py", line 132, in main                                         log.write(','.join([item, source_urls[item]]) + '\n')
File "...\envs\py3\lib\encodings\cp1252.py", line 19, in encode                                                           return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 102-106: character maps to <undefined> 

The error comes from here.

To reproduce the bug, simply enter in a csv file:

searchterm,exclude
Nike Moon Racer, "Nike"

and run fcd as following:

fcd -m 25 label.csv

The error comes from unknown characters of the address of this picture

I fixed it and will attach the fix to #16

Package downloading too few images

Thanks for this helpful tool.
I am using this and it works wonderful, the only problem I have is its just downloading 20-40 images for each class.
Am I doing something wrong or is there anyway I could make this download more images..?

I used the below command--
fcd -c GOOGLE -s 224 C:/Users/admin/Downloads/cartypes.csv

Store image data and classification categories in db?

Maybe it would be a good idea to store image info and classification details in a small sql (sqlite) database.

Potenital entries:

  • name
  • search query
  • source-url
  • class (incl. subclasses)
  • comment
  • link to raw file
  • link to scaled file

Add source URL to log/ record file

To check bad files it would be helpful to investigate the source of the images manually. Therefore the URL of the crawler instance needs to be recorded.

ValueError: images do not match (I think during cropping)

I get an error using !fcd -c GOOGLE -s 224 query.csv with a simple query dictionary

searchterm,exclude
keys, key
glasses, glasses
sunglasses, glasses
shades, glasses
remote controll, remote controll
phone, phone
smartphone, phone
mobilephone, phone
wallet, wallet
cash cart, cash cart
atm card, cash cart
credit card, cash cart

when it searches for glasses
Searching: >> glasses <<
I get:
..... trace till imageprocessing.py line 35 then File "/anaconda3/envs/fastai/lib/python3.7/site-packages/PIL/Image.py", line 1442, in paste self.im.paste(im, box) ValueError: images do not match

From what I found on stack overflow it seems to be that error:
https://stackoverflow.com/questions/12291641/python-pil-valueerror-images-do-not-match

I guess ignoring such an image would be enough (just a try catch). It is 2am so I just quickly did put a try catch within the for loop of imageprocessing.py to resolve it for me and go do bed xD. Maybe there is a better solution to that?

Edit: Later while writing this post I also had a IsADirectoryError in the download so I brainlessly also did put a try catch around im = Image.open(f)....

Nice lib btw.!

Resizing results from multiple searchengines (-c ALL) overwrites images

out = os.path.join(outpath, fname + '.jpg')

It looks like resize overwrites the output files when multiple crawlers are used. For example when resizing it goes throw the google results first and resizes 000001.jpg from the google results to the output folder. Then it resizes the the Bing results 00001.jpg and saves it to the sames folder overwriting the image from Google. And finally resizes the image 00001.jpg from Baidu and also saves it to the output folder overwriting the image from big.

So:
tmp/searchterm.google/000001.jpeg -> dataset/searchterm/000001.jpg
tmp/searchterm.bing/000001.jpeg -> dataset/searchterm/000001.jpg
tmp/searchterm.baidu/000001.jpeg -> dataset/searchterm/000001.jpg

Leaving only the image from the Baidu search in the output folder.

Add tests

Now other people are using it the tool really should have tests

!fcc /content/tiger/Bali_tigris

self.tk = _tkinter.create(screenName, baseName, className, interactive, wantobjects, useTk, sync, use)
_tkinter.TclError: no display name and no $DISPLAY environment variable

TypeError: 'NoneType' object is not iterable

Hi
Just installed FastClass as per instructions on your blog post

Created a simple query file and ran the command as per your blog post but got the following error:

fcd -c GOOGLE -k -o surfers surfers.csv
INFO: final dataset will be located in surfers
[1/2] Searching: >> surfer aerial view <<
(1) Crawling ...
    -> GOOGLE
Number of duplicate image files: 0. Removing...
(2) Resizing images to (299, 299)
0it [00:00, ?it/s]
Traceback (most recent call last):
  File "/home/tim/miniconda3/envs/ml/bin/fcd", line 8, in <module>
    sys.exit(cli())
  File "/home/tim/miniconda3/envs/ml/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/tim/miniconda3/envs/ml/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/tim/miniconda3/envs/ml/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/tim/miniconda3/envs/ml/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/tim/miniconda3/envs/ml/lib/python3.6/site-packages/fastclass/fc_download.py", line 170, in cli
    main(infile, size, crawler, keep, maxnum, outpath)
  File "/home/tim/miniconda3/envs/ml/lib/python3.6/site-packages/fastclass/fc_download.py", line 138, in main
    for item in source_urls:
TypeError: 'NoneType' object is not iterable

I'm running Ubuntu 19.10 using conda environment with Python 3.6. I also tried a new install of FastClass in a new conda environment with python 3.7 and got the following error:

fcd -c ALL -k -o surfers surfers.csv
INFO: final dataset will be located in surfers
[1/2] Searching: >> surfer aerial view <<
(1) Crawling ...
    -> GOOGLE
    -> BING
Number of duplicate image files: 1. Removing...
(2) Resizing images to (299, 299)
100%|█████████████████████████████████████████| 521/521 [00:10<00:00, 52.08it/s]
[2/2] Searching: >>  <<
(1) Crawling ...
    -> GOOGLE
    -> BING
Number of duplicate image files: 1. Removing...
(2) Resizing images to (299, 299)
  0%|                                                     | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/tim/miniconda3/envs/py3.7/bin/fcd", line 8, in <module>
    sys.exit(cli())
  File "/home/tim/miniconda3/envs/py3.7/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/tim/miniconda3/envs/py3.7/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/tim/miniconda3/envs/py3.7/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/tim/miniconda3/envs/py3.7/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/tim/miniconda3/envs/py3.7/lib/python3.7/site-packages/fastclass/fc_download.py", line 170, in cli
    main(infile, size, crawler, keep, maxnum, outpath)
  File "/home/tim/miniconda3/envs/py3.7/lib/python3.7/site-packages/fastclass/fc_download.py", line 133, in main
    source_urls = resize(files, outpath=out_resized, size=SIZE, urls=source_urls)
  File "/home/tim/miniconda3/envs/py3.7/lib/python3.7/site-packages/fastclass/imageprocessing.py", line 32, in resize
    im = Image.open(f)
  File "/home/tim/miniconda3/envs/py3.7/lib/python3.7/site-packages/PIL/Image.py", line 2843, in open
    fp = builtins.open(filename, "rb")
IsADirectoryError: [Errno 21] Is a directory: '/tmp/tmpct1gtla5/surfer'

However it did run fine when run from a conda environment with python 3.7 on my windows 10 box.

Any suggestions?
Thanks, and thanks so much for developing this app!
Kind regards
Tim

Remove Baidu from ALL

Baidu seems to be returning a lot of incorrect file types which cause the program to take time. Also, the -m tag works well when crawling through Google and Bing at the same time, but doesn't work so well when crawling through ALL. I suspect that it might have something to do with Baidu but I'm not sure.

So I think Baidu should be removed from the ALL option and then since the -m tag will work with the all option, we can add it to the readme.

Provide labels in fcc

I think it would be better if we can provide custom labels as input in the fcc, which if not provided defaults to the usual 1, 2, 3, 4, etc

Size of the image

Hi...Is there any way to specify the size of the images to be download for each class, (like 400*400)? This will be super helpful to avoid junk images...

Scraping Pinterest?

First of all, nicely done (good docs, easy install, works like a charm). Would find pinterest scraping quite useful ..

using navigation keys with fcc throws KeyError 'c'

Using navigation keys (with Left, Up, Down, Right) not in the num pad throws following error:

Traceback (most recent call last): 
File "...\envs\py3\lib\tkinter\__init__.py", line 1705, in __call__ return self.func(*args)
File "...\envs\py3\lib\site-packages\fastclass\fc_clean.py", line 133, in callback button_action(event.char)                                                                                                                    File "...\envs\py3\lib\site-packages\fastclass\fc_clean.py", line 129, in button_action self._class[f'c{char}'].add(self.filelist[self._index]) 
KeyError: 'c'

The callback function should always use event.keysym to avoid such problems. I fixed it and will open a PR.

broken images stop the whole program

I saw this error:

  File "/miniconda/bin/fcd", line 10, in <module>
    sys.exit(cli())                             
  File "/miniconda/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)                                                                                                                                                                                
  File "/miniconda/lib/python3.6/site-packages/click/core.py", line 717, in main                                                                                                                                    
    rv = self.invoke(ctx)
  File "/miniconda/lib/python3.6/site-packages/click/core.py", line 956, in invoke                                                                                                                                   
    return ctx.invoke(self.callback, **ctx.params)
  File "/miniconda/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/miniconda/lib/python3.6/site-packages/fastclass/fc_download.py", line 163, in cli
    main(infile, size, crawler, keep, maxnum, outpath)
  File "/miniconda/lib/python3.6/site-packages/fastclass/fc_download.py", line 126, in main
    source_urls = resize(files, outpath=out_resized, size=SIZE, urls=source_urls)                                                                                                                                    
  File "/miniconda/lib/python3.6/site-packages/fastclass/imageprocessing.py", line 45, in resize
    bg = bg.convert('RGB')
  File "/miniconda/lib/python3.6/site-packages/PIL/Image.py", line 912, in convert
    self.load()
  File "/miniconda/lib/python3.6/site-packages/PIL/ImageFile.py", line 239, in load
    len(b))        
OSError: image file is truncated (24 bytes not processed) 

and the whole program stopped of course.
I suggest catching this exception and skipping the file.

Better explanation of keep flag

Change readme to better explain keep flag. I crawled with and without using the flag; however, I was unable to find any differences in the results. What is the flag's purpose?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.