Code Monkey home page Code Monkey logo

i_am_a_robot's Introduction

I am a robot

I'am a robot is a bachelor project to help pentesters to automatize web attacks. This project targets to solve recaptcha v2 without user interaction. The project use machine learning to solve captcha.

Google use 2 types of captcha. The first use 9 distinct pictures, the second split in 16 part a single picture. The way to solve is very different. The first captcha can be split in 9 pictures and trying to know if this has or not the label. For the second the model have to draw a mask and identify in which sub picture the mask is.

The following data have been collect between 21.05.21 and 28.05.21. The ratio can change in the future

The ratio between 3x3 and 4x4 is not balanced and it is about 70% 3x3 and 30% 4x4.

Google recaptcha challenge ratio are not balanced. Google use about 15 different labels in both type of captcha but not in the same distribution.

The five most used labels for the 3x3 recaptcha represent 99.2 % of whole 3x3 captchas.

Label %
Bus 28.09
Crosswalk 21.05
Fire hydrant 18.13
Bicycle 16.54
Car 15.42

Recaptcha 4x4 has two main labels represent 67.69 %.

Label %
Traffic light 47.02
Crosswalk 20.67
Bicycle 6.26
Bus 6.02
Fire hydrant 5.55
Motorbike 4.13

I'am a robot already have multiple model to identify different labels:

The percentage represent for one splitted image. A captcha is 9 images but sometimes can reload some image. To calculate passing captcha ratio : %9 or 12

prerequisites

  • Python, pip
  • Cityscapes for learning step
  • Datasets for learning step ( if not created before)
  • cuda setup installation for learning step and solver step

Downloader

This part of the program can download google recaptcha from website you want. This can be use to improve dataset by adding new pictures.

Captcha will be save in 4_4 or 3_3 folder. The file will be name as : i_labelOfCaptcha.png Where i is the iteration value and labelOfCaptcha the label of the captcha

Notice: tests did not show any difference between different website or location from where the script is run.

install downloader

The following command will install pypeteer and pypeteer-stealth with their dependences.

run pip install -r requirement/downloader_requirements.txt

use downloader

downloader.py [-h] [--dest DEST] [--sleeping_time SLEEPING_TIME] [--threshold THRESHOLD] [--batch_name BATCH_NAME] url
  • url: the url from where you want download captcha, must have a captcha on it
  • --dest: the location to save you captcha, in downloader directory. Default : downloaded_captcha
  • --sleeping_time : time to wait between reloading captcha in second . If the value is too small can produce white picture, Google can ban your because your are detecting as a bot. Default : 10
  • --threshold : Number of captcha to download, if you are not banned before. Default 100
  • --batch_name : The name for saving batch in --dest directoy. You will have a warning if the batch already exist. Default dataset

improvement downloader

  • Download only targeted captcha label
  • Continue download after ban
    • Waiting time
    • Changing IP
    • Check user argument ( negative sleeping_time, etc)

Annotator

This part of the project is use to create dataset. Currently, the project allow to annotate only 3x3 captcha. Models at the moment are use for only one label. This choice has been made for helping learning. When a model has a good score, it is kept and he does not to have good score for every labels.

install annotator

The following command will install pillow package

run ``pip install -r requirement/annotator_requirements.txt `

use annotator

Annotator has 2 parts located in annotator directory.

  • copy_captcha_with_condition is a utils command to copy file from download step and split it to get ready for annotating phase. Pictures will be store splitted and unsplitted to check if the script works. Will ignore folder 4_4 to keep only 3x3 captcha. Picture will be rename with random lowercase string.

    copy_captcha_with_condition.py [-h] [--captcha_format CAPTCHA_FORMAT] --dest DEST [--dir DIR] --src SRC --target TARGET
    • --captcha_format : Number of column and line to split. Default 3
    • --dest : Directory to store unsplitted picture
    • --dir : Directory to store splitted picture. Default dataset
    • --src : Source directory to copy file to dir
    • --target : Captcha to copy, label you want
  • annotate: is the script to create dataset. The value annotated will be store in the picture filename. The first character is the class. The script will show a file name ( the first present in the dataset directory) and will wait for a value. You have to open the image or see it in the folder and if the picture has the target label enter the value_dataset else just type enter.

    annotate.py [-h] [--dest_save DEST_SAVE] [--value_dataset VALUE_DATASET] [--ratio RATIO] [--src_dataset DATASET] dataset_path
    
    • dataset_path: root directory of the dataset
    • --dest_save : directory to save annotated dataset in dataset_path
    • --value_dataset: Value to set to elements represent your target. default 1. Currently use values:
      • crosswalk : 1
      • fire_hydrant : 3
      • bus : 4
      • car : 5
      • bicycle : 6
    • --ratio : Ratio to keep image that not contain your target. recaptcha 3x3 has usually 3 pictures with label and 6 without. Default 50
    • --src_dataset : Dataset name, same as --dir for copy_captcha_with_condition. Default dataset

improvement annotator

  • Add question if user want to see picture ( currently not possible because of loosing focus when opening a picture)
  • Change the value to annotate dataset, replace it with a understandable value
  • Checking user_arguments

Learner

This part of the project as for objective to improve the different model. Currently the project has 2 learning ways. Both use pytorch vision.

install learner

The following command will install the base to do learning, but to improve performance, it is recommended to install pytorch in this way :

https://pytorch.org/get-started/locally/

Then you can run

  • captcha 3x3

  • pip install -r requirement/learner_captcha_3_requirements.txt

  • You can manually build your dataset from Downloader and Annotator or a dataset already annotated can be downloaded at captcha3 drive dataset

  • captcha 4x4 : pip install -r requirement/learner_captcha_4_requirements.txt

    • The dataset use is the Cityscapes dataset with little modification like deleting pictures without any label in captcha. The modified dataset and their corresponding mask can be found at captcha4 drive dataset (unzip 6.6Go )

    The directory tree must be

    root_directory
    └───pictures
    │   │.png
    │   
    └───masks
    │   |.json
    

use learner

Captcha 3 command located at learner/captcha3 folder:

captcha_3.py [-h] [--batch_size BATCH_SIZE] [--ratio RATIO] [--num_workers NUM_WORKERS] [--lr LR]
                    [--momentum MOMENTUM] [--num_epochs NUM_EPOCHS] [--num_classes NUM_CLASSES] --name NAME
                    dataset_path
  • dataset_path : Path to the dataset you create or download

  • --batch_size: The size of the batch to do learning. Can be reduce if you have not enough memory. Default 32

  • --ratio : How to divide the dataset between picture use for learning and picture to keep for testing. Default 75. The value represent percentage of picture for learning

  • --num_workers : Can be used to make evolution on multiple GPU. Default 0

  • --lr : Learning rate at the beginning of the learning. Default 0.001

  • --momentum : momentum use for the learning. Default 0.9

  • --num_epochs : Number of epoch before ending learning. Default 25

  • --num_classes : Number of classes to identify in the dataset. Should be the same as the annotated dataset. Data should be name of class_filename where class start at 0 and increment by 1 for each different class. If using given dataset the value should not be changed. If using other dataset can be changed. Default 2

  • --name : Filename for the dataset without extension. The filename will finish with .pt

Captcha 4 command located at learner/captcha4 folder:

captcha_4.py [-h] [--batch_size BATCH_SIZE] [--ratio RATIO] [--num_workers NUM_WORKERS] [--lr LR] [--momentum MOMENTUM] [--num_epochs NUM_EPOCHS] [--weight_decay WEIGHT_DECAY] [--step_size STEP_SIZE] [--gamma GAMMA] --name
                    NAME
                    dataset_path

  • dataset_path : Path to the dataset you create or download

  • --batch_size: The size of the batch to do learning. Can be reduce if you have not enough memory. Default 2

  • --ratio : How to divide the dataset between picture use for learning and picture to keep for testing. Default 75. The value represent percentage of picture for learning

  • --num_workers : Can be used to make evolution on multiple GPU. Default 0

  • --lr : Learning rate at the beginning of the learning. Default 0.001

  • --momentum : momentum use for the learning. Default 0.9

  • --num_epochs : Number of epoch before ending learning. Default 25

  • --weight_decay : Weight decay adds a penalty term to the error function. Default 0.0005

  • --step_size : Number of epoch to wait to reduce learning rate. Default 3

  • -- gamma : Factor to reduce learning rate. Default 0.1

  • --name : Filename for the dataset without extension. The filename will finish with .pt

improvement learner

  • Captcha 4 accept other dataset
  • check user arguments
  • add --verbose

Solver

This part of the project as for objective to automatically solve Recaptcha v2 from given website. It can automatically fill require form input or accept user input.

install solver

The following command will install the base to solve, but to improve performance, it is recommended to install pytorch in this way(same as how to install learner step ) :

https://pytorch.org/get-started/locally/

Then run pip install -r requirement/solver_cpu_requirements.txt

use solver

solver.py [-h] [--form FORM] [--params [PARAMS ...]] [--values [VALUES ...]] [--num_form NUM_FORM] [--ascii] [--headed] url
  • url : url of the website to solve captcha. This url has to have a captcha
  • --form : how to find the form to fill in the page. - for a html id, need to add # before idname - for a html class, need to add . before classname - for the form tag, nothing to add
    • --num-form : select the nth form in the page. Default value 0
    • --params : form field to fill. If none search required one. If at least one is given, doesn't fill other field, optionnal or required
    • --values : fill the --params field with user values. If values is not given or not enough values given, fill randomly. Produce error if params is smaller than values.
    • --ascii : remove message (type of captcha, google result...) and replace it with ascii art ⚠️ clear terminal ⚠️
    • --headed : run in front pypeteer, usally use for debugging

improvement solver

  • ameliorate exception handling
  • find better way to fill select form
  • find call to know when reloaded picture is printed. Currently wait 6 seconds

credits

Cityscape : M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The Cityscapes Dataset for Semantic Urban Scene Understanding,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

Quentin Saucy

Feel free to ask questions and adding scripts to improve the project

I'am not a native english speaker, feel free to pull request with any typo fix

i_am_a_robot's People

Contributors

qsaucy avatar

Stargazers

 avatar Gwen Dossegger avatar Dusk avatar Gil Balsiger avatar Chris Barros Henriques avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.