Code Monkey home page Code Monkey logo

dataset-scrapper-pinterest's Introduction

Pinterest-scraper

Tool Description

This tool can be used to make image scrapping from Pinterest, the tool have 4 stages:

Stage 1 - Board Search

Given a `search term` the crawler searches for boards using this term and stores the collected board links into a `sqlite` database to be used for collecting the pin urls in the second stage.

Stage 2 - Board Url Scraping

Given the board urls stored in the database from `stage 1` the crawler go through those stored links and collects the pins links and stores them in the `sqlite` database to be used in scraping and downloading the pins images in `stage 4`.

Stage 3 - Get Unique Pins

Before going to `stage 4` this stage is just simply excludes any duplicated pin urls so that in the fourth and the last stage, only the unique pins are being downloaded.

Stage 4 - Download Images

Given the pin urls stored from `stage 2` and after the duplicated urls being excluded in `stage 3` this last stage is going through those pin links and downloading the images inside those pins, then compresses those downloaded images and uploading them to `Mega Upload`.

Requirements

This tool uses Chrome web driver ,afterwards run the following command to install the required dependencies for the tool.

pip install -r ./requirements.txt

Example Usages

  • This command will execute all the 4 stages searching for bears images
python ./pinterest_scraper.py --search_term='bears'
  • This command will execute only the 3rd & 4th stages and will use the stored url links stored in the sqlite database
python ./pinterest_scraper.py --stages_to_execute=[3,4]

CLI Arguments and Options

  • search_term [string] - [optional] - If stage 1 was chosen to be executed then it should be a valid string to search boards with that provided string or else the tool will raise error.

  • stages_to_execute [list[int]] - [optional] - a list containing the number of stages required to be executed, default is a list containing all 4 stages [1,2,3,4]

  • maximum_scrape_theads [int] - [optional] Maximum number of threads used in scraping the pins, default is 2 threads

dataset-scrapper-pinterest's People

Contributors

skittoo avatar vilerareza avatar kenje4090 avatar haltingstate avatar

Watchers

 avatar

dataset-scrapper-pinterest's Issues

Update log output

Update the log output to show these info.

  1. Board Pins - pins reported by board during board scrapping stage
  2. Scrapped pin urls/ board pin urls - number of pin urls scrapped for board
  3. unique board pins
  4. downloaded board pins
  5. Missing board pins - pins we know, are unique, but have not downloaded

Migration to Scrappy

You can use playwright and scappy for the pinterest scrapper.

  • page html has to be saved in warc format
  • the current mysql lite schema is OK; but add job status; not started, in-progress, completed, failed/error; so we can use workers

https://github.com/kk-digital/dataset-scrapper-pinterest

this needs to be refactored

  • first cleaned up

  • then moved to scrappy

  • html files saved as warc
    -- in case we need to extract text later

  • all three stages should write "jobs" to mysql

  • then workers should pick up the jobs (can be same process)


run with search example "kcg-character"

Add time from start to finish for board

Add time from start to finish for board

  • and write to log file
  • board url, number of pins
  • total time taken
  • second per pin

After each board, append to log file

  • can have seperate log file for each stage

Ability to Rerun board search + pin scrapping jobs

Demonstrate ability to run stage -1 (board search) multiple times if required

1> If board search is run more than once

  • scrape boards
  • if url of board is already in database, then do not add again

2> Demonstrate ability to run the pin scrapping for each board, found in board search

  • should be able to run multiple times if needed

Refactor: Use Standard SQL escape functions

Stage1, Stage2, Stage3

        cmd = "insert into stage1(search_term, board_url) values ('" + \
            arg1.replace("'", "''")+"', '"+arg2+"')"

Use the python, standard string escape character

Should be

insert into stage1(search_term, board_url) values ('" +
arg1.replace("'", "''")+"', '"+arg2+"')"

->

fstring("insert into stage1(%s, %s,) values .... , search_term, board_url)

Refactor: The CSV save can commented out or removed

  with open(file_out_path, "w", encoding='utf8') as f:
        for url in all_data:
            data = all_data[url]
            f.write(str(data[0]))
            f.write(Separator_for_csv)
            f.write(str(url))
            f.write(Separator_for_csv)
            f.write(str(data[1]))
            f.write(Separator_for_csv)
            f.write(str(data[2]))
            f.write("\n")
            insert_data_into_database(search_term, str(url))   

Check/Verify: Verify if we add links to database during scroll, or only at end of scroll

Check/Verify:

  • Verify if we add links to database during scroll, or only at end of scroll

1> If we add at end of scroll, then it may fail, because ram usage will continue to increase and older elements may get removed also, if we scroll too long

2> So scrapping should extract urls every few scrolls and at end

3> However, scrapping should not add a url to list, if its already in the list

Enhancement: Show number of missing pin URLs at end of stage 4

Stage 4:

  • pin downloads

If pins failed or were not downloaded, we should be able to see the number in output logs for run, at end of run.

  • check if image exists
  • if image does not exist, assume failed

If stage 4 is rerun

  • try to download any pins that havent been downloaded yet

Refactor: def create_database():

1> There should be seperate create database function for each state

Stage1.CreateDatabase()
Stage2.CreateDatabase()
Stage3.CreateDatabase()

2> Seperate database clear function for each stage

Stage1.ClearDatabase()
Stage2.ClearDatabase()
Stage3.ClearDatabase()

Optional. Enhancement: OpenVPN Proxy Upgrade

1> Directory where OpenVPN proxy configuration goes

2> Ability to choose random VPN profiles from folder and use profile

3> Each proxy should have own cookie profile

  • Each proxy should clear cookies / profile when changing VPN

4> Should have function to test proxy, such as accessing http end-point

./outputs directory

  1. All stages take output directory parameter

  2. Defaults to ./output (relative to root)

  3. Put .output/ into .gitignore

  4. All stages need to have output_directory parameter

  5. Database is relative to ./output

  • example ./output/stage1_boards.db
  • example ./output/stage2_board_pins.db
  • example ./output/stage3_pins.db
  • example ./output/stage5_images.db

Use URL for output folder name

For example:
For board kcg-character with url https://www.pinterest.ph/synth3840/kcg-character/.
Instead of naming the folder kcg-character, the folder name would be synth3840/kcg-character but since we can't use / in a folder name, we do url encoding. So synth3840/kcg-character will become synth3840%2Fkcg-character.

Then since we cant use % as folder name. We have to convert the existing _ to __ and then % to _. So synth3840%2Fkcg-character will become synth3840_2Fkcg-character

Old folder name: kcg-character
New Folder name: synth3840_2Fkcg-character

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.