kk-digital / dataset-scrapper-pinterest Goto Github PK

View Code? Open in Web Editor NEW

0.0 1.0 2.0 114 KB

Python 100.00%

kcg-data

dataset-scrapper-pinterest's Introduction

Pinterest-scraper

Tool Description

This tool can be used to make image scrapping from Pinterest, the tool have 4 stages:

Stage 1 - Board Search

Given a `search term` the crawler searches for boards using this term and stores the collected board links into a `sqlite` database to be used for collecting the pin urls in the second stage.

Stage 2 - Board Url Scraping

Given the board urls stored in the database from `stage 1` the crawler go through those stored links and collects the pins links and stores them in the `sqlite` database to be used in scraping and downloading the pins images in `stage 4`.

Stage 3 - Get Unique Pins

Before going to `stage 4` this stage is just simply excludes any duplicated pin urls so that in the fourth and the last stage, only the unique pins are being downloaded.

Stage 4 - Download Images

Given the pin urls stored from `stage 2` and after the duplicated urls being excluded in `stage 3` this last stage is going through those pin links and downloading the images inside those pins, then compresses those downloaded images and uploading them to `Mega Upload`.

Requirements

This tool uses Chrome web driver ,afterwards run the following command to install the required dependencies for the tool.

pip install -r ./requirements.txt

Example Usages

This command will execute all the 4 stages searching for bears images

python ./pinterest_scraper.py --search_term='bears'

This command will execute only the 3rd & 4th stages and will use the stored url links stored in the sqlite database

python ./pinterest_scraper.py --stages_to_execute=[3,4]

CLI Arguments and Options

search_term [string] - [optional] - If stage 1 was chosen to be executed then it should be a valid string to search boards with that provided string or else the tool will raise error.
stages_to_execute [list[int]] - [optional] - a list containing the number of stages required to be executed, default is a list containing all 4 stages [1,2,3,4]
maximum_scrape_theads [int] - [optional] Maximum number of threads used in scraping the pins, default is 2 threads

dataset-scrapper-pinterest's People

Contributors

Watchers

Forkers

vilerareza calebalem

dataset-scrapper-pinterest's Issues

Update log output

Update the log output to show these info.

Board Pins - pins reported by board during board scrapping stage
Scrapped pin urls/ board pin urls - number of pin urls scrapped for board
unique board pins
downloaded board pins
Missing board pins - pins we know, are unique, but have not downloaded

Migration to Scrappy

You can use playwright and scappy for the pinterest scrapper.

page html has to be saved in warc format
the current mysql lite schema is OK; but add job status; not started, in-progress, completed, failed/error; so we can use workers

https://github.com/kk-digital/dataset-scrapper-pinterest

this needs to be refactored

first cleaned up
then moved to scrappy
html files saved as warc
-- in case we need to extract text later
all three stages should write "jobs" to mysql
then workers should pick up the jobs (can be same process)

run with search example "kcg-character"

Add time from start to finish for board

and write to log file
board url, number of pins
total time taken
second per pin

After each board, append to log file

can have seperate log file for each stage

Ability to Rerun board search + pin scrapping jobs

Demonstrate ability to run stage -1 (board search) multiple times if required

1> If board search is run more than once

scrape boards
if url of board is already in database, then do not add again

2> Demonstrate ability to run the pin scrapping for each board, found in board search

should be able to run multiple times if needed

Refactor: Use Standard SQL escape functions

Stage1, Stage2, Stage3

        cmd = "insert into stage1(search_term, board_url) values ('" + \
            arg1.replace("'", "''")+"', '"+arg2+"')"

Use the python, standard string escape character

Should be

insert into stage1(search_term, board_url) values ('" +
arg1.replace("'", "''")+"', '"+arg2+"')"

fstring("insert into stage1(%s, %s,) values .... , search_term, board_url)

https://realpython.com/python-string-formatting/

Add Comment/Rename: Stage1BoardSearch.py , def first_tool(driver, search_term):

1> Put comment on this function

2> What does it do?

3> Can the name be improved?

Refactor: The CSV save can commented out or removed

  with open(file_out_path, "w", encoding='utf8') as f:
        for url in all_data:
            data = all_data[url]
            f.write(str(data[0]))
            f.write(Separator_for_csv)
            f.write(str(url))
            f.write(Separator_for_csv)
            f.write(str(data[1]))
            f.write(Separator_for_csv)
            f.write(str(data[2]))
            f.write("\n")
            insert_data_into_database(search_term, str(url))

Check/Verify: Verify if we add links to database during scroll, or only at end of scroll

Check/Verify:

Verify if we add links to database during scroll, or only at end of scroll

1> If we add at end of scroll, then it may fail, because ram usage will continue to increase and older elements may get removed also, if we scroll too long

2> So scrapping should extract urls every few scrolls and at end

3> However, scrapping should not add a url to list, if its already in the list

Stage 1 error in “sketch sheet”

Enhancement: Show number of missing pin URLs at end of stage 4

Stage 4:

pin downloads

If pins failed or were not downloaded, we should be able to see the number in output logs for run, at end of run.

check if image exists
if image does not exist, assume failed

If stage 4 is rerun

try to download any pins that havent been downloaded yet

Python PEP8 Filenames: Convert all Filesname to lower case + underscores, not Camel Case

Python PEP8 standard

all filenames must be lower case
"_" must be used as (optional) separating character

Rename all files.

Refactor: def create_database():

1> There should be seperate create database function for each state

Stage1.CreateDatabase()
Stage2.CreateDatabase()
Stage3.CreateDatabase()

2> Seperate database clear function for each stage

Stage1.ClearDatabase()
Stage2.ClearDatabase()
Stage3.ClearDatabase()

Optional. Enhancement: OpenVPN Proxy Upgrade

1> Directory where OpenVPN proxy configuration goes

2> Ability to choose random VPN profiles from folder and use profile

3> Each proxy should have own cookie profile

Each proxy should clear cookies / profile when changing VPN

4> Should have function to test proxy, such as accessing http end-point

Refactor: Stage 1, Check if URL already exists in database before reinserting

1> If a board URL is already in the database, we should not add it a second time

2> We need to check if the board url is already in the database before adding it two times

Save for Stage2, Stage 3

./outputs directory

All stages take output directory parameter
Defaults to ./output (relative to root)
Put .output/ into .gitignore
All stages need to have output_directory parameter
Database is relative to ./output

example ./output/stage1_boards.db
example ./output/stage2_board_pins.db
example ./output/stage3_pins.db
example ./output/stage5_images.db

Add Comment/Rename: Stage1BoardSearch.py, def first_tool(driver, search_term):

Add Comment/Rename: Stage1BoardSearch.py , def first_tool(driver, search_term):

1> Put comment on this function

2> What does it do?

3> Can the name be improved?

Use URL for output folder name

For example:
For board kcg-character with url https://www.pinterest.ph/synth3840/kcg-character/.
Instead of naming the folder kcg-character, the folder name would be synth3840/kcg-character but since we can't use / in a folder name, we do url encoding. So synth3840/kcg-character will become synth3840%2Fkcg-character.

Then since we cant use % as folder name. We have to convert the existing _ to __ and then % to _. So synth3840%2Fkcg-character will become synth3840_2Fkcg-character

Old folder name: kcg-character
New Folder name: synth3840_2Fkcg-character

kk-digital / dataset-scrapper-pinterest Goto Github PK

dataset-scrapper-pinterest's Introduction

Pinterest-scraper

Tool Description

Stage 1 - Board Search

Stage 2 - Board Url Scraping

Stage 3 - Get Unique Pins

Stage 4 - Download Images

Requirements

Example Usages

CLI Arguments and Options

dataset-scrapper-pinterest's People

Contributors

Watchers

Forkers

dataset-scrapper-pinterest's Issues

Recommend Projects

Recommend Topics

Recommend Org