Code Monkey home page Code Monkey logo

wildberries's Introduction

wildberries.ru parser

The script is designed to parse various sections of the www.wildberries.ru marketplace. It knows how to clean its tables, but theoretically it is needed to accumulate information about how the sales and goods prices in different sections was changing over time.

Requirements

  • python 3.6+
  • mysql server 5.7+

How to get it on Ubuntu 18.04+:

  1. You need MySQL server and Python pip, if you still don't have them:
    @ sudo apt install python3-pip
    @ sudo apt install mysql-server
    
  2. Install requests module and Python connector to MySQL:
    @ pip3 install requests
    @ pip3 install python-mysql-connector
    
  3. Check MySQL installation and root password, choose default database:
    @ mysql -u root -p
    	mysql> USE sys;
    Ctrl-Z
    

How to get it on Windows:

  1. You need MySQL 5.7.xx community server, if you still don't have it: https://dev.mysql.com/get/Downloads/MySQLInstaller/mysql-installer-community-5.7.28.0.msi

When you'll see "Choosing a setup type", click "Custom" type and click Next. On the next screen all you need are MySQL Server, MySQL Workbench, MySQL Notifier and Connector/Python.

  1. Run MySQL Workbench and choose SYS as the default database.

  2. Install requests module and Python connector to MySQL:

    @ pip install requests
    @ pip install mysql.connector
    

Usage

First of all you need for real work with this script is a whole bunch of proxy servers or you'll get ban for an hour or so from www.wildberries.ru after showing 100 positions in a short time. Script is designed for work with the premium account on www.hidemyname.org, it will cost $8 for a month. After payment you must send the feedback form on a page https://hidemyname.org/en/feedback/ with request for API access. You'll get an e-mail with some keys and options.

Now you need to initialize API. It's one time procedure for every key. Go to http://hidemyna.me/ru/proxy-list/

There will be a form where you must click on IP:port link below. Next you'll see a small window with just one field and button. Copy-paste your key to this field and click "Login". If you see the list with some IP addresses, you're ready to work, no need to copy them.

Next you need to edit config.ini file. It has less than ten lines:

[mysql]
mysql_user = root
mysql_pass = <mysql root password>
mysql_host = localhost
mysql_base = sys

[params]
expiry_hours = 24
timeout_between_urls = 3
proxy_list_key = <your hidemyname 15-digit key>

Parameters

expiry_hours - each record in MySQL table has a timestamp and if you'll run a script more than once a day for example, too "young" positions will not be updated. 24 hours is good option, you can choose even 1 hour as well as 168 hours (1 week).

timeout_between_urls - delay between requests in seconds. It will be a random time from 0 to this timeout. Don't choose too low delay, 3 seconds is ok.

Command line interface

Usage: crawler.py [-h] [-c] [-d] [-u] [-s] [-n] [-m] database source

Required arguments:

  • database = Table name
  • source = Page URL from where you start to crawl. URL must be in quotation marks.

Optional arguments:

  • -c = Cleans the table before starting. If you want to start some table from scratch.
  • -d = Debug mode.
  • -u = Updates http proxies table.
  • -s = Uses https proxies too. Overrides -u argument.
  • -n = Don't uses proxies. It may need for debug mode or if you want to parse less than 100 positions.
  • -m = Fills empty material cells. Sometimes positions parse without "material" cell, in this case there will be just "---". In the end of the whole parsing I check and refill such cells, but I left an argument too, just in case, if you want to try to refill them again. It will work only if expiry_hours still not passed.

So. How to use it.

First time you start this crawler in a day, you must use -u or -s argument, because the proxies are changing constantly. Example:

python crawler.py -u backpacks https://www.wildberries.ru/catalog/aksessuary/sumki-i-ryukzaki/ryukzaki/ryukzaki

When you see that some section has more than 20000 positions, you must specify your request or many positions in the middle will stay unnoticed. You can split it into two queries or more. Now you need the quotes before and after URL. Example:

python crawler.py backpacks "https://www.wildberries.ru/catalog/aksessuary/sumki-i-ryukzaki/ryukzaki/ryukzaki?price=200;2000"
python crawler.py backpacks "https://www.wildberries.ru/catalog/aksessuary/sumki-i-ryukzaki/ryukzaki/ryukzaki?price=2000;20000"

Here I splitted backpack section to "cheaper than 2000rub" and "from 2000rub to 20000rub". You can split and specify these sections using three or even more categories. Example:

python crawler.py backpacks "https://www.wildberries.ru/catalog/aksessuary/sumki-i-ryukzaki/ryukzaki/ryukzaki?price=3000;10000&color=0&kind=1&consists=6"

if you need only black polyester backpacks for men with the price from 3000rub to 10000rub. And so on.

Contacts

Telegram: @berenzorn

Email: [email protected]

Feel free to contact me if you have a questions. Have a nice day :)

wildberries's People

Contributors

berenzorn avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.