Code Monkey home page Code Monkey logo

mengmeng12 / crawl_company_info Goto Github PK

View Code? Open in Web Editor NEW
9.0 2.0 6.0 808 KB

I am assigned to a data collection task, to collect 2000 company information from [owler](www.owler.com) and [crunchbase](www.crunchbase.com) Collecting structured data by hand and feed them into an excel table is always boring and time consuming. So I tried to use web crawl method to solve it and save some time (50 hours work estimated).

Jupyter Notebook 100.00%
python owler crunchbase crawler

crawl_company_info's Introduction

I tried to crawl company information from crunchbase and owler. The websites have some basic anti-crawling techniques. So I use chromewhip to cheat the website and crawl information successfully.

I am assigned to a data collection task, to collect 2000 company information from owler and crunchbase

Collecting structured data by hand and feed them into an excel table is always boring and time consuming. So I tried to use web crawl method to solve it and save some time (50 hours work estimated).

I should get the following information from the two websites: Company name (in two websites), Company Business, Category Company Revenue, Competitor count (cap at 10), Competitor Revenue in Total HHI, top 10 competitors revenue.

The websites have some anti web crawling techniques, so we could not use the basic methods to crawl.

At first I tried some easy ways: https://github.com/mengmeng12/crawl_company_info/blob/master/company_info_crawl.ipynb I can get some information at first, but the website has some anti robot examination which requires human verification.

Then I use chromewhip to mimic real human's interaction with website, in https://github.com/mengmeng12/crawl_company_info/blob/master/chromewhip1.ipynb

mimic human example:

I also wrote some function to calculate some index and ratio and did some other anti anti-web crawl things.

The final result is here: https://github.com/mengmeng12/crawl_company_info/blob/master/data/whole_info_v4.csv

The codes really save me from a very heavy work and helps me to understand web crawl better.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.