Code Monkey home page Code Monkey logo

company-crawler's Introduction

天眼查、企查查公司信息爬虫

使用说明

  1. 设置用户状态

    抓包工具抓包天眼查、企查查小程序,设置请求头用户鉴权信息,在各自目录的init.py文件中。可在此处配置随机UA,项目地址:fake_useragent

  2. 设置数据源

    MYSQL_CONFIG = {
        'develop': {
            'host': '192.168.1.103',
            'port': 3306,
            'db': 'enterprise',
            'username': 'root',
            'password': 'root@123'
        }
    }
    
  3. 执行db/data.sql生成数据结构

  4. 配置IP代理config/settings, 开启global proxy前请先自行部署ip代理池,项目地址:proxy_pool

    # 全局代理控制, 
    GLOBAL_PROXY = True
    PROXY_POOL_URL = "http://localhost:5010"
    
  5. 设置爬取关键字qichacha&tianyancha

    keys = ['Google'] # 设置爬取列表
    crawler.load_keys(keys)
    crawler.start()
    

Schedule List

功能 日期 状态 备注
鉴权Token提取 待完成
内置IP代理 待完成
防封策略 待完成
容器化运行 待完成


Please Kindly Note That

程序员技术交流tg群,欢迎大家加入!!!

内有技术交流!工作内推!远程工作!兼职、私活儿!!。

Telegram群链接:程序员社区https://t.me/+iZK2y8zMUiE0NDE1

群二维码:

company-crawler's People

Contributors

bouxin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

company-crawler's Issues

请问单个cookie请求次数上限如何解决?

程序中增加了配置表:存储多个cookie信息, 当某个cookie失效后,从配置表获取新的cookie,但是程序使用新的cookie还是无法正常获取信息。
关于cookie这块有熟悉的朋友吗?

每次只能保存20条数据吗

我是个菜鸟,请教一下,我每次运行只能保存20条数据,如果 想要多爬一点数据应该改哪个地方呢,谢谢!

爬取的数据存不了mysql数据库

楼主,您好! 修改了mysql相关的配置,执行程序的时候没有报错,但存不了数据库,请问是什么问题呢,希望得到帮助,谢谢

Installation fails due to conflicting urllib3 version

Hi, users are unable to run webinfo-crawler due to dependency conflict with urllib3 package.
As shown in the following full dependency graph of webinfo-crawler, webinfo-crawler requires urllib3==1.25.2,while requests==2.21.0 requires urllib3>=1.21.1,<1.25.

According to pip’s “first found wins” installation strategy, urllib3==1.25.2 is the actually installed version.
However, urllib3==1.25.2 does not satisfy urllib3>=1.21.1,<1.25.

Dependency tree

webinfo-crawler-master
| +-dbutils(version range:==1.3)
| +-pymysql(version range:==0.9.3)
| +-redis(version range:==3.2.0)
| +-requests(version range:==2.21.0)
| | +-certifi(version range:>=2017.4.17)
| | +-chardet(version range:<3.1.0,>=3.0.2)
| | +-idna(version range:>=2.5,<2.9)
| | +-urllib3(version range:>=1.21.1,<1.25)
| +-urllib3(version range:==1.25.2)

Thanks for your help.
Best,
Neolith

天眼查小程序更新加密了?

作者您好!不知道是不是天眼查更新加密了,我今天抓取的时候用抓包软件看到下面的提示,现在已经爬取不了了,这种情况是不是得伪造识别码,把小程序反编译出来,然后破解?求请问作者可以更新程序吗?如果可以的话,万分感谢🙏!祝好!

image

接口

请问从哪里查找对应功能的接口信息

组织了一些师兄弟还有网上的朋友围绕数据、算法进行积累(交易方向)求交流!

你好,我介绍一下情况:
组织了一些师兄弟还有网上的朋友围绕数据、算法进行积累(交易方向)
我们这种业余模式比公司还要强大,可以做很多创新的尝试!
目前股票已经在指导直接交易,其他策略也在继续推进中;
团队的大致方向是:
1、传统股票数据(在不断增加将来开源);
2、crypto server(数据及下单);
3、策略研究;
4、web 用户管理;
5、深度学习
一个人的力量有限、大家的力量无穷,感兴趣的朋友可以联系一下,谢谢!
我的微信:jtyd008
请备注:爬虫

2 问题

封号和字体解密,vip可不可以解?

刚更新这个搜不到数据

05/02/2019 12:52:39 crawler.py[line:45] INFO 开始搜索关键字[火锅]
05/02/2019 12:52:41 crawler.py[line:84] ERROR [tyc]api error, warn-无数据
05/02/2019 12:52:41 crawler.py[line:47] INFO 开始解析
05/02/2019 12:52:42 crawler.py[line:125] INFO no companies available
05/02/2019 12:52:42 crawler.py[line:49] INFO 数据已保存
05/02/2019 12:52:42 crawler.py[line:50] INFO 结束

天眼查抓取失败

01/10/2023 02:00:52 crawler.py[line:20] INFO 正在采集[谷歌]...
01/10/2023 02:00:52 crawler.py[line:20] INFO 正在采集[谷歌]...
Traceback (most recent call last):
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/urllib3/connection.py", line 175, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/urllib3/util/connection.py", line 95, in create_connection
raise err
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/urllib3/util/connection.py", line 85, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/urllib3/connectionpool.py", line 710, in urlopen
chunked=chunked,
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/urllib3/connectionpool.py", line 398, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/urllib3/connection.py", line 239, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/www/server/panel/pyenv/lib/python3.7/http/client.py", line 1277, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/www/server/panel/pyenv/lib/python3.7/http/client.py", line 1323, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/www/server/panel/pyenv/lib/python3.7/http/client.py", line 1272, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/www/server/panel/pyenv/lib/python3.7/http/client.py", line 1032, in _send_output
self.send(msg)
File "/www/server/panel/pyenv/lib/python3.7/http/client.py", line 972, in send
self.connect()
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/urllib3/connection.py", line 205, in connect
conn = self._new_conn()
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/urllib3/connection.py", line 187, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f794904ee50>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/requests/adapters.py", line 499, in send
timeout=timeout,
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/urllib3/connectionpool.py", line 788, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/urllib3/util/retry.py", line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=5010): Max retries exceeded with url: /get (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f794904ee50>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/www/wwwroot/company-crawler-master/tianyancha.py", line 19, in
crawler.start()
File "/www/wwwroot/company-crawler-master/tianyancha/crawler.py", line 21, in start
companies = TycClient().search(key).companies
File "/www/wwwroot/company-crawler-master/tianyancha/client.py", line 37, in search
data = Request(url, self.payload, proxy=True, headers=REQUEST_HEADERS).data
File "/www/wwwroot/company-crawler-master/util/httpclient.py", line 24, in init
self.get(**kwargs)
File "/www/wwwroot/company-crawler-master/util/httpclient.py", line 27, in get
p = proxy() if GLOBAL_PROXY and self.proxy else None
File "/www/wwwroot/company-crawler-master/util/httpclient.py", line 47, in proxy
p = Request(f"{PROXY_POOL_URL}/get").data
File "/www/wwwroot/company-crawler-master/util/httpclient.py", line 24, in init
self.get(**kwargs)
File "/www/wwwroot/company-crawler-master/util/httpclient.py", line 28, in get
resp = requests.get(self.url, params=self.params, verify=False, proxies=p, **kwargs)
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/requests/api.py", line 73, in get
return request("get", url, params=params, **kwargs)
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/requests/sessions.py", line 587, in request
resp = self.send(prep, **send_kwargs)
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/requests/adapters.py", line 565, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=5010): Max retries exceeded with url: /get (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f794904ee50>: Failed to establish a new connection: [Errno 111] Connection refused'))
Traceback (most recent call last):
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/urllib3/connection.py", line 175, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/urllib3/util/connection.py", line 95, in create_connection
raise err
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/urllib3/util/connection.py", line 85, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/urllib3/connectionpool.py", line 710, in urlopen
chunked=chunked,
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/urllib3/connectionpool.py", line 398, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/urllib3/connection.py", line 239, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/www/server/panel/pyenv/lib/python3.7/http/client.py", line 1277, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/www/server/panel/pyenv/lib/python3.7/http/client.py", line 1323, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/www/server/panel/pyenv/lib/python3.7/http/client.py", line 1272, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/www/server/panel/pyenv/lib/python3.7/http/client.py", line 1032, in _send_output
self.send(msg)
File "/www/server/panel/pyenv/lib/python3.7/http/client.py", line 972, in send
self.connect()
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/urllib3/connection.py", line 205, in connect
conn = self._new_conn()
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/urllib3/connection.py", line 187, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f747eaf9d50>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/requests/adapters.py", line 499, in send
timeout=timeout,
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/urllib3/connectionpool.py", line 788, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/urllib3/util/retry.py", line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=5010): Max retries exceeded with url: /get (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f747eaf9d50>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/www/wwwroot/company-crawler-master/tianyancha.py", line 19, in
crawler.start()
File "/www/wwwroot/company-crawler-master/tianyancha/crawler.py", line 21, in start
companies = TycClient().search(key).companies
File "/www/wwwroot/company-crawler-master/tianyancha/client.py", line 37, in search
data = Request(url, self.payload, proxy=True, headers=REQUEST_HEADERS).data
File "/www/wwwroot/company-crawler-master/util/httpclient.py", line 24, in init
self.get(**kwargs)
File "/www/wwwroot/company-crawler-master/util/httpclient.py", line 27, in get
p = proxy() if GLOBAL_PROXY and self.proxy else None
File "/www/wwwroot/company-crawler-master/util/httpclient.py", line 47, in proxy
p = Request(f"{PROXY_POOL_URL}/get").data
File "/www/wwwroot/company-crawler-master/util/httpclient.py", line 24, in init
self.get(**kwargs)
File "/www/wwwroot/company-crawler-master/util/httpclient.py", line 28, in get
resp = requests.get(self.url, params=self.params, verify=False, proxies=p, **kwargs)
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/requests/api.py", line 73, in get
return request("get", url, params=params, **kwargs)
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/requests/sessions.py", line 587, in request
resp = self.send(prep, **send_kwargs)
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/www/wwwroot/company-crawler-master/spider_venv/lib/python3.7/site-packages/requests/adapters.py", line 565, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=5010): Max retries exceeded with url: /get (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f747eaf9d50>: Failed to establish a new connection: [Errno 111] Connection refused'))

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.