qianlitp / crawlergo Goto Github PK

A powerful browser crawler for web vulnerability scanners

License: GNU General Public License v3.0

Go 98.89% Makefile 0.64% Shell 0.48%

headless-chrome crawler chrome-devtools golang vulnerability-scanner headless web-vulnerability-scanners chromedp crawlergo blackhat

crawlergo's Introduction

crawlergo

A powerful browser crawler for web vulnerability scanners

English Document | 中文文档

crawlergo is a browser crawler that uses chrome headless mode for URL collection. It hooks key positions of the whole web page with DOM rendering stage, automatically fills and submits forms, with intelligent JS event triggering, and collects as many entries exposed by the website as possible. The built-in URL de-duplication module filters out a large number of pseudo-static URLs, still maintains a fast parsing and crawling speed for large websites, and finally gets a high-quality collection of request results.

crawlergo currently supports the following features:

chrome browser environment rendering
Intelligent form filling, automated submission
Full DOM event collection with automated triggering
Smart URL de-duplication to remove most duplicate requests
Intelligent analysis of web pages and collection of URLs, including javascript file content, page comments, robots.txt files and automatic Fuzz of common paths
Support Host binding, automatically fix and add Referer
Support browser request proxy
Support pushing the results to passive web vulnerability scanners

Screenshot

Installation

Please read and confirm disclaimer carefully before installing and using。

Build

compilation for current platform

make build

compile for all platforms

make build_all

crawlergo relies only on the chrome environment to run, go to download for the new version of chromium.
Go to download page for the latest version of crawlergo and extract it to any directory. If you are on linux or macOS, please give crawlergo executable permissions (+x).
Or you can modify the code and build it yourself.

If you are using a linux system and chrome prompts you with missing dependencies, please see TroubleShooting below

Quick Start

Go！

Assuming your chromium installation directory is /tmp/chromium/, set up 10 tabs open at the same time and crawl the testphp.vulnweb.com:

bin/crawlergo -c /tmp/chromium/chrome -t 10 http://testphp.vulnweb.com/

Docker usage

You can also use this with docker without headache:

git clone https://github.com/Qianlitp/crawlergo
docker build . -t crawlergo
docker run crawlergo http://testphp.vulnweb.com/

Using Proxy

bin/crawlergo -c /tmp/chromium/chrome -t 10 --request-proxy socks5://127.0.0.1:7891 http://testphp.vulnweb.com/

Calling crawlergo with python

By default, crawlergo prints the results directly on the screen. We next set the output mode to json, and the sample code for calling it using python is as follows:

#!/usr/bin/python3
# coding: utf-8

import simplejson
import subprocess


def main():
    target = "http://testphp.vulnweb.com/"
    cmd = ["bin/crawlergo", "-c", "/tmp/chromium/chrome", "-o", "json", target]
    rsp = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    output, error = rsp.communicate()
	#  "--[Mission Complete]--"  is the end-of-task separator string
    result = simplejson.loads(output.decode().split("--[Mission Complete]--")[1])
    req_list = result["req_list"]
    print(req_list[0])


if __name__ == '__main__':
    main()

Crawl Results

When the output mode is set to json, the returned result, after JSON deserialization, contains four parts:

all_req_list： All requests found during this crawl task, containing any resource type from other domains.
req_list：Returns the current domain results of this crawl task, pseudo-statically de-duplicated, without static resource links. It is a subset of all_req_list .
all_domain_list：List of all domains found.
sub_domain_list：List of subdomains found.

Examples

crawlergo returns the full request and URL, which can be used in a variety of ways:

Used in conjunction with other passive web vulnerability scanners

First, start a passive scanner and set the listening address to: http://127.0.0.1:1234/

Next, assuming crawlergo is on the same machine as the scanner, start crawlergo and set the parameters:

--push-to-proxy http://127.0.0.1:1234/
Host binding (not available for high version chrome) (example)
Custom Cookies (example)
Regularly clean up zombie processes generated by crawlergo (example) , contributed by @ring04h

Bypass headless detect

crawlergo can bypass headless mode detection by default.

https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html

TroubleShooting

'Fetch.enable' wasn't found

Fetch is a feature supported by the new version of chrome, if this error occurs, it means your version is too low, please upgrade the chrome version.

chrome runs with missing dependencies such as xxx.so

// Ubuntu
apt-get install -yq --no-install-recommends \
     libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 \
     libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 \
     libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libgbm1 \
     libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 libnss3
     
// CentOS 7
sudo yum install pango.x86_64 libXcomposite.x86_64 libXcursor.x86_64 libXdamage.x86_64 libXext.x86_64 libXi.x86_64 \
     libXtst.x86_64 cups-libs.x86_64 libXScrnSaver.x86_64 libXrandr.x86_64 GConf2.x86_64 alsa-lib.x86_64 atk.x86_64 gtk3.x86_64 \
     ipa-gothic-fonts xorg-x11-fonts-100dpi xorg-x11-fonts-75dpi xorg-x11-utils xorg-x11-fonts-cyrillic xorg-x11-fonts-Type1 xorg-x11-fonts-misc -y

sudo yum update nss -y

Run prompt Navigation timeout / browser not found / don't know correct browser executable path

Make sure the browser executable path is configured correctly, type: chrome://version in the address bar, and find the executable file path:

Parameters

Required parameters

--chromium-path Path, -c Path The path to the chrome executable. (Required)

Basic parameters

--custom-headers Headers Customize the HTTP header. Please pass in the data after JSON serialization, this is globally defined and will be used for all requests. (Default: null)
--post-data PostData, -d PostData POST data. (Default: null)
--max-crawled-count Number, -m Number The maximum number of tasks for crawlers to avoid long crawling time due to pseudo-static. (Default: 200)
--filter-mode Mode, -f Mode Filtering mode, simple: only static resources and duplicate requests are filtered. smart: with the ability to filter pseudo-static. strict: stricter pseudo-static filtering rules. (Default: smart)
--output-mode value, -o value Result output mode, console: print the glorified results directly to the screen. json: print the json serialized string of all results. none: don't print the output. (Default: console)
--output-json filepath Write the result to the specified file after JSON serializing it. (Default: null)
--request-proxy proxyAddress socks5 proxy address, all network requests from crawlergo and chrome browser are sent through the proxy. (Default: null)

Expand input URL

--fuzz-path Use the built-in dictionary for path fuzzing. (Default: false)
--fuzz-path-dict Customize the Fuzz path by passing in a dictionary file path, e.g. /home/user/fuzz_dir.txt, each line of the file represents a path to be fuzzed. (Default: null)
--robots-path Resolve the path from the /robots.txt file. (Default: false)

Form auto-fill

--ignore-url-keywords, -iuk URL keyword that you don't want to visit, generally used to exclude logout links when customizing cookies. Usage: -iuk logout -iuk exit. (default: "logout", "quit", "exit")
--form-values, -fv Customize the value of the form fill, set by text type. Support definition types: default, mail, code, phone, username, password, qq, id_card, url, date and number. Text types are identified by the four attribute value keywords id, name, class, type of the input box label. For example, define the mailbox input box to be automatically filled with A and the password input box to be automatically filled with B, -fv mail=A -fv password=B.Where default represents the fill value when the text type is not recognized, as "Cralwergo". (Default: Cralwergo)
--form-keyword-values, -fkv Customize the value of the form fill, set by keyword fuzzy match. The keyword matches the four attribute values of id, name, class, type of the input box label. For example, fuzzy match the pass keyword to fill 123456 and the user keyword to fill admin, -fkv user=admin -fkv pass=123456. (Default: Cralwergo)

Advanced settings for the crawling process

--max-tab-count Number, -t Number The maximum number of tabs the crawler can open at the same time. (Default: 8)
--tab-run-timeout Timeout Maximum runtime for a single tab page. (Default: 20s)
--wait-dom-content-loaded-timeout Timeout The maximum timeout to wait for the page to finish loading. (Default: 5s)
--event-trigger-interval Interval The interval when the event is triggered automatically, generally used in the case of slow target network and DOM update conflicts that lead to URL miss capture. (Default: 100ms)
--event-trigger-mode Value DOM event auto-triggered mode, with async and sync, for URL miss-catching caused by DOM update conflicts. (Default: async)
--before-exit-delay Delay exit to close chrome at the end of a single tab task. Used to wait for partial DOM updates and XHR requests to be captured. (Default: 1s)

Other

--push-to-proxy The listener address of the crawler result to be received, usually the listener address of the passive scanner. (Default: null)
--push-pool-max The maximum number of concurrency when sending crawler results to the listening address. (Default: 10)
--log-level Logging levels, debug, info, warn, error and fatal. (Default: info)
--no-headless Turn off chrome headless mode to visualize the crawling process. (Default: false)

Follow me

Weibo：@9ian1i Twitter: @9ian1i

crawlergo's People

Contributors

Stargazers

Watchers

Forkers

boz14676 d4wner unclejim tempbottle qing-q mitooooo leezp xiaobaidebai j4ckzh0u 0xa-saline huangyuan666 1f3lse gh-jy rezaduty makiraid rajivraj captainbarber99 songxiaomo9708 useafter bravery9 pt001 devsecops-src willshion thetraker jack51706 qingxina wooyunvip pigcanfly2018 allinonecyberteamindia greg-wu gh0stshell p4yche kenanat keyman9848 ingramali wolfking2 xingkong123600 listenquiet niummm xxnbyy foxhack gkfnf kpax00 lw1988 morns0 icysun om1jerry trevor3000 wuwoweishui 020monkey saomase ilkkai 5l1v3r1 htesiege test2504 laowang1026 xinchuangfu zk953874391 rfma shamo666 haoirui moon3r this-is-y xhackff allblue147 wxtlco l0ading-x handlemail lonehand l34rner simon0105 xiaohutuchong coding-rgb rv0p111 kenuoseclab ykankaya xalxl popmedd echocipher sunpeak quinn-yan s0ulhun43r liao10086 lzy11332211 datougui2020 slowmistio yut0u wxy5566 xuanxuan3000 dawson0x00 deep-webs omiter javalangclass 5683xc v0re lengghd mrquiet todolin hkzck liu-ruikang

crawlergo's Issues

出现./crawlergo: cannot execute binary file: Exec format error的原因是什么呢

某网站爬取时间过长

目标站点为:https://www.che168.com/
爬取了两天了,还未结束, 所以希望作者能帮忙看一下是什么原因.
因为crawlergo是串联在自己写的一个程序中的,程序一直在爬,导致无法结束.
后续应该如何约束最大爬取时间,或深度?

部分爬取URL如下:

http://www.che168.com/suihua/suva0-suva-suvb-suvc-suvd/0_8/a0_0msdgscncgpi1ltocsp1exa16/#pvareaid=100943
https://www.che168.com/china/baoma/baoma5xi/0_5/a3_8msdgscncgpi1ltocspexx0a1/#pvareaid=108403%23seriesZong
http://www.che168.com/jiangsu/suva0-suva-suvb-suvc-suvd/0_8/a0_0msdgscncgpi1ltocsp1exa16/
http://www.che168.com/nanjing/suva0-suva-suvb-suvc-suvd/0_8/a0_0msdgscncgpi1ltocsp1exa16/#pvareaid=100943
https://www.che168.com/china/aodi/aodia6l/0_5/a3_8msdgscncgpi1ltocspexx0a1/#pvareaid=108403%23seriesZong
http://www.che168.com/xuzhou/suva0-suva-suvb-suvc-suvd/0_8/a0_0msdgscncgpi1ltocsp1exa16/#pvareaid=100943
http://www.che168.com/wuxi/suva0-suva-suvb-suvc-suvd/0_8/a0_0msdgscncgpi1ltocsp1exa16/#pvareaid=100943
https://www.che168.com/china/baoma/baoma3xi/0_5/a3_8msdgscncgpi1ltocspexx0a1/#pvareaid=108403%23seriesZong
http://www.che168.com/changzhou/suva0-suva-suvb-suvc-suvd/0_8/a0_0msdgscncgpi1ltocsp1exa16/#pvareaid=100943
http://www.che168.com/suzhou/suva0-suva-suvb-suvc-suvd/0_8/a0_0msdgscncgpi1ltocsp1exa16/#pvareaid=1009

取消默认 header 中的 Spider-Name

默认 header 中存在"Spider-Name": "crawlergo-0KeeTeam"，这个标识很容易被规则拦截。

crawlergo 在大批量抓取时卡死。

crawlergo 在抓取的时候卡死了... 环境是Windows10，cpu跟内存消耗也不大，temp目录下一直清空着着也不是空间的原因...

支持带cookie爬取吗

这个参数是不是不能添加cookie的？
--custom-headers Headers 自定义HTTP头，使用传入json序列化之后的数据，这个是全局定义，将被用于所有请求

一个奇怪的链接

cmd = ["E:/exploit/spider/crawlergo/crawlergo", "-c", "E:/exploit/spider/crawlergo/chrome-win/chrome.exe","-t", "5","-f","smart", "-m", "1", "--output-mode", "json", 'https://www.baidu.com']

rsp = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

result = simplejson.loads(output.decode().split("--[Mission Complete]--")[1])

result["all_req_list"][9]['url']

'https://,Wn=/'

'https://,Wn=/'请问这个链接是怎么触发的？

可否支持设置默认的parameter value

从目前的测试来看，如果我设置postdata 是'username=admin&password=password'，那只尝试一次，而且忽略其他在页面里面一同出现的paramter，后续的username password 继续使用默认的KeeTeam 等等。能否支持设定username=admin 以后，所有在username 出现的地方都使用admin 而不用KeeTeam？ password 类似。

mac下运行报“navigate timeout fork/exec /Applications/Chrome.app: permission ”错误

具体报错信息如下：
$ ./crawlergo -c /Applications/Chrome.app -t 20 https://www.baidu.com
INFO[0000] Init crawler task, host: www.baidu.com, max tab count: 20, max crawl count: 200.
INFO[0000] filter mode: smart
INFO[0000] Start crawling.
INFO[0000] filter repeat, target count: 2
INFO[0000] Crawling GET http://www.baidu.com/
INFO[0000] Crawling GET https://www.baidu.com/
WARN[0000] navigate timeout fork/exec /Applications/Chrome.app: permission deniedhttp://www.baidu.com/
WARN[0000] navigate timeout fork/exec /Applications/Chrome.app: permission deniedhttps://www.baidu.com/
INFO[0000] closing browser.

crawlergo和Chrome.app已加执行权限，Chrome.app为“Google Chrome.app”更改而来。
Mac版本10.14.5，Chrome版本 80.0.3987.87（正式版本）（64 位），python3.7.6

导航超时错误

navigate timeout context deadline exceeded
想本地做个dedecms的爬虫测试，直接就报了这个错误是哪里操作不当嘛？

爬取特定站点总是出现 Mission Complete

crawlergo.exe -c "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" --tab-run-timeout 120s   -f strict   -m 1  --wait-dom-content-loaded-timeout 30s --output-mode console   https://cn.vuejs.org/

navigate timeout context deadline exceeded

执行
./crawlergo -c /usr/bin/google-chrome-stable -t 20 http://testphp.vulnweb.com/

传参的url只爬到一个
GET http://testphp.vulnweb.com/search.php?test=query
release

Distributor ID:	CentOS
Description:	CentOS Linux release 7.6.1810 (Core) 
Release:	7.6.1810
Codename:	Core

对表单支持不太好

自动填的0kee不一定符合该input的type，导致一些表单触发不了，就抓不到目标页面。

直接报错

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x16b9e8f]

goroutine 975 [running]:
ioscan-ng/src/tasks/crawlergo/engine.(*Tab).InterceptRequest(0xc00062c1c0, 0xc0005e5d80)
D:/go_projects/ioscan-ng/src/tasks/crawlergo/engine/intercept_request.go:42 +0x25f
created by ioscan-ng/src/tasks/crawlergo/engine.NewTab.func1
D:/go_projects/ioscan-ng/src/tasks/crawlergo/engine/tab.go:90 +0x2e8

求开源爬虫

核心是爬虫，求开源...😂

爬虫爬取的链接问题

看fate0大佬发的视频中测试http://testphp.vulnweb.com/AJAX/index.php 这个网站可以获取到很多链接，但crawlergo只能获取到4个，页面a标签的js发出的请求是否可以考虑获取到。

针对新版本的event-trigger-mode,event-trigger-intervalbefore-exit-delay,选项, 有推荐的默认参数设置吗?

如题, 因为对这个几个新增选项的最佳配置不太熟悉, 在希望能最大限度的爬取接口的目标下, 怎么配置参数比较合理, 谢谢.

在Mac下运行出现WaitGroup重用的问题

环境:
Darwin ZBMAC-C02VQ02-5.local 17.2.0 Darwin Kernel Version 17.2.0: Fri Sep 29 18:27:05 PDT 2017; root:xnu-4570.20.62~3/RELEASE_X86_64 x86_64

命令:
./crawlergo -c /Applications/Chromium.app/Contents/MacOS/Chromium -f smart -o json -t 5 http://www.baidu.com

报错:
panic: sync: WaitGroup is reused before previous Wait has returned

goroutine 93421 [running]:
sync.(*WaitGroup).Wait(0xc0093f59a0)
C:/Go/src/sync/waitgroup.go:132 +0xae
ioscan-ng/src/tasks/crawlergo/engine.(*Tab).Start.func3(0xc0093f5800)
D:/go_projects/ioscan-ng/src/tasks/crawlergo/engine/tab.go:229 +0x34
created by ioscan-ng/src/tasks/crawlergo/engine.(*Tab).Start
D:/go_projects/ioscan-ng/src/tasks/crawlergo/engine/tab.go:227 +0x4f1

除此之外，选择-o json并未输出结果，所有请求均为timeout

navigate timeout Cannot navigate to invalid URL (-32000)target

ERRO[0000] navigate timeout 'Fetch.enable' wasn't found (-32601)

CentOS Linux release 7.6.1810 (Core)

[root@VM_0_17_centos data]# ./crawlergo -c /root/.local/share/pyppeteer/local-chromium/575458/chrome-linux/chrome -t 10 http://testphp.vulnweb.com
Crawling GET https://testphp.vulnweb.com/
Crawling GET http://testphp.vulnweb.com/
ERRO[0000] navigate timeout 'Fetch.enable' wasn't found (-32601)
ERRO[0000] https://testphp.vulnweb.com/
ERRO[0000] navigate timeout 'Fetch.enable' wasn't found (-32601)
ERRO[0000] http://testphp.vulnweb.com/
--[Mission Complete]--
GET http://testphp.vulnweb.com/ HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.0 Safari/537.36
Spider-Name: crawlergo-0KeeTeam


GET https://testphp.vulnweb.com/ HTTP/1.1
Spider-Name: crawlergo-0KeeTeam
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.0 Safari/537.36

能否增加一个爬虫入口（url列表）像awvs一样，因为有一些页面爬虫爬不到？

新版本报redis error

centos 6.5使用crawlergo缺少依赖问题

我使用你们提供的那个依赖解决后，把需要的安装完了，还是显示缺少依赖。是centos 6.5版本太低了不支持嘛。

针对IP的url地址会加很多//

http://11.2.3.2//////////////////////////////////////////////////////////////////////////////////shoppingcart.php?a=empty
正常的url应该是
http://11.2.3.2/shoppingcart.php?a=empty

长时间爬取会产生极大的临时文件

在windows下连续跑了4-5天，产生了40G+的chrome临时文件，看了几个文件名好像是CrashpadMetrics-active.pma
感觉需要处理一下生成的临时文件。

--push-to-proxy的建议

--push-to-proxy的建议
如果是文件名，内容是
http://56.67.8.0:9900
socks5://35.88.324.9:8080

目的是可以同时推给多个被动代理

系统调用py脚本输出结果问题

使用系统调用脚本如下:

输出结果如下:

支持代理配置

希望可以支持代理配置，这样可以方便在不同网络环境下进行测试，虽可以通过 proxychains 等方法实现，但是不如原生支持来的方便：）

python调用过程中会出现卡住的情况

部分LOG如下:

alling _exit(1). Core file will not be generated.
http:///components
WARN[0006] navigate timeout context deadline exceededhttp://A
WARN[0006] navigate timeout context deadline exceededhttp://A
INFO[0009] Crawling GET http://A/api.php
INFO[0009] Crawling GET http://A/uc_client/
WARN[0021] navigate timeout unable to execute *log.EnableParams: context deadline exceededhttp://A/connect.php
WARN[0021] navigate timeout unable to execute *log.EnableParams: context deadline exceededhttp://A/*?mod=misc*
INFO[0021] closing browser.
> 此次卡住了,无输出,按下enter也没反应,卡了24小时都无反应

这种情况已经出现了3次了, 无法定位出原因, 因为同样的python代码有时候]没有任何问题,有的时候就会卡住.

ERRO[0005] navigate timeout context deadline exceeded

试了几个网站都提示timeout，网络没问题

2052 ◯ ./crawlergo -c /opt/bugbounty/chrome-linux/chrome -t 20 http://testphp.vulnweb.com/
Crawling GET https://testphp.vulnweb.com/
Crawling GET http://testphp.vulnweb.com/
ERRO[0005] navigate timeout context deadline exceeded
ERRO[0005] http://testphp.vulnweb.com/
--[Mission Complete]--
GET http://testphp.vulnweb.com/ HTTP/1.1
Spider-Name: crawlergo-0KeeTeam
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.0 Safari/537.36

GET https://testphp.vulnweb.com/ HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.0 Safari/537.36
Spider-Name: crawlergo-0KeeTeam

一次性添加多个 target

如果有大量的页面要爬取，即有很多 target，每次开子进程运行 ./crawlergo target 开销有点大，能不能一次性加入爬取列表？

navigate timeout unable to execute *log.EnableParams: context deadline exceeded

环境：17.7.0 Darwin Kernel Version 17.7.0: Fri Oct 4 23:08:59 PDT 2019; root:xnu-4570.71.57~1/RELEASE_X86_64 x86_64

用的是最新版的crawlergo，chromium版本是18年的

请问win中怎么写路径

crawlergo -c /pachong/chrome -t 20 http://testphp.vulnweb.com/
crawlergo -c \pachong\chrome -t 20 http://testphp.vulnweb.com/
在win环境下都报错

windows下运行，所有站点都会报错timeout

$ crawlergo.exe -c .\GoogleChromePortable64\GoogleChromePortable.exe http://www.baidu.com
Crawling GET https://www.baidu.com/
Crawling GET http://www.baidu.com/
time="2019-12-31T10:56:43+08:00" level=error msg="navigate timeout chrome failed to start:\n"
time="2019-12-31T10:56:43+08:00" level=error msg="https://www.baidu.com/"
time="2019-12-31T10:56:43+08:00" level=debug msg="all navigation tasks done."
time="2019-12-31T10:56:43+08:00" level=error msg="navigate timeout chrome failed to start:\n"
time="2019-12-31T10:56:43+08:00" level=error msg="http://www.baidu.com/"
time="2019-12-31T10:56:43+08:00" level=debug msg="get comment nodes err"
time="2019-12-31T10:56:43+08:00" level=debug msg="all navigation tasks done."
time="2019-12-31T10:56:43+08:00" level=debug msg="invalid target"
time="2019-12-31T10:56:43+08:00" level=debug msg="get comment nodes err"
time="2019-12-31T10:56:43+08:00" level=debug msg="invalid target"
--[Mission Complete]--
GET http://www.baidu.com/ HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.0 Safari/537.36
Spider-Name: crawlergo-0KeeTeam

GET https://www.baidu.com/ HTTP/1.1
Spider-Name: crawlergo-0KeeTeam
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.0 Safari/537.36

带端口的url瞬间返回结果

./crawlergo_linux -c chrome-linux/chrome -output-mode json http://A.B.com:80/
执行后: 瞬间返回如下:

--[Mission Complete]--
{"req_list":null,"all_domain_list":[xxxxx],"all_req_list":[xxxxx]}

但是:
./crawlergo_linux -c chrome-linux/chrome -output-mode json http://A.B.com/

Crawling GET http://A.B.com/
DEBU[0000] 
DEBU[0006] context deadline exceeded
--[Mission Complete]--
{"req_list":[xxxxx],"all_domain_list":[xxxxx],"sub_domain_list":[xxxxx]}

被检测出来爬虫问题

请问如何在发送爬虫请求的时候走代理池，避免被封ip

【需求】机器资源和爬虫属性限制

有些基准的爬虫需求，主要考虑到可能在爬取大型网站会存在一些问题。
内存大小限制；速率限制；chrome数量限制；cpu限制；

资源使用限制：如内存大小使用限制，CPU限制（3个tab就吃不消）。不知道这里调用的chrome headless能否控制，性能好的机器反而吃的资源多；
爬取速度限制：支持低速、中速？有些网站需要低频扫描。 tab数量和qps之前的关系还是很模糊的，性能好的机器1个tab对应很多chrome进程，速度也很快；
爬取层数限制；

Ubuntu运行出错问题

root@ubuntu:~/Desktop/crawlergo# ./crawlergo -c /Desktop/crawlergo/chrome-linux/chrome -t 20  http://testphp.vulnweb.com/
INFO[0000] Init crawler task, host: testphp.vulnweb.com, max tab count: 20, max crawl count: 200. 
INFO[0000] filter mode: smart                           
INFO[0000] Start crawling.                              
INFO[0000] filter repeat, target count: 2               
INFO[0000] Crawling GET https://testphp.vulnweb.com/    
WARN[0000] navigate timeout fork/exec /Desktop/crawlergo/chrome-linux/chrome: no such file or directoryhttps://testphp.vulnweb.com/ 
INFO[0000] Crawling GET http://testphp.vulnweb.com/     
WARN[0000] navigate timeout fork/exec /Desktop/crawlergo/chrome-linux/chrome: no such file or directoryhttp://testphp.vulnweb.com/ 
INFO[0000] closing browser.

运行操作如上已经赋予crawlergo文件的+x 权限请问这是什么情况？

新需求: 截图+负载爬虫模式

希望能对某些页面进行截图, 方便人工验证,(尤其是匹配到关键字"后台","管理"等.
希望能支持多子域+多fuzzpath的场景, 并且请求能均衡发送.

针对第二点,目前的方法是先用dirsearch fuzz出来的path, 筛选成list , 然后list加到一个string里面再用subprocess调用crawlergo, 这样的弊端也很显然.

不知道作者后期有无这方面的规划,thanks!

爬取链接错误。

yH5BAEAAAAALAAAAAABAAEAAAIBRAA7，感觉像图片里的。

crawlergo.exe -c "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" --tab-run-timeout 120s   -f strict   -m 1  --wait-dom-content-loaded-timeout 30s --output-mode console   https://cn.vuejs.org/

crawlergo 直接退出了

0.12的版本下载后只有5.1M，不知道是精简过多了。执行后就直接退出了
➜ crawlergo mv ~/Downloads/crawlergo ./
➜ crawlergo chmod +x crawlergo
➜ crawlergo ./crawlergo
[1] 9838 killed ./crawlergo
➜ crawlergo ./crawlergo -h
[1] 9845 killed ./crawlergo -h
➜ crawlergo ./crawlergo
[1] 9852 killed ./crawlergo
➜ crawlergo

怎么获取表单中的参数？

POST表单中的参数怎么获取？

感谢分享如此好用的爬虫工具, 希望可以提供chromium中localStrorage 和附加数据的支持!

感谢大佬分享如此好用的爬虫工具
在使用过程中我发现在需要对一些需要认证页面现在的爬取有一点无力的感觉, 提供的Header的客制化只能应付一些利用Cookie作为凭据的场景, 在一些SPA的场景中, 作为凭据的Token往往会放在浏览器的LocalStorage 或者作为一个固定的数据附加在提交的Body中, 希望可以提供这两块地方的客制化. 希望大佬可以将上述的特性放在后期的更新中, 因为现在的很多页面大部分都需要认证的功能, 如果只是单一的爬取非认证的页面能得到信息比较有限.
再次更新感谢大佬的分享🙏 🙏 🙏 🙏 !

macos下运行crawlergo时浏览器路径有问题

您好,
我在使用crawlergo时，使用chrome浏览器的路径/Applications/Google\ Chrome.app好像没反应。
想问一下，macos的chrome浏览器路径什么样才是正确的。

还有就是我去你们提供的chromium浏览器下载了。路径是/Users/mac/Downloads/chrome-mac/Chromium.app

好像两个都不行。
最后，我也给crawlergo提供了权限了。chmod +x crawlergo

还是显示permission denied

如果已知路径比较多, 手工拼接比较麻烦
这种拼接传参的方法和分开一个个执行得到的结果是一样? 还是说有差别,没有进行验证.

当然后期能有参数支持多路径作为入口最好不过.

Originally posted by @djerrystyle in #31 (comment)