s0rg / crawley Goto Github PK

View Code? Open in Web Editor NEW

246.0 246.0 14.0 249 KB

The unix-way web crawler

License: MIT License

Makefile 1.23% Go 98.77%

cli crawler go golang golang-application pentest pentest-tool pentesting unix-way web-crawler web-scraping web-spider

crawley's Introduction

crawley's People

Contributors

Stargazers

Watchers

Forkers

juxuanu 5l1v3r1 rnemeth90 suryatmodulus tawawhite shvz0 ninjacz attacker-codeninja thenbe excloudx6 fossabot sorokinvld v1d1an marybonilla2231

crawley's Issues

cookies loaded but login page detected

Hello, when I use this command:
crawley -depth -1 -dirs only -cookie "phpbb3_XXXXXX_sid=XXXXXX; phpbb3_XXXXXX_u=XXXXXX; phpbb3_XXXXXX_k=XXXXXX;" "https://XXXXXX.net/viewtopic.php?p=5018859" > urls.txt
crawley scrapes the login page of the forum and not the thread I selected, crawley doesn't return any errors

2023/03/20 16:08:58 [*] config: workers: 4 depth: -1 delay: 150ms
2023/03/20 16:08:58 [*] crawling url: https://XXXXXX.net/viewtopic.php?p=5018859
2023/03/20 16:09:00 [*] complete

I tried with different user agents and headers, every time the result is the same: login page of the forum. The cookies are ok, I copied them using EditThisCookie googlechrome extension. I'm not using any vpn/proxy.
I tested also other forums, the result is always the same, I can't login.
Do you know if there is a problem loading cookies?

Using with authenticated sites

Is their a way to use credentials / a session cookie to scrap a site on which I am logged in ?

Add support for subdomains recursive scraping

Please add support for subdomains recursive search.
For example: if I run crawley on https://abc.com, I can add a flag (e.g. -subdomains) to allow crawling also for subdomains under abc.com (e.g. www.abc.com or somesub.abc.com)

Thanks!

Deadlock error

I want to scrape this forum www.invitehawk.com but I get this deadlock error. I know it's a cookie error. In this case I can scrape even without cookies but I'd like to know che correct cookie syntax. I'm using netscape cookie format.

➜ crawley --cookie invitehawk_ck.txt -depth -1 -dirs only "https://www.invitehawk.com/topic/126147-tracker-review-index-sorted-by-category/" | grep -E '.*20[0-9][0-9]-review' > url.list.ih.txt 

2023/03/19 10:17:11 [*] config: workers: 4 depth: -1 delay: 150ms
2023/03/19 10:17:11 [*] crawling url: https://www.invitehawk.com/topic/126147-tracker-review-index-sorted-by-category/
fatal error: all goroutines are asleep - deadlock!

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0x0?)
	/opt/hostedtoolcache/go/1.20.0/x64/src/runtime/sema.go:62 +0x27
sync.(*WaitGroup).Wait(0xc0000b01e0?)
	/opt/hostedtoolcache/go/1.20.0/x64/src/sync/waitgroup.go:116 +0x4b
github.com/s0rg/crawley/pkg/crawler.(*Crawler).close(0xc0000b2840)
	/home/runner/work/crawley/crawley/pkg/crawler/crawler.go:201 +0x65
panic({0x6c1960, 0xc0000c20c0})
	/opt/hostedtoolcache/go/1.20.0/x64/src/runtime/panic.go:884 +0x213
github.com/s0rg/crawley/pkg/client.parseOne({0x7ffc5e21c176?, 0x11?})
	/home/runner/work/crawley/crawley/pkg/client/cookie.go:35 +0x13d
github.com/s0rg/crawley/pkg/client.prepareCookies({0xc00009e430?, 0x1, 0xc000170000?})
	/home/runner/work/crawley/crawley/pkg/client/cookie.go:17 +0x13c
github.com/s0rg/crawley/pkg/client.New({0xc0000cc280, 0x3e}, 0x4, 0x0, {0xc00009e440, 0x1, 0x1}, {0xc00009e430, 0x1, 0x1})
	/home/runner/work/crawley/crawley/pkg/client/http.go:57 +0x1f8
github.com/s0rg/crawley/pkg/crawler.(*Crawler).Run(0xc0000b2840, {0x7ffc5e21c19d, 0x50}, 0x70ee68)
	/home/runner/work/crawley/crawley/pkg/crawler/crawler.go:99 +0x29b
main.crawl({0x7ffc5e21c19d, 0x50}, {0xc0000b4200?, 0x0?, 0xc0000406c8?})
	/home/runner/work/crawley/crawley/cmd/crawley/main.go:94 +0xe5
main.main()
	/home/runner/work/crawley/crawley/cmd/crawley/main.go:235 +0x188

cannot parse string as cookie

I tried several commands:

(with spaces) crawley -depth -1 -headless -cookie "phpbb3_ddu4final_k=XXXXX; phpbb3_ddu4final_sid=XXXXX; phpbb3_ddu4final_u=XXXXX;" "https://ddunlimited.net/viewtopic.php?p=5018859" > test.txt
(without spaces) crawley -depth -1 -headless -cookie "phpbb3_ddu4final_k=XXXXX;phpbb3_ddu4final_sid=XXXXX;phpbb3_ddu4final_u=XXXXX;" "https://ddunlimited.net/viewtopic.php?p=5018859" > test.txt
(single value) crawley -depth -1 -headless -cookie "phpbb3_ddu4final_k=XXXXX;" "https://ddunlimited.net/viewtopic.php?p=5018859" > test.txt
(separated values) crawley -depth -1 -headless -cookie "phpbb3_ddu4final_k=XXXXX;" -cookie "phpbb3_ddu4final_sid=XXXXX;" -cookie "phpbb3_ddu4final_u=XXXXX;" "https://ddunlimited.net/viewtopic.php?p=5018859" > test.txt
(without other parameters) crawley -cookie "phpbb3_ddu4final_k=XXXXX;" "https://ddunlimited.net/viewtopic.php?p=5018859" > test.txt
The result is every time "cannot parse the string as cookie":

2023/03/29 23:36:18 [*] config: workers: 4 depth: -1 delay: 150ms
2023/03/29 23:36:18 [*] crawling url: https://ddunlimited.net/viewtopic.php?p=5018859
2023/03/29 23:36:18 cannot parse 'phpbb3_ddu4final_k=XXXXX; phpbb3_ddu4final_sid=XXXXX; phpbb3_ddu4final_u=XXXXX;' as cookie, expected format: 'key=value;' as in curl
2023/03/29 23:36:20 [*] complete

It works when I don't use semicolon:

(single value) crawley -depth -1 -headless -cookie "phpbb3_ddu4final_k=XXXXX" "https://ddunlimited.net/viewtopic.php?p=5018859" > test.txt

Crawley v1.5.12-a1f6de2 (archlinux)

Support ignoring URL params

Add flag to ignore scraping same URL but with different params.
For example, while scraping a website https://abc.com, the flag will disable scraping both https://abc.com/something.php?lang=en and https://abc.com/something.php?lang=ru, since they are the same page but with different params.

Thanks!

s0rg / crawley Goto Github PK

crawley's Introduction

crawley's People

Contributors

Stargazers

Watchers

Forkers

crawley's Issues

cookies loaded but login page detected

Using with authenticated sites

Add support for subdomains recursive scraping

Deadlock error

cannot parse string as cookie

Support ignoring URL params

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent