Code Monkey home page Code Monkey logo

crawley's Introduction

s0rg's GitHub stats

Top Langs

crawley's People

Contributors

dependabot[bot] avatar fossabot avatar juxuanu avatar marybonilla2231 avatar s0rg avatar thenbe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

crawley's Issues

cookies loaded but login page detected

Hello, when I use this command:
crawley -depth -1 -dirs only -cookie "phpbb3_XXXXXX_sid=XXXXXX; phpbb3_XXXXXX_u=XXXXXX; phpbb3_XXXXXX_k=XXXXXX;" "https://XXXXXX.net/viewtopic.php?p=5018859" > urls.txt
crawley scrapes the login page of the forum and not the thread I selected, crawley doesn't return any errors

2023/03/20 16:08:58 [*] config: workers: 4 depth: -1 delay: 150ms
2023/03/20 16:08:58 [*] crawling url: https://XXXXXX.net/viewtopic.php?p=5018859
2023/03/20 16:09:00 [*] complete

I tried with different user agents and headers, every time the result is the same: login page of the forum. The cookies are ok, I copied them using EditThisCookie googlechrome extension. I'm not using any vpn/proxy.
I tested also other forums, the result is always the same, I can't login.
Do you know if there is a problem loading cookies?

Deadlock error

I want to scrape this forum www.invitehawk.com but I get this deadlock error. I know it's a cookie error. In this case I can scrape even without cookies but I'd like to know che correct cookie syntax. I'm using netscape cookie format.

โžœ crawley --cookie invitehawk_ck.txt -depth -1 -dirs only "https://www.invitehawk.com/topic/126147-tracker-review-index-sorted-by-category/" | grep -E '.*20[0-9][0-9]-review' > url.list.ih.txt 

2023/03/19 10:17:11 [*] config: workers: 4 depth: -1 delay: 150ms
2023/03/19 10:17:11 [*] crawling url: https://www.invitehawk.com/topic/126147-tracker-review-index-sorted-by-category/
fatal error: all goroutines are asleep - deadlock!

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0x0?)
	/opt/hostedtoolcache/go/1.20.0/x64/src/runtime/sema.go:62 +0x27
sync.(*WaitGroup).Wait(0xc0000b01e0?)
	/opt/hostedtoolcache/go/1.20.0/x64/src/sync/waitgroup.go:116 +0x4b
github.com/s0rg/crawley/pkg/crawler.(*Crawler).close(0xc0000b2840)
	/home/runner/work/crawley/crawley/pkg/crawler/crawler.go:201 +0x65
panic({0x6c1960, 0xc0000c20c0})
	/opt/hostedtoolcache/go/1.20.0/x64/src/runtime/panic.go:884 +0x213
github.com/s0rg/crawley/pkg/client.parseOne({0x7ffc5e21c176?, 0x11?})
	/home/runner/work/crawley/crawley/pkg/client/cookie.go:35 +0x13d
github.com/s0rg/crawley/pkg/client.prepareCookies({0xc00009e430?, 0x1, 0xc000170000?})
	/home/runner/work/crawley/crawley/pkg/client/cookie.go:17 +0x13c
github.com/s0rg/crawley/pkg/client.New({0xc0000cc280, 0x3e}, 0x4, 0x0, {0xc00009e440, 0x1, 0x1}, {0xc00009e430, 0x1, 0x1})
	/home/runner/work/crawley/crawley/pkg/client/http.go:57 +0x1f8
github.com/s0rg/crawley/pkg/crawler.(*Crawler).Run(0xc0000b2840, {0x7ffc5e21c19d, 0x50}, 0x70ee68)
	/home/runner/work/crawley/crawley/pkg/crawler/crawler.go:99 +0x29b
main.crawl({0x7ffc5e21c19d, 0x50}, {0xc0000b4200?, 0x0?, 0xc0000406c8?})
	/home/runner/work/crawley/crawley/cmd/crawley/main.go:94 +0xe5
main.main()
	/home/runner/work/crawley/crawley/cmd/crawley/main.go:235 +0x188

cannot parse string as cookie

I tried several commands:

  • (with spaces) crawley -depth -1 -headless -cookie "phpbb3_ddu4final_k=XXXXX; phpbb3_ddu4final_sid=XXXXX; phpbb3_ddu4final_u=XXXXX;" "https://ddunlimited.net/viewtopic.php?p=5018859" > test.txt
  • (without spaces) crawley -depth -1 -headless -cookie "phpbb3_ddu4final_k=XXXXX;phpbb3_ddu4final_sid=XXXXX;phpbb3_ddu4final_u=XXXXX;" "https://ddunlimited.net/viewtopic.php?p=5018859" > test.txt
  • (single value) crawley -depth -1 -headless -cookie "phpbb3_ddu4final_k=XXXXX;" "https://ddunlimited.net/viewtopic.php?p=5018859" > test.txt
  • (separated values) crawley -depth -1 -headless -cookie "phpbb3_ddu4final_k=XXXXX;" -cookie "phpbb3_ddu4final_sid=XXXXX;" -cookie "phpbb3_ddu4final_u=XXXXX;" "https://ddunlimited.net/viewtopic.php?p=5018859" > test.txt
  • (without other parameters) crawley -cookie "phpbb3_ddu4final_k=XXXXX;" "https://ddunlimited.net/viewtopic.php?p=5018859" > test.txt
    The result is every time "cannot parse the string as cookie":
2023/03/29 23:36:18 [*] config: workers: 4 depth: -1 delay: 150ms
2023/03/29 23:36:18 [*] crawling url: https://ddunlimited.net/viewtopic.php?p=5018859
2023/03/29 23:36:18 cannot parse 'phpbb3_ddu4final_k=XXXXX; phpbb3_ddu4final_sid=XXXXX; phpbb3_ddu4final_u=XXXXX;' as cookie, expected format: 'key=value;' as in curl
2023/03/29 23:36:20 [*] complete

It works when I don't use semicolon:

  • (single value) crawley -depth -1 -headless -cookie "phpbb3_ddu4final_k=XXXXX" "https://ddunlimited.net/viewtopic.php?p=5018859" > test.txt

Crawley v1.5.12-a1f6de2 (archlinux)

Support ignoring URL params

Add flag to ignore scraping same URL but with different params.
For example, while scraping a website https://abc.com, the flag will disable scraping both https://abc.com/something.php?lang=en and https://abc.com/something.php?lang=ru, since they are the same page but with different params.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.