spider-rs / spider Goto Github PK

View Code? Open in Web Editor NEW

631.0 10.0 66.0 1.5 MB

The fastest web crawler written in Rust. Maintained by @a11ywatch.

Home Page: https://spider.cloud

License: MIT License

Rust 100.00%

crawler indexer rust spider scraping headless-chrome

spider's People

Contributors

Stargazers

Watchers

Forkers

zoosky leiz-me dracher dragnucs dotslashrun parallaxisjones blanksteer liinns kirinse xiaobin0860 quietlychris mhmtbsbyndr ahcm hechler juanparati linbuxiao balroggg knassar702 huqingwei311 5l1v3r1 captainlazarus benmarsden njirubryance sarroutbi kelison zyycharles skeptrunedev devflowinc roniemartinez bspaceinc oyelowo shubhamg1995 davidftv dxvid-pts liupeitao akhiln28 bodhi-labs thiagoguislotti beckend marlonbaeten iancormac84 kulak nosterx humanely viecks adaschevici jonaslh4 randomn4me felixengl javiermonton iamazy jackyvan danloh julianzhang ayourtch nfkt wittech xinmi88 mu-l azwhale cxapython forsummer houseme lxlpp123 orionhunts-ai emilsivervik

spider's Issues

error[E0061]: this function takes 2 arguments but 1 argument was supplied

   Compiling spider v1.42.1
error[E0061]: this function takes 2 arguments but 1 argument was supplied
   --> /Users/roniemartinez/.cargo/registry/src/index.crates.io-6f17d22bba15001f/spider-1.42.1/src/website.rs:857:37
    |
857 |                         Some(cb) => cb(u),
    |                                     ^^--- an argument of type `std::option::Option<string_concat::String>` is missing
    |
help: provide the argument
    |
857 |                         Some(cb) => cb(u, /* std::option::Option<string_concat::String> */),
    |                                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

error[E0308]: `match` arms have incompatible types
   --> /Users/roniemartinez/.cargo/registry/src/index.crates.io-6f17d22bba15001f/spider-1.42.1/src/website.rs:858:30
    |
856 |                       let link_result = match self.on_link_find_callback {
    |  _______________________________________-
857 | |                         Some(cb) => cb(u),
    | |                                     -----
    | |                                     |  |
    | |                                     |  here the type of `u` is inferred to be `CaseInsensitiveString`
    | |                                     this is found to be of type `(CaseInsensitiveString, std::option::Option<string_concat::String>)`
858 | |                         _ => u,
    | |                              ^ expected `(CaseInsensitiveString, ...)`, found `CaseInsensitiveString`
859 | |                     };
    | |_____________________- `match` arms have incompatible types
    |
    = note: expected tuple `(CaseInsensitiveString, std::option::Option<string_concat::String>)`
              found struct `CaseInsensitiveString`

Some errors have detailed explanations: E0061, E0308.
For more information about an error, try `rustc --explain E0061`.
error: could not compile `spider` (lib) due to 2 previous errors

Scraping timeout Issue

I am encountering an issue while scraping a list of websites (~400). I have implemented a timeout of 5 minutes for each entry, and if the crawling process takes longer, it should skip to the next website. However, sometimes execution gets stuck at some entry and does not proceed further, even after implementing the timeout.

code:
` async fn fetch_and_store(companies: Vec, output_file: &str,) -> Result<(), Box> {

let timeout_duration = Duration::from_secs(300);

let result: Vec<CompanyInfo> = companies.par_iter().map(|company| {
    info!("crawling:: {}", company.website);
    let link_start_time = Instant::now();
    let rt = Runtime::new().unwrap();

    let website = rt.block_on(async {
        
        let mut website: Website = Website::new(&company.website);
        
        let mut headers = header::HeaderMap::new();
        headers.insert(
            header::USER_AGENT,
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
                .parse()
                .unwrap(),
        );

        website.with_headers(Some(headers));
        website.with_respect_robots_txt(true);
        // website.scrape().await;
        match timeout(timeout_duration, website.scrape()).await {
            Ok(_) => {
                // success
                info!("crawl completed:: {}", company.website)
            }
            Err(_) => {
                info!("crawl timed out after {:?} seconds", timeout_duration);
            }
        }
        website
    });

    let links: Vec<CaseInsensitiveString> = website.get_links().iter().cloned().collect();
    let pages: Vec<PageWrapper> = website
        .get_pages()
        .map_or_else(|| Vec::<PageWrapper>::new(), |pages| pages.iter().map(|page| (page).into()).collect());

    let link_elapsed = link_start_time.elapsed();
    CompanyInfo {
        company: company.company.to_owned(),
        link: company.website.to_owned(),
        links,
        pages,
    }
}).collect();

info!("crawled all companies");

let elapsed = start_time.elapsed();

let json_result = serde_json::to_string_pretty(&result).unwrap();
let mut output_file = File::create(output_file)?;
output_file.write_all(json_result.as_bytes())?;

info!("Results stored in {:?}", output_file);
info!("Total time taken: {:?}", elapsed);

Ok(())

Is it possible to extract broken links from the crawl?

Currently it seems like the crawler only returns working links, but I'm interested in all links on the page, especially the ones that do not work.

[Bug] Follows external website on redirect (302, 301, 3XX)

First of all kudos and thank you for fixing the bugs that I report @j-mendez

It seems that the spider still follow redirects even when the domain is not same to the input domain.

Example: (not actual websites)

Sequence	GET	Response	Comment
1	https://example.com/redirect-to-another	302 http://choosealicense.com	Should have stopped here immediately
2	http://choosealicense.com	301 https://choosealicense.com	Outside of target domain
3	https://choosealicense.com	200	Outside of target domain

only let me spider one url

I'd like to use the spider cli to extract the links in or url exactly. no following links.

thank you for considering.

Remove `.DS_Store` and add it to `.gitignore`

Async runtime

When this module is used in an API manner it would be good to not have blocking request.
Opting into using an optional tokio runtime or if a runtime is detected on the parent installation to use it.

Start with looking into basic switch to async if set with feature = [async']`.

With this change we can lower the default crawl delay willingly. ( crawl delay is still good for avoiding being detected as a bot in situations )

Some pages have 0 bytes from scraped page. After rerunning, different pages have 0 bytes

Running the following, I see 24 pages of 187 that have size 0. Is there a way to retry pages that have no body?

// crawl_and_scrape_urls("https://rsseau.fr").await;

pub async fn crawl_and_scrape_urls(webpage: &str) {
    let mut website: Website = Website::new(webpage)
        .with_chrome_intercept(cfg!(feature = "chrome_intercept"), true)
        .with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
        .with_caching(cfg!(feature = "cache"))
        .with_delay(200)
        .build()
        .unwrap();
    let mut rx2 = website.subscribe(16).unwrap();

    let start = Instant::now();
    tokio::spawn(async move {
        while let Ok(page) = rx2.recv().await {
            println!("found {:?}, size: {}, is_some:{}, status:{:?}, {:?}", page.get_url(), page.get_bytes().map(|b| b.len()).unwrap_or_default(), page.get_bytes().is_some(), page.status_code, start.elapsed());
        }
    });

    website.crawl().await;
}

Output

/usr/bin/cargo run --color=always --package scraper --bin scraper
    Finished dev [unoptimized + debuginfo] target(s) in 0.08s
     Running `target/debug/scraper`
found "https://rsseau.fr", size: 18790, is_some:true, status:200, 1.334018279s
found "https://rsseau.fr/en/resume", size: 12010, is_some:true, status:200, 2.726973448s
found "https://rsseau.fr/books", size: 298, is_some:true, status:200, 2.740495469s
found "https://rsseau.fr/en/books", size: 10999, is_some:true, status:200, 3.046203878s
found "https://rsseau.fr/en/blog", size: 25008, is_some:true, status:200, 3.066285103s
found "https://rsseau.fr/en", size: 25954, is_some:true, status:200, 4.052980776s
found "https://rsseau.fr/blog", size: 25954, is_some:true, status:200, 4.062893583s
found "https://rsseau.fr/fr", size: 25954, is_some:true, status:200, 4.07402141s
found "https://rsseau.fr/en/blog/change-domain-without-kill-seo", size: 26395, is_some:true, status:200, 5.016811423s
found "https://rsseau.fr/en/tag/seo", size: 8482, is_some:true, status:200, 5.647036121s
found "https://rsseau.fr/en/tag/nodejs", size: 8606, is_some:true, status:200, 5.653839305s
found "https://rsseau.fr/en/blog/hack-password-wpa-wifi", size: 22011, is_some:true, status:200, 6.29619579s
found "https://rsseau.fr/en/blog/stripe", size: 99792, is_some:true, status:200, 6.47515529s
found "https://rsseau.fr/en/tag/hack", size: 8507, is_some:true, status:200, 6.92102344s
found "https://rsseau.fr/en/tag/gcloud", size: 8540, is_some:true, status:200, 6.924143442s
found "https://rsseau.fr/en/blog/migrate-from-jekyll-to-gatsby", size: 30831, is_some:true, status:200, 6.931287066s
found "https://rsseau.fr/en/tag/curl", size: 0, is_some:false, status:200, 7.254878261s
found "https://rsseau.fr/en/blog/google-oauth", size: 0, is_some:false, status:200, 7.255088685s
found "https://rsseau.fr/en/tag/jekyll", size: 0, is_some:false, status:200, 7.255194972s
found "https://rsseau.fr/en/tag/capistrano", size: 0, is_some:false, status:200, 7.255290269s
found "https://rsseau.fr/en/tag/apache", size: 9463, is_some:true, status:200, 8.193197751s
found "https://rsseau.fr/en/tag/rails", size: 11183, is_some:true, status:200, 8.196521026s
found "https://rsseau.fr/en/blog/lazy-load-components", size: 28556, is_some:true, status:200, 8.203452536s
found "https://rsseau.fr/en/tag/Wi-fi", size: 8514, is_some:true, status:200, 8.836947024s
found "https://rsseau.fr/en/tag/leanpub", size: 8470, is_some:true, status:200, 8.839509473s
found "https://rsseau.fr/en/tag/vuejs", size: 8362, is_some:true, status:200, 8.992576543s
found "https://rsseau.fr/en/tag/scrum", size: 8488, is_some:true, status:200, 8.996547947s
found "https://rsseau.fr/en/tag/kali", size: 8507, is_some:true, status:200, 9.483346134s
found "https://rsseau.fr/en/blog/setup-typeorm-and-inversify", size: 131390, is_some:true, status:200, 9.668443071s
found "https://rsseau.fr/en/tag/javascript", size: 0, is_some:false, status:200, 9.802099715s
found "https://rsseau.fr/en/blog/deploy-static-website-like-netlify-in-bash", size: 0, is_some:false, status:200, 9.802235151s
found "https://rsseau.fr/en/blog/typescript-user-defined-type-guard-copy", size: 0, is_some:false, status:200, 9.802343697s
found "https://rsseau.fr/en/books/", size: 10999, is_some:true, status:200, 10.586840931s
found "https://rsseau.fr/en/tag/bash", size: 9052, is_some:true, status:200, 10.739109977s
found "https://rsseau.fr/en/tag/docker", size: 9603, is_some:true, status:200, 10.744324133s
found "https://rsseau.fr/en/blog/gcloud-deploy-with-gitlabci", size: 34133, is_some:true, status:200, 10.756706391s
found "https://rsseau.fr/en/tag/activestorage", size: 8458, is_some:true, status:200, 11.37978689s
found "https://rsseau.fr/en/tag/oauth", size: 8450, is_some:true, status:200, 11.382555012s
found "https://rsseau.fr/en/blog/daily-stand-up-with-markdown", size: 14456, is_some:true, status:200, 11.39337682s
found "https://rsseau.fr/en/blog/express-typescript", size: 149343, is_some:true, status:200, 11.591208215s
found "https://rsseau.fr/en/tag/devops", size: 8540, is_some:true, status:200, 12.344568176s
found "https://rsseau.fr/en/tag/sequelize", size: 8572, is_some:true, status:200, 12.349304596s
found "https://rsseau.fr/en/blog/api-on-rails", size: 20730, is_some:true, status:200, 12.985909389s
found "https://rsseau.fr/en/tag/organisation", size: 8537, is_some:true, status:200, 12.997187653s
found "https://rsseau.fr/en/tag/wpa", size: 8500, is_some:true, status:200, 13.000430931s
found "https://rsseau.fr/en/blog/debug-nodejs-with-vscode", size: 35983, is_some:true, status:200, 13.006235807s
found "https://rsseau.fr/en/tag/google", size: 9417, is_some:true, status:200, 13.62642731s
found "https://rsseau.fr/en/tag/typescript", size: 11504, is_some:true, status:200, 13.631146821s
found "https://rsseau.fr/en/tag/gatsby", size: 38297, is_some:true, status:200, 14.319829988s
found "https://rsseau.fr/en/tag/zip", size: 38297, is_some:true, status:200, 14.34285717s
found "https://rsseau.fr/en/tag/typeorm", size: 38297, is_some:true, status:200, 14.363938214s
found "https://rsseau.fr/en/tag/vscode", size: 38297, is_some:true, status:200, 14.379952857s
found "https://rsseau.fr/en/tag/gitlabci", size: 38297, is_some:true, status:200, 14.390048828s
found "https://rsseau.fr/en/tag/book", size: 8449, is_some:true, status:200, 14.728185969s
found "https://rsseau.fr/en/tag/express", size: 8558, is_some:true, status:200, 14.735000263s
found "https://rsseau.fr/en/tag/plaintext", size: 8516, is_some:true, status:200, 15.687281919s
found "https://rsseau.fr/en/tag/api", size: 8423, is_some:true, status:200, 15.692422077s
found "https://rsseau.fr/en/tag/quick", size: 8412, is_some:true, status:200, 15.695330775s
found "https://rsseau.fr/en/blog/zip-active-storage", size: 66668, is_some:true, status:200, 15.723135836s
found "https://rsseau.fr/en/blog/deploy-rails", size: 8608, is_some:true, status:200, 16.518250342s
found "https://rsseau.fr/en/tag/stripe", size: 8608, is_some:true, status:200, 16.520390223s
found "https://rsseau.fr/en/blog/gnu-parallel", size: 8608, is_some:true, status:200, 16.522275453s
found "https://rsseau.fr/fr/blog/mise-en-place-typeorm-et-inversify", size: 225485, is_some:true, status:200, 17.699066835s
found "https://rsseau.fr/fr/blog/deployer-rails-avec-capistrano", size: 41982, is_some:true, status:200, 18.124187429s
found "https://rsseau.fr/fr/blog/stripe", size: 102611, is_some:true, status:200, 18.141562479s
found "https://rsseau.fr/fr/blog/daily-stand-up-avec-markdown", size: 14740, is_some:true, status:200, 18.912935428s
found "https://rsseau.fr/fr/blog/changer-de-domain-sans-tuer-le-seo", size: 28030, is_some:true, status:200, 18.919045855s
found "https://rsseau.fr/fr/blog/migration-de-jekyll-a-gatsby", size: 34258, is_some:true, status:200, 19.570102397s
found "https://rsseau.fr/fr/blog/deboguer-nodejs-avec-vscode", size: 36334, is_some:true, status:200, 19.588353609s
found "https://rsseau.fr/fr/blog/mise-en-place-express-typescript", size: 152031, is_some:true, status:200, 19.776439555s
found "https://rsseau.fr/fr/blog/retour-experience-api-on-rails", size: 69344, is_some:true, status:200, 20.423467775s
found "https://rsseau.fr/fr/blog/zip-active-storage", size: 69344, is_some:true, status:200, 20.445701542s
found "https://rsseau.fr/fr/2020-03-02-rapport-automatique-avec-goaccess/", size: 69344, is_some:true, status:200, 20.466012771s
found "https://rsseau.fr/fr/blog/deployer-sur-gcloud-deploy-avec-gitlabci", size: 69344, is_some:true, status:200, 20.486064527s
found "https://rsseau.fr/fr/blog/hacker-password-wpa-wifi", size: 69344, is_some:true, status:200, 20.506047966s
found "https://rsseau.fr/fr/blog/reverse-proxy-apache", size: 16276, is_some:true, status:200, 21.608815143s
found "https://rsseau.fr/fr/resume", size: 12074, is_some:true, status:200, 22.247370263s
found "https://rsseau.fr/fr/tag/twitter-bootstrap-3", size: 8897, is_some:true, status:200, 22.261193406s
found "https://rsseau.fr/fr/tag/ruby", size: 12934, is_some:true, status:200, 22.892540364s
found "https://rsseau.fr/fr/blog/migrer-une-application-rails-vers-mariadb", size: 22349, is_some:true, status:200, 22.896557347s
found "https://rsseau.fr/fr/tag/docker", size: 10863, is_some:true, status:200, 23.531697766s
found "https://rsseau.fr/fr/tag/activestorage", size: 8503, is_some:true, status:200, 23.545107492s
found "https://rsseau.fr/fr/tag/sequelize", size: 8639, is_some:true, status:200, 23.547406859s
found "https://rsseau.fr/fr/tag/google", size: 9542, is_some:true, status:200, 24.17552635s
found "https://rsseau.fr/fr/blog", size: 46385, is_some:true, status:200, 24.205842962s
found "https://rsseau.fr/fr/tag/networking", size: 9366, is_some:true, status:200, 24.825658916s
found "https://rsseau.fr/fr/blog/typescript-generateur", size: 33246, is_some:true, status:200, 24.837728815s
found "https://rsseau.fr/fr/tag/stripe", size: 8484, is_some:true, status:200, 25.461133317s
found "https://rsseau.fr/fr/tag/scrum", size: 8537, is_some:true, status:200, 25.465922365s
found "https://rsseau.fr/fr/tag/expressjs", size: 8847, is_some:true, status:200, 26.105561874s
found "https://rsseau.fr/fr/tag/zip", size: 8433, is_some:true, status:200, 26.111223205s
found "https://rsseau.fr/fr/tag/seo", size: 9413, is_some:true, status:200, 26.750228634s
found "https://rsseau.fr/fr/tag/typeorm", size: 8833, is_some:true, status:200, 26.754363334s
found "https://rsseau.fr/fr/books", size: 11133, is_some:true, status:200, 26.760243558s
found "https://rsseau.fr/fr/tag/benchmark", size: 8684, is_some:true, status:200, 27.390576402s
found "https://rsseau.fr/fr/tag/plaintext", size: 8565, is_some:true, status:200, 27.397710367s
found "https://rsseau.fr/fr/tag/api", size: 8463, is_some:true, status:200, 28.036268311s
found "https://rsseau.fr/fr/tag/inversify", size: 8847, is_some:true, status:200, 28.039097112s
found "https://rsseau.fr/fr/tag/mariadb", size: 8715, is_some:true, status:200, 28.674843866s
found "https://rsseau.fr/fr/tag/raspberrypi", size: 8743, is_some:true, status:200, 28.683088186s
found "https://rsseau.fr/fr/tag/capistrano", size: 8569, is_some:true, status:200, 29.31537255s
found "https://rsseau.fr/fr/tag/rails", size: 14574, is_some:true, status:200, 29.317693106s
found "https://rsseau.fr/fr/blog/comparaison-server-apache-ruby-rpi2-vs-rp3", size: 16725, is_some:true, status:200, 30.786770496s
found "https://rsseau.fr/fr/blog/go-back-to-jekyll", size: 16725, is_some:true, status:200, 30.793385397s
found "https://rsseau.fr/fr/tag/mysql", size: 16725, is_some:true, status:200, 30.798532784s
found "https://rsseau.fr/fr/tag/express", size: 16725, is_some:true, status:200, 30.801998355s
found "https://rsseau.fr/fr/tag/javascript", size: 16725, is_some:true, status:200, 30.805447475s
found "https://rsseau.fr/fr/tag/apache", size: 16725, is_some:true, status:200, 30.808880987s
found "https://rsseau.fr/fr/blog/daily-stand-up-with-markdown", size: 16725, is_some:true, status:200, 30.812327328s
found "https://rsseau.fr/fr/tag/raspberry", size: 16725, is_some:true, status:200, 30.81575131s
found "https://rsseau.fr/fr/tag/typescript", size: 16725, is_some:true, status:200, 30.819181851s
found "https://rsseau.fr/fr/tag/organisation", size: 16725, is_some:true, status:200, 30.822607393s
found "https://rsseau.fr/fr/tag/jekyll", size: 16725, is_some:true, status:200, 30.826098744s
found "https://rsseau.fr/fr/tag/jquery", size: 16725, is_some:true, status:200, 30.829595233s
found "https://rsseau.fr/fr/tag/gatsby", size: 16725, is_some:true, status:200, 30.834674862s
found "https://rsseau.fr/fr/tag/vscode", size: 16725, is_some:true, status:200, 30.838163752s
found "https://rsseau.fr/fr/blog/afficher-les-erreurs-d-un-formulaire-en-ajax-avec-twitter-bootstrap-et-rails", size: 16725, is_some:true, status:200, 30.841690321s
found "https://rsseau.fr/fr/tag/nodejs", size: 16725, is_some:true, status:200, 30.84517312s
found "https://rsseau.fr/fr/tag/analytics", size: 8439, is_some:true, status:200, 32.446239693s
found "https://rsseau.fr/fr/tag/selfhosted", size: 8365, is_some:true, status:200, 32.60177853s
found "https://rsseau.fr/fr/tag/symfony", size: 8467, is_some:true, status:200, 32.603505456s
found "https://rsseau.fr/fr/tag/slim", size: 0, is_some:false, status:200, 32.804250504s
found "https://rsseau.fr/fr/tag/gitlabci", size: 8782, is_some:true, status:200, 33.737470697s
found "https://rsseau.fr/fr/blog/kvm", size: 14545, is_some:true, status:200, 33.740319227s
found "https://rsseau.fr/fr/tag/postgres", size: 8579, is_some:true, status:200, 33.74432176s
found "https://rsseau.fr/fr/tag/optimization", size: 8683, is_some:true, status:200, 33.746227429s
found "https://rsseau.fr/fr/blog/rapport-automatique-avec-goaccess", size: 32387, is_some:true, status:200, 33.748175758s
found "https://rsseau.fr/fr/tag/kali", size: 0, is_some:false, status:200, 34.089999871s
found "https://rsseau.fr/fr/tag/Wi-fi", size: 8572, is_some:true, status:200, 35.036508595s
found "https://rsseau.fr/fr/blog/lire-les-logs-avec-go-access", size: 14914, is_some:true, status:200, 35.038509911s
found "https://rsseau.fr/fr/blog/typescript-user-defined-type-guard", size: 39823, is_some:true, status:200, 35.045317337s
found "https://rsseau.fr/fr/blog/rust-web-spider-crate", size: 65789, is_some:true, status:200, 35.056255361s
found "https://rsseau.fr/fr/tag/admin", size: 8417, is_some:true, status:200, 35.186236176s
found "https://rsseau.fr/fr/blog/setup-phinx", size: 32223, is_some:true, status:200, 35.194905372s
found "https://rsseau.fr/fr/tag/qemu", size: 8419, is_some:true, status:200, 35.665102682s
found "https://rsseau.fr/fr/tag/lxc", size: 8461, is_some:true, status:200, 35.669967437s
found "https://rsseau.fr/fr/tag/bash", size: 11014, is_some:true, status:200, 35.676669776s
found "https://rsseau.fr/fr/tag/devops", size: 8768, is_some:true, status:200, 35.681116745s
found "https://rsseau.fr/fr/tag/virtualization", size: 8489, is_some:true, status:200, 36.318233503s
found "https://rsseau.fr/fr/blog/mettre-a-jour-un-package-sur-pipy", size: 13367, is_some:true, status:200, 36.324154685s
found "https://rsseau.fr/fr/tag/vagrant", size: 8467, is_some:true, status:200, 37.007201523s
found "https://rsseau.fr/fr/blog/rust", size: 85654, is_some:true, status:200, 37.036087941s
found "https://rsseau.fr/fr/tag/curl", size: 8536, is_some:true, status:200, 37.656499798s
found "https://rsseau.fr/fr/blog/optimiser-apache", size: 30485, is_some:true, status:200, 37.664812565s
found "https://rsseau.fr/fr/blog/installer-bridge-sfr-box-4k", size: 21396, is_some:true, status:200, 38.295006832s
found "https://rsseau.fr/fr/tag/thread", size: 8591, is_some:true, status:200, 38.306079543s
found "https://rsseau.fr/fr/blog/gateway", size: 26420, is_some:true, status:200, 38.309145495s
found "https://rsseau.fr/fr/tag/pipy", size: 8377, is_some:true, status:200, 38.925674446s
found "https://rsseau.fr/fr/tag/sql", size: 9462, is_some:true, status:200, 38.939839748s
found "https://rsseau.fr/fr/tag/crawler", size: 9469, is_some:true, status:200, 39.565235569s
found "https://rsseau.fr/fr/blog/syncthing", size: 16969, is_some:true, status:200, 39.573231857s
found "https://rsseau.fr/fr/tag/wpa", size: 8558, is_some:true, status:200, 40.198726785s
found "https://rsseau.fr/fr/tag/performance", size: 8676, is_some:true, status:200, 40.20141249s
found "https://rsseau.fr/fr/tag/hack", size: 8565, is_some:true, status:200, 40.838352946s
found "https://rsseau.fr/fr/tag/sync", size: 8323, is_some:true, status:200, 40.842826356s
found "https://rsseau.fr/fr/tag/rust", size: 10163, is_some:true, status:200, 41.48078013s
found "https://rsseau.fr/fr/blog/reproduire-netlify-avec-un-raspberry", size: 18346, is_some:true, status:200, 41.483941691s
found "https://rsseau.fr/fr/tag/network", size: 8431, is_some:true, status:200, 42.13793719s
found "https://rsseau.fr/fr/blog/benchmark-templates", size: 40244, is_some:true, status:200, 42.143059129s
found "https://rsseau.fr/fr/blog/mise-en-place-postgres-replication-avec-docker", size: 59323, is_some:true, status:200, 42.166506708s
found "https://rsseau.fr/fr/blog/v%C3%A9rifier-la-syntaxe-php-a-chaque-commit", size: 24282, is_some:true, status:200, 42.795060822s
found "https://rsseau.fr/fr/blog/rust-threaded-crawler", size: 52008, is_some:true, status:200, 42.807865087s
found "https://rsseau.fr/fr/tag/oauth", size: 8543, is_some:true, status:200, 43.425957892s
found "https://rsseau.fr/fr/blog/installer-apache", size: 24573, is_some:true, status:200, 43.430563466s
found "https://rsseau.fr/fr/tag/linux", size: 0, is_some:false, status:200, 43.779819269s
found "https://rsseau.fr/fr/tag/kvm", size: 0, is_some:false, status:200, 43.780043342s
found "https://rsseau.fr/fr/tag/gcloud", size: 0, is_some:false, status:200, 43.780160648s
found "https://rsseau.fr/fr/tag/haml", size: 0, is_some:false, status:200, 43.780258885s
found "https://rsseau.fr/fr/tag/routing", size: 0, is_some:false, status:200, 43.780367112s
found "https://rsseau.fr/fr/blog/new-symfony-project-with-vagrant", size: 0, is_some:false, status:200, 43.780466349s
found "https://rsseau.fr/fr/tag/sfr", size: 0, is_some:false, status:200, 43.780585205s
found "https://rsseau.fr/fr/tag/goaccess", size: 0, is_some:false, status:200, 43.780708991s
found "https://rsseau.fr/fr/tag/leanpub%20book", size: 0, is_some:false, status:200, 43.780823147s
found "https://rsseau.fr/fr/tag/python", size: 0, is_some:false, status:200, 43.780952813s
found "https://rsseau.fr/fr/tag/git", size: 0, is_some:false, status:200, 43.78106027s
found "https://rsseau.fr/fr/blog/authentification-google-oauth", size: 0, is_some:false, status:200, 43.781174106s
found "https://rsseau.fr/fr/tag/php", size: 0, is_some:false, status:200, 43.781321242s
found "https://rsseau.fr/fr/tag/crate", size: 0, is_some:false, status:200, 43.781446788s
found "https://rsseau.fr/fr/tag/phinx", size: 8409, is_some:true, status:200, 44.845781423s
found "https://rsseau.fr/fr/blog/kill-rails-n1-queries", size: 0, is_some:false, status:200, 45.207587511s
found "https://rsseau.fr/fr/tag/quick", size: 8621, is_some:true, status:200, 46.334896682s
found "https://rsseau.fr/en/blog/typescript-user-defined-type-guard", size: 271, is_some:true, status:200, 48.004604031s
found "https://rsseau.fr/fr/blog/2018-06-22-kill-rails-n1-queries", size: 271, is_some:true, status:200, 48.005367177s
found "https://rsseau.fr/fr/blog/2019-11-2019-11-07-lire-les-logs-avec-go-access", size: 271, is_some:true, status:200, 48.005952388s
found "https://rsseau.fr/fr/blog/2018-02-07-rust.html", size: 271, is_some:true, status:200, 48.006487242s
found "https://rsseau.fr/2018/02/07/rust-web-spider-crate.html", size: 271, is_some:true, status:200, 48.006988466s
found "https://rsseau.fr/fr/blog/2017-11-16-installer-apache", size: 271, is_some:true, status:200, 48.00748828s

Process finished with exit code 0

CLI - Not including the schema in -d parameter results in critical error

I installed the CLI via cargo in Ubuntu 22.04.
If I run spider -v --domain example.com crawl I get

thread 'main' panicked at /home/inelemento/.cargo/registry/src/index.crates.io-6f17d22bba15001f/spider_cli-1.48.3/src/main.rs:57:10:
called `Result::unwrap()` on an `Err` value: Kind(NotFound)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

the crawler launches if I use http://example.com as argument.

Since the parameter is called domain and not URL this can be misleading.
Also I would expect the error to be reported with an user-friendly error message.

Cheers and keep up the good work.

crawling nested anchors are not found

Currently, if you run the test against a website that has nested anchor tags the crawler will not pick up the links. It seems that we need to retry the selector to get all the links. You can use www.drake.com as an example to test against.

Great package btw.

[Feature request] Sitemap

It would be nice to have a sitemap crawl option like the one setup for https://github.com/a11ywatch/crawler. We can migrate the code over and release under a feature flag. The sitemap can crawl either pre or mid crawls to extend the reach of found pages.

Allow subdomain crawling

Issue
It would be nice to have the ability to crawl all subdomains for a domain. Having this option would make it more realistic to gather everything attached to a live website. An optional flag on crawl or config defaulting to false so prior behavior does not change unless it becomes more mandatory for it being the default.

search

how to find any information, a word for example on a multi-page site ?

Getting URL after redirect

Thank you for this amazing crate!

I have a quick question and possibly a feature request if it's currently not possible.

Suppose that crawling starts from https://example.com, which has a link to /a. It seems that if there exists an HTTP redirect from /a to /b, page.get_html() will give the content served at https://example.com/b, but page.get_url() will return https://example.com/a. This might very well be the preferable behavior, but it would also be nice to extract the URL after the redirect, i.e. https://example.com/b. Is that possible? Also, it would be great to know when a redirect has occurred.

cli parameters

There seems to be an parameter order that I have to follow.

spider -t -s -D 200 --domain https://www.ard.de -v scrape --output-links

Maybe make order of arguments not a thing, except for scrape being the first one

Already crawled URL attempted as % encoded

Hi,
I have the following code:

let mut website: Website = Website::new(webpage)
        .with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
        .build()
        .unwrap();
    let mut rx2 = website.subscribe(16).unwrap();
    let start = Instant::now();
    tokio::spawn(async move {
        while let Ok(page) = rx2.recv().await {
            println!(
                "found {:?}, size: {}, is_some:{}, status:{:?}, {:?}",
                page.get_url(),
                page.get_bytes().map(|b| b.len()).unwrap_or_default(),
                page.get_bytes().is_some(),
                page.status_code,
                start.elapsed()
            );
});
website.scrape().await;

After this is running for a bit, I will start to see log info like the following:
**First from spider::utils**
�[90m[�[0m2024-03-18T15:48:12Z �[32mINFO �[0m spider::utils�[90m]�[0m fetch - https://www.cprime.com/%22https:////www.cprime.com//resources//blog//how-to-develop-a-hospital-management-system///%22

Then, from the `println!`
found "https://www.cprime.com/%22https:////www.cprime.com//resources//blog//how-to-develop-a-hospital-management-system///%22", size: 0, is_some:false, status:404, 4028.537054042s

This pattern will occur for quite a few URLs that don't exist. I can confirm that the URL appended to the base `https://www.cprime.com/` has already been crawled. So I'm not missing pages but seems to be lots of redundancy, and 404 are being generated. 

This happens for various sites that I have tested it on.


Any thoughts on how to track this down?

Support ignoring SSL errors

Hi, is it possible to support disabling SSL verification?

Add the ability to download not only html, but also all site assets: css, js, imgs, etc

cli tutorial store crawls result as json

The cli tutorial kind of hints as crawl results stored as JSON,

I checked and it is not json. can I get JSON output somehow, or can I know the format please?

Chrome flag chrome_intercept page hang.

We attached the interception to the browser thinking it would emit the events per page. We need to add the interception handler at the page level instead.

This is being worked on actively.

Add --help option to spider_cli

Upon installing spider_cli, I pretty immediately tried to check usage information using spider --help, which then leads to an immediate panic. I also typed man spider (in some ways, I wish the command was actually spider man, because that would be delightful). Anyway, of course no man page exists (mostly not for CLI tools installed through cargo), so I ended up checking usage on crates.io.

I'd like to recommend adding in a -help option to the CLI, because it's a pretty standard feature, and makes using the tool easier on a day-to-day basis, especially for potentially more complicated configuration settings. To that end, it might be worth adding clap as a spider_cli dependency--in my experience, it makes building fully-featured CLI tools much easier and easy to update as the main repository evolves, and is a commonly-used library in the Rust ecosystem. If you're interested in this, I'd be happy to put together a draft pull request with those changes.

Fix tld catching when subdomains and TLD options declared

Currently when both subdomains and TLDs are set some of the TLDs are missed.
Re-visiting the conditional filter for the logic on handling both conditions.

Running with decentralized feature

Trying to run a decentralized worker example.

First, as per the documentation, running a worker:

RUST_LOG=info SPIDER_WORKER_PORT=3030 spider_worker
[2024-03-14T18:59:22Z INFO  spider::utils] Spider_Worker starting at 0.0.0.0: - 3030
[2024-03-14T18:59:22Z INFO  warp::server] Server::run; addr=0.0.0.0:3030
[2024-03-14T18:59:22Z INFO  warp::server] listening on http://0.0.0.0:3030

Then two variants, both do not result in the expected behavior:

cd examples
SPIDER_WORKER=http://127.0.0.1:3030 cargo run --example example --features decentralized

Gives

error: none of the selected packages contains these features: decentralized

Another, running from the project root gives the following in the worker's log

...
[2024-03-14T19:10:57Z INFO  spider::utils] - error parsing html text {} - https://rsseau.fr/fr/tag/linux
[2024-03-14T19:10:57Z INFO  spider::utils] - error parsing html text {} - https://rsseau.fr/fr/tag/inversify
[2024-03-14T19:10:57Z INFO  spider::utils] - error parsing html text {} - https://rsseau.fr/fr/blog/setup-phinx
[2024-03-14T19:10:57Z INFO  spider::utils] - error parsing html text {} - https://rsseau.fr/fr/blog/reproduire-netlify-avec-un-raspberry
[2024-03-14T19:10:58Z ERROR hyper::server::tcp] accept error: Too many open files (os error 24)
[2024-03-14T19:10:59Z ERROR hyper::server::tcp] accept error: Too many open files (os error 24)
...

After killing the worker, the example process prints out

...
- "https://rsseau.fr/fr/tag/gitlabci"
- "https://rsseau.fr/fr/tag/curl"
- "https://rsseau.fr/fr/tag/mariadb"
Time elapsed in website.crawl() is: 6.187290167s for total pages: 183

Also extract urls that are pointing to other domains? [CLI]

I only get 'internal links'.

Is there a way to get external links too?

spider_cli doesn't install using cargo

Hello,

Per the README, I've tried installing spider_cli on two different machines, using $ cargo install spider_cli. Each time, I receive an error

error: could not find `spider_cli` in registry `crates-io` with version `*`

Similarly, while other command-line programs distributed in this manner (like lychee) have listings on crates.io for both lychee and lychee-lib, I can not find a corresponding spider_cli crate available. This may indicate that publishing was not actually successful.

I have tested on rustc/cargo 1.60.0 for both stable and nightly, and actively develop other projects in Rust, so I don't think ti's because of anything exotic or uncommon about my machines' setup.

Add better documentation to get started

It is useful to tell folks what they need to do, probably starting with:

% git clone https://github.com/spider-rs/spider.git 
cd spider

Easy to do get some results by:
cargo run --example example

Harder to get this working
spider [OPTIONS] --domain <DOMAIN> [SUBCOMMAND]

Where is the --help to give me the [OPTIONS] I'm looking for. How about a [SUBCOMMAND]?

Is with or without the https://? Does it matter?

Having a general INSTALL.txt file is always helpful.

When you are able to get a spider to work, where does the data go?

I can get cargo run --example example to scan https://rsseau.fr as configured in the example.rs file, but not sure how to customize that? I should be able to just copy the example.rs file and run something that points to that config, but not sure what that is.

This is all good info to put in an INSTALL.txt file.

Is it possible to dynamicall add links to crawl?

Hello and first off, thank you for developing this fantastic software!

I am currently working on a project that involves crawling pages with dynamically generated content. To navigate through these pages, I've managed to generate direct links to the content I'm interested in. However, I've encountered a limitation when trying to integrate these links into the crawler.

The issue arises with the website.on_link_find_callback function. From my understanding, this callback is designed to process and return only one link at a time. My use case requires the ability to add multiple dynamically generated links into the crawler's queue for processing.

Is it possible?

I see there is extra_links but it's private.

Should make the media selector more configurable?

I'm using spider-rs to build a simple system which make a snapshot of my website, but it can't scrape the css/js/image.. and so on.
Will it be supported in future?

Blacklist entire url tree?

Hello,

I was using this library today, and noticed that I was having a lot of issues with one site that had comments enabled. As a result, while the fetch/parsing of the main content was fairly quick, there were many cases where I was also crawling the https://example.com/article/comment/<comment_number> as well. When I added in

website.configuration.blacklist_url.push("https://www.example.com/comment/".to_string());

to my configuration, I believe it successfully stopped that particular page from coming up, but pages that contains that url (although didn't match it exactly) still seemed to appear. It would be nice to be able to avoid this tree entirely (I'm not interested in the comments of the site at all).

I believe this is because the is_allowed() function uses the following internally

if self.configuration.blacklist_url.contains(link) {
            return false;
        }

which checks for exact matches of the intended link in the blacklisted urls, rather than checks for a pattern match of the blacklisted url on the link itself. Is there a way that might make sense for helping to prevent this?

I'm using spider = "1.5.1" in my Cargo.toml file, with rustc 1.60.0-nightly. I've actually seen this behavior on a couple of sites, but at the risk of annoying him by running too many crawlers over it, one example of this might be https://www.jeffgeerling.com

Blacklist regex for CLI does not seem to work

I've been trying to crawl a website and ignore paths like /news or whatever using -b, but nothing seems to work.

I looked at #36 and from what I can see the feature regex is not enabled for the CLI part of this project.

Is this on purpose? I also can't seem to find any example or test that tests that the CLI can actually do this.

Hope I'm just blind and overlooking something.

Thanks for the tool! :D

Change API to builder pattern

It looks like as we add more features to Website, it gets more complicated. Would it be an option to change it to builder pattern, instead, so it would be much simpler to use?

let mut website: Website = Website::new("https://choosealicense.com");

website.configuration.respect_robots_txt = true;
website.configuration.subdomains = true;
website.configuration.tld = false;
website.configuration.delay = 0; // Defaults to 0 ms due to concurrency handling
website.configuration.request_timeout = None; // Defaults to 15000 ms
website.configuration.http2_prior_knowledge = false; // Enable if you know the webserver supports http2
website.configuration.channel_buffer = 100; // Defaults to 50 - tune this depending on on_link_find_callback
website.configuration.user_agent = Some("myapp/version".into()); // Defaults to using a random agent
website.on_link_find_callback = Some(|s| { println!("link target: {}", s); s }); // Callback to run on each link find
website.configuration.blacklist_url.get_or_insert(Default::default()).push("https://choosealicense.com/licenses/".into());
website.configuration.proxies.get_or_insert(Default::default()).push("socks5://10.1.1.1:12345".into()); // Defaults to none - proxy list.

website.crawl().await;

When setting up with builder pattern (suggestion only):

let mut website: Website = Website::new("https://choosealicense.com")
    .with_respect_robots_txt(true)
    .with_subdomains(true)
    .with_tld(false)
    .with_delay(0)
    .with_request_timeout(None)
    .with_http2_prior_knowledge(false)
    .with_channel_buffer(100)
    .with_user_agent(Some("myapp/version".into()))
    .with_on_link_find_callback( Some(|s| { println!("link target: {}", s); s }))
    .with_blacklist_url("https://choosealicense.com/licenses/".into())
    .with_blacklist_url( .... ) // can add more
    .with_proxy("socks5://10.1.1.1:12345".into())
    .with_proxy( ... ); // can add more

website.crawl().await;

`with_on_link_find_callback` doesn't exist

As of 323d03f:

spider$ rg -F with_on_link_find_callback
spider/README.md
78:    .with_on_link_find_callback(Some(|link, html| {

Looks like it did exist at 9bc8888 in at https://github.com/spider-rs/spider/blob/9bc8888/spider/src/website.rs#L1136-L1148:

    /// Perform a callback to run on each link find.
    pub fn with_on_link_find_callback(
        &mut self,
        on_link_find_callback: Option<
            fn(CaseInsensitiveString, Option<String>) -> (CaseInsensitiveString, Option<String>),
        >,
    ) -> &mut Self {
        match on_link_find_callback {
            Some(callback) => self.on_link_find_callback = Some(callback.into()),
            _ => self.on_link_find_callback = None,
        };
        self
    }

Am I—and thus the README—missing an import; was this mistakenly removed; renamed; hidden behind a feature flag; or should it be removed from the README?

Add Documentation about new features of the spider

Looking at https://github.com/spider-rs/spider/blob/main/spider_cli/README.md

I think this got brought into the code #79

Or it got bumped off the todo list. Either is fine, but would be nice if it was built if folks knew how to use it.

CPU usage is exceptionally high, reaching up to 1450%

It's an impressive spider that exhibits remarkable speed.
However, I've encountered certain domains where the CPU usage is exceptionally high, reaching up to 1450%.

Extract text from Html

Hi there @j-mendez initially thanks a lot for this spider awesome crate

i am using you example scrape.rs to scrape website and along with sublinks of that domain present in it
but it gives me raw html

i want to extract text from raw html provided via scrape any idea or advice is accepted

[Bug] Trailing slash breaking with websites that don't allow it

Tried crawling a website but it failed because trailing slash is appended even though it is not included in the original URL.

Some websites do not accept this so they return 500. A trailing slash shouldn't be forced into the URL.

Cause of error:

spider/spider/src/website.rs

Lines 72 to 76 in 64a6807

    
           let domain = if domain.ends_with('/') { 
        
               domain.into() 
        
           } else { 
        
               string_concat!(domain, "/") 
        
           };

full-resource feature seems to be missing Javascript

Hello again!

Thanks for the super quick fix in #129!

I am now at a stage that I can successfully see my little script use spider to crawl and download the website I am testing with.

However, I noticed a peculiar bug.

First off, all files, regardless of actual filetype, are stored as .html. This isn't that much of a problem and one can simply run a script to fix that.

Secondly, I don't see any .js files being downloaded at all. I'm testing this on various sites that I definitely know have Javascript (I check using my browser to verify there's an actual file available) and I have installed spider using the feature full_resources. I also expected other files to be downloaded like images and whatnot, but that's less of a problem for my personal use case. I'm just mentioning it in case that's unexpected behaviour.

Thank you again for getting back to me so quickly on the last one. I hope I'm not spamming issues haha :D

Extracting all urls on a page

Hi,
This is related to issue #135 . Perhaps I'm using this incorrectly, but when I try the following command, using spider_cli 1.80.78:

spider -t -v --url https://www.theconsortium.cloud/ --depth 10 -s -E https://39834791.fs1.hubspotusercontent-na1.net/ scrape

I never see any URLs from the external domain, even though on one of the pages crawled https://www.theconsortium.cloud/application-consulting-services-page there is a button that links to a pdf on Hubspot, the HTML looks like this:

<a class="hs-button " href="https://39834791.fs1.hubspotusercontent-na1.net/hubfs/39834791/Application%20Rationalization.pdf" id="hs-button_widget_1686407295591" target="_blank" rel="noopener "> Download our one-pager for more information </a>

the output from the scape command looks like this for that page:
{
"html": "",
"links": [],
"url": "https://www.theconsortium.cloud/application-consulting-services-page"
},

Is there a way, programmatically or via the CLI, to have a spider detect all of the links on a page? Or is that post-processing that needs to happen, whereby I would need to parse each page to find the links?

WARN Connection header illegal in HTTP/2: connection

I tried the sample code on the example, but the warning(WARN Connection header illegal in HTTP/2: connection) has occurred to me.

When executing the sample code without the cargo lambda, the warning didn't show.
Could someone tell me what the warning means?

use lambda_http::{run, service_fn, Body, Error, Request, RequestExt, Response};
use spider::tokio;
use spider::website::Website;

/// This is the main body for the function.
/// Write your code inside it.
/// There are some code example in the following URLs:
/// - https://github.com/awslabs/aws-lambda-rust-runtime/tree/main/examples
async fn function_handler(_event: Request) -> Result<Response<Body>, Error> {
    // Extract some useful information from the request
    let url = "https://choosealicense.com";
    let mut website: Website = Website::new(&url);
    website.crawl().await;

    for link in website.get_links() {
        println!("- {:?}", link.as_ref());
    }

    // Return something that implements IntoResponse.
    // It will be serialized to the right response event automatically by the runtime
    let resp = Response::builder()
        .status(200)
        .header("content-type", "text/html")
        .body("Hello AWS Lambda HTTP request".into())
        .map_err(Box::new)?;
    Ok(resp)
}

#[tokio::main]
async fn main() -> Result<(), Error> {
    tracing_subscriber::fmt()
        .with_max_level(tracing::Level::INFO)
        // disable printing the name of the module in every log line.
        .with_target(false)
        // disabling time is handy because CloudWatch will add the ingestion time.
        .without_time()
        .init();

    run(service_fn(function_handler)).await
}

Concurrency configuration not available in latest release 1.2.1

README.md suggests there's a concurrency configuration option available but is not available in 1.2.1. It's also not listed in docs: https://docs.rs/spider/1.2.1/spider/configuration/struct.Configuration.html

[Feature request] URL Globbing

I would like to propose an implementation of curl's URL globbing: https://everything.curl.dev/cmdline/globbing

Sure, this can be done via loop, but it is useful and easier to control if there is one URL to use to Pause, Resume or Shutdown the spider

The result of get_html is garbled in case of Shift_JIS html

Here's a minimum reproduction code.

let url = "https://hoken.kakaku.com/health_check/blood_pressure/";
    let mut website: Website = Website::new(&url);
    website.with_budget(Some(spider::hashbrown::HashMap::from([("*", 10)])));

    website.scrape().await;

    let mut lock = stdout().lock();

    let separator = "-".repeat(url.len());

    for page in website.get_pages().unwrap().iter() {
        writeln!(
            lock,
            "{}\n{}\n\n{}\n\n{}",
            separator,
            page.get_url_final(),
            page.get_html(),
            separator
        )
        .unwrap();
    }

Here's the result for get_html().

<!doctype html>
<html lang="ja" prefix="og: http://ogp.me/ns# fb: http://www.facebook.com/2008/fbml">
<head>
<!--[if IE]>
<meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1">
<![endif]-->
<meta charset="shift_jis">

<title>���i.com - HbA1c�̒l�������Ƃǂ��Ȃ�H��l�Ƌ^����a�C�ɂ���Ĉ�t�����</title>
<meta name="description" content="HbA1c�̒l�������Ƒz�肳���a�C�A�a�C�̉��P���@�ɂ���Ĉ�t���ڂ�������������܂��B���N�f�f�Ő��l���C�ɂȂ������͎Q�l�ɂ��Ă݂܂��傤�B">
<link rel="canonical" href="https://hoken.kakaku.com/health_check/hba1c/">

Could someone tell me how to deal with the garbled HTML above?

Get a list of images and their alt text

It seems like I should be able to do some functions of Screaming frog with this.

What I'm most interested in is:

a list of images (and all the pages that feature that image), with their alt text (and variations if that isn't consistently used).
a list of file types (html, pdf, wepb, svg) would also be useful.
also useful to produce a sitemap.xml file from a crawl.

Use version from Cargot.toml for user agent

I think that user_agent in src/configuration.rs can use crate version from cargo.toml file. Maybe it is possible to use something like that :

const VERSION: &'static str = env!("CARGO_PKG_VERSION");

[Feature Request] Option to use HTTP/SOCKS proxy

Thanks for this awesome tool!

I would like to know if it is possible to use a HTTP/SOCKS proxy.

Cheers!

Edit: Fix

Scraped html does not match the url - chrome [with_wait_for_idle_network]

Using 1.82.4, when running the code below, the url doesn't match the page contents. It seems to mix up urls for different pages when inspecting the contents. So far it works fine on https://rsseau.fr, but it has trouble on the url below. Do I need to use website.subscribe_guard()?

//    crawl_and_scrape_urls("https://docs.drift.trade").await;
pub async fn crawl_and_scrape_urls(webpage: &str) {
    let mut website: Website = Website::new(webpage)
        .with_chrome_intercept(cfg!(feature = "chrome_intercept"), true)
        .with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
        .with_caching(cfg!(feature = "cache"))
        .with_delay(200)
        .build()
        .unwrap();
    let mut rx2 = website.subscribe(16).unwrap();

    // website.subscribe_guard()
    let start = Instant::now();
    tokio::spawn(async move {
        while let Ok(page) = rx2.recv().await {
            println!("found {:?}, size: {}, is_some:{}, status:{:?}, {:?}", page.get_url(), page.get_bytes().map(|b| b.len()).unwrap_or_default(), page.get_bytes().is_some(), page.status_code, start.elapsed());
            fs::write(page.get_url_final().replace("/", "__"), page.get_html()).expect("Unable to write file");
        }
    });

    // crawl the site first
    website.crawl().await;
    // persist links to the next crawl
    website.persist_links();
    // scrape all discovered links
    website.scrape().await;
}

CLI -d flag is duplicated

I installed via cargo install and the -d flag is defined twice. One for delay the other for domain. If I'm not mistaken, clap only supports lower case flags?

Here's the output of spider --help

spider_cli 1.8.2
madeindjs <[email protected]>, j-mendez <[email protected]>
Multithreaded web crawler written in Rust.

USAGE:
    spider [OPTIONS] --domain <DOMAIN> [SUBCOMMAND]

OPTIONS:
    -b, --blacklist-url <BLACKLIST_URL>
            Comma seperated string list of pages to not crawl or regex with feature enabled

    -c, --concurrency <CONCURRENCY>
            How many request can be run simultaneously

    -d, --domain <DOMAIN>
            Domain to crawl

    -d, --delay <DELAY>
            Polite crawling delay in milli seconds

    -h, --help
            Print help information

    -r, --respect-robots-txt
            Respect robots.txt file

    -u, --user-agent <USER_AGENT>
            User-Agent

    -v, --verbose
            Print page visited on standard output

    -V, --version
            Print version information

SUBCOMMANDS:
    crawl     crawl the website extracting links
    help      Print this message or the help of the given subcommand(s)
    scrape    scrape the website extracting html and links

Trailing slash appended on link visited

When trying to look at the links visited the root domain returns with a trailing slash. https://github.com/madeindjs/spider/pull/25/files#diff-7ebe7ce46b5a73fb72ebc85803edab9f4b4bf36726629990f59fd25ad1f187c2R184 is a test case showing it. It might make sense to make trailing slash a config property that either returns on all links or none.

	let domain = if domain.ends_with('/') {
	domain.into()
	} else {
	string_concat!(domain, "/")
	};