Code Monkey home page Code Monkey logo

Comments (8)

anjackson avatar anjackson commented on June 10, 2024

I've been attempting to create an ExtractorHTML test case for this, and although it does extract the data URI it doesn't seem to use it as a relative path and construct a HTTP URL from it. Are you using a different extractor? Or perhaps I'm missing something?

from heritrix3.

anjackson avatar anjackson commented on June 10, 2024

Bump @csrster any more details available?

from heritrix3.

csrster avatar csrster commented on June 10, 2024

from heritrix3.

ato avatar ato commented on June 10, 2024

Adding the following to ExtractorHtmlTest:

    public void test() throws IOException {
        String url = "http://web.archive.org/web/20180830083248id_/http://www.haggmark.dk/solgt/oversigt";
        CrawlURI curi = new CrawlURI(UURIFactory.getInstance(url));
        String content = IOUtils.toString(new URL(url).openStream());
        getExtractor().extract(curi, content);

        CrawlURI[] links = curi.getOutLinks().toArray(new CrawlURI[0]);
        Arrays.sort(links);
        for (CrawlURI link: links) {
            System.out.println(link.getURI());
        }
    }

Yields a lot of log errors like this one:

Mar 16, 2019 4:31:26 PM org.archive.modules.extractor.UnitTestUriLoggerModule logUriError
INFO: http://web.archive.org/web/20180830083248id_/http://www.haggmark.dk/solgt/oversigt
org.apache.commons.httpclient.URIException: Created (escaped) uuri > 2083: http://web.archive.org/web/20180830083248id_/http://www.haggmark.dk/solgt/%22
	at org.archive.url.UsableURIFactory.validityCheck(UsableURIFactory.java:327)
	at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:310)
	at org.archive.net.UURIFactory.getInstance(UURIFactory.java:55)
	at org.archive.modules.extractor.Extractor.addRelativeToBase(Extractor.java:190)
	at org.archive.modules.extractor.ExtractorHTML.addLinkFromString(ExtractorHTML.java:663)
	at org.archive.modules.extractor.ExtractorHTML.processEmbed(ExtractorHTML.java:695)
	at org.archive.modules.extractor.ExtractorHTML.processGeneralTag(ExtractorHTML.java:459)
	at org.archive.modules.extractor.ExtractorHTML.extract(ExtractorHTML.java:855)

It doesn't return them as extracted links because of the exception though.

from heritrix3.

csrster avatar csrster commented on June 10, 2024

Hi again,
It seems like we agree that there's a bug here. Iirc our problem wasn't so much with Heritrix queueing these urls but with the heritrix error logs becoming enormous. So we would still be interested in seeing our pull request accepted.
cheers!
Colin

from heritrix3.

ato avatar ato commented on June 10, 2024

Yep. Do you mean there's already a pull request for this? I couldn't find it. Could you link it?

from heritrix3.

csrster avatar csrster commented on June 10, 2024

Digging through our issue history in our private Jira I found this comment:

2018-10-10 07:06:00.376 INFO thread-62 org.archive.modules.deciderules.MatchesListRegexDecideRule.evaluate() Timeout matching regex '.*[a-zA-Z0-9\W-]+\.dk.*(\/[a-zA-Z0-9\W-]{3,})(?=\/).*(\/[a-zA-Z0-9\W-]{3,})(?=\/).*(\/[a-zA-Z0-9\W-]{3,})(?=\/).*(\1(?=\/).*\2(?=\/).*\3(?=\/)|\1(?=\/).*\3(?=\/).*\2(?=\/)|\2(?=\/).*\1(?=\/).*\3(?=\/)|\2(?=\/).*\3(?=\/).*\1(?=\/)|\3(?=\/).*\2(?=\/).*\1(?=\/)|\3(?=\/).*\1(?=\/).*\2(?=\/)).*' to url 'http://ryd-lortet.dk/%22 ...

So what happened here is that the giant URL was constructed from the inline data. We have actually modified MatchesListRegexDecideRule to include a timeout on the regex matching, and logging from the modified MatchesListRegexDecideRule shows that matching of the giant Url with our hideous regex was giving us extra problems on top of the err-log inflation. I think that must mean that at least some of these inline Urls get past the validityCheck.

We'll be coming with a separate pull-request for the timeout on the decide rule real soon now.

cheers again!
Colin

from heritrix3.

csrster avatar csrster commented on June 10, 2024

You must be right Alex - I thought we'd actually made a pull request, but now I see it was only a bug report. Give me a minute!
Colin

from heritrix3.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.