Comments (8)
I've been attempting to create an ExtractorHTML
test case for this, and although it does extract the data URI it doesn't seem to use it as a relative path and construct a HTTP URL from it. Are you using a different extractor? Or perhaps I'm missing something?
from heritrix3.
Bump @csrster any more details available?
from heritrix3.
from heritrix3.
Adding the following to ExtractorHtmlTest:
public void test() throws IOException {
String url = "http://web.archive.org/web/20180830083248id_/http://www.haggmark.dk/solgt/oversigt";
CrawlURI curi = new CrawlURI(UURIFactory.getInstance(url));
String content = IOUtils.toString(new URL(url).openStream());
getExtractor().extract(curi, content);
CrawlURI[] links = curi.getOutLinks().toArray(new CrawlURI[0]);
Arrays.sort(links);
for (CrawlURI link: links) {
System.out.println(link.getURI());
}
}
Yields a lot of log errors like this one:
Mar 16, 2019 4:31:26 PM org.archive.modules.extractor.UnitTestUriLoggerModule logUriError
INFO: http://web.archive.org/web/20180830083248id_/http://www.haggmark.dk/solgt/oversigt
org.apache.commons.httpclient.URIException: Created (escaped) uuri > 2083: http://web.archive.org/web/20180830083248id_/http://www.haggmark.dk/solgt/%22data:image/png;base64,/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAIBAQIBAQICAgICAgICAwUDAwMDAwYEBAMFBwYHBwcGBwcICQsJCAgKCAcHCg0KCgsMDAwMBwkODw0MDgsMDAz/2wBDAQICAgMDAwYDAwYMCAcIDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAz/wAARCACCAMMDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD6L+Pnxon8IrD4Y0zUry68YeUt7KtmZd1rEpz5kiJjKPggrnpk46V7PpUs19aQzR3EzRzIrqd55BGa+Tfjb+0a3w18ZeG/C1946YLa2rWmp6xbWYmvwZJQsNwZdoVQ0TBN52qzE4IzX0p+zV4i0PxH8I7K+0nxL/wkWmwwB/tUswllVBkFmwzHHB7kDB+g/V8n4mpYnNK9GMrqKSs7aNdL8zcnq1J90fi2Oy90sLTny73111u/RW2Vl2Z2FvHcZ/10/wD32atq00QJa4m+UbiN54HrioNO8Z6TP4sk0j7TAs0Vmt95rSr5ckTdCrZ545+ntXF/FP4zWdz8JLXxFod+slrdyMtjeWn7yFpVYrtZseoP3SO/XFermOfUMPRlWuny328v+HRxYXA1Ks1G1r2/H/hmeiQTyusZW6mZZhujIkO1h14PTpV2Nbrb/r5v+/hr4uufi7rXiKaSDTdej2xbpruCQvHslGFeNCCQGOCcn+Lrwa9m/Zp/avX4s6/o+mvdLMssKWwDokM0kgiUl3BOST8xwo4OK+Tyzj7B4yo6bi4vS1+tz1MZw7Xox54tM9wWO5z/AMfE/wCDmrEK3A/5bTn/AIGa1ILSGe6khjZZJIfvqvOz6+lWo9H4r6r63B7WPB9m09TJjN0DxPOP+2hqzAt0f+Xi4/7+GtOLSgO1TPapbRlpGWNF6s52gVz1MVFK+hpGPRGWY7oD/j4n/wC/ho8y6X/l4uP++zWXpvxNsdV1rU47d42sdJhaWWYHPm4HReeDnPXGeMda2vDuv2Pi2LdZt5m1FZuMYJHT1OPXpXDRzTD1fgad9vM7KmBqQ+JEObkj/j5uP+/hprC6zxcT/Xea2ZNLx2pBYbK61WiYezMN3ugP+Pif/vs1E73RP/HxP7fOa33sNx+7ULadVxrxF7MwnNx/z3n/AO+zVeU3H/PxP/38NdDJYVVk07muiNdC5Tnphc/895v++zVWX7Sv/LxP/wB9mujk06q0umn+7XVDEIXKc3cG4P8Ay2m/77NUbn7SP+W83/fZrp59NyelU7nS8dq7KeIiTydjl7g3H/Peb/vs1QuTcHrNN/32a6m40zJ6c1QudOx1Wu+liIkuJ554h+0DVpP9IuOi/wAZ/uiitjxFpn/E4l47L/6CKK1+smkVofj74x8b+DdDvdX1fUpNDa/8SWRbR3srmae1u
at org.archive.url.UsableURIFactory.validityCheck(UsableURIFactory.java:327)
at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:310)
at org.archive.net.UURIFactory.getInstance(UURIFactory.java:55)
at org.archive.modules.extractor.Extractor.addRelativeToBase(Extractor.java:190)
at org.archive.modules.extractor.ExtractorHTML.addLinkFromString(ExtractorHTML.java:663)
at org.archive.modules.extractor.ExtractorHTML.processEmbed(ExtractorHTML.java:695)
at org.archive.modules.extractor.ExtractorHTML.processGeneralTag(ExtractorHTML.java:459)
at org.archive.modules.extractor.ExtractorHTML.extract(ExtractorHTML.java:855)
It doesn't return them as extracted links because of the exception though.
from heritrix3.
Hi again,
It seems like we agree that there's a bug here. Iirc our problem wasn't so much with Heritrix queueing these urls but with the heritrix error logs becoming enormous. So we would still be interested in seeing our pull request accepted.
cheers!
Colin
from heritrix3.
Yep. Do you mean there's already a pull request for this? I couldn't find it. Could you link it?
from heritrix3.
Digging through our issue history in our private Jira I found this comment:
2018-10-10 07:06:00.376 INFO thread-62 org.archive.modules.deciderules.MatchesListRegexDecideRule.evaluate() Timeout matching regex '.*[a-zA-Z0-9\W-]+\.dk.*(\/[a-zA-Z0-9\W-]{3,})(?=\/).*(\/[a-zA-Z0-9\W-]{3,})(?=\/).*(\/[a-zA-Z0-9\W-]{3,})(?=\/).*(\1(?=\/).*\2(?=\/).*\3(?=\/)|\1(?=\/).*\3(?=\/).*\2(?=\/)|\2(?=\/).*\1(?=\/).*\3(?=\/)|\2(?=\/).*\3(?=\/).*\1(?=\/)|\3(?=\/).*\2(?=\/).*\1(?=\/)|\3(?=\/).*\1(?=\/).*\2(?=\/)).*' to url 'http://ryd-lortet.dk/%22data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAEnCAYAAACHcBUB ...
So what happened here is that the giant URL was constructed from the inline data. We have actually modified MatchesListRegexDecideRule to include a timeout on the regex matching, and logging from the modified MatchesListRegexDecideRule shows that matching of the giant Url with our hideous regex was giving us extra problems on top of the err-log inflation. I think that must mean that at least some of these inline Urls get past the validityCheck.
We'll be coming with a separate pull-request for the timeout on the decide rule real soon now.
cheers again!
Colin
from heritrix3.
You must be right Alex - I thought we'd actually made a pull request, but now I see it was only a bug report. Give me a minute!
Colin
from heritrix3.
Related Issues (20)
- Question re: cloudfront.net HOT 1
- Compatibility problems with Sonatype release process
- ${launchId} is not being replaced (sometimes) HOT 1
- Questions about TransclusionDecideRule HOT 6
- Bean reference missing inherited properties
- Question about the size of the 'state' directory HOT 3
- Time is not stopped when Disk Space Monitor is triggered and report files are removed HOT 5
- Resume a crawl for later
- Question: how to create a new log/report for a single class
- Implicit max. value of URI cost and precedence (?) should raise warning if exceeded HOT 1
- Error: Could not find or load main class org.archive.crawler.Heritrix Caused by: java.lang.ClassNotFoundException: org.archive.crawler.Heritrix HOT 2
- WARNING: politessDelay unset, returning default 5000
- How to change auth type?
- Provided seed files are updated (the more the job is repited, the more they are modified)
- Error when more than 125 jobs are instantiated HOT 4
- archive web crawler - crawl speed HOT 7
- Support for silent option when running a job
- Redirect field in seeds-report.txt is only populated for status 301 and 302
- Text versions of DNS should be recorded as WARC-Type resource instead of response
- Heritrix 3.4.0-SNAPSHOT-2022-03-08T19:15:59Z keeps pausing.. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from heritrix3.