Comments (5)
Apparently this happens a lot with og:facebook-tags
attributes.
Perhaps given the change in usage of these fields in recent years, it's time to change the default behaviour to avoid this speculative link extraction?
from heritrix3.
However, looking at the code in question, it appears that the ExtractorHTML
extracts links that might be URLs from any <meta content="..."
attribute except for property="robots"
or property="refresh"
:
I think, in general this won't happen with textual content
attributes, but in this case the domain-name form appears to be causing this to be judged isVeryLikelyUri(...) == true
.
heritrix3/commons/src/main/java/org/archive/util/UriUtils.java
Lines 394 to 469 in 0581170
Hence, I'm not sure how often this problem will really turn up - it may not be worth worrying about.
However, for common properties that are known not to be used for absolute or relative URLs of any sort, the ExtractorHTML
class could be modified to skip this speculative link extraction.
from heritrix3.
This really happens very often and would be a great fix saving a lot of bandwidth and trouble. E. g. when crawling www.klausenstein.at an automatic abuse-report by this host is created because of this line in the page src:
<meta name="publisher" content="iNetWorker.at"/>
This causes heritrix to request http://www.klausenstein.at/iNetWorker.at which is interpreted as a crawler-trap and results in an abuse-report. We faced lots of similar situations with something like
<meta name="publisher" content="domain.com"/>
...
from heritrix3.
Unfortunately the problems are increasing more and more, this tag also causes problems:
<meta name="twitter:domain" content="Drivingthenation.com" />
It is placed on every page of the domain and generates an additional invalid call (404) of the form "current URL + Drivingthenation.com" for every single page request, which leads to thousands of additional invalid requests with 404 return code. For instance www.drivingthenation.com/category/automobilesandenergy/ "links" to www.drivingthenation.com/category/automobilesandenergy/Drivingthenation.com and so on. But all these "linked" pages do not exist.
It would be very helpful if a solution could be found for this problem in the near future. These incorrectly extracted URLs lead to great frustration for webmasters. It's always the content="domain.com" attribute which most likely is never a link!?
from heritrix3.
In my opinion, this URL guessing approach by parsing javascript content must die completely. This easily causes hundreds of RPM of not found errors, which often triggers alerts. Whoever thought that this is a good approach has probably never hosted or monitored anything.
from heritrix3.
Related Issues (20)
- Question re: cloudfront.net HOT 1
- Compatibility problems with Sonatype release process
- ${launchId} is not being replaced (sometimes) HOT 1
- Questions about TransclusionDecideRule HOT 6
- Bean reference missing inherited properties
- Question about the size of the 'state' directory HOT 3
- Time is not stopped when Disk Space Monitor is triggered and report files are removed HOT 5
- Resume a crawl for later
- Question: how to create a new log/report for a single class
- Implicit max. value of URI cost and precedence (?) should raise warning if exceeded HOT 1
- Error: Could not find or load main class org.archive.crawler.Heritrix Caused by: java.lang.ClassNotFoundException: org.archive.crawler.Heritrix HOT 2
- WARNING: politessDelay unset, returning default 5000
- How to change auth type?
- Provided seed files are updated (the more the job is repited, the more they are modified)
- Error when more than 125 jobs are instantiated HOT 4
- archive web crawler - crawl speed HOT 7
- Support for silent option when running a job
- Redirect field in seeds-report.txt is only populated for status 301 and 302
- Text versions of DNS should be recorded as WARC-Type resource instead of response
- Heritrix 3.4.0-SNAPSHOT-2022-03-08T19:15:59Z keeps pausing.. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from heritrix3.