norconex / crawlers Goto Github PK

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

Home Page: https://opensource.norconex.com/crawlers

License: Apache License 2.0

Java 99.06% Shell 0.07% HTML 0.77% Batchfile 0.09%

search-engine web-crawler java collector-http flexible crawler crawlers filesystem-crawler collector-fs

crawlers's People

Contributors

Stargazers

Watchers

Forkers

gageorge hamanakam1 leonardsaers tvillalba martinfou nycander sergioiglesias betongsuggan zinaaa dullgiulio jupiterhyun starknowdata herimedia m1k3yfoo automotola zhaimobile jusunpark sriramvg balzsuzsi liinnux teotikalki c825503724 sylvainroussy lordhung benderpan mohamedelsakka ozhiganov tunks ebbesson khazeshgar stanxii kkjslee adeotek jerry-zww andyglick liar666 richardsonvix akashfoss jamesyangwang wolverline albinchang a1bar nullllun andredainez rsandx mvwijland chemberries hhy5277 hansomesong tomezhouyp saschaheyer deovolente1 rendiya asysc2020 casm-consulting safisec yi-refs mattbucci watsoncoders essiembre strogo krickert brian-yuen ohtwadi

crawlers's Issues

Adding unique logo to each project

Hello,

I suggest adding a logo to each project (committer, importer, collector).
When I click a link and it switches to a different project it's hard to notice that I actually left the original project I was surfing. The pages look almost identical.

Also, it would be really helpful if we could add a graphic illustration on the collector homepage to show how the collector, importer, and committer are used together.

Just thought I would share my experience.

Thanks,
Khalid

Some URLs are not extracted

Links that have the href attribute at the beginning of a line (like the one below) are not extracted correctly because the DefaultURLExtractor.URL_PATTERN does not match in this case (because of the trailing \W):

<a alt="foo"
href="foo.html">Foo</a>

Since you're using tika anyway, you could think about using its LinkContentHandler to collect links.

Cannot run the demo test

I followed the instruction in the HOW_TO_RUN_EXAMPLE file and here is what I got on a windows 7 machine

I am using the following version norconex-collector-http-1.0.0-20130530.150731-5

C:\Program Files (x86)\norconex-collector-http-1.0.0-SNAPSHOT>collector-http.bat
-a start -c examples/minimum/minimum-config.xml
INFO [HttpCrawlerConfigLoader] URL filter loaded: RegexURLFilter [regex=http://
en.wikipedia.org/wiki/.*, caseSensitive=false, onMatch=INCLUDE]
INFO [HttpCollectorConfigLoader] Configuration loaded: id=Minimal Config HTTP C
ollector; logsDir=./logs; progressDir=./progress
INFO [HttpCollector] Suite of 1 HTTP crawler jobs created.
FATAL [JobRunner] Job suite execution failed: Minimal Config Wikipedia 1-page Cr
awl
java.io.FileNotFoundException: .\logs\latest\Minimal Config Wikipedia 1-page Cra
wl.log (The system cannot find the path specified)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(Unknown Source)
at java.io.FileOutputStream.(Unknown Source)
at org.apache.log4j.FileAppender.setFile(FileAppender.java:294)
at org.apache.log4j.FileAppender.(FileAppender.java:110)
at org.apache.log4j.FileAppender.(FileAppender.java:121)
at com.norconex.jef.log.FileLogManager.createAppender(FileLogManager.jav
a:87)
at com.norconex.jef.JobRunner.runSuite(JobRunner.java:76)
at com.norconex.collector.http.HttpCollector.crawl(HttpCollector.java:17
2)
at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:139
)
C:\Program Files (x86)\norconex-collector-http-1.0.0-SNAPSHOT>

Checksum metadata is always kept

Hi,

Since version 1.3.0, I've noticed that there is a new metadata called "collector.http.checksum-doc". It is fine, except that it breaks the expected behaviour when using KeepOnlyTagger in the Importer. Since this new metadata is added in the HTTPDocumentChecksumStep, which is after the ImportModuleStep where the KeepOnlyTagger is processed, the metadata will be kept even if we wanted to exclude it.

Java application for crawling purpose with collector-http

Hi,

I need to use collector-http to get data from several sites which fulfill some regular expression and store them in a database via Java application. Is this possible with collector-http, and how if yes? Is there any simple app to start with?

Regards,
Derek

Orphans deletion don't consider depth

If we start a crawl (resume=false) with a lower depth value over an existing DB, if there are orphans to process at the end of the crawl, and if some orphans have a depth that is too deep because of the lower depth value, they will still get fetched.

Endlees Loop while Crawling

I am crawling one of our customers and the webshop has a list of products with 3 links to the next pages.
Once the crawler hits the pages, it ends within an endless loop producing longer and longer urls.

Entry-URL: http://kaffeeroester.de/verkaufshits
Page 2: http://kaffeeroester.de/verkaufshits?&p=2
Page 3: http://kaffeeroester.de/verkaufshits?&p=3

URLs in endless loop:

Any idea whats going wrong?

Kind regards,

Daniel

RegexURLFilter - On match include

When adding multiple filters, it seems that a URL gets rejected when it does not match one of the criteria. For example, if we have the following filters configured:

  <!-- At a minimum make sure you stay on your domain. -->
    <httpURLFilters>
                 <filter class="com.norconex.collector.http.filter.impl.RegexURLFilter" caseSensitive="false" onMatch="include">
                                            http://site/eng/Folder1/.*
         </filter>    
        <filter class="com.norconex.collector.http.filter.impl.RegexURLFilter" caseSensitive="false" onMatch="include">
                                            http://site/eng/Folder2/.*
        </filter>
  </httpURLFilters>

We would expect that any links with the following path "..../eng/Folder1/." or "....../eng/Folder2/." would be included and crawled. Currently, if we start crawling the landing page, “http://site/eng/Folder2/index.html”, it gets rejected using the above filters. Hence, when using a match include, a document should be rejected only if it does not match any of the "include" statements.

Collector's Log file name format

Please find here as some question related to log, as my feedback after first using collector 2.*. I don't expect all things I noted here be classified as bugs or feature requests, and hope my experience help to increase collector-http's usability, make it more "friendly".

To automate some processes I'd like to read and parse collector's logs to control how processes are run.

I found the case, when crawler failed to init (I know that's normal when it is not starting if, for example, importer or committer isn't set correctly), collector doesn't create/update workdir, including logs so the automation is unable to detect the collector haven't started at all. Or it is getting logs of previous successful launch so I can make mistakes when control running collector's instances. Is there a chance to detect had the collector launched at all automatically? How can I catch such rare, but real cases?

Let's continue :)

I found collector's logs at dir .../${workdir}/logs/latest/logs/ . Last "logs" path's part appeared after update to 2.* branch. You may keep the path as you like but on my opinion it is senseless - already have a dir "logs" at one level upper.

I expected log have ${crawlerId}.log name format, but, for example, if crawler id is site.com_rss then the log file name is site.com_95_rss.log. Where 95 comes from? What is this?

May you tell me how to recognize crawler's latest log? May I build log name only having ${crawlerId} variable without listing .../latest/logs/ directory?

Thanks a lot for you help.

Stopping collector-http on right manner

As I know to stop running collector I need to use action 'stop', but it isn't clear how and where.

I tried run collector-http.sh -a stop -c %a config file% on another console while collector-http.sh -a resume -c %a config file% were launched on another.

It really stopped, but throw several exceptions

INFO  [HttpCollector] Suite of 1 HTTP crawler jobs created.
INFO  [FileStopRequestHandler$1] STOP request received.
INFO  [JobRunner] Running www.site.com: BEGIN (Tue Jun 24 17:47:05 EEST 2014)
INFO  [HttpCrawler] Stopping the crawler "www.site.com".
INFO  [MapDBCrawlURLDatabase] www.site.com: Initializing crawl database...
Exception in thread "Thread-0" java.lang.NoSuchMethodError: com.norconex.commons.lang.map.Properties.setBoolean(Ljava/lang/String;Z)V
    at com.norconex.jef.progress.JobProgressPropertiesFileSerializer.serialize(JobProgressPropertiesFileSerializer.java:95)
    at com.norconex.jef.suite.JobSuite$1.serialize(JobSuite.java:192)
    at com.norconex.jef.suite.JobSuite$1.jobStateChanged(JobSuite.java:184)
    at com.norconex.jef.progress.JobProgressStateChangeAdapter.jobStopping(JobProgressStateChangeAdapter.java:58)
    at com.norconex.jef.progress.JobProgress.stopRequestReceived(JobProgress.java:477)
    at com.norconex.jef.JobRunner.fireStopRequested(JobRunner.java:331)
    at com.norconex.jef.JobRunner.access$0(JobRunner.java:328)
    at com.norconex.jef.JobRunner$1.stopRequestReceived(JobRunner.java:90)
    at com.norconex.jef.suite.FileStopRequestHandler$1.run(FileStopRequestHandler.java:83)
INFO  [MapDBCrawlURLDatabase] www.site.com: Done initializing databases.
INFO  [DefaultSitemapResolver] Resolving sitemap: http://www.site.com/sitemap.xml
INFO  [DefaultSitemapResolver]          Resolved: http://www.site.com/sitemap.xml
INFO  [HttpCrawler] www.site.com: RobotsTxt support enabled
INFO  [HttpCrawler] www.site.com: RobotsMeta support enabled
INFO  [HttpCrawler] www.site.com: Sitemap support enabled
INFO  [HttpCrawler] www.site.com: Crawling URLs...
INFO  [HttpCrawler] www.site.com: Crawler "www.site.com" stopping: committing documents.
INFO  [HttpCrawler] Deleting crawler downloads directory: /home/anton/projects/wsas/collectors/www.site.com/workdir/downloads/www.site.com
INFO  [HttpCrawler] www.site.com: 0 URLs processed in 0:00:00.004 for "www.site.com".
INFO  [HttpCrawler] www.site.com: Crawler "www.site.com" stopped.
INFO  [MapDBCrawlURLDatabase] www.site.com: Closing crawl database...
INFO  [JobRunner] Running www.site.com: END (Tue Jun 24 17:47:05 EEST 2014)
FATAL [JobRunner] Job suite execution failed: www.site.com
java.lang.NoSuchMethodError: com.norconex.commons.lang.map.Properties.setBoolean(Ljava/lang/String;Z)V
    at com.norconex.jef.progress.JobProgressPropertiesFileSerializer.serialize(JobProgressPropertiesFileSerializer.java:95)
    at com.norconex.jef.suite.JobSuite$1.serialize(JobSuite.java:192)
    at com.norconex.jef.suite.JobSuite$1.jobStateChanged(JobSuite.java:184)
    at com.norconex.jef.progress.JobProgressStateChangeAdapter.jobCompleted(JobProgressStateChangeAdapter.java:62)
    at com.norconex.jef.JobRunner.fireJobCompleted(JobRunner.java:366)
    at com.norconex.jef.JobRunner.runJob(JobRunner.java:209)
    at com.norconex.jef.JobRunner.runSuite(JobRunner.java:94)
    at com.norconex.collector.http.HttpCollector.crawl(HttpCollector.java:198)
    at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:167)
Exception in thread "activityTracker_www.site.com" java.lang.NoSuchMethodError: com.norconex.commons.lang.map.Properties.setBoolean(Ljava/lang/String;Z)V
    at com.norconex.jef.progress.JobProgressPropertiesFileSerializer.serialize(JobProgressPropertiesFileSerializer.java:95)
    at com.norconex.jef.suite.JobSuite$1.serialize(JobSuite.java:192)
    at com.norconex.jef.suite.JobSuite$1.jobRunningVerified(JobSuite.java:188)
    at com.norconex.jef.JobRunner.fireJobRunningVerified(JobRunner.java:352)
    at com.norconex.jef.JobRunner.access$2(JobRunner.java:348)
    at com.norconex.jef.JobRunner$2.run(JobRunner.java:254)

After that I can't resume crawling, only start new.

Please let me know is there exists a way to stop running crawlers on more delicate manner, and is it possible to resume crawling after such stopping.

Thank you.

"Cumulative" strategy for urlFilter

Hello,

Hope I'm not too annoying with my questions.

At the default collector settings I apply such urlFilter:

<filter class="$extension" onMatch="exclude" >
        jpg,gif,png,ico,swf,css,js,pdf,doc,txt,odt</filter>

to the file shared-collector.xml to keep all common settings together. It is included to the conig.xml such way:

<crawlerDefaults>
        #parse("${apppath}/shared-configs/shared-collector.xml")
</crawlerDefaults>

Also I added crawler-specific urlFilter to config.xml:

<httpURLFilters>
          <filter class="$urlRegex">http://.*somesite\.com.*</filter>
</httpURLFilters>

hoping that URL should pass both rules to be set to the download queue.

But url passes filter if it match ANY of these rules. Is there a kind of misconfiguration in my files, or an option exists to force url pass all rules before queued?

Thanks a lot.

DefaultRobotsTxtProvider fails parsing robots.txt

com.norconex.collector.http.robot.impl.DefaultRobotsTxtProvider has no handling for comments and only add the last line in a robots.txt as a restriction.

I have an alternative fix for this will soon be provided.

Running collector-http example

I've downloaded collector-http to try it. Executing the following example:
collector-http.bat -a start -c examples/minimum/minimum-config.xml
I get the error (in spanish, so don't know the proper english message):
"class not loaded o not found: com.norconex.collector.http.HttpCollector "
What can I do?
Thanks
Carlos

Link to online documentation

In Javadoc, maybe add a link to the online documentation http://www.norconex.com/product/collector-http/ or a link to the Git documentation in the overview web page or the help page.

MongoDB - cannot authenticate connection

I can't see any invocation of authenticate method on the instance of com.mongodb.DB in com.norconex.collector.http.db.impl.mongo.MongoCrawlURLDatabase.
Is there any reason why authentication is not supported?

Make all in- links metadata and anchor texts accessible from a crawled document

Is it possible to retrieve the anchor text and metadata of all crawled links pointing to one crawled document?

The problem I'm facing is setting a readable name on crawled document and the only human readable name of the document is the actually link text pointing at the document. This feature could also be useful when it comes to increase the recall of a document in a search solution.

One place where this information could be stored is in the link database. However the CrawlURL object does not include link metadata or anchor text. Also a link is supposed to be ignored if the status of the URL is processed, but maybe the links metadata could be stored even if the state is processed.

If one would like to implement such feature within the crawler, would it be best to include the functionality in the link database?

Javadoc example is outdated for RegexURLFilter

Class com.norconex.collector.http.filter.impl.RegexURLFilter

I think the javadoc example are outdated. It should show usage of the "onMatch" attribute.

Unsupported Content-Coding: xml

After upgrading from 1.2.0 to 1.3.1 I get the following exception when norconex tries to crawl my sitemaps:

[non-job]: 2014-04-23 18:03:26,612 [pool-2-thread-2] ERROR ector.http.crawler.HttpCrawler  - www.secrethost.local: Could not process document: http://www.secrethost.local/solr-article-content-map.xml (org.apache.http.client.ClientProtocolException)
com.norconex.collector.http.HttpCollectorException: org.apache.http.client.ClientProtocolException
    at com.norconex.collector.http.fetch.impl.DefaultDocumentFetcher.fetchDocument(DefaultDocumentFetcher.java:147)
    at com.norconex.collector.http.crawler.DocumentProcessor$DocumentFetcherStep.processURL(DocumentProcessor.java:195)
    at com.norconex.collector.http.crawler.DocumentProcessor.processURL(DocumentProcessor.java:118)
    at com.norconex.collector.http.crawler.HttpCrawler.processNextQueuedURL(HttpCrawler.java:400)
    at com.norconex.collector.http.crawler.HttpCrawler.processNextURL(HttpCrawler.java:321)
    at com.norconex.collector.http.crawler.HttpCrawler.access$0(HttpCrawler.java:303)
    at com.norconex.collector.http.crawler.HttpCrawler$ProcessURLsRunnable.run(HttpCrawler.java:582)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.http.client.ClientProtocolException
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:188)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
    at com.norconex.collector.http.fetch.impl.DefaultDocumentFetcher.fetchDocument(DefaultDocumentFetcher.java:99)
    ... 9 more
Caused by: org.apache.http.HttpException: Unsupported Content-Coding: xml
    at org.apache.http.client.protocol.ResponseContentEncoding.process(ResponseContentEncoding.java:98)
    at org.apache.http.protocol.ImmutableHttpProcessor.process(ImmutableHttpProcessor.java:139)
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:199)
    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:85)
    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108)
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:186)
    ... 13 more

The sitemap is also the start url:

<crawler id="${name}">
  <startURLs>
    <url>${sitemap}</url>
  </startURLs>

  <sitemap ignore="false">
    <location>${sitemap}</location>
  </sitemap>
<!-- ... -->

'Writer fails' Exception

I get following error during test crawler execution, and, in spite of collected information it weren't commited. May you help me understand the reason and fix it?

[non-job]: 2014-10-14 14:19:06,669 [pool-1-thread-2] ERROR ector.http.crawler.HttpCrawler  - online.wsj.com_xml_rss_3_7085.xml: Could not process document: http://online.wsj.com/articles/australian-prime-minister-softens-rhetoric-on-vladimir-putin-1413268865?mod=World_newsreel_4 (Writer thread failed)
java.lang.RuntimeException: Writer thread failed
    at org.mapdb.AsyncWriteEngine.checkState(AsyncWriteEngine.java:328)
    at org.mapdb.AsyncWriteEngine.update(AsyncWriteEngine.java:428)
    at org.mapdb.Caches$WeakSoftRef.update(Caches.java:495)
    at org.mapdb.EngineWrapper.update(EngineWrapper.java:65)
    at org.mapdb.HTreeMap.putInner(HTreeMap.java:504)
    at org.mapdb.HTreeMap.put(HTreeMap.java:459)
    at com.norconex.collector.http.db.impl.mapdb.MapDBCrawlURLDatabase.processed(MapDBCrawlURLDatabase.java:199)
    at com.norconex.collector.http.crawler.URLProcessor.processURL(URLProcessor.java:330)
    at com.norconex.collector.http.crawler.URLProcessor.processURL(URLProcessor.java:98)
    at com.norconex.collector.http.crawler.DocumentProcessor$URLExtractorStep.processURL(DocumentProcessor.java:291)
    at com.norconex.collector.http.crawler.DocumentProcessor.processURL(DocumentProcessor.java:118)
    at com.norconex.collector.http.crawler.HttpCrawler.processNextQueuedURL(HttpCrawler.java:399)
    at com.norconex.collector.http.crawler.HttpCrawler.processNextURL(HttpCrawler.java:321)
    at com.norconex.collector.http.crawler.HttpCrawler.access$100(HttpCrawler.java:66)
    at com.norconex.collector.http.crawler.HttpCrawler$ProcessURLsRunnable.run(HttpCrawler.java:582)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOError: java.io.IOException: File too large
    at org.mapdb.Volume$FileChannelVol.putData(Volume.java:783)
    at org.mapdb.StoreWAL.replayLogFile(StoreWAL.java:833)
    at org.mapdb.StoreWAL.commit(StoreWAL.java:630)
    at org.mapdb.EngineWrapper.commit(EngineWrapper.java:93)
    at org.mapdb.AsyncWriteEngine.access$201(AsyncWriteEngine.java:74)
    at org.mapdb.AsyncWriteEngine.runWriter(AsyncWriteEngine.java:245)
    at org.mapdb.AsyncWriteEngine$WriterRunnable.run(AsyncWriteEngine.java:170)
    ... 1 more
Caused by: java.io.IOException: File too large
    at sun.nio.ch.FileDispatcherImpl.pwrite0(Native Method)
    at sun.nio.ch.FileDispatcherImpl.pwrite(FileDispatcherImpl.java:66)
    at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:89)
    at sun.nio.ch.IOUtil.write(IOUtil.java:65)
    at sun.nio.ch.FileChannelImpl.writeInternal(FileChannelImpl.java:730)
    at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:716)
    at org.mapdb.Volume$FileChannelVol.writeFully(Volume.java:709)
    at org.mapdb.Volume$FileChannelVol.putData(Volume.java:781)
    ... 7 more

FileNotFoundException on mapdb.t

I'm running on

norconex-collector 1.2.0
norconex-committer-solr-1.0.1
Oracle JRE 1.7.0_51
Windows Server 2008 R2

This exception (se below) occurs quite frequently in the logs.
It seems to happen farily random and I'm not sure how to reproduce it.

    java.io.IOError: java.io.FileNotFoundException: ..\crawler-output\XYZ\crawldb\XYZ\mapdb.t (Access is denied)
            at org.mapdb.Volume$FileChannelVol.<init>(Volume.java:671)
            at org.mapdb.Volume.volumeForFile(Volume.java:183)
            at org.mapdb.Volume$1.createTransLogVolume(Volume.java:218)
            at org.mapdb.StoreWAL.openLogIfNeeded(StoreWAL.java:108)
            at org.mapdb.StoreWAL.put(StoreWAL.java:215)
            at org.mapdb.Caches$WeakSoftRef.put(Caches.java:429)
            at org.mapdb.Queues$Queue.add(Queues.java:373)
            at com.norconex.collector.http.db.impl.mapdb.MappedQueue.add(MappedQueue.java:157)
            at com.norconex.collector.http.db.impl.mapdb.MappedQueue.add(MappedQueue.java:1)
            at com.norconex.collector.http.db.impl.mapdb.MapDBCrawlURLDatabase.queue(MapDBCrawlURLDatabase.java:146)
            at com.norconex.collector.http.crawler.HttpCrawler.deleteCacheOrphans(HttpCrawler.java:255)
            at com.norconex.collector.http.crawler.HttpCrawler.handleOrphans(HttpCrawler.java:221)
            at com.norconex.collector.http.crawler.HttpCrawler.execute(HttpCrawler.java:173)
            at com.norconex.collector.http.crawler.HttpCrawler.startExecution(HttpCrawler.java:147)
            at com.norconex.jef.AbstractResumableJob.execute(AbstractResumableJob.java:52)
            at com.norconex.jef.JobRunner.runJob(JobRunner.java:193)
            at com.norconex.jef.JobRunner.runSuite(JobRunner.java:94)
            at com.norconex.collector.http.HttpCollector.crawl(HttpCollector.java:198)
            at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:165)
    Caused by: java.io.FileNotFoundException: ..\crawler-output\XYZ\crawldb\XYZ\mapdb.t  (Access is denied)
            at java.io.RandomAccessFile.open(Native Method)
            at java.io.RandomAccessFile.<init>(Unknown Source)
            at org.mapdb.Volume$FileChannelVol.<init>(Volume.java:668)
            ... 18 more

Here's my main configuration:

<httpcollector id="${name}">

  <!-- Decide where to store generated files. -->
  <progressDir>../crawler-output/${name}/progress</progressDir>
  <logsDir>../crawler-output/${name}/logs</logsDir>

  <crawlers>
    <crawler id="${name}">
      <startURLs>
        <url>${sitemap}</url>
      </startURLs>

      <sitemap ignore="false" class="com.norconex.collector.http.sitemap.impl.DefaultSitemapResolver">
        <location>${sitemap}</location>
      </sitemap>

      <!-- Where the crawler default directory to generate files is. -->
      <workDir>../crawler-output/${name}/</workDir>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>0</maxDepth>

      <!-- Remove pages that are no longer linked to by the sitemap -->
      <deleteOrphans>true</deleteOrphans>

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="100" />

    <committer class="com.norconex.committer.solr.SolrCommitter">
      <solrURL>http://localhost:8984/solr/sites</solrURL>
        <batchSize>100</batchSize>
        <solrBatchSize>100</solrBatchSize>
        <contentTargetField>content_${language}</contentTargetField>
      </committer>
    </crawler>
  </crawlers>
</httpcollector>

1.34 to 2.0 Conversion

I moved my configuration over from 1.34 to 2.0 and I receive the following error:
ERROR com.norconex.importer.Importer - Unsupported Import Handler: null

Is there additional configuration that I need to do for 2.0?

Additional fetures for stop action of collector

Currently, if several instances of collector are running at a time, we can identify which one to stop providing config path as a UID.

It may be useful to stop collector providing it's id attribute. We may have discussion about use-case of this feature here and estimate how valuable it may be.

To start/stop separate crawlers in context of a single collector instance may be interesting also because in log files we operate by crawler names, not config paths. It may be helpful for automation, or management GUI design.

running collector-http examples with Solr

I'd like to test "minimum" and "complex" examples with solr but not sure what changes to make to minimum-config.xml and complex-config.xml. I'm trying, at the same time Solr, so my repository is collection1 (C:\solr\example\solr\collection1).
I've downloaded "norconex-committer-solr-2.0.0" and copied bin directory onto collector-http's.
I'd appreciate some advice.
Thanks
Carlos

Force encoding when featching content

It seams that the crawler now takes the encoding to use from the HTTP header. There may be cases where the HTTP header information are inaccurate and where the encoding instead is given in the HTML code.

A simple setting where you could force the encoding to be use would be valuable. Maybe this already exists, otherwise it's a feature which would be appreciated.

Exception when trying to create folder

(Platform: Windows)
If there is a URL segment that is less than 25 characters then a file is created in the downloads folder rather than a folder:
Example: www.xyz.com/segment1/page.html creates a file named "page" in the "segment1" folder.

If a subsequent URL is below this URL then the collector.core attempts to create a folder, but in Windows you cannot have a folder name and a file name that is the same:
Example:page.html contains the link www.xyz.com/segment1/page/view

Exception:
java.io.IOException: Directory '.\segment1\page' could not be created

I believe that 1.34 used to put a .raw extension on the files that would prevent this issue.

Configuraton which allows reading of limited list of urls

In some cases I need collector-http download pages exactly listed in the 'urls' configuration section.

Reading logs I see that the crawler reads sitemaps of large amount of sites, not listed in the section, but probably linked on required to download pages.

Crawler's configuration I use:

<?xml version="1.0" encoding="UTF-8"?>

<httpcollector id="${machinereadablename}">

    #set($workdir = "${crawlerdir}/workdir")
    #set($configdir = "${crawlerdir}/config")
    #set($basepkg = "com.norconex.collector.http")
    #set($extension = "${basepkg}.filter.impl.ExtensionURLFilter")
    #set($urlRegex = "${basepkg}.filter.impl.RegexURLFilter")
    #set($robotsTxt = "${basepkg}.robot.impl.DefaultRobotsTxtProvider")
    #set($robotsMeta = "${basepkg}.robot.impl.DefaultRobotsMetaProvider")
    #set($headersRegex = "${basepkg}.filter.impl.RegexHeaderFilter")
    #set($docFetcher = "${basepkg}.fetch.impl.DefaultDocumentFetcher")
    #set($urlExtractor = "${basepkg}.url.impl.DefaultURLExtractor")
    #set($sitemap = "${basepkg}.sitemap.impl.DefaultSitemapResolver")
    #set($headersChecksummer = "${basepkg}.checksum.impl.DefaultHttpHeadersChecksummer")
    #set($docChecksummer = "${basepkg}.checksum.impl.DefaultHttpDocumentChecksummer")

    <progressDir>${workdir}/progress</progressDir>

    <logsDir>${workdir}/logs</logsDir>

    <crawlerDefaults>
        #parse("${apppath}/shared-configs/shared-collector.xml")
    </crawlerDefaults>

    <crawlers>

    <maxdepth>1</maxdepth>

        <crawler id="${machinereadablename}">
            <httpURLFilters>
                #parse("${configdir}/config-url-filters.xml")
            </httpURLFilters>

            <startURLs>
                #parse("${configdir}/config-urls.xml")
            </startURLs>                       
        </crawler>

    </crawlers>

</httpcollector>

shared-collector.xml is here

<userAgent>Norconex Collector-Http</userAgent>

<urlNormalizer class="${basepkg}.url.impl.GenericURLNormalizer">
    <normalizations>
        lowerCaseSchemeHost, upperCaseEscapeSequence, decodeUnreservedCharacters, removeDefaultPort 
    </normalizations>
    <replacements>
        <replace>
            <match>&amp;view=print</match>
            <replacement>&amp;view=html</replacement>
        </replace>
    </replacements>
</urlNormalizer>

<numThreads>2</numThreads>

<maxDepth>3</maxDepth>

<maxURLs>-1</maxURLs>

<workDir>$workdir</workDir>

<keepDownloads>false</keepDownloads>

<deleteOrphans>true</deleteOrphans>

<crawlURLDatabaseFactory class="${basepkg}.db.impl.DefaultCrawlURLDatabaseFactory" />

<robotsTxt ignore="false" class="$robotsTxt"/>

<httpHeadersFetcher class="${basepkg}.fetch.impl.SimpleHttpHeadersFetcher"  validStatusCodes="200" />

<httpHeadersChecksummer class="$headersChecksummer" field="collector.http.url" />

<httpDocumentFetcher class="$docFetcher" validStatusCodes="200" />

<httpDocumentChecksummer class="$docChecksummer"/>

<robotsMeta ignore="false" class="$robotsMeta" />

<urlExtractor class="$urlExtractor" />

Please give me an advice how tune-up a crawler to download only pages listed in 'url' sections, and don't visit other links and crawl sitemaps.

Thank you.

trouble with hrefs using double slash

For my test crawl site I have some links in pages with double slashes. For example:

And so, when I'm crawling http://www.example.com, the crawler is trying to go to http://www.example.com//www.example.com/page1.html, which isn't correct. I tried using urlNormalizer to "fix" this URL, but that didn't work. Is there some way I can get the collector to correct this problem?

Possible temporary working directories bug

I have following config for a crawler:

.xml

<?xml version="1.0" encoding="UTF-8"?>

<httpcollector id="${machinereadablename}">

  #set($http = "com.norconex.collector.http")
  #set($core = "com.norconex.collector.core")
  #set($urlNormalizer   = "${http}.url.impl.GenericURLNormalizer")

  #set($workdir = "${crawlerdir}/workdir")
  #set($configdir = "${crawlerdir}/config")


  #parse("${apppath}/shared-configs/collector-defaults.xml")

  <crawlers>

    <crawler id="${machinereadablename}">
      <startURLs>
        #parse("${configdir}/config-urls.xml")        
      </startURLs>

      #parse("${configdir}/config-importer.xml")

      <progressDir>${workdir}/progress</progressDir>
      <logsDir>${workdir}/logs</logsDir>
      <workDir>$workdir</workDir>

      <referenceFilters>
            #parse("${apppath}/shared-configs/collector-shared-reference-filters.xml")
            #parse("${configdir}/config-reference-filters.xml")
      </referenceFilters>

      <urlNormalizer class="$urlNormalizer" />

      <maxDepth>${maxdepth}</maxDepth> 

    </crawler>

  </crawlers>

</httpcollector>

.properties

apppath = /an/absolute/path
crawlerdir = /an/absolute/path/collectors/test-collector
machinereadablename = test-collector
maxdepth = 2

When I run the collector (-a start) it puts crawlstore and sitemaps to $workdir, and logs and progress to collector app dir.

What is wrong with my config? Is it a bug of collector configuration procedure?

Stopping of the collector isn't working

I run collector such way: ./collector-http.sh -a start /a/config/path/config.xml

Then trying to stop it: ./collector-http.sh -a stop /a/config/path/config.xml

It returns following:

INFO  [AbstractCrawlerConfig] Reference filter loaded: com.norconex.collector.core.filter.impl.ExtensionReferenceFilter@7fa98a66[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,swf,css,js,pdf,doc,xls,ppt,txt,odt,zip,rar,gz,swf,xlsx,docx,pptx,mp3,wav,mid,caseSensitive=false]
INFO  [AbstractCollectorConfig] Configuration loaded: id=somename; logsDir=/somwdir/workdir/logs; progressDir=/somedir/workdir/progress


An ERROR occured:

This collector cannot be stopped since it is NOT running.


Details of the error has been stored at: ...

The file with details contains

com.norconex.collector.core.CollectorException: This collector cannot be stopped since it is NOT running.
    at com.norconex.collector.core.AbstractCollector.stop(AbstractCollector.java:126)
    at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:73)
    at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)

I'm highly appreciate for your help and hope my experience is useful for making this nice software better.

Custom Committers and version 2.*

I just checked documentation and can't find any link/text which may help me with refactoring of the existent committer. It is possible to use Importer also (I agree - a dirty hack), I guess, but the link at FAQ page (http://www.norconex.com/product/importer/) leads nowhere.

Please provide a sample setup to crawl a website and store the content in Solr repo.

Please provide a sample setup to crawl a website and store the content in Solr repo. Also we have other requirements like, indexing Metadata, skip certain URLs, parsing only part of a content page and also parsing the data from oracle database.

Is it possible to give a best example to help me to implement the above requirements. We are actually looking to finalise to go with Apache Nutch or Norconex. I have no experience in Norconex as I just read about it since yesterday. It will be helpful if you can provide the inputs so that I can showcase and decide on the crawler.

Thanks,
Ravi

Site rejected

Using http-collector 1.3.3-SNAPSHOT for url www.ft.com. Site is rejected by unknown reason. Could you help me to get why it isn't crawling?

Please find log and config attached:

www.ft.com: 2014-08-07 03:36:33,669 [main] INFO     com.norconex.jef.JobRunner  - Running www.ft.com: BEGIN (Thu Aug 07 03:36:33 CDT 2014)
www.ft.com: 2014-08-07 03:36:33,676 [main] INFO pl.mapdb.MapDBCrawlURLDatabase  - www.ft.com: Initializing crawl database...
www.ft.com: 2014-08-07 03:36:34,624 [main] INFO pl.mapdb.MapDBCrawlURLDatabase  - www.ft.com: Done initializing databases.
www.ft.com: 2014-08-07 03:36:35,397 [main] INFO p.crawler.CrawlStatus.REJECTED  -   REJECTED >   (0) http://www.ft.com
www.ft.com: 2014-08-07 03:36:35,400 [main] INFO ector.http.crawler.HttpCrawler  - www.ft.com: RobotsTxt support enabled
www.ft.com: 2014-08-07 03:36:35,400 [main] INFO ector.http.crawler.HttpCrawler  - www.ft.com: RobotsMeta support enabled
www.ft.com: 2014-08-07 03:36:35,400 [main] INFO ector.http.crawler.HttpCrawler  - www.ft.com: Sitemap support enabled
www.ft.com: 2014-08-07 03:36:35,400 [main] INFO ector.http.crawler.HttpCrawler  - www.ft.com: Crawling URLs...
www.ft.com: 2014-08-07 03:36:35,404 [main] INFO ector.http.crawler.HttpCrawler  - www.ft.com: Deleting orphan URLs (if any)...
www.ft.com: 2014-08-07 03:36:35,406 [main] INFO ector.http.crawler.HttpCrawler  - www.ft.com: Deleted 0 orphan URLs...
www.ft.com: 2014-08-07 03:36:35,406 [main] INFO ector.http.crawler.HttpCrawler  - www.ft.com: Crawler "www.ft.com" finishing: committing documents.
www.ft.com: 2014-08-07 03:36:35,486 [main] INFO ector.http.crawler.HttpCrawler  - Deleting crawler downloads directory: /home/demo/www/collectors/www.ft.com/workdir/downloads/www.ft.com
www.ft.com: 2014-08-07 03:36:35,488 [main] INFO ector.http.crawler.HttpCrawler  - www.ft.com: 1 URLs processed in 0:00:00.086 for "www.ft.com".
www.ft.com: 2014-08-07 03:36:35,503 [main] INFO ector.http.crawler.HttpCrawler  - www.ft.com: Crawler "www.ft.com" completed.
www.ft.com: 2014-08-07 03:36:35,503 [main] INFO pl.mapdb.MapDBCrawlURLDatabase  - www.ft.com: Closing crawl database...
www.ft.com: 2014-08-07 03:36:35,618 [main] INFO     com.norconex.jef.JobRunner  - Running www.ft.com: END (Thu Aug 07 03:36:33 CDT 2014)

config.xml

<?xml version="1.0" encoding="UTF-8"?>

<httpcollector id="${machinereadablename}">

    #set($workdir = "${crawlerdir}/workdir")
    #set($configdir = "${crawlerdir}/config")
    #set($basepkg = "com.norconex.collector.http")
    #set($extension = "${basepkg}.filter.impl.ExtensionURLFilter")
    #set($urlRegex = "${basepkg}.filter.impl.RegexURLFilter")
    #set($robotsTxt = "${basepkg}.robot.impl.DefaultRobotsTxtProvider")
    #set($robotsMeta = "${basepkg}.robot.impl.DefaultRobotsMetaProvider")
    #set($headersRegex = "${basepkg}.filter.impl.RegexHeaderFilter")
    #set($docFetcher = "${basepkg}.fetch.impl.DefaultDocumentFetcher")
    #set($urlExtractor = "${basepkg}.url.impl.DefaultURLExtractor")
    #set($sitemap = "${basepkg}.sitemap.impl.DefaultSitemapResolver")
    #set($headersChecksummer = "${basepkg}.checksum.impl.DefaultHttpHeadersChecksummer")
    #set($docChecksummer = "${basepkg}.checksum.impl.DefaultHttpDocumentChecksummer")

    <progressDir>${workdir}/progress</progressDir>

    <logsDir>${workdir}/logs</logsDir>

    <crawlerDefaults>
        #parse("${apppath}/shared-configs/shared-collector.xml")
    </crawlerDefaults>

    <crawlers>

        <crawler id="${machinereadablename}">
            <httpURLFilters>
                #parse("${configdir}/config-url-filters.xml")
            </httpURLFilters>

            <startURLs>
                <url>${url}</url>
            </startURLs>

        </crawler>

    </crawlers>

</httpcollector>

shared-collector.xml

<userAgent>Web Crawler</userAgent>

<urlNormalizer class="${basepkg}.url.impl.GenericURLNormalizer">
    <normalizations>
        lowerCaseSchemeHost, upperCaseEscapeSequence, decodeUnreservedCharacters, removeDefaultPort 
    </normalizations>
    <replacements>
        <replace>
            <match>&amp;view=print</match>
            <replacement>&amp;view=html</replacement>
        </replace>
    </replacements>
</urlNormalizer>

<numThreads>2</numThreads>

<maxDepth>3</maxDepth>

<maxURLs>-1</maxURLs>

<workDir>$workdir</workDir>

<keepDownloads>false</keepDownloads>

<deleteOrphans>true</deleteOrphans>

<crawlURLDatabaseFactory class="${basepkg}.db.impl.DefaultCrawlURLDatabaseFactory" />

<robotsTxt ignore="false" class="$robotsTxt"/>

<httpHeadersFetcher class="${basepkg}.fetch.impl.SimpleHttpHeadersFetcher"  validStatusCodes="200" />

<httpHeadersChecksummer class="$headersChecksummer" field="collector.http.url" />

<httpDocumentFetcher class="$docFetcher" validStatusCodes="200" />

<httpDocumentChecksummer class="$docChecksummer"/>

<robotsMeta ignore="false" class="$robotsMeta" />

<urlExtractor class="$urlExtractor" />

Extracted URLs are wrong when page has been redirected

The extracted URLs of pages that have been redirected are wrong because the original URL is taken as the basis to build the extracted URLs and not the current redirected location.

E.g. www.example.com/foo is redirected to www.example.com/foo/. The page on www.example.com/foo/ contains a relative link to page1.html as <a href="page1.html">Page 1</a>

The extracted URL now is www.example.com/page1.html instead of www.example.com/foo/page1.html because the UrlParts.relativeBase member in DefaultURLExtractor.extractUrl() points to www.example.com instead of the newly redirected location www.example.com/foo/. It should be detected if a redirect happened and the new location URL should be used for further processing.

Schedule the collector for iterative crawls

How do you go about scheduling the crawler for iterative crawls? E.g. I want the crawler to run at midnight every day and it should handle if the crawler is already running.

Trying to start 2.0.1 with old configs.

Hello!

Haven't visited you for a long time :)

I cleaned workdir and tried to launch the collector via command line getting such exceptions:

WARN  [ConfigurationUtil] Could not instantiate object from configuration for node: crawlerDefaults key: robotsTxt
com.norconex.commons.lang.config.ConfigurationException: This class could not be instantiated: "com.norconex.collector.http.robot.impl.DefaultRobotsTxtProvider".
    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:177)
    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:292)
    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:258)
    at com.norconex.collector.http.crawler.HttpCrawlerConfig.loadCrawlerConfigFromXML(HttpCrawlerConfig.java:285)
    at com.norconex.collector.core.crawler.AbstractCrawlerConfig.loadFromXML(AbstractCrawlerConfig.java:325)
    at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfig(CrawlerConfigLoader.java:123)
    at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfigs(CrawlerConfigLoader.java:72)
    at com.norconex.collector.core.AbstractCollectorConfig.loadFromXML(AbstractCollectorConfig.java:176)
    at com.norconex.collector.core.CollectorConfigLoader.loadCollectorConfig(CollectorConfigLoader.java:78)
    at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:65)
    at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)
Caused by: java.lang.ClassNotFoundException: com.norconex.collector.http.robot.impl.DefaultRobotsTxtProvider
    at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:260)
    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:175)
    ... 10 more
WARN  [ConfigurationUtil] Could not instantiate object from configuration for node: crawlerDefaults key: robotsMeta
com.norconex.commons.lang.config.ConfigurationException: This class could not be instantiated: "com.norconex.collector.http.robot.impl.DefaultRobotsMetaProvider".
    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:177)
    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:292)
    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:258)
    at com.norconex.collector.http.crawler.HttpCrawlerConfig.loadCrawlerConfigFromXML(HttpCrawlerConfig.java:309)
    at com.norconex.collector.core.crawler.AbstractCrawlerConfig.loadFromXML(AbstractCrawlerConfig.java:325)
    at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfig(CrawlerConfigLoader.java:123)
    at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfigs(CrawlerConfigLoader.java:72)
    at com.norconex.collector.core.AbstractCollectorConfig.loadFromXML(AbstractCollectorConfig.java:176)
    at com.norconex.collector.core.CollectorConfigLoader.loadCollectorConfig(CollectorConfigLoader.java:78)
    at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:65)
    at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)
Caused by: java.lang.ClassNotFoundException: com.norconex.collector.http.robot.impl.DefaultRobotsMetaProvider
    at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:260)
    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:175)
    ... 10 more
Exception in thread "main" java.lang.NoClassDefFoundError: com/norconex/committer/AbstractMappedCommitter
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:455)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:367)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:260)
    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:175)
    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:292)
    at com.norconex.commons.lang.config.ConfigurationUtil.newInstance(ConfigurationUtil.java:258)
    at com.norconex.collector.core.crawler.AbstractCrawlerConfig.loadFromXML(AbstractCrawlerConfig.java:318)
    at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfig(CrawlerConfigLoader.java:123)
    at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfigs(CrawlerConfigLoader.java:83)
    at com.norconex.collector.core.AbstractCollectorConfig.loadFromXML(AbstractCollectorConfig.java:176)
    at com.norconex.collector.core.CollectorConfigLoader.loadCollectorConfig(CollectorConfigLoader.java:78)
    at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:65)
    at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)
Caused by: java.lang.ClassNotFoundException: com.norconex.committer.AbstractMappedCommitter
    at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 24 more

What I'm doing wrong and are old configs good for a new version of the collector?

Add a search box in the javadoc

This is a more of an "nice-to-have" enhancement for having a search box in the Javadoc documentation. But there is an extension in Chrome https://code.google.com/p/javadoc-search-frame/ and works pretty well. So not sure if we should add a search box by default.

Possibly collector adds artefactso to the end of URI

Hello!

I used generic URL nomalizer such way as described at sample config file:

<urlNormalizer class="${basepkg}.url.impl.GenericURLNormalizer">
            <normalizations>
                lowerCaseSchemeHost, upperCaseEscapeSequence, decodeUnreservedCharacters, removeDefaultPort 
            </normalizations>
            <replacements>
                <replace>
                    <match>&amp;view=print</match>
                    <replacement>&amp;view=html</replacement>
                </replace>
            </replacements>
        </urlNormalizer>

After collector launch I get such message:

INFO  [DefaultSitemapResolver]          Resolved: http://www.somehost.com/sitemap.xml
ERROR [URLProcessor] Could not process URL: http://www.somehost.com/hook/"
com.norconex.commons.lang.url.URLException: Invalid URL syntax: http://www.somehost.com/hook/"
    at com.norconex.commons.lang.url.URLNormalizer.<init>(URLNormalizer.java:176)
    at com.norconex.collector.http.url.impl.GenericURLNormalizer.normalizeURL(GenericURLNormalizer.java:163)
    at com.norconex.collector.http.crawler.URLProcessor$URLNormalizerStep.processURL(URLProcessor.java:202)
    at com.norconex.collector.http.crawler.URLProcessor.processURL(URLProcessor.java:310)
    at com.norconex.collector.http.crawler.URLProcessor.processURL(URLProcessor.java:98)
    at com.norconex.collector.http.crawler.DocumentProcessor$URLExtractorStep.processURL(DocumentProcessor.java:292)
    at com.norconex.collector.http.crawler.DocumentProcessor.processURL(DocumentProcessor.java:118)
    at com.norconex.collector.http.crawler.HttpCrawler.processNextQueuedURL(HttpCrawler.java:400)
    at com.norconex.collector.http.crawler.HttpCrawler.processNextURL(HttpCrawler.java:321)
    at com.norconex.collector.http.crawler.HttpCrawler.access$0(HttpCrawler.java:303)
    at com.norconex.collector.http.crawler.HttpCrawler$ProcessURLsRunnable.run(HttpCrawler.java:582)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.URISyntaxException: Illegal character in path at index 29: http://www.somehost.com/hook/"
    at java.net.URI$Parser.fail(URI.java:2848)
    at java.net.URI$Parser.checkChars(URI.java:3021)
    at java.net.URI$Parser.parseHierarchical(URI.java:3105)
    at java.net.URI$Parser.parse(URI.java:3053)
    at java.net.URI.<init>(URI.java:588)
    at com.norconex.commons.lang.url.URLNormalizer.<init>(URLNormalizer.java:174)
    ... 13 more
INFO  [CrawlStatus]      ERROR >   (1) http://www.somehost.com/hook/"

startURLs at config doesn't contain double quotes as XML sitemap also.

Reading sources of GenericURLNormalizer, URLNormalizer and documentation of URI I found that an error possibly were because double quotes symbol added at the end of URI: http://www.somehost.com/hook/" (position 29, leading 0).

I can't locate the place where it happens, and also I can't detect misconfiguration of the collector, so your help will highly appreciated.

PS in some places at the sample configuration from ./examples I had to add {} symbols manually to attributes like ... class="{$basepackage}.db.im.... to get it working. Do I need to put it as a separate issue, or I'm just wrong with configuring?

Stop/Start issue

Hello. I'm glad you continue publishing new releases, so using 1.3.4 with pleasure.

If call collector-http.sh -a stop -c ... with existent but not running crawler, then you may never launch that crawler again with -a= start|resume as it always gets STOP signal and trying to shutdown. The only cure is to delete working directory, which isn't a good idea for me as means a large re-crawl.

I expected that launching collector with 'start' or 'resume' action once will "clear" that stop signal and the second launch will successful, but even 3rd and 4th launches get STOP signal again and again.

May you check this issue, please?

Stop action's INFO and ERROR messages related to UTF-8

Every time I'm stopping collector from shell it shows following INFO and ERROR messages in log:

INFO  [FileStopRequestHandler$1] STOP request received.
INFO  [HttpCrawler] Stopping the crawler "www.site.com".
ERROR [DefaultSitemapResolver] Cannot fetch sitemap: http://www.ifb.unisg.ch/sitemap_index.xml (ParseError at [row,col]:[5,21]
Message: Invalid byte 2 of 3-byte UTF-8 sequence.)
INFO  [DefaultSitemapResolver] Resolving sitemap: http://www.unisg.ch/Sitemaps/Sitemap.xml
ERROR [DefaultSitemapResolver] Cannot fetch sitemap: http://www.ifb.unisg.ch/sitemap_index.xml (ParseError at [row,col]:[5,21]
Message: Invalid byte 2 of 3-byte UTF-8 sequence.)
INFO  [DefaultSitemapResolver] Resolving sitemap: http://www.unisg.ch/Sitemaps/Sitemap.xml

The collector produces them for 1-3 minutes and then halts normally.

My guess the issue is in early streams or other resources closing while running threads still using them. So after several unsuccessful tries threads are closing normally.

Dos-Formated collector-http.sh

Hello,

When collector-http.sh is executed on Unix based machine, it shows an error since collector-http.sh is Dos-Formatted (created on Windows?).

To fix the issue, I had to create a new file and copy the content from the original file and save it.

You can also use multiple solutions explained here http://www.cyberciti.biz/faq/howto-unix-linux-convert-dos-newlines-cr-lf-unix-text-format/

Thanks,
Khalid

Crawled page advanced logic

I have a following task:

filter the collected page from HTML makeup, including menus, etc. Only the content; Seems default importer has such logics, but I need some advanced, including spaced/lineend collapsing, removing CSS classes by patern-set names, etc.
check the text for keywords/ key phrases, and if they're present, allow to commit the document.

For such puporses I plan to use custom Importer. Am I right when decided to select that interface? What do you recommend me to use for that?

using a proxy debugging tool

Thanks for the fast response on the last one. I do have another question now. I'm trying to use a proxy debugging tool (Charles) to watch the traffic from the crawler. So I've set up the "proxyPort" and "proxyHost" for my client initializer. But when I try to crawl an https site, I get the error below. I suspect it is because I'm using SSL proxying, and httpClient is complaining about a possible man-in-the-middle attack. Is there any way to configure Norconex to ignore the certificate differences?

com.norconex.collector.http.HttpCollectorException: javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated
at com.norconex.collector.http.handler.impl.DefaultHttpClientInitializer.authenticateUsingForm(DefaultHttpClientInitializer.java:330)
at com.norconex.collector.http.handler.impl.DefaultHttpClientInitializer.initializeHTTPClient(DefaultHttpClientInitializer.java:132)
at com.norconex.collector.http.crawler.HttpCrawler.initializeHTTPClient(HttpCrawler.java:369)
at com.norconex.collector.http.crawler.HttpCrawler.execute(HttpCrawler.java:143)
at com.norconex.collector.http.crawler.HttpCrawler.startExecution(HttpCrawler.java:130)
at com.norconex.jef.AbstractResumableJob.execute(AbstractResumableJob.java:52)
at com.norconex.jef.JobRunner.runJob(JobRunner.java:193)
at com.norconex.jef.JobRunner.runSuite(JobRunner.java:94)
at com.norconex.collector.http.HttpCollector.crawl(HttpCollector.java:172)
at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:139)
Caused by: javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated
at sun.security.ssl.SSLSessionImpl.getPeerCertificates(Unknown Source)
at org.apache.http.conn.ssl.AbstractVerifier.verify(AbstractVerifier.java:126)
at org.apache.http.conn.ssl.SSLSocketFactory.createLayeredSocket(SSLSocketFactory.java:493)
at org.apache.http.impl.conn.DefaultClientConnectionOperator.updateSecureConnection(DefaultClientConnectionOperator.java:232)
at org.apache.http.impl.conn.ManagedClientConnectionImpl.layerProtocol(ManagedClientConnectionImpl.java:401)
at org.apache.http.impl.client.DefaultRequestDirector.establishRoute(DefaultRequestDirector.java:841)
at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:648)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:479)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
at com.norconex.collector.http.handler.impl.DefaultHttpClientInitializer.authenticateUsingForm(DefaultHttpClientInitializer.java:328)

Tune up collector-http for specific features

I would like to know may collector-http fit requirements for the following task (I appreciate your precious time, and read documentation first, but haven't found some nuances):

A set of hundreds web sites need to be monitored for new (not updated) content appeared after fixed date (starting from now, or the nearest future) and provide detected pages' text (and only text, as cleared as possible from markup and ads) by a reasonable way to another software: it may be CSV, JSON, socket transmission, etc. Monitoring starts form a moment and continues for a fixed time period, no realtime reaction needed. The list of sites may be supplemented, but newly added sites' content shouldn't be completely replicated, need as described above only new pages from date of the site adding to the list.

I don't expect fully written for me configuration files/plugins/etc, but please, set me direction, and tell is it possible at all with collector-http. It is very specific task and ALL crawlers I checked don't provide this functional from box, provide no possibility to get it from documentation.

Thank you for your assistance, and hope this nice software helps me with my research tasks.

Option to parse links when web page has been rejected

When crawling web pages, it would useful to provide an option in the XML file to parse links for rejected pages.

Issue when temporary folders are created (Unix/Mac)

Hello,

When a crawler's working directory is specified in the configuration file, the working directories are not properly created on Unix based systems.

Instead of having one directory with two children:
crawlerName/
--> logs/
--> progress/

It is creating two sister directories with the names:
crawlerName\logs/
crawlerName\progress/

To reproduce the issue, run the complex configuration under examples in Linux.

It looks like "" is used which only works on Windows.

Thanks,
Khalid

form based authentication

I noticed on the overview doc page it says "Supports various website authentication schemes - Supports standard/popular implementations out-of-the-box". But in looking at the rest of the doc, I don't see how to configure any type of authentication. Can you provide an example of how to post to form-based authentication on a website before crawling?

Thanks.

Adding Solr parameters to the committer

Hello,

Sometimes we need to pass parameters with Solr Commiter URL. This is very useful especially when dealing with something like LucidWorks.

Here is an example of what would be done:

http://127.0.0.1:8888/solr/collection1
par1, par2 <---- to be added
support_url
body
5
5

SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", "1234");
doc.addField("body", "test");
SolrServer server = new CommonsHttpSolrServer("http://localhost:8888/solr/collection1");
UpdateRequest req = new UpdateRequest();
req.setParam("par1", someValue); <---- to be added for parameters

Thanks,
Khalid

Collector tasks queue after stopping

Hello!

After stopping of the collector I got full log of such messages, and stopping takes very long time (hours).

[non-job]: 2014-10-03 15:59:08,822 [pool-1-thread-2] ERROR ap.impl.DefaultSitemapResolver  - Cannot fetch sitemap: http://finance.yahoo.com/sitemap/videos/94.xml (Timeout waiting for connection from pool)
[non-job]: 2014-10-03 15:59:38,823 [pool-1-thread-2] ERROR ap.impl.DefaultSitemapResolver  - Cannot fetch sitemap: http://finance.yahoo.com/sitemap/videos/95.xml (Timeout waiting for connection from pool)
[non-job]: 2014-10-03 16:00:08,824 [pool-1-thread-2] ERROR ap.impl.DefaultSitemapResolver  - Cannot fetch sitemap: http://finance.yahoo.com/sitemap/videos/96.xml (Timeout waiting for connection from pool)
[non-job]: 2014-10-03 16:00:38,825 [pool-1-thread-2] ERROR ap.impl.DefaultSitemapResolver  - Cannot fetch sitemap: http://finance.yahoo.com/sitemap/videos/97.xml (Timeout waiting for connection from pool)
[non-job]: 2014-10-03 16:01:08,826 [pool-1-thread-2] ERROR ap.impl.DefaultSitemapResolver  - Cannot fetch sitemap: http://finance.yahoo.com/sitemap/videos/98.xml (Timeout waiting for connection from pool)
[non-job]: 2014-10-03 16:01:38,827 [pool-1-thread-2] ERROR ap.impl.DefaultSitemapResolver  - Cannot fetch sitemap: http://finance.yahoo.com/sitemap/videos/99.xml (Timeout waiting for connection from pool)

As I get collector's flow, allalready queued tasks must be finished before the instance has stopped. So from the moment of receiving of stopping command the app doesn't plan new tasks but execute all already planned.

Also after stop command receiving, the app clears all connections pool and doesn't allow to open new connections, so ALL queued tasks left can't be finished normally, and are killed by timeout. It means that crawler with large queue doesn't clear tasks queue normally, and shutdowns extremely slow.

May we discuss this possible issue?

RenameTagger only works with collector.http.* fields

Hello,

For some reason, RenameTagger is only working when the field name is starting with collector.http.* such as collector.http.MIMETYPE. It didn't work with me when the field name was "support_url" and "dc:title".

Here is how I have it setup:
.
.
.
.

.
.
.
.

Please let me know if you need my config to reproduce the issue.

Thanks,
Khalid

DeleteTagger creating crawl errors

Hello,

With the following setup in the configuration file:

.
.
.

I get the following errors while the URLs are being crawled:
.
.
.
.
ERROR [HttpCrawler] Could not process document:
http://some/url (null)
INFO [CrawlStatus] ERROR > (1)
http://some/url
ERROR [HttpCrawler] Could not process document:
http://some/url (null)
INFO [CrawlStatus] ERROR > (1)
http://some/url
ERROR [HttpCrawler] Could not process document:
http://some/url (null)
INFO [CrawlStatus] ERROR > (1)
http://some/url
ERROR [HttpCrawler] Could not process document:
http://some/url (null)
INFO [CrawlStatus] ERROR > (1)
http://some/url
ERROR [HttpCrawler] Could not process document:
http://some/url (null)
INFO [CrawlStatus] ERROR > (1)
http://some/url
.
.
.
.

I can provide you with the my config file if needed.

Thanks,
Khalid

norconex / crawlers Goto Github PK

crawlers's People

Contributors

Stargazers

Watchers

Forkers

crawlers's Issues

Recommend Projects

Recommend Topics

Recommend Org