I'm trying to run this script but getting error whle indexing the data ( Job Status : Failed).
https://github.com/chrismattmann/nutch-python/wiki
Script:
from nutch.nutch import Nutch
from nutch.nutch import SeedClient
from nutch.nutch import Server
from nutch.nutch import JobClient
import nutch
sv=Server('http://localhost:8081')
sc=SeedClient(sv)
seed_urls=('http://espn.go.com','http://www.espn.com')
sd= sc.create('espn-seed',seed_urls)
nt = Nutch('default')
jc = JobClient(sv, 'test', 'default')
cc = nt.Crawl(sd, sc, jc)
while True:
print("---------------- Printing JOb Progress ------- : ", cc.progress)
job = cc.progress() # gets the current job if no progress, else iterates and makes progress
print("---------------- Printing JOb Progess ------- : ", job)
if job == None:
break
End Output :
`---------------- Printing JOb Progess ------- : <nutch.nutch.Job object at 0x7f37df6a2d30>
---------------- Printing JOb Progress ------- : <bound method CrawlClient.progress of <nutch.nutch.CrawlClient object at 0x7f37e42f01d0>>
nutch.py: GET Endpoint: /job/test-default-INDEX-1148278571
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'Content-Type': 'application/json', 'Date': 'Thu, 19 Jul 2018 07:20:22 GMT', 'Transfer-Encoding': 'chunked', 'Server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 200
nutch.py: Response JSON: {'id': 'test-default-INDEX-1148278571', 'type': 'INDEX', 'confId': 'default', 'args': {'url_dir': 'seedFiles/seed-1531984813255'}, 'result': None, 'state': 'RUNNING', 'msg': 'OK', 'crawlId': 'test'}
@@@@@@@@@@@@@@@@@@@@@@@ Job STATE @@@@@@@@@@@@@@@@ : RUNNING
---------------- Printing JOb Progess ------- : <nutch.nutch.Job object at 0x7f37df6a2d30>
---------------- Printing JOb Progress ------- : <bound method CrawlClient.progress of <nutch.nutch.CrawlClient object at 0x7f37e42f01d0>>
nutch.py: GET Endpoint: /job/test-default-INDEX-1148278571
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'Content-Type': 'application/json', 'Date': 'Thu, 19 Jul 2018 07:20:22 GMT', 'Transfer-Encoding': 'chunked', 'Server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 200
nutch.py: Response JSON: {'id': 'test-default-INDEX-1148278571', 'type': 'INDEX', 'confId': 'default', 'args': {'url_dir': 'seedFiles/seed-1531984813255'}, 'result': None, 'state': 'FAILED', 'msg': 'ERROR: java.io.IOException: Job failed!', 'crawlId': 'test'}
@@@@@@@@@@@@@@@@@@@@@@@ Job STATE @@@@@@@@@@@@@@@@ : FAILED
Traceback (most recent call last):
File "test.py", line 18, in
job = cc.progress() # gets the current job if no progress, else iterates and makes progress
File "/home/purushottam/Documents/tech_learn/ex_nutch/nutch-python/nutch/nutch.py", line 563, in progress
raise NutchCrawlException
nutch.nutch.NutchCrawlException
`
**Nutch Hadoop Log file : **
`2018-07-19 13:04:16,524 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: test/crawldb
2018-07-19 13:04:16,524 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: test/segments/20180719002543
2018-07-19 13:04:16,528 WARN mapreduce.JobResourceUploader - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2018-07-19 13:04:16,726 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: content dest: content
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: title dest: title
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: host dest: host
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: segment dest: segment
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: boost dest: boost
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: digest dest: digest
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: tstamp dest: tstamp
2018-07-19 13:04:16,738 WARN mapred.LocalJobRunner - job_local1889671569_0083
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:8983/solr: Expected mime type application/octet-stream but got text/html.
<title>Error 404 Not Found</title>
HTTP ERROR 404
Problem accessing /solr/update. Reason:
Not Found
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:8983/solr: Expected mime type application/octet-stream but got text/html.
<title>Error 404 Not Found</title>
HTTP ERROR 404
Problem accessing /solr/update. Reason:
Not Found
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:544)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:482)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:463)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:191)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:179)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:117)
at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-07-19 13:04:17,636 ERROR impl.JobWorker - Cannot run job worker!
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:96)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:89)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:351)
at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:73)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)`