chrismattmann / nutch-python Goto Github PK
View Code? Open in Web Editor NEWNutch-Python is a Python binding to the Apache Nutch™ REST services allowing Nutch to be called natively in the Python community. — Edit
License: Apache License 2.0
Nutch-Python is a Python binding to the Apache Nutch™ REST services allowing Nutch to be called natively in the Python community. — Edit
License: Apache License 2.0
I'm trying to run this script but getting error whle indexing the data ( Job Status : Failed).
https://github.com/chrismattmann/nutch-python/wiki
Script:
from nutch.nutch import Nutch
from nutch.nutch import SeedClient
from nutch.nutch import Server
from nutch.nutch import JobClient
import nutch
sv=Server('http://localhost:8081')
sc=SeedClient(sv)
seed_urls=('http://espn.go.com','http://www.espn.com')
sd= sc.create('espn-seed',seed_urls)
nt = Nutch('default')
jc = JobClient(sv, 'test', 'default')
cc = nt.Crawl(sd, sc, jc)
while True:
print("---------------- Printing JOb Progress ------- : ", cc.progress)
job = cc.progress() # gets the current job if no progress, else iterates and makes progress
print("---------------- Printing JOb Progess ------- : ", job)
if job == None:
break
End Output :
`---------------- Printing JOb Progess ------- : <nutch.nutch.Job object at 0x7f37df6a2d30>
---------------- Printing JOb Progress ------- : <bound method CrawlClient.progress of <nutch.nutch.CrawlClient object at 0x7f37e42f01d0>>
nutch.py: GET Endpoint: /job/test-default-INDEX-1148278571
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'Content-Type': 'application/json', 'Date': 'Thu, 19 Jul 2018 07:20:22 GMT', 'Transfer-Encoding': 'chunked', 'Server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 200
nutch.py: Response JSON: {'id': 'test-default-INDEX-1148278571', 'type': 'INDEX', 'confId': 'default', 'args': {'url_dir': 'seedFiles/seed-1531984813255'}, 'result': None, 'state': 'RUNNING', 'msg': 'OK', 'crawlId': 'test'}
@@@@@@@@@@@@@@@@@@@@@@@ Job STATE @@@@@@@@@@@@@@@@ : RUNNING
---------------- Printing JOb Progess ------- : <nutch.nutch.Job object at 0x7f37df6a2d30>
---------------- Printing JOb Progress ------- : <bound method CrawlClient.progress of <nutch.nutch.CrawlClient object at 0x7f37e42f01d0>>
nutch.py: GET Endpoint: /job/test-default-INDEX-1148278571
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'Content-Type': 'application/json', 'Date': 'Thu, 19 Jul 2018 07:20:22 GMT', 'Transfer-Encoding': 'chunked', 'Server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 200
nutch.py: Response JSON: {'id': 'test-default-INDEX-1148278571', 'type': 'INDEX', 'confId': 'default', 'args': {'url_dir': 'seedFiles/seed-1531984813255'}, 'result': None, 'state': 'FAILED', 'msg': 'ERROR: java.io.IOException: Job failed!', 'crawlId': 'test'}
@@@@@@@@@@@@@@@@@@@@@@@ Job STATE @@@@@@@@@@@@@@@@ : FAILED
Traceback (most recent call last):
File "test.py", line 18, in
job = cc.progress() # gets the current job if no progress, else iterates and makes progress
File "/home/purushottam/Documents/tech_learn/ex_nutch/nutch-python/nutch/nutch.py", line 563, in progress
raise NutchCrawlException
nutch.nutch.NutchCrawlException
`
**Nutch Hadoop Log file : **
`2018-07-19 13:04:16,524 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: test/crawldb
2018-07-19 13:04:16,524 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: test/segments/20180719002543
2018-07-19 13:04:16,528 WARN mapreduce.JobResourceUploader - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2018-07-19 13:04:16,726 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: content dest: content
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: title dest: title
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: host dest: host
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: segment dest: segment
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: boost dest: boost
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: digest dest: digest
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: tstamp dest: tstamp
2018-07-19 13:04:16,738 WARN mapred.LocalJobRunner - job_local1889671569_0083
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:8983/solr: Expected mime type application/octet-stream but got text/html.
Problem accessing /solr/update. Reason:
Not Found
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:8983/solr: Expected mime type application/octet-stream but got text/html.
<title>Error 404 Not Found</title>Problem accessing /solr/update. Reason:
Not Found
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:544)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:482)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:463)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:191)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:179)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:117)
at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-07-19 13:04:17,636 ERROR impl.JobWorker - Cannot run job worker!
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:96)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:89)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:351)
at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:73)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)`
Here is the issue for Nutchpy - ContinuumIO/nutchpy#12
Docs for the Reader endpoint - http://docs.nutchpytonutchrestapi.apiary.io/
Hi,
I was able to run the following code on my own linux machine without a problem:
from nutch.nutch import Nutch
from nutch.nutch import SeedClient
from nutch.nutch import Server
from nutch.nutch import JobClient
import nutch
sv=Server('http://localhost:8081')
sc=SeedClient(sv)
seed_urls=('http://espn.go.com','http://www.espn.com')
sd= sc.create('espn-seed',seed_urls)
nt = Nutch('default')
jc = JobClient(sv, 'test', 'default')
cc = nt.Crawl(sd, sc, jc)
while True:
job = cc.progress() # gets the current job if no progress, else iterates and makes progress
if job == None:
break
however, when I run the same code on AWS (ubuntu 14.04), it gives a runtime error. here is the runtime log of the code:
nutch.py: Response status: 200
nutch.py: Response JSON: {u'crawlId': u'test', u'args': {u'url_dir': u'/tmp/1456875353316-0'}, u'state': u'IDLE', u'result': None, u'msg': u'idle', u'type': u'GENERATE', u'id': u'test-default-GENERATE-1140031758', u'confId': u'default'}
nutch.py: GET Endpoint: /job/test-default-GENERATE-1140031758
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'Date': 'Tue, 01 Mar 2016 23:36:35 GMT', 'Content-Length': '0', 'Server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 204
Traceback (most recent call last):
File "main.py", line 22, in
job = cc.progress() # gets the current job if no progress, else iterates and makes progress
File "/usr/local/lib/python2.7/dist-packages/nutch/nutch.py", line 531, in progress
jobInfo = currentJob.info()
File "/usr/local/lib/python2.7/dist-packages/nutch/nutch.py", line 201, in info
return self.server.call('get', '/job/' + self.id)
File "/usr/local/lib/python2.7/dist-packages/nutch/nutch.py", line 160, in call
raise error
in order to run the python code, I was running nutch as: /bin/nutch startserver, here is the run the
Injector: starting at 2016-03-01 23:35:53
Injector: crawlDb: test/crawldb
Injector: urlDir: /tmp/1456875353316-0
Injector: Converting injected urls to crawl db entries.
Injector: overwrite: false
Injector: update: false
Injector: Total number of urls rejected by filters: 0
Injector: Total number of urls after normalization: 2
Injector: Total new urls injected: 2
Injector: finished at 2016-03-01 23:36:34, elapsed: 00:00:40
Generator: starting at 2016-03-01 23:36:35
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: running in local mode, generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: test/segments/20160301233638
Generator: finished at 2016-03-01 23:36:40, elapsed: 00:00:05
I would appreciate if you can help.
Thanks
The Seed API was refactored according to https://issues.apache.org/jira/browse/NUTCH-2090.
The python code needs to be refactored to match the specifications of the new API.
Or we need backward compatibility in Nutch for the Seed API.
Unfortunately I think I've left the CLI a bit behind with the last few weeks of work. The best examples of usage are in the test_nutch.py
file right now. I've been waiting for the dust to settle a bit around the refactor to update the documentation but this is a reasonable time to do so.
What are the primary goals of the CLI? Should it expose the full RESTful interface, or are there specific tasks you want to enable? Personally, I'm a bit more interested in the Pythonic interface, but I want to make sure we keep our use cases covered.
I used the following command to initialize the Nutch object.
nt = Nutch('crawlTest', urlDir='urls/', serverEndpoint='http://localhost:8081')
But it gave me the following error
nutch.py: GET Endpoint: /config/crawlTest
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'date': 'Fri, 09 Oct 2015 10:26:13 GMT', 'transfer-encoding': 'chunked', 'content-type': 'application/json', 'server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 200
nutch.py: Response JSON: {}
Traceback (most recent call last):
File "/Users/Antrromet/Documents/workspace/Nutch/test_nutch_python.py", line 3, in <module>
nt = Nutch('crawlTest', urlDir='urls/', serverEndpoint='http://localhost:8081')
File "build/bdist.macosx-10.11-intel/egg/nutch/nutch.py", line 609, in __init__
File "build/bdist.macosx-10.11-intel/egg/nutch/nutch.py", line 302, in __getitem__
KeyError
Ideally, the above should have worked, because it should have used the default configuration, and should have been able to find it. But unfortunately, it doesn't and throws the KeyError
.
I even tried explicitly giving the default config (although it doesn't matter because its the default param) but in vain.
nt = Nutch('crawlTest', confId='default', urlDir='urls/', serverEndpoint='http://localhost:8081')
The above gave me the following error.
Traceback (most recent call last):
File "/Users/Antrromet/Documents/workspace/Nutch/test_nutch_python.py", line 3, in <module>
nt = Nutch('crawlTest', confId='default', urlDir='urls/', serverEndpoint='http://localhost:8081')
TypeError: __init__() got multiple values for keyword argument 'confId'
When i run the python code:
def get_seed(seedListData=('http://www.zidaho.com','http://www.iguntrade.com')):
sc = get_seed_client()
return sc.create('test_seed', seedListData)
it fails with the below being displayed in the stdout:
nutch.py: POST Request data: {'id': '12345', 'seedUrls': [{'url': 'http://www.zidaho.com', 'seedList': None, 'id': 0}, {'url': 'http://www.iguntrade.com', 'seedList': None, 'id': 1}], 'name': 'test_seed'}
nutch.py: POST Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'date': 'Sat, 10 Oct 2015 18:00:58 GMT', 'content-length': '0', 'server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 500
Traceback (most recent call last):
File "nutchPythonCrawl.py", line 79, in
test_crawl_client()
File "nutchPythonCrawl.py", line 73, in test_crawl_client
cc = get_crawl_client()
File "nutchPythonCrawl.py", line 30, in get_crawl_client
seed = get_seed()
File "nutchPythonCrawl.py", line 21, in get_seed
return sc.create('test_seed', seedListData)
File "/Users/rrgirish/anaconda/lib/python2.7/site-packages/nutch/nutch.py", line 433, in create
seedPath = self.server.call('post', "/seed/create", seedListData, JsonAcceptHeader, forceText=True)
File "/Users/rrgirish/anaconda/lib/python2.7/site-packages/nutch/nutch.py", line 160, in call
raise error
Trying to start the command line tool due to 'bad args':
➜ nutch git:(master) ✗ nutch-python
nutch.py: Error: Bad args
nutch.py:
A simple python client for Nutch using the Nutch server REST API.
Most commands return results in JSON format by default, or plain text.
To control Nutch, use:
-- from nutch.nutch import Nutch
-- nt = Nutch(crawlId, # name your crawl
confId='default', # pick a known config. file
urlDir='url/', # directory containing the seed URL list
serverEndpoint='localhost:8001', # endpoint where the Nutch server is running
**args # additional key=value pairs to submit as args
)
-- response, status = nt.<command>(**args) # where commmand is in the set
# ['INJECT', 'GENERATE', 'FETCH', 'PARSE', 'UPDATEDB']
or
-- status, response = nt.crawl(**args) # will run the commands in order with echoing of reponses
Commands (which become Hadoop jobs):
inject - inject a URL seed list into a named crawl
generate - general URL list
fetch - fetch inital set of web pages
parse - parse web pages and invoke Tika metadata extraction
updatedb - update the crawl database
To get/set the configuration of the Nutch server, use:
-- nt.configGetList() # get list of named configurations
-- nt.configGetInfo(id) # get parameters in named config.
-- nt.configCreate(id, parameterDict) # create a new named config.
To see the status of jobs, use:
-- nt.jobGetList() # get list of running jobs
-- nt.jobGetInfo(id) # get metadata for a job id
-- nt.jobStop(id) # stop a job, DANGEROUS!!, may corrupt segment files
It looks this is directly related to #4, I will be updating the documentation and main method accordingly if no one's on it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.