chrismattmann / nutch-python Goto Github PK

Nutch-Python is a Python binding to the Apache Nutch™ REST services allowing Nutch to be called natively in the Python community. — Edit

License: Apache License 2.0

Python 98.87% Batchfile 0.59% Shell 0.54%

nutch-python's Introduction

nutch-python

A Python client library for the Apache Nutch that makes Nutch 1.x capabilities available using the Nutch REST Server.

See (https://wiki.apache.org/nutch/NutchTutorial) for installing Nutch 1.x and alternatively operating it via the command line.

This Python client library for Nutch is installable via Setuptools, Pip and Easy Install.

Installation (with pip)

pip install nutch

Installation (without pip)

python setup.py build
python setup.py install

Wiki Documentation

See the wiki for instructions on how to use Nutch-Python and its API.

New Command Line Tool

When you install Nutch-Python you also get a new command line client tool, nutch-python installed in your /path/to/python/bin directory.

The options and help for the command line tool can be seen by typing nutch-python without any arguments.

Questions, comments?

Send them to Chris A. Mattmann.

Contributors

Brian D. Wilson, JPL
Chris A. Mattmann, JPL
Aron Ahmadia, Continuum Analytics

License

Apache License, version 2

nutch-python's People

Contributors

Stargazers

Watchers

Forkers

sujen1412 nasa-jpl-memex ajagtian rrgirish jyotsna094 bhavyasanghavi ankur09 chaitanyacixlive jijicanyu geoffwalmsley bl4ck1c3 eaure codenewer vishalbelsare gomllab derek0301 5l1v3r1 dengamusic

nutch-python's Issues

update `nutch-python` usage/documentation

Unfortunately I think I've left the CLI a bit behind with the last few weeks of work. The best examples of usage are in the test_nutch.py file right now. I've been waiting for the dust to settle a bit around the refactor to update the documentation but this is a reasonable time to do so.

What are the primary goals of the CLI? Should it expose the full RESTful interface, or are there specific tasks you want to enable? Personally, I'm a bit more interested in the Pythonic interface, but I want to make sure we keep our use cases covered.

Change seedlist creation to conform to NUTCH-2090

The Seed API was refactored according to https://issues.apache.org/jira/browse/NUTCH-2090.
The python code needs to be refactored to match the specifications of the new API.
Or we need backward compatibility in Nutch for the Seed API.

Cannot start command line tool

Trying to start the command line tool due to 'bad args':

➜  nutch git:(master) ✗ nutch-python
nutch.py: Error: Bad args
nutch.py:
A simple python client for Nutch using the Nutch server REST API.
Most commands return results in JSON format by default, or plain text.

To control Nutch, use:

-- from nutch.nutch import Nutch
-- nt = Nutch(crawlId,                           # name your crawl
              confId='default',                  # pick a known config. file
              urlDir='url/',                     # directory containing the seed URL list
              serverEndpoint='localhost:8001',   # endpoint where the Nutch server is running
              **args                             # additional key=value pairs to submit as args
             )
-- response, status = nt.<command>(**args)       # where commmand is in the set
                                                 # ['INJECT', 'GENERATE', 'FETCH', 'PARSE', 'UPDATEDB']
 or
-- status, response = nt.crawl(**args)    # will run the commands in order with echoing of reponses

Commands (which become Hadoop jobs):
  inject   - inject a URL seed list into a named crawl
  generate - general URL list
  fetch    - fetch inital set of web pages
  parse    - parse web pages and invoke Tika metadata extraction
  updatedb - update the crawl database

To get/set the configuration of the Nutch server, use:
-- nt.configGetList()                    # get list of named configurations
-- nt.configGetInfo(id)                  # get parameters in named config.
-- nt.configCreate(id, parameterDict)    # create a new named config.

To see the status of jobs, use:
-- nt.jobGetList()                       # get list of running jobs
-- nt.jobGetInfo(id)                     # get metadata for a job id
-- nt.jobStop(id)                        # stop a job, DANGEROUS!!, may corrupt segment files

It looks this is directly related to #4, I will be updating the documentation and main method accordingly if no one's on it.

Unable to initialize the Nutch object

I used the following command to initialize the Nutch object.

nt = Nutch('crawlTest', urlDir='urls/', serverEndpoint='http://localhost:8081')

But it gave me the following error

nutch.py: GET Endpoint: /config/crawlTest
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'date': 'Fri, 09 Oct 2015 10:26:13 GMT', 'transfer-encoding': 'chunked', 'content-type': 'application/json', 'server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 200
nutch.py: Response JSON: {}
Traceback (most recent call last):
  File "/Users/Antrromet/Documents/workspace/Nutch/test_nutch_python.py", line 3, in <module>
    nt = Nutch('crawlTest', urlDir='urls/', serverEndpoint='http://localhost:8081')
  File "build/bdist.macosx-10.11-intel/egg/nutch/nutch.py", line 609, in __init__
  File "build/bdist.macosx-10.11-intel/egg/nutch/nutch.py", line 302, in __getitem__
KeyError

Ideally, the above should have worked, because it should have used the default configuration, and should have been able to find it. But unfortunately, it doesn't and throws the KeyError.

I even tried explicitly giving the default config (although it doesn't matter because its the default param) but in vain.

nt = Nutch('crawlTest', confId='default', urlDir='urls/', serverEndpoint='http://localhost:8081')

The above gave me the following error.

Traceback (most recent call last):
  File "/Users/Antrromet/Documents/workspace/Nutch/test_nutch_python.py", line 3, in <module>

    nt = Nutch('crawlTest', confId='default', urlDir='urls/', serverEndpoint='http://localhost:8081')

TypeError: __init__() got multiple values for keyword argument 'confId'

Error: raise NutchCrawlException nutch.nutch.NutchCrawlException while indexing

I'm trying to run this script but getting error whle indexing the data ( Job Status : Failed).

https://github.com/chrismattmann/nutch-python/wiki

Script:

from nutch.nutch import Nutch
from nutch.nutch import SeedClient
from nutch.nutch import Server
from nutch.nutch import JobClient
import nutch

sv=Server('http://localhost:8081')
sc=SeedClient(sv)
seed_urls=('http://espn.go.com','http://www.espn.com')
sd= sc.create('espn-seed',seed_urls)

nt = Nutch('default')
jc = JobClient(sv, 'test', 'default')

cc = nt.Crawl(sd, sc, jc)
while True:
print("---------------- Printing JOb Progress ------- : ", cc.progress)
job = cc.progress() # gets the current job if no progress, else iterates and makes progress
print("---------------- Printing JOb Progess ------- : ", job)
if job == None:
break

End Output :

`---------------- Printing JOb Progess ------- : <nutch.nutch.Job object at 0x7f37df6a2d30>
---------------- Printing JOb Progress ------- : <bound method CrawlClient.progress of <nutch.nutch.CrawlClient object at 0x7f37e42f01d0>>
nutch.py: GET Endpoint: /job/test-default-INDEX-1148278571
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'Content-Type': 'application/json', 'Date': 'Thu, 19 Jul 2018 07:20:22 GMT', 'Transfer-Encoding': 'chunked', 'Server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 200
nutch.py: Response JSON: {'id': 'test-default-INDEX-1148278571', 'type': 'INDEX', 'confId': 'default', 'args': {'url_dir': 'seedFiles/seed-1531984813255'}, 'result': None, 'state': 'RUNNING', 'msg': 'OK', 'crawlId': 'test'}
@@@@@@@@@@@@@@@@@@@@@@@ Job STATE @@@@@@@@@@@@@@@@ : RUNNING

---------------- Printing JOb Progess ------- : <nutch.nutch.Job object at 0x7f37df6a2d30>
---------------- Printing JOb Progress ------- : <bound method CrawlClient.progress of <nutch.nutch.CrawlClient object at 0x7f37e42f01d0>>
nutch.py: GET Endpoint: /job/test-default-INDEX-1148278571
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'Content-Type': 'application/json', 'Date': 'Thu, 19 Jul 2018 07:20:22 GMT', 'Transfer-Encoding': 'chunked', 'Server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 200
nutch.py: Response JSON: {'id': 'test-default-INDEX-1148278571', 'type': 'INDEX', 'confId': 'default', 'args': {'url_dir': 'seedFiles/seed-1531984813255'}, 'result': None, 'state': 'FAILED', 'msg': 'ERROR: java.io.IOException: Job failed!', 'crawlId': 'test'}
@@@@@@@@@@@@@@@@@@@@@@@ Job STATE @@@@@@@@@@@@@@@@ : FAILED
Traceback (most recent call last):
File "test.py", line 18, in
job = cc.progress() # gets the current job if no progress, else iterates and makes progress
File "/home/purushottam/Documents/tech_learn/ex_nutch/nutch-python/nutch/nutch.py", line 563, in progress
raise NutchCrawlException
nutch.nutch.NutchCrawlException
`

**Nutch Hadoop Log file : **

`2018-07-19 13:04:16,524 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: test/crawldb
2018-07-19 13:04:16,524 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: test/segments/20180719002543
2018-07-19 13:04:16,528 WARN mapreduce.JobResourceUploader - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2018-07-19 13:04:16,726 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: content dest: content
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: title dest: title
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: host dest: host
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: segment dest: segment
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: boost dest: boost
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: digest dest: digest
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: tstamp dest: tstamp
2018-07-19 13:04:16,738 WARN mapred.LocalJobRunner - job_local1889671569_0083
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:8983/solr: Expected mime type application/octet-stream but got text/html.

<title>Error 404 Not Found</title>

HTTP ERROR 404

Problem accessing /solr/update. Reason:

    Not Found

at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)

Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:8983/solr: Expected mime type application/octet-stream but got text/html.

<title>Error 404 Not Found</title>

HTTP ERROR 404

Problem accessing /solr/update. Reason:

    Not Found

at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:544)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:482)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:463)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:191)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:179)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:117)
at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

2018-07-19 13:04:17,636 ERROR impl.JobWorker - Cannot run job worker!
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:96)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:89)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:351)
at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:73)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)`

NutchPy Seq-Reader via Nutch REST Service

Here is the issue for Nutchpy - ContinuumIO/nutchpy#12

Docs for the Reader endpoint - http://docs.nutchpytonutchrestapi.apiary.io/

runtime error at AWS

Hi,

I was able to run the following code on my own linux machine without a problem:

from nutch.nutch import Nutch
from nutch.nutch import SeedClient
from nutch.nutch import Server
from nutch.nutch import JobClient
import nutch

sv=Server('http://localhost:8081')
sc=SeedClient(sv)
seed_urls=('http://espn.go.com','http://www.espn.com')
sd= sc.create('espn-seed',seed_urls) 

nt = Nutch('default')
jc = JobClient(sv, 'test', 'default')
cc = nt.Crawl(sd, sc, jc)
while True:
    job = cc.progress() # gets the current job if no progress, else iterates and makes progress
    if job == None:
        break

however, when I run the same code on AWS (ubuntu 14.04), it gives a runtime error. here is the runtime log of the code:

nutch.py: Response status: 200
nutch.py: Response JSON: {u'crawlId': u'test', u'args': {u'url_dir': u'/tmp/1456875353316-0'}, u'state': u'IDLE', u'result': None, u'msg': u'idle', u'type': u'GENERATE', u'id': u'test-default-GENERATE-1140031758', u'confId': u'default'}
nutch.py: GET Endpoint: /job/test-default-GENERATE-1140031758
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'Date': 'Tue, 01 Mar 2016 23:36:35 GMT', 'Content-Length': '0', 'Server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 204
Traceback (most recent call last):
File "main.py", line 22, in
job = cc.progress() # gets the current job if no progress, else iterates and makes progress
File "/usr/local/lib/python2.7/dist-packages/nutch/nutch.py", line 531, in progress
jobInfo = currentJob.info()
File "/usr/local/lib/python2.7/dist-packages/nutch/nutch.py", line 201, in info
return self.server.call('get', '/job/' + self.id)
File "/usr/local/lib/python2.7/dist-packages/nutch/nutch.py", line 160, in call
raise error

nutch.nutch.NutchException: Unexpected server response: 204

in order to run the python code, I was running nutch as: /bin/nutch startserver, here is the run the

Injector: starting at 2016-03-01 23:35:53
Injector: crawlDb: test/crawldb
Injector: urlDir: /tmp/1456875353316-0
Injector: Converting injected urls to crawl db entries.
Injector: overwrite: false
Injector: update: false
Injector: Total number of urls rejected by filters: 0
Injector: Total number of urls after normalization: 2
Injector: Total new urls injected: 2
Injector: finished at 2016-03-01 23:36:34, elapsed: 00:00:40
Generator: starting at 2016-03-01 23:36:35
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: running in local mode, generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: test/segments/20160301233638
Generator: finished at 2016-03-01 23:36:40, elapsed: 00:00:05

I would appreciate if you can help.

Thanks

Unable to create a seed object via python

When i run the python code:

def get_seed(seedListData=('http://www.zidaho.com','http://www.iguntrade.com')):
    sc = get_seed_client()
    return sc.create('test_seed', seedListData)

it fails with the below being displayed in the stdout:

nutch.py: POST Request data: {'id': '12345', 'seedUrls': [{'url': 'http://www.zidaho.com', 'seedList': None, 'id': 0}, {'url': 'http://www.iguntrade.com', 'seedList': None, 'id': 1}], 'name': 'test_seed'}
nutch.py: POST Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'date': 'Sat, 10 Oct 2015 18:00:58 GMT', 'content-length': '0', 'server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 500
Traceback (most recent call last):
File "nutchPythonCrawl.py", line 79, in
test_crawl_client()
File "nutchPythonCrawl.py", line 73, in test_crawl_client
cc = get_crawl_client()
File "nutchPythonCrawl.py", line 30, in get_crawl_client
seed = get_seed()
File "nutchPythonCrawl.py", line 21, in get_seed
return sc.create('test_seed', seedListData)
File "/Users/rrgirish/anaconda/lib/python2.7/site-packages/nutch/nutch.py", line 433, in create
seedPath = self.server.call('post', "/seed/create", seedListData, JsonAcceptHeader, forceText=True)
File "/Users/rrgirish/anaconda/lib/python2.7/site-packages/nutch/nutch.py", line 160, in call
raise error

chrismattmann / nutch-python Goto Github PK

nutch-python's Introduction

nutch-python

Installation (with pip)

Installation (without pip)

Wiki Documentation

New Command Line Tool

Questions, comments?

Contributors

License

nutch-python's People

Contributors

Stargazers

Watchers

Forkers

nutch-python's Issues

HTTP ERROR 404

HTTP ERROR 404

nutch.nutch.NutchException: Unexpected server response: 204

Recommend Projects

Recommend Topics

Recommend Org