continuumio / nutchpy Goto Github PK

View Code? Open in Web Editor NEW

22.0 17.0 16.0 4.97 MB

For interacting with nutch via Python

License: Apache License 2.0

Shell 6.49% Python 32.00% Java 61.12% Batchfile 0.38%

nutchpy's Issues

JVM gateway process remains alive even after the termination of python's reader process

Request :

Auto shutdown gateway process when the python process exits.

How to reproduce:

Run sequence_reader examples few times and then inspect the processes list.

Compilation Error on Mac OS X 10.10

I'm getting an compilation error while trying to build the latest version. Here's the output:

➜  nutchpy git:(master) sudo python setup.py install
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building seqreader-app 1.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ seqreader-app ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] skip non existing resourceDirectory /Users/ayberk/nutchpy/seqreader-app/src/main/resources
[INFO]
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ seqreader-app ---
[INFO] Changes detected - recompiling the module!
[WARNING] File encoding has not been set, using platform encoding UTF-8, i.e. build is platform dependent!
[INFO] Compiling 6 source files to /Users/ayberk/nutchpy/seqreader-app/target/classes
[INFO] -------------------------------------------------------------
[ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] /Users/ayberk/nutchpy/seqreader-app/src/main/java/com/continuumio/seqreaderapp/RecordIterator.java:[19,8] com.continuumio.seqreaderapp.RecordIterator is not abstract and does not override abstract method remove() in java.util.Iterator
[INFO] 1 error
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 3.077 s
[INFO] Finished at: 2015-09-29T17:16:43-07:00
[INFO] Final Memory: 17M/115M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project seqreader-app: Compilation failure
[ERROR] /Users/ayberk/nutchpy/seqreader-app/src/main/java/com/continuumio/seqreaderapp/RecordIterator.java:[19,8] com.continuumio.seqreaderapp.RecordIterator is not abstract and does not override abstract method remove() in java.util.Iterator
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
Traceback (most recent call last):
  File "setup.py", line 132, in <module>
    shutil.copy(jar_file,java_lib_dir)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 119, in copy
    copyfile(src, dst)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 82, in copyfile
    with open(src, 'rb') as fsrc:
IOError: [Errno 2] No such file or directory: 'seqreader-app/target/seqreader-app-1.0-SNAPSHOT-jar-with-dependencies.jar'

I use java version "1.7.0_79" and python version "2.7.10".

Generate Webgraph

Ask for the web graph of currently crawled sties

sequence_reader.slice() is skipping the first record in sequence file

Test case:

import os
import nutchpy

path = os.path.dirname(nutchpy.__file__)
path = os.path.join(path,"ex_data", "crawldb_data")

first_slice = nutchpy.sequence_reader.slice(0, 2, path)
head = nutchpy.sequence_reader.head(2, path)
print "Head = %s\nSlice=%s" % (head[0][0], first_slice[0][0])
assert first_slice[0][0] == head[0][0]

conda install apache-maven / nutchpy with blaze doesn't work with linux 32-bit versions

I am constantly facing an error if I try to install any apache-maven or nutchpy with the help of conda in Ubuntu 12.04 : 32-bit version. Also, I tried to see if anaconda had any version of nutchpy compatible with 32-bit versions, but didn't find any.

presha@presha-Inspiron-N5110:~/IR$ conda install -c blaze apache-maven
Fetching package metadata: ......
Error: No packages found in current linux-32 channels matching: apache-maven

You can search for this package on anaconda.org with

anaconda search -t conda apache-maven

presha@presha-Inspiron-N5110:~/IR$ conda install -c blaze nutchpy
Fetching package metadata: ......
Error: No packages found in current linux-32 channels matching: nutchpy

You can search for this package on anaconda.org with

anaconda search -t conda nutchpy

presha@presha-Inspiron-N5110:~/IR$ anaconda search -t conda nutchpy
Using binstar api site https://api.anaconda.org
Run 'anaconda show <USER/PACKAGE>' to get more details:
Packages:
Name | Version | Package Types | Platforms
------------------------- | ------ | --------------- | ---------------
blaze/nutchpy | 0.1 | conda | linux-64, win-64, osx-64
Found 1 packages

Could someone please suggest an alternative approach to install it in ubuntu 12.04, 32 bit version.

[renovate on-prem migration] Obsolete Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Ignored or Blocked

These are blocked by an existing closed PR and will not be recreated unless you click a checkbox below.

chore(deps): update dependency org.apache.nutch:nutch to v2

Detected dependencies

maven

seqreader-app/pom.xml

junit:junit 4.13.2

org.apache.commons:commons-jexl3 3.2.1

org.apache.nutch:nutch 1.18

org.apache.hadoop:hadoop-common 3.3.2

net.sf.py4j:py4j 0.10.9.5

org.apache.maven.plugins:maven-compiler-plugin 3.10.1

org.apache.maven.plugins:maven-jar-plugin 3.2.2

org.apache.maven.plugins:maven-dependency-plugin 3.3.0

org.apache.maven.plugins:maven-assembly-plugin 3.3.0

pip_setup

setup.py

py4j >=0.8.2.1

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Detected dependencies

maven

seqreader-app/pom.xml

junit:junit 4.13.2

org.apache.commons:commons-jexl3 3.2.1

org.apache.nutch:nutch 1.18

org.apache.hadoop:hadoop-common 3.3.2

net.sf.py4j:py4j 0.10.9.5

org.apache.maven.plugins:maven-compiler-plugin 3.10.1

org.apache.maven.plugins:maven-jar-plugin 3.2.2

org.apache.maven.plugins:maven-dependency-plugin 3.3.0

org.apache.maven.plugins:maven-assembly-plugin 3.3.0

pip_setup

setup.py

py4j >=0.8.2.1

Check this box to trigger a request for Renovate to run again on this repository

Unable to use LinkReader

When tried to use link_reader instead of sequence_reader with the following command,

import os
import nutchpy

path = os.path.dirname(nutchpy.__file__)
path = os.path.join(path,"ex_data", "crawldb_data")
data = nutchpy.link_reader.read(path)

It crashed with the following logs

Traceback (most recent call last):
  File "/Users/Antrromet/Documents/LiClipse Workspace/NutchPy/src/seq_reader.py", line 16, in <module>
    data = nutchpy.link_reader.read(path)
  File "/Library/Python/2.7/site-packages/nutchpy/readers.py", line 25, in read
    data = self.reader.read(path)
  File "/Library/Python/2.7/site-packages/py4j-0.9-py2.7.egg/py4j/java_gateway.py", line 813, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/Library/Python/2.7/site-packages/py4j-0.9-py2.7.egg/py4j/protocol.py", line 308, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.continuumio.seqreaderapp.LinkReader.read.
: java.io.IOException: wrong value class: url: null, anchor: , score: 0.0, timestamp: 0, link type: unknown is not class org.apache.nutch.crawl.CrawlDatum
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1874)
    at com.continuumio.seqreaderapp.LinkReader.read(LinkReader.java:50)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

Out of all the readers mentioned here, only SequenceReader works.

java.lang.OutOfMemoryError: Java heap space | SequenceReader.read

I am getting the below error while parsing data file in crawldb. Please note that the size of file is 136 MB and it has lot of URLs in it.

As far as I am able to understand, SequenceReader.read tries to put all extracted data into java heap and then processes it. Therefore it works for small inputs (considering my allotted heap space) and fail for large ones.

Question - Is it possible to process the data in incremental fashion (in parts) and release the memory as you go on ? May be StreamingSequenceReader.read ?

_Error Details_
Traceback (most recent call last):
File "get_domains.py", line 41, in
main(sys.argv[1])
File "get_domains.py", line 18, in main
parse_data = nutchpy.sequence_reader.read(data)
File "/root/anaconda/lib/python2.7/site-packages/nutchpy/readers.py", line 113, in read
data = self.reader.read(path)
File "/root/anaconda/lib/python2.7/site-packages/py4j/java_gateway.py", line 538, in call
self.target_id, self.name)
File "/root/anaconda/lib/python2.7/site-packages/py4j/protocol.py", line 300, in get_return_value
format(target_id, '.', name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.continuumio.seqreaderapp.SequenceReader.read.
: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at org.apache.nutch.crawl.CrawlDatum.toString(CrawlDatum.java:409)
at com.continuumio.seqreaderapp.SequenceReader.read(SequenceReader.java:113)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)

Evaluate Scala as Java replacement

Begin to integrate Nutch REST services

we should start to integrate NUTCH's new REST API:

https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI

The same way that I did this with Tika-Python.

Bad error without pom.xml

Bug report from Shadi Saleh [email protected]

When installing, the following errors occur

[INFO] Scanning for projects...

[INFO]


[INFO]
------------------------------------------------------------------------

[INFO] Building seqreader-app 1.0-SNAPSHOT

[INFO]
------------------------------------------------------------------------

[INFO]

[INFO] --- maven-resources-plugin:2.3:resources (default-resources) @
seqreader-app ---

[WARNING] Using platform encoding (ANSI_X3.4-1968 actually) to copy
filtered resources, i.e. build is platform dependent!

[INFO] skip non existing resourceDirectory
/root/py_nutch/nutchpy/seqreader-app/src/main/resources

[INFO]

[INFO] --- maven-compiler-plugin:2.0.2:compile (default-compile) @
seqreader-app ---

[INFO] Compiling 5 source files to
/root/py_nutch/nutchpy/seqreader-app/target/classes

[INFO]
------------------------------------------------------------------------

[INFO] BUILD FAILURE

[INFO]
------------------------------------------------------------------------

[INFO] Total time: 1.213s

[INFO] Finished at: Sat Jan 10 03:23:39 EST 2015

[INFO] Final Memory: 9M/153M

[INFO]
------------------------------------------------------------------------

[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-compiler-plugin:2.0.2:compile
(default-compile) on project seqreader-app: Compilation failure:
Compilation failure:

[ERROR]
/root/py_nutch/nutchpy/seqreader-app/src/main/java/com/continuumio/seqreaderapp/LinkReader.java:[21,25]
error: generics are not supported in -source 1.3

[ERROR]

[ERROR] (use -source 5 or higher to enable generics)

[ERROR]
/root/py_nutch/nutchpy/seqreader-app/src/main/java/com/continuumio/seqreaderapp/SequenceReader.java:[47,12]
error: generics are not supported in -source 1.3

[ERROR]

[ERROR] (use -source 5 or higher to enable generics)

[ERROR]
/root/py_nutch/nutchpy/seqreader-app/src/main/java/com/continuumio/seqreaderapp/NodeReader.java:[21,25]
error: generics are not supported in -source 1.3

[ERROR] -> [Help 1]

[ERROR]

[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.

[ERROR] Re-run Maven using the -X switch to enable full debug logging.

[ERROR]

[ERROR] For more information about the errors and possible solutions,
please read the following articles:

[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

Traceback (most recent call last):

  File "setup.py", line 132, in <module>

    shutil.copy(jar_file,java_lib_dir)

  File "/usr/lib/python2.7/shutil.py", line 119, in copy

    copyfile(src, dst)

  File "/usr/lib/python2.7/shutil.py", line 82, in copyfile

    with open(src, 'rb') as fsrc:

IOError: [Errno 2] No such file or directory:
'seqreader-app/target/seqreader-app-1.0-SNAPSHOT-jar-with-dependencies.jar'

But things worked when adding nutchpy/seqreader-app/pom.xml

   <plugin>

      <groupId>org.apache.maven.plugins</groupId>

      <artifactId>maven-compiler-plugin</artifactId>

      <configuration>

        <compilerVersion>1.5</compilerVersion>

        <source>1.5</source>

        <target>1.5</target>

      </configuration>

    </plugin>

can't find the file

Hi
I already downloaded the nutchpy, and try to execute the example.
But it came up with an error like "FileNotFoundError: [WinError 2] The system cannot find the file specified"
My operation system is Windows 10 and the full console information shows as below.

Traceback (most recent call last):

File "C:\cygwin64\home\nutchpy\test.py", line 1, in
import nutchpy
File "C:\cygwin64\home\nutchpy\nutchpy__init__.py", line 6, in
from .JVM import gateway
File "C:\cygwin64\home\nutchpy\nutchpy\JVM.py", line 91, in
gateway = NutchJavaGateway().gateway
File "C:\cygwin64\home\nutchpy\nutchpy\JVM.py", line 88, in gateway
self._gateway = launch_gateway()
File "C:\cygwin64\home\nutchpy\nutchpy\JVM.py", line 38, in launch_gateway
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
File "C:\Python35-32\lib\subprocess.py", line 950, in init
restore_signals, start_new_session)
File "C:\Python35-32\lib\subprocess.py", line 1220, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] FileNotFoundError: [WinError 2] The system cannot find the file specified

So I don't know how to fix it and also I have no idea the what file the system is looking for. I am very sure the path of crawling result data is correct.

Thank you.

Error when trying to run nutch crawl

(memex-explorer)cdoig@066-cdoig:~$ crawl ~/work/memex/memex/court_docs/raw_data_no_comments/seed_dir/  ~/work/memex/memex/court_docs/crawl_test 4
JAVA_HOME is set to '/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home'
~/anaconda/envs/memex-explorer/lib/nutch ~
No SOLRURL specified. Skipping indexing.
Injecting seed URLs
/Users/cdoig/anaconda/envs/memex-explorer/lib/nutch/bin/nutch inject /Users/cdoig/work/memex/memex/court_docs/crawl_test/crawldb /Users/cdoig/work/memex/memex/court_docs/raw_data_no_comments/seed_dir/
Injector: starting at 2015-03-27 09:55:10
Injector: crawlDb: /Users/cdoig/work/memex/memex/court_docs/crawl_test/crawldb
Injector: urlDir: /Users/cdoig/work/memex/memex/court_docs/raw_data_no_comments/seed_dir
Injector: Converting injected urls to crawl db entries.
Injector: java.net.UnknownHostException: 066-cdoig: 066-cdoig: nodename nor servname provided, or not known
    at java.net.InetAddress.getLocalHost(InetAddress.java:1473)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:960)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:324)
    at org.apache.nutch.crawl.Injector.run(Injector.java:380)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Injector.main(Injector.java:370)
Caused by: java.net.UnknownHostException: 066-cdoig: nodename nor servname provided, or not known
    at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
    at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901)
    at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293)
    at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
    ... 12 more

Error running:
  /Users/cdoig/anaconda/envs/memex-explorer/lib/nutch/bin/nutch inject /Users/cdoig/work/memex/memex/court_docs/crawl_test/crawldb /Users/cdoig/work/memex/memex/court_docs/raw_data_no_comments/seed_dir/
Failed with exit value 255.
~

Update CrawlDB from Python

if you are are in the middle of a crawl you should be able to update the crawldb

EOF

how to handle EOF for slicing and head

Static Py4J gateway

Py4J gateway is currently limited to port 25333. I believe this is a result of the current packaging with with Maven. Will continue to investigage

Solr stop command not working

The solr command solr stop is not working for me. This is the error I'm getting:

$:~/anaconda/envs/topic_space/solr_pkg/solr$ solr stop
Solr server directory /Users/cdoig/anaconda/envs/topic_space/example not found!
$:~/anaconda/envs/topic_space/solr_pkg/solr$ solr stop -help
Usage: solr stop [-k key] [-p port] [-V]
  -k <key>      Stop key; default is solrrocks
  -p <port>     Specify the port the Solr HTTP listener is bound to; default is 8983
  -all          Find and stop all running Solr servers on this host
  -V            Verbose messages from this script

NOTE: If port is not specified, then all running Solr servers are stopped.

@quasiben

https://github.com/ContinuumIO/nutchpy/blob/master/solr_recipe/build.sh#L17

Version upgrade

I got some problem reading the data I crawled. The trace is as following. I checked out the newest version in nutch, and it is already 8 instead of 7. Might be a good time to upgrade the dependency version.

Traceback (most recent call last):
  File "readStatistics.py", line 96, in <module>
    main()
  File "readStatistics.py", line 19, in main
    count = nutchpy.sequence_reader.count(crawlDbFile)
  File "/Users/Taichi1/miniconda3/lib/python3.4/site-packages/nutchpy/readers.py", line 173, in count
    count = self.reader.count(path)
  File "/Users/Taichi1/miniconda3/lib/python3.4/site-packages/py4j/java_gateway.py", line 813, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/Users/Taichi1/miniconda3/lib/python3.4/site-packages/py4j/protocol.py", line 308, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.continuumio.seqreaderapp.SequenceReader.count.
: A record version mismatch occured. Expecting v7, found v8
    at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:246)
    at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
    at com.continuumio.seqreaderapp.SequenceReader.count(SequenceReader.java:198)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

continuumio / nutchpy Goto Github PK

nutchpy's Issues

Request :

How to reproduce:

Open

Ignored or Blocked

Detected dependencies

Open

Detected dependencies

Recommend Projects

Recommend Topics

Recommend Org