continuumio / nutchpy Goto Github PK
View Code? Open in Web Editor NEWFor interacting with nutch via Python
License: Apache License 2.0
For interacting with nutch via Python
License: Apache License 2.0
I'm getting an compilation error while trying to build the latest version. Here's the output:
โ nutchpy git:(master) sudo python setup.py install
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building seqreader-app 1.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ seqreader-app ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] skip non existing resourceDirectory /Users/ayberk/nutchpy/seqreader-app/src/main/resources
[INFO]
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ seqreader-app ---
[INFO] Changes detected - recompiling the module!
[WARNING] File encoding has not been set, using platform encoding UTF-8, i.e. build is platform dependent!
[INFO] Compiling 6 source files to /Users/ayberk/nutchpy/seqreader-app/target/classes
[INFO] -------------------------------------------------------------
[ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] /Users/ayberk/nutchpy/seqreader-app/src/main/java/com/continuumio/seqreaderapp/RecordIterator.java:[19,8] com.continuumio.seqreaderapp.RecordIterator is not abstract and does not override abstract method remove() in java.util.Iterator
[INFO] 1 error
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 3.077 s
[INFO] Finished at: 2015-09-29T17:16:43-07:00
[INFO] Final Memory: 17M/115M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project seqreader-app: Compilation failure
[ERROR] /Users/ayberk/nutchpy/seqreader-app/src/main/java/com/continuumio/seqreaderapp/RecordIterator.java:[19,8] com.continuumio.seqreaderapp.RecordIterator is not abstract and does not override abstract method remove() in java.util.Iterator
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
Traceback (most recent call last):
File "setup.py", line 132, in <module>
shutil.copy(jar_file,java_lib_dir)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 119, in copy
copyfile(src, dst)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 82, in copyfile
with open(src, 'rb') as fsrc:
IOError: [Errno 2] No such file or directory: 'seqreader-app/target/seqreader-app-1.0-SNAPSHOT-jar-with-dependencies.jar'
I use java version "1.7.0_79" and python version "2.7.10".
Create a seed list and start crawling
Ask for the web graph of currently crawled sties
Test case:
import os
import nutchpy
path = os.path.dirname(nutchpy.__file__)
path = os.path.join(path,"ex_data", "crawldb_data")
first_slice = nutchpy.sequence_reader.slice(0, 2, path)
head = nutchpy.sequence_reader.head(2, path)
print "Head = %s\nSlice=%s" % (head[0][0], first_slice[0][0])
assert first_slice[0][0] == head[0][0]
I am constantly facing an error if I try to install any apache-maven or nutchpy with the help of conda in Ubuntu 12.04 : 32-bit version. Also, I tried to see if anaconda had any version of nutchpy compatible with 32-bit versions, but didn't find any.
presha@presha-Inspiron-N5110:~/IR$ conda install -c blaze apache-maven
Fetching package metadata: ......
Error: No packages found in current linux-32 channels matching: apache-maven
You can search for this package on anaconda.org with
anaconda search -t conda apache-maven
presha@presha-Inspiron-N5110:~/IR$ conda install -c blaze nutchpy
Fetching package metadata: ......
Error: No packages found in current linux-32 channels matching: nutchpy
You can search for this package on anaconda.org with
anaconda search -t conda nutchpy
presha@presha-Inspiron-N5110:~/IR$ anaconda search -t conda nutchpy
Using binstar api site https://api.anaconda.org
Run 'anaconda show <USER/PACKAGE>' to get more details:
Packages:
Name | Version | Package Types | Platforms
------------------------- | ------ | --------------- | ---------------
blaze/nutchpy | 0.1 | conda | linux-64, win-64, osx-64
Found 1 packages
Could someone please suggest an alternative approach to install it in ubuntu 12.04, 32 bit version.
This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.
These updates have all been created already. Click a checkbox below to force a retry/rebase of any.
These are blocked by an existing closed PR and will not be recreated unless you click a checkbox below.
seqreader-app/pom.xml
junit:junit 4.13.2
org.apache.commons:commons-jexl3 3.2.1
org.apache.nutch:nutch 1.18
org.apache.hadoop:hadoop-common 3.3.2
net.sf.py4j:py4j 0.10.9.5
org.apache.maven.plugins:maven-compiler-plugin 3.10.1
org.apache.maven.plugins:maven-jar-plugin 3.2.2
org.apache.maven.plugins:maven-dependency-plugin 3.3.0
org.apache.maven.plugins:maven-assembly-plugin 3.3.0
setup.py
py4j >=0.8.2.1
This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.
These updates have all been created already. Click a checkbox below to force a retry/rebase of any.
seqreader-app/pom.xml
junit:junit 4.13.2
org.apache.commons:commons-jexl3 3.2.1
org.apache.nutch:nutch 1.18
org.apache.hadoop:hadoop-common 3.3.2
net.sf.py4j:py4j 0.10.9.5
org.apache.maven.plugins:maven-compiler-plugin 3.10.1
org.apache.maven.plugins:maven-jar-plugin 3.2.2
org.apache.maven.plugins:maven-dependency-plugin 3.3.0
org.apache.maven.plugins:maven-assembly-plugin 3.3.0
setup.py
py4j >=0.8.2.1
When tried to use link_reader instead of sequence_reader with the following command,
import os
import nutchpy
path = os.path.dirname(nutchpy.__file__)
path = os.path.join(path,"ex_data", "crawldb_data")
data = nutchpy.link_reader.read(path)
It crashed with the following logs
Traceback (most recent call last):
File "/Users/Antrromet/Documents/LiClipse Workspace/NutchPy/src/seq_reader.py", line 16, in <module>
data = nutchpy.link_reader.read(path)
File "/Library/Python/2.7/site-packages/nutchpy/readers.py", line 25, in read
data = self.reader.read(path)
File "/Library/Python/2.7/site-packages/py4j-0.9-py2.7.egg/py4j/java_gateway.py", line 813, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/Library/Python/2.7/site-packages/py4j-0.9-py2.7.egg/py4j/protocol.py", line 308, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.continuumio.seqreaderapp.LinkReader.read.
: java.io.IOException: wrong value class: url: null, anchor: , score: 0.0, timestamp: 0, link type: unknown is not class org.apache.nutch.crawl.CrawlDatum
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1874)
at com.continuumio.seqreaderapp.LinkReader.read(LinkReader.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Out of all the readers mentioned here, only SequenceReader works.
I am getting the below error while parsing data file in crawldb. Please note that the size of file is 136 MB and it has lot of URLs in it.
As far as I am able to understand, SequenceReader.read tries to put all extracted data into java heap and then processes it. Therefore it works for small inputs (considering my allotted heap space) and fail for large ones.
Question - Is it possible to process the data in incremental fashion (in parts) and release the memory as you go on ? May be StreamingSequenceReader.read ?
_Error Details_
Traceback (most recent call last):
File "get_domains.py", line 41, in
main(sys.argv[1])
File "get_domains.py", line 18, in main
parse_data = nutchpy.sequence_reader.read(data)
File "/root/anaconda/lib/python2.7/site-packages/nutchpy/readers.py", line 113, in read
data = self.reader.read(path)
File "/root/anaconda/lib/python2.7/site-packages/py4j/java_gateway.py", line 538, in call
self.target_id, self.name)
File "/root/anaconda/lib/python2.7/site-packages/py4j/protocol.py", line 300, in get_return_value
format(target_id, '.', name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.continuumio.seqreaderapp.SequenceReader.read.
: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at org.apache.nutch.crawl.CrawlDatum.toString(CrawlDatum.java:409)
at com.continuumio.seqreaderapp.SequenceReader.read(SequenceReader.java:113)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
we should start to integrate NUTCH's new REST API:
https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI
The same way that I did this with Tika-Python.
Bug report from Shadi Saleh [email protected]
When installing, the following errors occur
[INFO] Scanning for projects...
[INFO]
[INFO]
------------------------------------------------------------------------
[INFO] Building seqreader-app 1.0-SNAPSHOT
[INFO]
------------------------------------------------------------------------
[INFO]
[INFO] --- maven-resources-plugin:2.3:resources (default-resources) @
seqreader-app ---
[WARNING] Using platform encoding (ANSI_X3.4-1968 actually) to copy
filtered resources, i.e. build is platform dependent!
[INFO] skip non existing resourceDirectory
/root/py_nutch/nutchpy/seqreader-app/src/main/resources
[INFO]
[INFO] --- maven-compiler-plugin:2.0.2:compile (default-compile) @
seqreader-app ---
[INFO] Compiling 5 source files to
/root/py_nutch/nutchpy/seqreader-app/target/classes
[INFO]
------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO]
------------------------------------------------------------------------
[INFO] Total time: 1.213s
[INFO] Finished at: Sat Jan 10 03:23:39 EST 2015
[INFO] Final Memory: 9M/153M
[INFO]
------------------------------------------------------------------------
[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-compiler-plugin:2.0.2:compile
(default-compile) on project seqreader-app: Compilation failure:
Compilation failure:
[ERROR]
/root/py_nutch/nutchpy/seqreader-app/src/main/java/com/continuumio/seqreaderapp/LinkReader.java:[21,25]
error: generics are not supported in -source 1.3
[ERROR]
[ERROR] (use -source 5 or higher to enable generics)
[ERROR]
/root/py_nutch/nutchpy/seqreader-app/src/main/java/com/continuumio/seqreaderapp/SequenceReader.java:[47,12]
error: generics are not supported in -source 1.3
[ERROR]
[ERROR] (use -source 5 or higher to enable generics)
[ERROR]
/root/py_nutch/nutchpy/seqreader-app/src/main/java/com/continuumio/seqreaderapp/NodeReader.java:[21,25]
error: generics are not supported in -source 1.3
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
Traceback (most recent call last):
File "setup.py", line 132, in <module>
shutil.copy(jar_file,java_lib_dir)
File "/usr/lib/python2.7/shutil.py", line 119, in copy
copyfile(src, dst)
File "/usr/lib/python2.7/shutil.py", line 82, in copyfile
with open(src, 'rb') as fsrc:
IOError: [Errno 2] No such file or directory:
'seqreader-app/target/seqreader-app-1.0-SNAPSHOT-jar-with-dependencies.jar'
But things worked when adding nutchpy/seqreader-app/pom.xml
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<compilerVersion>1.5</compilerVersion>
<source>1.5</source>
<target>1.5</target>
</configuration>
</plugin>
Hi
I already downloaded the nutchpy, and try to execute the example.
But it came up with an error like "FileNotFoundError: [WinError 2] The system cannot find the file specified"
My operation system is Windows 10 and the full console information shows as below.
Traceback (most recent call last):
File "C:\cygwin64\home\nutchpy\test.py", line 1, in
import nutchpy
File "C:\cygwin64\home\nutchpy\nutchpy__init__.py", line 6, in
from .JVM import gateway
File "C:\cygwin64\home\nutchpy\nutchpy\JVM.py", line 91, in
gateway = NutchJavaGateway().gateway
File "C:\cygwin64\home\nutchpy\nutchpy\JVM.py", line 88, in gateway
self._gateway = launch_gateway()
File "C:\cygwin64\home\nutchpy\nutchpy\JVM.py", line 38, in launch_gateway
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
File "C:\Python35-32\lib\subprocess.py", line 950, in init
restore_signals, start_new_session)
File "C:\Python35-32\lib\subprocess.py", line 1220, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] FileNotFoundError: [WinError 2] The system cannot find the file specified
So I don't know how to fix it and also I have no idea the what file the system is looking for. I am very sure the path of crawling result data is correct.
Thank you.
(memex-explorer)cdoig@066-cdoig:~$ crawl ~/work/memex/memex/court_docs/raw_data_no_comments/seed_dir/ ~/work/memex/memex/court_docs/crawl_test 4
JAVA_HOME is set to '/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home'
~/anaconda/envs/memex-explorer/lib/nutch ~
No SOLRURL specified. Skipping indexing.
Injecting seed URLs
/Users/cdoig/anaconda/envs/memex-explorer/lib/nutch/bin/nutch inject /Users/cdoig/work/memex/memex/court_docs/crawl_test/crawldb /Users/cdoig/work/memex/memex/court_docs/raw_data_no_comments/seed_dir/
Injector: starting at 2015-03-27 09:55:10
Injector: crawlDb: /Users/cdoig/work/memex/memex/court_docs/crawl_test/crawldb
Injector: urlDir: /Users/cdoig/work/memex/memex/court_docs/raw_data_no_comments/seed_dir
Injector: Converting injected urls to crawl db entries.
Injector: java.net.UnknownHostException: 066-cdoig: 066-cdoig: nodename nor servname provided, or not known
at java.net.InetAddress.getLocalHost(InetAddress.java:1473)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:960)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at org.apache.nutch.crawl.Injector.inject(Injector.java:324)
at org.apache.nutch.crawl.Injector.run(Injector.java:380)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Injector.main(Injector.java:370)
Caused by: java.net.UnknownHostException: 066-cdoig: nodename nor servname provided, or not known
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293)
at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
... 12 more
Error running:
/Users/cdoig/anaconda/envs/memex-explorer/lib/nutch/bin/nutch inject /Users/cdoig/work/memex/memex/court_docs/crawl_test/crawldb /Users/cdoig/work/memex/memex/court_docs/raw_data_no_comments/seed_dir/
Failed with exit value 255.
~
if you are are in the middle of a crawl you should be able to update the crawldb
how to handle EOF for slicing and head
Py4J gateway is currently limited to port 25333. I believe this is a result of the current packaging with with Maven. Will continue to investigage
The solr command solr stop is not working for me. This is the error I'm getting:
$:~/anaconda/envs/topic_space/solr_pkg/solr$ solr stop
Solr server directory /Users/cdoig/anaconda/envs/topic_space/example not found!
$:~/anaconda/envs/topic_space/solr_pkg/solr$ solr stop -help
Usage: solr stop [-k key] [-p port] [-V]
-k <key> Stop key; default is solrrocks
-p <port> Specify the port the Solr HTTP listener is bound to; default is 8983
-all Find and stop all running Solr servers on this host
-V Verbose messages from this script
NOTE: If port is not specified, then all running Solr servers are stopped.
https://github.com/ContinuumIO/nutchpy/blob/master/solr_recipe/build.sh#L17
I got some problem reading the data I crawled. The trace is as following. I checked out the newest version in nutch, and it is already 8 instead of 7. Might be a good time to upgrade the dependency version.
Traceback (most recent call last):
File "readStatistics.py", line 96, in <module>
main()
File "readStatistics.py", line 19, in main
count = nutchpy.sequence_reader.count(crawlDbFile)
File "/Users/Taichi1/miniconda3/lib/python3.4/site-packages/nutchpy/readers.py", line 173, in count
count = self.reader.count(path)
File "/Users/Taichi1/miniconda3/lib/python3.4/site-packages/py4j/java_gateway.py", line 813, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/Users/Taichi1/miniconda3/lib/python3.4/site-packages/py4j/protocol.py", line 308, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.continuumio.seqreaderapp.SequenceReader.count.
: A record version mismatch occured. Expecting v7, found v8
at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:246)
at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
at com.continuumio.seqreaderapp.SequenceReader.count(SequenceReader.java:198)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.