Comments (4)
I am getting the same error if I try to read the entire file at once. If being read in chunks (Say few rows of data), I can read the data. Any pointers on how to resolve the issue?
Error as seen in the log window (I have miniconda3 installed with Python 3.4)
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.continuumio.seqreaderapp.SequenceReader.head.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.AbstractStringBuilder.(AbstractStringBuilder.java:68)
at java.lang.StringBuilder.(StringBuilder.java:89)
at org.apache.nutch.crawl.CrawlDatum.toString(CrawlDatum.java:408)
at com.continuumio.seqreaderapp.SequenceReader.head(SequenceReader.java:73)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
from nutchpy.
Hi @Pramodnagarajarao
As a work around may be you can create a script to read over segments (instead of one crawldb data file) iteratively.
from nutchpy.
Thanks @karanjeets.
You mean to say that we need to read from segment directory's associated data file and not crawldb data file with sequence_reader.read() methods intact? If yes, I have tried that too. Didn't succeed.
from nutchpy.
@karanjeets @Pramodnagarajarao
Theres is a better way, using stream/iterator reader: example
from nutchpy.
Related Issues (20)
- Update CrawlDB from Python HOT 1
- Create Seed List HOT 1
- Generate Webgraph HOT 4
- Static Py4J gateway HOT 1
- Evaluate Scala as Java replacement HOT 1
- EOF
- Bad error without pom.xml HOT 2
- Solr stop command not working HOT 1
- Error when trying to run nutch crawl HOT 5
- Begin to integrate Nutch REST services HOT 6
- sequence_reader.slice() is skipping the first record in sequence file HOT 4
- Unable to use LinkReader HOT 1
- JVM gateway process remains alive even after the termination of python's reader process HOT 3
- Compilation Error on Mac OS X 10.10 HOT 2
- can't find the file HOT 2
- conda install apache-maven / nutchpy with blaze doesn't work with linux 32-bit versions HOT 8
- Version upgrade HOT 3
- [renovate on-prem migration] Obsolete Dependency Dashboard
- Dependency Dashboard
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nutchpy.