The wikixmlj from delip

wikixmlj's Introduction

WikiXMLJ provides easy access to Wikipedia XML dumps.

Features

Easy access to important elements of a Wikipedia page
Also provides interfaces for Wiki text parsing.
Memory efficient
SAX interface for parsing
Lazy loading of files for DOM
Callback support with DOM
Directly operate on compressed wikipedia dumps (gzip/bzip2/native xml supported)

Note: gzip streams are way faster than bzip2 for a slight trade off in space.

DOMParser Example

   import edu.jhu.nlp.wikipedia.*;


    WikiXMLParser wxp = WikiXMLParserFactory.getDOMParser(args[0]);
    try {
            wxp.parse();
            WikiPageIterator it = wxp.getIterator();
            while(it.hasMorePages()) {
                    WikiPage page = it.nextPage();
                    System.out.println(page.getTitle());
            }

    }catch(Exception e) {
            e.printStackTrace();
    }

SAXParser Example

    import edu.jhu.nlp.wikipedia.*;

    WikiXMLParser wxsp = WikiXMLParserFactory.getSAXParser(args[0]);
            
    try {
              
        wxsp.setPageCallback(new PageCallbackHandler() { 
                       public void process(WikiPage page) {
                              System.out.println(page.getTitle());
                       }
        });
            
       wxsp.parse();
    }catch(Exception e) {
            e.printStackTrace();
    }

Notes

The DOM parser is known to run out of memory despite of lazy loading the DOM tree for very large Wikipedia dumps (like English). This issue will be fixed eventually until then using the SAX parser interface is highly recommended. If you really want to use the DOM parser, try it with the callback interface.

2.This should not be confused with the Java Wikipedia Parser that converts wiki-text to HTML.

Dependencies

All dependencies are packaged into this source. The dependencies might have different licensing terms.

Apache Xerces DOM parser for lazy loading. Optionally uses bzip2 code refactored by Kohsuke Kawaguchi from Apache's Ant project.

Contributors

Jason Smith
Itamar Syn-Hershko (@synhershko)
Alan Said (@alansaid)
Victor Olivares

wikixmlj's People

Contributors

Stargazers

Watchers

wikixmlj's Issues

Maintained?

Is this project maintained, i.e. will pull requests get answered?

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

Hi, I'm getting an OutOfMemmoryError while running the program. Below is the complete exception:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at org.json.simple.parser.Yylex.zzUnpackCMap(Yylex.java:302) at org.json.simple.parser.Yylex.<clinit>(Yylex.java:40) at org.json.simple.parser.JSONParser.<init>(JSONParser.java:34) at edu.jhu.nlp.language.Language.getJsonObject(Language.java:70) at edu.jhu.nlp.language.Language.<init>(Language.java:47) at edu.jhu.nlp.wikipedia.WikiTextParser.<init>(WikiTextParser.java:45) at edu.jhu.nlp.wikipedia.WikiPage.setWikiText(WikiPage.java:39) at edu.jhu.nlp.wikipedia.SAXPageCallbackHandler.endElement(SAXPageCallbackHandler.java:55) at org.apache.xerces.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:597) at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(XMLNSDocumentScannerImpl.java:676) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1646) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:324) at org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:845) at org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:768) at org.apache.xerces.parsers.XMLParser.parse(XMLParser.java:108) at org.apache.xerces.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1201) at edu.jhu.nlp.wikipedia.WikiXMLSAXParser.parse(WikiXMLSAXParser.java:68) at com.accenture.WikiLogic.main(WikiLogic.java:222)

Can you please assist regarding this? Thank you.

Recommend Projects

delip / wikixmlj Goto Github PK

wikixmlj's Introduction

Features

DOMParser Example

SAXParser Example

Notes

Dependencies

Contributors

wikixmlj's People

Contributors

Stargazers

Watchers

Forkers

wikixmlj's Issues

Maintained?

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent