Code Monkey home page Code Monkey logo

wikixmlj's Introduction

WikiXMLJ provides easy access to Wikipedia XML dumps.

Features

  • Easy access to important elements of a Wikipedia page
  • Also provides interfaces for Wiki text parsing.
  • Memory efficient
  • SAX interface for parsing
  • Lazy loading of files for DOM
  • Callback support with DOM
  • Directly operate on compressed wikipedia dumps (gzip/bzip2/native xml supported)

Note: gzip streams are way faster than bzip2 for a slight trade off in space.

DOMParser Example

   import edu.jhu.nlp.wikipedia.*;


    WikiXMLParser wxp = WikiXMLParserFactory.getDOMParser(args[0]);
    try {
            wxp.parse();
            WikiPageIterator it = wxp.getIterator();
            while(it.hasMorePages()) {
                    WikiPage page = it.nextPage();
                    System.out.println(page.getTitle());
            }

    }catch(Exception e) {
            e.printStackTrace();
    }

SAXParser Example

    import edu.jhu.nlp.wikipedia.*;

    WikiXMLParser wxsp = WikiXMLParserFactory.getSAXParser(args[0]);
            
    try {
              
        wxsp.setPageCallback(new PageCallbackHandler() { 
                       public void process(WikiPage page) {
                              System.out.println(page.getTitle());
                       }
        });
            
       wxsp.parse();
    }catch(Exception e) {
            e.printStackTrace();
    }

Notes

  1. The DOM parser is known to run out of memory despite of lazy loading the DOM tree for very large Wikipedia dumps (like English). This issue will be fixed eventually until then using the SAX parser interface is highly recommended. If you really want to use the DOM parser, try it with the callback interface.

2.This should not be confused with the Java Wikipedia Parser that converts wiki-text to HTML.

Dependencies

All dependencies are packaged into this source. The dependencies might have different licensing terms.

Apache Xerces DOM parser for lazy loading. Optionally uses bzip2 code refactored by Kohsuke Kawaguchi from Apache's Ant project.

Contributors

Jason Smith
Itamar Syn-Hershko (@synhershko)
Alan Said (@alansaid)
Victor Olivares

wikixmlj's People

Contributors

alansaid avatar andyhedges avatar buzztaiki avatar delip avatar synhershko avatar treedust avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

wikixmlj's Issues

Maintained?

Is this project maintained, i.e. will pull requests get answered?

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

Hi, I'm getting an OutOfMemmoryError while running the program. Below is the complete exception:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at org.json.simple.parser.Yylex.zzUnpackCMap(Yylex.java:302) at org.json.simple.parser.Yylex.<clinit>(Yylex.java:40) at org.json.simple.parser.JSONParser.<init>(JSONParser.java:34) at edu.jhu.nlp.language.Language.getJsonObject(Language.java:70) at edu.jhu.nlp.language.Language.<init>(Language.java:47) at edu.jhu.nlp.wikipedia.WikiTextParser.<init>(WikiTextParser.java:45) at edu.jhu.nlp.wikipedia.WikiPage.setWikiText(WikiPage.java:39) at edu.jhu.nlp.wikipedia.SAXPageCallbackHandler.endElement(SAXPageCallbackHandler.java:55) at org.apache.xerces.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:597) at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(XMLNSDocumentScannerImpl.java:676) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1646) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:324) at org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:845) at org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:768) at org.apache.xerces.parsers.XMLParser.parse(XMLParser.java:108) at org.apache.xerces.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1201) at edu.jhu.nlp.wikipedia.WikiXMLSAXParser.parse(WikiXMLSAXParser.java:68) at com.accenture.WikiLogic.main(WikiLogic.java:222)

Can you please assist regarding this? Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.