Code Monkey home page Code Monkey logo

readabilitybundle's Introduction

readabilityBUNDLE

Main Content Extraction from html written in Java. It will extract the article text with out the around clutters.

Recent days its really a challenging open issue to extract the main article content from html pages. There are many open source algorithms / implementations available. What i aim in this project is concise some of the best content extraction algorithm implemented in JAVA.

My focus is mainly on the tuning parameters and customization / modifications of these algorithmic features according to my requirements.

readabilityBUNDLE will perform equally what other algorithms does plus below listed extras.

Whats extra in readabilityBUNDLE

  • Preserve the html tags in the extracted content.
  • Keep all the possible images in the content instead of finding best image.
  • Keep all the available videos.
  • Better extraction of li,ul,ol tags
  • Content normalization of extracted content.
  • Incorporated 3 best popular extraction algorithm , you can choose based on your requirement.
  • Provision to append next pages extracted content and create a consolidated output
  • Many cleaner / formatter measures added.
  • Some core changes in algorithms.

The main challenge which i was facing to extract the main content by keeping all the images / videos / html tags / and some realated div tags which are used as content / non content identification by most of the algorithms.

readabilityBUNDLE borrows much code and concepts from Project Goose , Snacktory and Java-Readability. My intension was just fine tune / modify the algorithm to work with my requirements.

Some html pages works very well in a particular algorithm and some not. This is the main reason i put all the available algorithm under a roof . You can choose an algorithm which best suits you.

You can see all author citations in each java file itself.

Dependency Projects

Usage

You need to say which extraction algorithm to use. The 3 extraction algorithms are ReadabilitySnack,ReadabilityCore and ReadabilityGoose. By default its ReadabilitySnack.

  • With out next page finding

Sample Usage

Article article = new Article();
ContentExtractor ce = new ContentExtractor();
HtmlFetcher htmlFetcher = new HtmlFetcher();
String html = htmlFetcher.getHtml("http://blogmaverick.com/2012/11/19/what-i-really-think-about-facebook/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+Counterparties+%28Counterparties%29", 0);

article = ce.extractContent(html, "ReadabilitySnack");

System.out.println("Content : "+article.getCleanedArticleText());
  • With next page html sources

If you need to extract and append content from next pages also then,

  • You can use [NextPageFinder] (https://github.com/srijiths/NextPageFinder) to find out all the next pages links.

  • Get the html of each next pages as a List of String using Network

  • Pass it to the content extractor like

    article = ce.extractContent(firstPageHtml,extractionAlgorithm,nextPagesHtmlSources)

Build

Using Maven , mvn clean package

readabilitybundle's People

Contributors

srijiths avatar

Watchers

James Cloos avatar Matthew M. Kaufman avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.