Code Monkey home page Code Monkey logo

essence's Introduction

essence

Maven Central All Contributors

An automatic web page content extractor for Kotlin and Java.

Given an HTML document, essence automatically extracts the main text content (and much more).

Try out the demo - a simple webapp to demonstrate essence.

This library is inspired by node-unfluff and its lineage

Usage

Java

import io.github.cdimascio.essence.Essence;

EssenceResult data = Essence.extract(html);
System.out.println(data.getText());

Kotlin

val data = Essence.extract(html)
println(data.text)

See Extracted data elements for additional extracted metadata.

Install

Maven

<dependency>
  <groupId>io.github.cdimascio</groupId>
  <artifactId>essence</artifactId>
  <version>0.13.0</version>
  <type>pom</type>
</dependency>

Gradle

compile 'io.github.cdimascio:essence:0.13.0'

Try the Essence web demo

Essence web is a simple web page that fetches content at a given url and passes the HTML to this essence library.

The essence web project lives here

Extracted data elements

essence attempts to extract the following content:

  • title - The document's title
  • softTitle - A version of title with less truncation
  • date - The document's publication date
  • copyright - The document's copyright line, if present
  • author - The document's author
  • publisher - The document's publisher (website name)
  • text - The main text of the document with all the junk thrown away
  • image - The main image for the document (what's used by facebook, etc.)
  • (coming soon...)videos - An array of videos that were embedded in the article. Each video has src, width and height.
  • tags- Any tags or keywords that could be found by checking <rel> tags or by looking at href urls.
  • canonicalLink - The canonical url of the document, if given.
  • lang - The language of the document, either detected or supplied by you.
  • description - The description of the document, from <meta> tags
  • favicon - The url of the document's favicon.
  • links - An array of links embedded within the article text. (text and href for each)

Credits

License

Apache 2.0

Buy Me A Coffee

Contributors ✨

Thanks goes to these wonderful people (emoji key):


Clément P.

💻

This project follows the all-contributors specification. Contributions of any kind welcome!

essence's People

Contributors

allcontributors[bot] avatar cdimascio avatar cleymax avatar neroux avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

essence's Issues

DocumentScorer.kt stopwords.size > 2 seems to be wrong

Not sure whether I get it right. in the DocumentScorer.kt, I think the code here is using wrong judgement:

class DocumentScorer(private val stopWords: StopWords) : Scorer {

    override fun score(doc: Document): ScoredElement? {
        val nodesWithText = mutableListOf<Element>()
        val nodesToCheck = doc.select("p, pre, td")
        nodesToCheck.forEach { node ->
            val text = node.text()
            val wordStats = stopWords.statistics(text)
            val hasHighLinkDensity = NodeHeuristics.hasHighLinkDensity(node)
            // if stopWords.size is bigger than 2, this node should be ignored, rather than added to nodesWithText?
           // this should be changed to: wordStats.stopWords.size <= 2
            if (wordStats.stopWords.size > 2 && !hasHighLinkDensity) {
                nodesWithText.add(node)
            }
        }
        ......
   }
}

I think we meant to find the the nodes with good text, and not containing a lot of stopwords, right?

Wrong content from some sites

Is essence still maintained?

It looks like essence has a problem with the content of the following pages:

It looks like these are specifically problems with these pages?
At least, that's the only place I've had a problem so far. But of course, I'm only using a few feeds so far.

In general essence is really good and does exactly what it is supposed to. Ingenious!

If the project is still being maintained, it would be great to have a look and see if you can fix the problem or if there is a workaround.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.