cdimascio / essence Goto Github PK

View Code? Open in Web Editor NEW

115.0 7.0 16.0 1.98 MB

Automatically extract the main text content (and more) from an HTML document

License: Apache License 2.0

Kotlin 98.22% Java 1.78%

html-extractor webpage-extractor website-extractor extractor web-content-extractor scraper hacktoberfest

essence's Introduction

essence

An automatic web page content extractor for Kotlin and Java.

Given an HTML document, essence automatically extracts the main text content (and much more).

Try out the demo - a simple webapp to demonstrate essence.

This library is inspired by node-unfluff and its lineage

Usage

Java

import io.github.cdimascio.essence.Essence;

EssenceResult data = Essence.extract(html);
System.out.println(data.getText());

Kotlin

val data = Essence.extract(html)
println(data.text)

See Extracted data elements for additional extracted metadata.

Install

Maven

<dependency>
  <groupId>io.github.cdimascio</groupId>
  <artifactId>essence</artifactId>
  <version>0.13.0</version>
  <type>pom</type>
</dependency>

Gradle

compile 'io.github.cdimascio:essence:0.13.0'

Try the Essence web demo

Essence web is a simple web page that fetches content at a given url and passes the HTML to this essence library.

The essence web project lives here

Extracted data elements

essence attempts to extract the following content:

title - The document's title
softTitle - A version of title with less truncation
date - The document's publication date
copyright - The document's copyright line, if present
author - The document's author
publisher - The document's publisher (website name)
text - The main text of the document with all the junk thrown away
image - The main image for the document (what's used by facebook, etc.)
(coming soon...)videos - An array of videos that were embedded in the article. Each video has src, width and height.
tags- Any tags or keywords that could be found by checking <rel> tags or by looking at href urls.
canonicalLink - The canonical url of the document, if given.
lang - The language of the document, either detected or supplied by you.
description - The description of the document, from <meta> tags
favicon - The url of the document's favicon.
links - An array of links embedded within the article text. (text and href for each)

Credits

node-unfluff by https://github.com/ageitgey
python-goose by Xavier Grangier
goose by Gravity Labs

License

Apache 2.0

Contributors ✨

Thanks goes to these wonderful people (emoji key):

_{Clément P.}
💻

This project follows the all-contributors specification. Contributions of any kind welcome!

essence's People

Contributors

Stargazers

Watchers

Forkers

nuhmanp zaksoliman bejean cleymax hippalus chengniu kokorins mayankpunetha007 webdevuc 15th-ave-ne sammlung metacheck raulguo sknull mohdaiman linisme

essence's Issues

DocumentScorer.kt stopwords.size > 2 seems to be wrong

Not sure whether I get it right. in the DocumentScorer.kt, I think the code here is using wrong judgement:

class DocumentScorer(private val stopWords: StopWords) : Scorer {

    override fun score(doc: Document): ScoredElement? {
        val nodesWithText = mutableListOf<Element>()
        val nodesToCheck = doc.select("p, pre, td")
        nodesToCheck.forEach { node ->
            val text = node.text()
            val wordStats = stopWords.statistics(text)
            val hasHighLinkDensity = NodeHeuristics.hasHighLinkDensity(node)
            // if stopWords.size is bigger than 2, this node should be ignored, rather than added to nodesWithText?
           // this should be changed to: wordStats.stopWords.size <= 2
            if (wordStats.stopWords.size > 2 && !hasHighLinkDensity) {
                nodesWithText.add(node)
            }
        }
        ......
   }
}

I think we meant to find the the nodes with good text, and not containing a lot of stopwords, right?

demo page link not working

in the readme the link https://essence.mybluemix.net/ is mentioned for a "Try out the demo", this link is not working any more.

Unable to parse text from yahoo html

Hi,
I was testing it with the Yahoo finance website, however, it's unable to get the text. For example, this post data is not parsed and I get the empty text: https://finance.yahoo.com/news/3-airline-stocks-ready-takeoff-080507562.html

Improve core logic to support non space tokenized languages like japenese

The tokenization logic should be more generic.

Could use something like: https://www.atilika.org/ , to tokenize Japanese.

Wrong content from some sites

Is essence still maintained?

It looks like essence has a problem with the content of the following pages:

https://www.business-standard.com/companies/news/city-gas-distributors-optimistic-about-long-term-growth-prospects-123091301205_1.html (essence result: Are you sure you want to Log out from Business Standard)
https://www.business-standard.com/markets/ipo/signature-global-to-hit-capital-markets-with-rs-730-crore-ipo-on-sep-20-123091300863_1.html (essence result: Are you sure you want to Log out from Business Standard)
https://www.theverge.com/2023/9/13/23871857/ev-charger-broken-fix-biden-usdot-funding-infrastructure (essence result: / Sign up for Verge Deals to get deals on products we've tested sent to your inbox daily.)
https://www.theverge.com/2023/9/13/23871712/california-right-to-repair-act-sb-244 (essence result: / Sign up for Verge Deals to get deals on products we've tested sent to your inbox daily.)
https://koreatimes.co.kr/www/nation/2023/09/501_359160.html?utm_source=fl (essence result: A cart holding groceries is pushed through a supermarket in Bellflower, Calif., in this Feb. 13 file photo. AP-Yonhap)

It looks like these are specifically problems with these pages?
At least, that's the only place I've had a problem so far. But of course, I'm only using a few feeds so far.

In general essence is really good and does exactly what it is supposed to. Ingenious!

If the project is still being maintained, it would be great to have a look and see if you can fix the problem or if there is a workaround.