chimbori / crux Goto Github PK

View Code? Open in Web Editor NEW

236.0 13.0 43.0 4.61 MB

Crux offers a flexible plugin-based API & implementation to extract interesting information from Web pages.

License: Apache License 2.0

Kotlin 100.00%

kotlin metadata unfurl opengraph readability reader schema-org twitter-cards

crux's People

Contributors

Stargazers

Watchers

Forkers

yassinajdi halit4603 bysshe yonimoses vacat mmb1995 pjmdr11 wnds anhtuan23 irmobydick platelminto chunksnbits linxfae readwise sumonst21 stallioncms pawanhegde freshy969 naetimus litton androsparks jvankat arunkumar9t2 jibaro bejean schoenobates danishack josepowera tau3000 pjeannenot ciferkey arkfalak sitedata jamison413 nowi xingmu lucidl kjeller vaimalaviya1233 laatonwalabhoot

crux's Issues

Crux replaces page title with site title.

I've been running crux over several sites and noticed the following bug.

Problem

Here is an example URL that displays the problem: https://www.bbc.com/news/world-europe-61691816

Test based off the README example to verify the problem:

  @Test
  fun broken() {
    val crux = Crux()

    val httpUrl = "https://www.bbc.com/news/world-europe-61691816".toHttpUrl()

    val document = Jsoup.connect(httpUrl.toString()).get()

    val resource = runBlocking {
      crux.extractFrom(httpUrl, document)
    }

    assertEquals("Ukraine anger as Macron says 'Don't humiliate Russia'", resource.fields[Fields.TITLE])
  }

The sequence of events is:

HtmlMetadataExtractor correctly extracts the right title "Ukraine anger as Macron says 'Don't humiliate Russia' - BBC News"
WebAppManifestParser extracts the title "BBC"
The fold operation in Crux.extractFrom uses Resource.plus to merge the resources overwriting the title with "BBC"

crux/src/main/kotlin/com/chimbori/crux/api/Resource.kt

Line 51 in 3b4586c

fields = if (anotherResource?.fields == null) fields else fields + anotherResource.fields,

Possible solutions

If you update Crux.createDefaultPlugins to place WebAppManifestParser before HtmlMetadataExtractor like this:

public fun createDefaultPlugins(okHttpClient: OkHttpClient): List<Plugin> = listOf(
  // Static redirectors go first, to avoid getting stuck into CAPTCHAs.
  GoogleUrlRewriter(),
  FacebookUrlRewriter(),
  // Remove any tracking parameters remaining.
  TrackingParameterRemover(),
  // Prefer canonical URLs over AMP URLs.
  AmpRedirector(refetchContentFromCanonicalUrl = true, okHttpClient),
  // Fetches and parses the Web Manifest. May replace existing favicon URL with one from the manifest.json.
  WebAppManifestParser(okHttpClient),
  // Parses many standard HTML metadata attributes.
  HtmlMetadataExtractor(okHttpClient),
  // Extracts the best possible favicon from all the markup available on the page itself.
  FaviconExtractor(),
  // Parses the content of the page to remove ads, navigation, and all the other fluff.
  ArticleExtractor(okHttpClient),
)

It will produce the correct results.

This is the simplest way we can resolve it. Is there a specific reason to have WebAppManifestParser after HtmlMetadataExtractor or can we reorder it?

If that is not possible then we might need to consider a new way to handle merging the fields.

[NEWYORKER] Span tags are replaced with a paragraph tag

Sites such as the New Yorker use span elements to make the first element in their article more prominent. These are supposed to appear inline with the rest of the paragraph, but since Crux replaces spans with paragraph tags in the post-process step, the single character occurs as its own paragraph in the output.

https://www.newyorker.com/news/our-columnists/putin-and-trumps-ominous-nostalgia-for-the-second-world-war

We can start retaining span tags in the output without a minimum length (because spans in these cases are usually really short) or remove the span tag and only keep the content.

I can create a PR if you want.

Release to Maven Central

I just re-created the pull request that was lost in the transition, mostly to let you know I see your spiffy new repo and I'm tracking. :)

The pull request is here: #1

I'd recommend against merging that PR for now. I'll continue to work on that branch / PR and will let you know as soon as we're ready to deploy to Maven Central. It'll take some work from you -- e.g., you'll need to do some DNS/webpage stuff to take ownership of the "com.chimbori" group in Maven Central -- but I'll try make it as turnkey as possible!

Preserve <br>?

I tried extracting https://us13.campaign-archive.com/?u=67bd06787e84d73db24fb0aa5&id=c3e998f811&e=7bc177b38a

and then rendering it. Looks broken since the
are extracted. Can I send a PR to add them back? What's the best way to do this?

Article publication date

Hi Guys,

I use crux for a week now and it looks great.
One thing I really miss is extracting the publication date of an article.
Any chance you'd be able to add such feature?

images, videos, iframes

hello!
parsed dom does not seem to include images (nor iframe, hence videos).
article object holds a list of Article.Images but Image class is private to get its properties (unless one is parsing that string into an object).

is there any way to:
a) make Image public?
b) add a new property to Image object to point it's location (dom position, text position...)?
c) create a similar object for iframes?

this link includes many images (not even lazy load ones) but none appears after parsing, meanwhile article.images holds them all:
https://www.slashgear.com/apple-airpods-2-review-price-performance-wireless-charging-04572129/

Mark certain nodes to be kept: `crux-keep`

See the excellent commentary by @sigpwned on #14

Unable to write custom plugin since Plugin interface is sealed

I really don't understand how we should write custom plugins if we can't even extend the Plugin interface since it's sealed.

I have a need for an article extractor, but don't want to do anything with urls, I just want to pass it a jsoup Document instance and for it to spit out an Article when it does its business. I figured I'd write my own minimal plugin which uses some of Crux's extraction functions, but without any HttpUrl or OkHttpClient params inside the constructor. It seems that I can't really do this since:

I can't actually extend Plugin, meaning I can't pass it as part of activePlugins to a Crux constructor,
I have to copy/paste the contents of your extractContent method (which is already copy/pasted inside your ArticleExtractor plugin), which is obviously doable but feels unnecessary
I can't use any pre or post process helpers since the damn things are internal!

Can you explain the logic behind hiding all of this? What should I do?

the extracted content output not contain picture elements

Use default values with Article data class

Instead of making everything intialize to null, it would be nice to use default values instead.

Instead of

data class Article(
    var canonicalUrl: HttpUrl,
    var title: String? = null,
    var description: String? = null,
    var siteName: String? = null,
    var themeColor: String? = null,
    var ampUrl: HttpUrl? = null,
    var imageUrl: HttpUrl? = null,
    var videoUrl: HttpUrl? = null,
    var feedUrl: HttpUrl? = null,
    var faviconUrl: HttpUrl? = null,

    /** Estimated reading time, in minutes. This is not populated unless explicitly requested by the caller. */
    var estimatedReadingTimeMinutes: Int? = null,
    var document: Document? = null,
    var keywords: List<String>? = null,
    var images: List<Image>? = null) {

  /** Encapsulates the data from an image found under an element */
  data class Image(
      var srcUrl: HttpUrl? = null,
      var weight: Int = 0,
      var title: String? = null,
      var height: Int = 0,
      var width: Int = 0,
      var alt: String? = null,
      var noFollow: Boolean = false,
      var element: Element? = null) {
    companion object {
      fun from(baseUrl: HttpUrl, imgElement: Element) = Image().apply {
        element = imgElement
        // Some sites use data-src to load images lazily, so prefer the data-src attribute if it exists.
        srcUrl = if (imgElement.attr("data-src").isNotEmpty()) {
          baseUrl.resolve(imgElement.attr("data-src"))
        } else {
          baseUrl.resolve(imgElement.attr("src"))
        }
        width = parseAttrAsInt(imgElement, "width")
        height = parseAttrAsInt(imgElement, "height")
        alt = imgElement.attr("alt")
        title = imgElement.attr("title")
        noFollow = imgElement.parent()?.attr("rel")?.contains("nofollow") == true
      }
    }
  }
}

Which forces us to import OkHttp, and deal with nulls everywhere. We can use the following

data class Article(
    var canonicalUrl: String,
    var title: String ="",
    var description: String = "",
    var siteName: String = "",
    var themeColor: String = "",
    var ampUrl: String = "",
    var imageUrl: String = "",
    var videoUrl: String = "",
    var feedUrl: String = "",
    var faviconUrl: String = "",

    /** Estimated reading time, in minutes. This is not populated unless explicitly requested by the caller. */
    var estimatedReadingTimeMinutes: Int = 0,
    var document: Document? = null,
    var keywords: List<String> = [],
    var images: List<Image> = []) {

  /** Encapsulates the data from an image found under an element */
  data class Image(
      var srcUrl: String = "",
      var weight: Int = 0,
      var title: String = "",
      var height: Int = 0,
      var width: Int = 0,
      var alt: String = "",
      var noFollow: Boolean = false,
      var element: Element? = null) {
    companion object {
      fun from(baseUrl: HttpUrl, imgElement: Element) = Image().apply {
        element = imgElement
        // Some sites use data-src to load images lazily, so prefer the data-src attribute if it exists.
        srcUrl = if (imgElement.attr("data-src").isNotEmpty()) {
          baseUrl.resolve(imgElement.attr("data-src"))
        } else {
          baseUrl.resolve(imgElement.attr("src"))
        }
        width = parseAttrAsInt(imgElement, "width")
        height = parseAttrAsInt(imgElement, "height")
        alt = imgElement.attr("alt")
        title = imgElement.attr("title")
        noFollow = imgElement.parent()?.attr("rel")?.contains("nofollow") == true
      }
    }
  }
}

Missing content for NYT recipe

https://cooking.nytimes.com/recipes/1018068-chicken-paprikash

The extracted text doesn't include the ingredients or instructions

[NYT] Content after ad is not extracted

Hi, In NYT articles, text after the first ad is not extracted.
For example: https://www.nytimes.com/2018/12/06/us/politics/huawei-meng-china-iran.html?action=click&module=Top%20Stories&pgtype=Homepage
I tried to extract in Hermit and the result is the same.

Multiplatform support

As the title says. Provisions for KMP support by moving from OkHttp to Ktor
And Jsoup to https://github.com/fleeksoft/ksoup
I would be happy to provide a PR and contribute to this.
Would anyone be interested in helping out?

Exception in thread "main" java.lang.NoClassDefFoundError: com/chimbori/crux/articles/ArticleExtractor

Hi,I use crux version 2.0.1 or 2.0.2，when i build application and run it on command line

java -cp spider-text-extraction-0.1.0.jar com.persian.spider.App

Exception in thread "main" java.lang.NoClassDefFoundError: com/chimbori/crux/articles/ArticleExtractor at com.persian.spider.SpiderArticle.textExtraction(SpiderArticle.java:20) at com.persian.spider.App.main(App.java:12)
Caused by: java.lang.ClassNotFoundException: com.chimbori.crux.articles.ArticleExtractor at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338) at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 2 more

java version

java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)

Can you help me? thank you.

New Release?

Any plans on pushing a new release? Seems like there have been a lot of change since 2.2.0 which is the last on Maven Central.

Why not use Uri.parse or String instead of OkHttp?

okhttp must be implemented in gradle for the only reason to pass HttpUrl object to the constructor?
wouldn't it be simplier to use Uri.parse or just a simple string?

Hidden popup chosen as article

Hi there.

First let me say that Crux is a very useful framework, and that it is the only one I have found so far that can deal with CJK.

I have found one instance of a website so far where the main article content is in a div with "article" class, which even contains an article element, also with "article" in the class, and yet a hidden div with text for creating a profile is always chosen as the article content instead. I think perhaps the weight of the class/tag "article" is not high enough, or child elements of something that is hidden on the page don't have their scores lowered correctly.

An example article that always fails to be extracted is: https://www.news24.com/World/News/watch-indonesia-frees-bali-nine-drug-smuggler-lawrence-from-prison-20181121

I have been looking but I can't seem to find a way to customise the scoring without forking the project. Is there a way to do it that I just haven't found?

page the fails badly

on the follow page: https://www.cnbc.com/2020/01/07/how-to-set-a-family-member-with-a-disability-on-a-great-financial-path.html

Crux fails badly. It only gets some text from the middle of the page.

I realize working well on all pages is very difficult task, but I'm hoping you can figure something out nonetheless.