Code Monkey home page Code Monkey logo

crux's People

Contributors

arunkumar9t2 avatar bejean avatar chimbori avatar ciferkey avatar dajac avatar gomiguchi avatar hnrc avatar ifesdjeen avatar jloomis avatar jonathansantilli avatar karussell avatar kinow avatar kireet avatar manastungare avatar nzv8fan avatar pyr avatar sigpwned avatar soebbing avatar tjerkw avatar xiangronglin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crux's Issues

Crux replaces page title with site title.

I've been running crux over several sites and noticed the following bug.

Problem

Here is an example URL that displays the problem: https://www.bbc.com/news/world-europe-61691816

Test based off the README example to verify the problem:

  @Test
  fun broken() {
    val crux = Crux()

    val httpUrl = "https://www.bbc.com/news/world-europe-61691816".toHttpUrl()

    val document = Jsoup.connect(httpUrl.toString()).get()

    val resource = runBlocking {
      crux.extractFrom(httpUrl, document)
    }

    assertEquals("Ukraine anger as Macron says 'Don't humiliate Russia'", resource.fields[Fields.TITLE])
  }

The sequence of events is:

  • HtmlMetadataExtractor correctly extracts the right title "Ukraine anger as Macron says 'Don't humiliate Russia' - BBC News"
  • WebAppManifestParser extracts the title "BBC"
  • The fold operation in Crux.extractFrom uses Resource.plus to merge the resources overwriting the title with "BBC"
    fields = if (anotherResource?.fields == null) fields else fields + anotherResource.fields,

Possible solutions

If you update Crux.createDefaultPlugins to place WebAppManifestParser before HtmlMetadataExtractor like this:

public fun createDefaultPlugins(okHttpClient: OkHttpClient): List<Plugin> = listOf(
  // Static redirectors go first, to avoid getting stuck into CAPTCHAs.
  GoogleUrlRewriter(),
  FacebookUrlRewriter(),
  // Remove any tracking parameters remaining.
  TrackingParameterRemover(),
  // Prefer canonical URLs over AMP URLs.
  AmpRedirector(refetchContentFromCanonicalUrl = true, okHttpClient),
  // Fetches and parses the Web Manifest. May replace existing favicon URL with one from the manifest.json.
  WebAppManifestParser(okHttpClient),
  // Parses many standard HTML metadata attributes.
  HtmlMetadataExtractor(okHttpClient),
  // Extracts the best possible favicon from all the markup available on the page itself.
  FaviconExtractor(),
  // Parses the content of the page to remove ads, navigation, and all the other fluff.
  ArticleExtractor(okHttpClient),
)

It will produce the correct results.

This is the simplest way we can resolve it. Is there a specific reason to have WebAppManifestParser after HtmlMetadataExtractor or can we reorder it?

If that is not possible then we might need to consider a new way to handle merging the fields.

[NEWYORKER] Span tags are replaced with a paragraph tag

Sites such as the New Yorker use span elements to make the first element in their article more prominent. These are supposed to appear inline with the rest of the paragraph, but since Crux replaces spans with paragraph tags in the post-process step, the single character occurs as its own paragraph in the output.

https://www.newyorker.com/news/our-columnists/putin-and-trumps-ominous-nostalgia-for-the-second-world-war

We can start retaining span tags in the output without a minimum length (because spans in these cases are usually really short) or remove the span tag and only keep the content.

I can create a PR if you want.

Release to Maven Central

I just re-created the pull request that was lost in the transition, mostly to let you know I see your spiffy new repo and I'm tracking. :)

The pull request is here: #1

I'd recommend against merging that PR for now. I'll continue to work on that branch / PR and will let you know as soon as we're ready to deploy to Maven Central. It'll take some work from you -- e.g., you'll need to do some DNS/webpage stuff to take ownership of the "com.chimbori" group in Maven Central -- but I'll try make it as turnkey as possible!

Article publication date

Hi Guys,

I use crux for a week now and it looks great.
One thing I really miss is extracting the publication date of an article.
Any chance you'd be able to add such feature?

images, videos, iframes

hello!
parsed dom does not seem to include images (nor iframe, hence videos).
article object holds a list of Article.Images but Image class is private to get its properties (unless one is parsing that string into an object).

is there any way to:
a) make Image public?
b) add a new property to Image object to point it's location (dom position, text position...)?
c) create a similar object for iframes?

this link includes many images (not even lazy load ones) but none appears after parsing, meanwhile article.images holds them all:
https://www.slashgear.com/apple-airpods-2-review-price-performance-wireless-charging-04572129/

Unable to write custom plugin since Plugin interface is sealed

I really don't understand how we should write custom plugins if we can't even extend the Plugin interface since it's sealed.

I have a need for an article extractor, but don't want to do anything with urls, I just want to pass it a jsoup Document instance and for it to spit out an Article when it does its business. I figured I'd write my own minimal plugin which uses some of Crux's extraction functions, but without any HttpUrl or OkHttpClient params inside the constructor. It seems that I can't really do this since:

  1. I can't actually extend Plugin, meaning I can't pass it as part of activePlugins to a Crux constructor,
  2. I have to copy/paste the contents of your extractContent method (which is already copy/pasted inside your ArticleExtractor plugin), which is obviously doable but feels unnecessary
  3. I can't use any pre or post process helpers since the damn things are internal!

Can you explain the logic behind hiding all of this? What should I do?

Use default values with Article data class

Instead of making everything intialize to null, it would be nice to use default values instead.

Instead of

data class Article(
    var canonicalUrl: HttpUrl,
    var title: String? = null,
    var description: String? = null,
    var siteName: String? = null,
    var themeColor: String? = null,
    var ampUrl: HttpUrl? = null,
    var imageUrl: HttpUrl? = null,
    var videoUrl: HttpUrl? = null,
    var feedUrl: HttpUrl? = null,
    var faviconUrl: HttpUrl? = null,

    /** Estimated reading time, in minutes. This is not populated unless explicitly requested by the caller. */
    var estimatedReadingTimeMinutes: Int? = null,
    var document: Document? = null,
    var keywords: List<String>? = null,
    var images: List<Image>? = null) {

  /** Encapsulates the data from an image found under an element */
  data class Image(
      var srcUrl: HttpUrl? = null,
      var weight: Int = 0,
      var title: String? = null,
      var height: Int = 0,
      var width: Int = 0,
      var alt: String? = null,
      var noFollow: Boolean = false,
      var element: Element? = null) {
    companion object {
      fun from(baseUrl: HttpUrl, imgElement: Element) = Image().apply {
        element = imgElement
        // Some sites use data-src to load images lazily, so prefer the data-src attribute if it exists.
        srcUrl = if (imgElement.attr("data-src").isNotEmpty()) {
          baseUrl.resolve(imgElement.attr("data-src"))
        } else {
          baseUrl.resolve(imgElement.attr("src"))
        }
        width = parseAttrAsInt(imgElement, "width")
        height = parseAttrAsInt(imgElement, "height")
        alt = imgElement.attr("alt")
        title = imgElement.attr("title")
        noFollow = imgElement.parent()?.attr("rel")?.contains("nofollow") == true
      }
    }
  }
}

Which forces us to import OkHttp, and deal with nulls everywhere. We can use the following

data class Article(
    var canonicalUrl: String,
    var title: String ="",
    var description: String = "",
    var siteName: String = "",
    var themeColor: String = "",
    var ampUrl: String = "",
    var imageUrl: String = "",
    var videoUrl: String = "",
    var feedUrl: String = "",
    var faviconUrl: String = "",

    /** Estimated reading time, in minutes. This is not populated unless explicitly requested by the caller. */
    var estimatedReadingTimeMinutes: Int = 0,
    var document: Document? = null,
    var keywords: List<String> = [],
    var images: List<Image> = []) {

  /** Encapsulates the data from an image found under an element */
  data class Image(
      var srcUrl: String = "",
      var weight: Int = 0,
      var title: String = "",
      var height: Int = 0,
      var width: Int = 0,
      var alt: String = "",
      var noFollow: Boolean = false,
      var element: Element? = null) {
    companion object {
      fun from(baseUrl: HttpUrl, imgElement: Element) = Image().apply {
        element = imgElement
        // Some sites use data-src to load images lazily, so prefer the data-src attribute if it exists.
        srcUrl = if (imgElement.attr("data-src").isNotEmpty()) {
          baseUrl.resolve(imgElement.attr("data-src"))
        } else {
          baseUrl.resolve(imgElement.attr("src"))
        }
        width = parseAttrAsInt(imgElement, "width")
        height = parseAttrAsInt(imgElement, "height")
        alt = imgElement.attr("alt")
        title = imgElement.attr("title")
        noFollow = imgElement.parent()?.attr("rel")?.contains("nofollow") == true
      }
    }
  }
}

Exception in thread "main" java.lang.NoClassDefFoundError: com/chimbori/crux/articles/ArticleExtractor

Hi,I use crux version 2.0.1 or 2.0.2,when i build application and run it on command line

java -cp spider-text-extraction-0.1.0.jar com.persian.spider.App

Exception in thread "main" java.lang.NoClassDefFoundError: com/chimbori/crux/articles/ArticleExtractor at com.persian.spider.SpiderArticle.textExtraction(SpiderArticle.java:20) at com.persian.spider.App.main(App.java:12)
Caused by: java.lang.ClassNotFoundException: com.chimbori.crux.articles.ArticleExtractor at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338) at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 2 more

java version

java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)

Can you help me? thank you.

New Release?

Any plans on pushing a new release? Seems like there have been a lot of change since 2.2.0 which is the last on Maven Central.

Hidden popup chosen as article

Hi there.

First let me say that Crux is a very useful framework, and that it is the only one I have found so far that can deal with CJK.

I have found one instance of a website so far where the main article content is in a div with "article" class, which even contains an article element, also with "article" in the class, and yet a hidden div with text for creating a profile is always chosen as the article content instead. I think perhaps the weight of the class/tag "article" is not high enough, or child elements of something that is hidden on the page don't have their scores lowered correctly.

An example article that always fails to be extracted is: https://www.news24.com/World/News/watch-indonesia-frees-bali-nine-drug-smuggler-lawrence-from-prison-20181121

I have been looking but I can't seem to find a way to customise the scoring without forking the project. Is there a way to do it that I just haven't found?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.