chimbori / crux Goto Github PK
View Code? Open in Web Editor NEWCrux offers a flexible plugin-based API & implementation to extract interesting information from Web pages.
License: Apache License 2.0
Crux offers a flexible plugin-based API & implementation to extract interesting information from Web pages.
License: Apache License 2.0
I've been running crux over several sites and noticed the following bug.
Here is an example URL that displays the problem: https://www.bbc.com/news/world-europe-61691816
Test based off the README example to verify the problem:
@Test
fun broken() {
val crux = Crux()
val httpUrl = "https://www.bbc.com/news/world-europe-61691816".toHttpUrl()
val document = Jsoup.connect(httpUrl.toString()).get()
val resource = runBlocking {
crux.extractFrom(httpUrl, document)
}
assertEquals("Ukraine anger as Macron says 'Don't humiliate Russia'", resource.fields[Fields.TITLE])
}
The sequence of events is:
Crux.extractFrom
uses Resource.plus
to merge the resources overwriting the title with "BBC"If you update Crux.createDefaultPlugins
to place WebAppManifestParser
before HtmlMetadataExtractor
like this:
public fun createDefaultPlugins(okHttpClient: OkHttpClient): List<Plugin> = listOf(
// Static redirectors go first, to avoid getting stuck into CAPTCHAs.
GoogleUrlRewriter(),
FacebookUrlRewriter(),
// Remove any tracking parameters remaining.
TrackingParameterRemover(),
// Prefer canonical URLs over AMP URLs.
AmpRedirector(refetchContentFromCanonicalUrl = true, okHttpClient),
// Fetches and parses the Web Manifest. May replace existing favicon URL with one from the manifest.json.
WebAppManifestParser(okHttpClient),
// Parses many standard HTML metadata attributes.
HtmlMetadataExtractor(okHttpClient),
// Extracts the best possible favicon from all the markup available on the page itself.
FaviconExtractor(),
// Parses the content of the page to remove ads, navigation, and all the other fluff.
ArticleExtractor(okHttpClient),
)
It will produce the correct results.
This is the simplest way we can resolve it. Is there a specific reason to have WebAppManifestParser
after HtmlMetadataExtractor
or can we reorder it?
If that is not possible then we might need to consider a new way to handle merging the fields
.
Sites such as the New Yorker use span elements to make the first element in their article more prominent. These are supposed to appear inline with the rest of the paragraph, but since Crux replaces spans with paragraph tags in the post-process step, the single character occurs as its own paragraph in the output.
We can start retaining span tags in the output without a minimum length (because spans in these cases are usually really short) or remove the span tag and only keep the content.
I can create a PR if you want.
I just re-created the pull request that was lost in the transition, mostly to let you know I see your spiffy new repo and I'm tracking. :)
The pull request is here: #1
I'd recommend against merging that PR for now. I'll continue to work on that branch / PR and will let you know as soon as we're ready to deploy to Maven Central. It'll take some work from you -- e.g., you'll need to do some DNS/webpage stuff to take ownership of the "com.chimbori" group in Maven Central -- but I'll try make it as turnkey as possible!
I tried extracting https://us13.campaign-archive.com/?u=67bd06787e84d73db24fb0aa5&id=c3e998f811&e=7bc177b38a
and then rendering it. Looks broken since the
are extracted. Can I send a PR to add them back? What's the best way to do this?
Hi Guys,
I use crux for a week now and it looks great.
One thing I really miss is extracting the publication date of an article.
Any chance you'd be able to add such feature?
hello!
parsed dom does not seem to include images (nor iframe, hence videos).
article object holds a list of Article.Images but Image class is private to get its properties (unless one is parsing that string into an object).
is there any way to:
a) make Image public?
b) add a new property to Image object to point it's location (dom position, text position...)?
c) create a similar object for iframes?
this link includes many images (not even lazy load ones) but none appears after parsing, meanwhile article.images holds them all:
https://www.slashgear.com/apple-airpods-2-review-price-performance-wireless-charging-04572129/
I really don't understand how we should write custom plugins if we can't even extend the Plugin
interface since it's sealed.
I have a need for an article extractor, but don't want to do anything with urls, I just want to pass it a jsoup Document
instance and for it to spit out an Article
when it does its business. I figured I'd write my own minimal plugin which uses some of Crux's extraction functions, but without any HttpUrl
or OkHttpClient
params inside the constructor. It seems that I can't really do this since:
Plugin
, meaning I can't pass it as part of activePlugins
to a Crux
constructor,extractContent
method (which is already copy/pasted inside your ArticleExtractor
plugin), which is obviously doable but feels unnecessaryCan you explain the logic behind hiding all of this? What should I do?
Instead of making everything intialize to null, it would be nice to use default values instead.
Instead of
data class Article(
var canonicalUrl: HttpUrl,
var title: String? = null,
var description: String? = null,
var siteName: String? = null,
var themeColor: String? = null,
var ampUrl: HttpUrl? = null,
var imageUrl: HttpUrl? = null,
var videoUrl: HttpUrl? = null,
var feedUrl: HttpUrl? = null,
var faviconUrl: HttpUrl? = null,
/** Estimated reading time, in minutes. This is not populated unless explicitly requested by the caller. */
var estimatedReadingTimeMinutes: Int? = null,
var document: Document? = null,
var keywords: List<String>? = null,
var images: List<Image>? = null) {
/** Encapsulates the data from an image found under an element */
data class Image(
var srcUrl: HttpUrl? = null,
var weight: Int = 0,
var title: String? = null,
var height: Int = 0,
var width: Int = 0,
var alt: String? = null,
var noFollow: Boolean = false,
var element: Element? = null) {
companion object {
fun from(baseUrl: HttpUrl, imgElement: Element) = Image().apply {
element = imgElement
// Some sites use data-src to load images lazily, so prefer the data-src attribute if it exists.
srcUrl = if (imgElement.attr("data-src").isNotEmpty()) {
baseUrl.resolve(imgElement.attr("data-src"))
} else {
baseUrl.resolve(imgElement.attr("src"))
}
width = parseAttrAsInt(imgElement, "width")
height = parseAttrAsInt(imgElement, "height")
alt = imgElement.attr("alt")
title = imgElement.attr("title")
noFollow = imgElement.parent()?.attr("rel")?.contains("nofollow") == true
}
}
}
}
Which forces us to import OkHttp, and deal with nulls everywhere. We can use the following
data class Article(
var canonicalUrl: String,
var title: String ="",
var description: String = "",
var siteName: String = "",
var themeColor: String = "",
var ampUrl: String = "",
var imageUrl: String = "",
var videoUrl: String = "",
var feedUrl: String = "",
var faviconUrl: String = "",
/** Estimated reading time, in minutes. This is not populated unless explicitly requested by the caller. */
var estimatedReadingTimeMinutes: Int = 0,
var document: Document? = null,
var keywords: List<String> = [],
var images: List<Image> = []) {
/** Encapsulates the data from an image found under an element */
data class Image(
var srcUrl: String = "",
var weight: Int = 0,
var title: String = "",
var height: Int = 0,
var width: Int = 0,
var alt: String = "",
var noFollow: Boolean = false,
var element: Element? = null) {
companion object {
fun from(baseUrl: HttpUrl, imgElement: Element) = Image().apply {
element = imgElement
// Some sites use data-src to load images lazily, so prefer the data-src attribute if it exists.
srcUrl = if (imgElement.attr("data-src").isNotEmpty()) {
baseUrl.resolve(imgElement.attr("data-src"))
} else {
baseUrl.resolve(imgElement.attr("src"))
}
width = parseAttrAsInt(imgElement, "width")
height = parseAttrAsInt(imgElement, "height")
alt = imgElement.attr("alt")
title = imgElement.attr("title")
noFollow = imgElement.parent()?.attr("rel")?.contains("nofollow") == true
}
}
}
}
https://cooking.nytimes.com/recipes/1018068-chicken-paprikash
The extracted text doesn't include the ingredients or instructions
Hi, In NYT articles, text after the first ad is not extracted.
For example: https://www.nytimes.com/2018/12/06/us/politics/huawei-meng-china-iran.html?action=click&module=Top%20Stories&pgtype=Homepage
I tried to extract in Hermit and the result is the same.
As the title says. Provisions for KMP support by moving from OkHttp to Ktor
And Jsoup to https://github.com/fleeksoft/ksoup
I would be happy to provide a PR and contribute to this.
Would anyone be interested in helping out?
Hi,I use crux version 2.0.1 or 2.0.2,when i build application and run it on command line
java -cp spider-text-extraction-0.1.0.jar com.persian.spider.App
Exception in thread "main" java.lang.NoClassDefFoundError: com/chimbori/crux/articles/ArticleExtractor at com.persian.spider.SpiderArticle.textExtraction(SpiderArticle.java:20) at com.persian.spider.App.main(App.java:12)
Caused by: java.lang.ClassNotFoundException: com.chimbori.crux.articles.ArticleExtractor at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338) at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 2 more
java version
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
Can you help me? thank you.
Any plans on pushing a new release? Seems like there have been a lot of change since 2.2.0 which is the last on Maven Central.
okhttp must be implemented in gradle for the only reason to pass HttpUrl object to the constructor?
wouldn't it be simplier to use Uri.parse or just a simple string?
Hi there.
First let me say that Crux is a very useful framework, and that it is the only one I have found so far that can deal with CJK.
I have found one instance of a website so far where the main article content is in a div with "article" class, which even contains an article element, also with "article" in the class, and yet a hidden div with text for creating a profile is always chosen as the article content instead. I think perhaps the weight of the class/tag "article" is not high enough, or child elements of something that is hidden on the page don't have their scores lowered correctly.
An example article that always fails to be extracted is: https://www.news24.com/World/News/watch-indonesia-frees-bali-nine-drug-smuggler-lawrence-from-prison-20181121
I have been looking but I can't seem to find a way to customise the scoring without forking the project. Is there a way to do it that I just haven't found?
on the follow page: https://www.cnbc.com/2020/01/07/how-to-set-a-family-member-with-a-disability-on-a-great-financial-path.html
Crux fails badly. It only gets some text from the middle of the page.
I realize working well on all pages is very difficult task, but I'm hoping you can figure something out nonetheless.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.