Code Monkey home page Code Monkey logo

Comments (7)

akolonin avatar akolonin commented on August 17, 2024

@dagims - look at the https://www.nytimes.com/ - it has few tens of articles where each article may be source of of a news item. In such case, neither title, nor og:title nor h1 may work, so you need all possible candidates from the html title down to the "text" except. What is suggested in #15 is to use the "spatially closest title candidate preceding the text position", like it is done with images (while the images are looked up for closet before and after). Using Levenshtein difference may be (or may not) used as an alternative to using spatial difference to select the most similar title candidate preceding the text body by the spacial index instead the most spatially close, but: A) you should measure the different between the "text" and the "title candidate"; B) similarity should it be based on words, instead of letters?

from aigents-java.

dagiopia avatar dagiopia commented on August 17, 2024

@akolonin respectable sites such as nytimes.com actually do have all of them. These sites set the og:title meta to the correct title of an article. The og:title metadata is the most reliable one because it is used by professional sites and CMSs. Because of CMSs, this appies to a significant number of websites out there. The need for performing string similarity with letters is because of the specific type of problem the difference between the strings obtained from the different tags mentioned above presents. That is, when looking at a certain page, say this one,
The following is the title tag:

<title data-rh="true">As Job Losses Mount, Lawmakers Face a Make-or-Break Moment - The New York Times</title>

and the following is the meta tag:

<meta data-rh="true" property="og:title" content="As Job Losses Mount, Lawmakers Face a Make-or-Break Moment"/>

and finally the following is the header

<h1 id="link-69abe703" class="css-1s4ffep e1h9rw200" itemProp="headline" data-test-id="headline">As Job Losses Mount, Lawmakers Face a Make-or-Break Moment</h1>

As you can see, all three contain the proper title but the <title> tag contains something extra which is the name of the website. It contains the name of the site for this case and similar others but sometimes it's the specific section of the site the article is in or some generic text. The need to compare the string similarity with letters was needed only to get rid of the extra part. I'm sure word similarity can be used instead and would perform equally well if not better but I was trying to minimize complexity and use an application specific solution. Going with word similarity will require parsing into words which might require some sort of dictionary look up or a similarly more complex algorithm than calculating edit distance.

On a separate note, I have a question about how and when topic matching is done. My question is which of the following two is correct? (if even one of them is a correct assumption 😆)

  1. Topics are matched while performing crawling. That is, while each web page is being read, it is searched for the topics the user trusts and if there is a match, the page data is stored for presenting to the user later on.
  2. First all pages from trusted sites is crawled and the extracted text, links, images etc... is stored then follows the matching going through the stored extracted text looking for the topics the user is interested in.

from aigents-java.

akolonin avatar akolonin commented on August 17, 2024

@dagims 1) I am saying not "don't use og:title", but "use title candidates which may be spatially and semantically closer to the piece of content than og:title, title, or h1". 2) we work with not "respectable sites" but with "any html pages" which is a big difference. 3) we are matching not an "articles" but "news items" where "bitcoin is rising because of..." and "dollar is sinking because of ..." are the news items which may be appearing in the same article with title "market news" and such title may be not the most precise title candidate for these news items.

from aigents-java.

akolonin avatar akolonin commented on August 17, 2024

The "matching" reality is much more complex than either "just 1" or "just 2" because of multiple users having multiple trusted topics while the crawl process is shared between the users plus the new topics may appear between the crawls. Logically, the "just 1" is the right view, but: A) for each site, it considers all users trusting this site and collects topics from all of these users: https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/self/Siter.java#L215, B) there is a page cache to prevent redundant re-reads of the same pages https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/self/Siter.java#L442

from aigents-java.

akolonin avatar akolonin commented on August 17, 2024

What still needs to be done in PR
A) Eliminate unused symbols:
https://github.com/aigents/aigents-java/pull/18/files#diff-3b52764c5dc3c3c1b07f581c6b0ab39fR59
B) Take care about too LOOOOOOONG titles
https://github.com/aigents/aigents-java/pull/18/files#diff-c6a4981e6efdd02e8b67fb5cc85a19c9R549
So need to strip the first sentence using new function String Siter.shortTitle(String longtitle) { ... }
which would (for the simplex implementation) use symbols in AL.punctuation
https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/al/AL.java#L142
then tokenize the longtitle using them then take the first token (in other words, get the first string which neither starts with punctuation not ends with punctuation but may contain spaces).
Or you may invent something smarter :-)
The other point - creation of such "default stupidly intelligent title" should happen AFTER the attempt to create title using your main logic based on HTML because if we overload shortTitle in the future with AGI math it will be too expensive to do that math in advanced.
In other words, here is what to do in Siter:

  1. title = null;
  2. try to get title from HTML titler structures
  3. if (AL.empty(title)) title = siter.shortTitle(nl_text);

from aigents-java.

dagiopia avatar dagiopia commented on August 17, 2024

@akolonin I decided to modify the regex that checks for AL.pronounciation rather than using ParseS because it's a much more efficient.

from aigents-java.

akolonin avatar akolonin commented on August 17, 2024

Completed in
a88d597
Many improvements may come later.

from aigents-java.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.