Code Monkey home page Code Monkey logo

reader's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

reader's Issues

Reader only returning the cookie banner

Whenever I attempt to access a page from the unibo.it domain that includes a cookie banner, Jina only returns the content of the banner itself.

Here's an example: https://r.jina.ai/https://www.unibo.it/it/ateneo/organizzazione-e-sedi/servizi-di-ateneo/servizi-online/servizi-online-per-studenti/guida-servizi-online-studenti/liste-di-distribuzione-docenti-studenti

However, when utilizing the x-respond-with header (with any type), all the page content is properly returned.

Option to toggle the usage of Readability?

As I was experimenting with your API I noticed that it was a bit "too aggressive" on some pages, removing sections that I would want to keep in the final Markdown.

So I looked around both in the project code, as well as setting up an isolated test that only used turndown directly, but finally I found that the "culprit" was @mozilla/readability.

While this seems to do a great job at removing "irrelevant" content before it's converted to Markdown in most cases, I can definitely see how it might be a bit too greedy/aggressive in its cleanup strategy (i.e. not only in my specific case), and since I couldn't really find any combinations of config options for Readability that kept the specific "hero" section on the page I was trying with, I instead wanted to suggest that you might add the ability to simply enable/disable Readability completely?

As I'm not using the project by hosting the actual code locally or on my own server, but rather just using your public API, the ideal scenario would therefore be if this toggle could even exist as e.g. an extra parameter or alternative API endpoint.

Of course turndown should still be configured to remove things like <script> and <style> when not using Readability (if you don't already explicitly do this), but other than that I really think this alternative parsing option could be a very valuable addition!

访问的时候产生了报错

访问的时候产生了报错:{"data":null,"cause":{},"code":422,"name":"AssertionFailureError","status":42206,"message":"Failed to goto https://ddjjjc.gov.cn/pages/news.asp?id=4173: Error: net::ERR_CERT_COMMON_NAME_INVALID at https://ddjjjc.gov.cn/pages/news.asp?id=4173","readableMessage":"AssertionFailureError: Failed to goto https://ddjjjc.gov.cn/pages/news.asp?id=4173: Error: net::ERR_CERT_COMMON_NAME_INVALID at https://ddjjjc.gov.cn/pages/news.asp?id=4173"}
我是用requests请求的,这是i请求代码:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.7'
}
urllib = "https://r.jina.ai/https://www.ddjjjc.gov.cn/pages/news.asp?id=4173"
resource = requests.get(urllib,headers=headers)
print(resource.text)

URL loses information in the conversion

Converting the URL at https://www.eventbrite.com/e/brian-mclaren-wisdom-and-courage-for-a-world-falling-apart-tickets-823891721317 results in the loss of several headings (Date and Time, Location, Refund Policy, etc.) in Jina reader. Rendered result is this:

Title: Brian McLaren | “Wisdom and Courage for a World Falling Apart"

URL Source: https://www.eventbrite.com/e/brian-mclaren-wisdom-and-courage-for-a-world-falling-apart-tickets-823891721317

Markdown Content:
Brian McLaren, author, speaker, activist, and public theologian notes that the challenge of living well and maintaining resilience in turbulent times requires new ways of thinking, becoming, and belonging. Facing nations, ecosystems, economies, religions, and other institutions in disarray, we are called to a spiritual transformation in our own lives that will express itself in transformation in our world.

Childcare for ages 2-12 will be provided by reservation through April 15.

Not pulling image links correctly

Testing jina on our norwegian websites, we see that the images are not pulled correctly:

example URL: https://mikalsenutvikling.no/
Jina URL: https://r.jina.ai/https://mikalsenutvikling.no

First image on the website by Jina is given as ![Image 1: Daglig Leder - André mikalsen](data:image/svg+xml,%3Csvg%20xmlns='http://www.w3.org/2000/svg'%20viewBox='0%200%201200%201307'%3E%3C/svg%3E)

But in the HTML you can clearly see there is a src image with .png that should be the URL given in the Jina reader version:

src="https://mikalsenutvikling.no/wp-content/uploads/2022/11/Andre-Mikalsen-optimalisert.png

This has been the same on all sites we have tested.

support docker deployment

Could you please add support for docker deployment to streamline setting up and running the project?

Data Retrieval After Initial Page Load for Timely Updates

When using API, there's a limitation where it only captures data loaded during the initial page load. However, certain data is loaded dynamically after a brief delay, making it inaccessible through the current implementation. so is their any option available?

Timeout for both twitter and reddit happening frequently

Around 50% of the time, this tweet and the few others I tried shows a timeout error: https://r.jina.ai/https://twitter.com/EntreEden/status/1780771887624417315.

The response I get is this:

{"name":"TimeoutError",
"domainThrown":true,
"message":"Timed out after 10000 ms while waiting for the WS endpoint URL to appear in stdout!"}

Same with a reddit post I tried: https://www.reddit.com/r/cscareerquestions/comments/ufdyhd/after_4_years_of_working_im_slowly_learning_how/

I am in Canada, all on my home wifi and computer.

If you need help with scaling, I have a $5 Hetzner vps that I can pitch in. If I get guidance, I'm willing to help out with the issue as well.

DOMException 500 Error

@hanxiao Could you please give some advice for this issue?

`https://r.jina.ai/https://www.msn.com/en-us/news/technology/the-best-ai-search-engines-and-tools-you-can-use-to-search-the-web/ar-BB1kbzFL`

{"code":500,"status":50000,"message":"Failed to execute 'setAttribute' on 'Element': '}' is not a valid attribute name.","name":"DOMException"}

Issue: Jina reader fails to parse URLs containing Chinese characters

Issue: Jina reader fails to parse URLs containing Chinese characters

Description:

We have encountered an issue where the Jina reader fails to parse URLs that contain Chinese characters. This issue is causing our application to throw errors and prevents us from properly extracting content from certain websites.

Steps to Reproduce:

  1. Make a request to the Jina reader API with a URL containing Chinese characters, such as https://zh.wikipedia.org/wiki/%E5%91%A8%E6%9D%B0%E5%80%AB.
  2. Observe that the Jina reader fails to parse the URL and returns an error.

Expected Behavior:

We expect the Jina reader to properly handle URLs containing Chinese characters and successfully parse the corresponding web pages. The reader should be able to decode the URL, retrieve the web page content, and return it as expected.

Actual Behavior:

When a URL containing Chinese characters is passed to the Jina reader, it fails to parse the URL and throws an error. The error message typically indicates that the reader is unable to read properties of undefined, specifically the 'parentNode' property.

Example Error Message:

Failed to fetch https://zh.wikipedia.org/wiki/%E5%91%A8%E6%9D%B0%E5%80%AB: {"code":500,"status":50000,"message":"Cannot read properties of undefined (reading 'parentNode')","name":"TypeError"}

Impact:

This issue prevents our application from properly extracting content from websites that have URLs containing Chinese characters. It limits the functionality of our application and affects the user experience when dealing with such websites.

Potential Causes:

  • The Jina reader may not be properly decoding the URL before making the request, leading to an invalid URL being passed to the underlying parsing logic.
  • The parsing logic within the Jina reader may not be handling URLs with Chinese characters correctly, resulting in the "Cannot read properties of undefined" error.

Workaround:

As a temporary workaround, we have implemented a filtering mechanism in our application to skip URLs that contain Chinese characters. However, this is not an ideal solution as it limits the functionality and coverage of our application.

Request:

We kindly request the Jina team to investigate this issue and provide a fix that allows the Jina reader to properly handle URLs containing Chinese characters. It would be greatly appreciated if you could provide an update on the progress and an estimated timeline for the resolution.

Additional Information:

  • We are currently using the official Jina reader API, not the open-source service.
  • We are in the process of setting up our own service, but we are unsure if the open-source service also has this issue.

Please let us know if you require any further information or if there are any specific details you need to investigate and resolve this issue.

Thank you for your attention to this matter. We look forward to your response and resolution.

Best regards,
Loki.W

Please advise on how to filter out unwanted content

I studied the source code as a whole and found that I mainly use turndown to convert html into Markdown.

What puzzles me is that the things returned by your interface are very clean and do not contain some useless information on the page. I am very curious about how you do it. Specifically, for example, the navigation bar and comment area of some pages will not appear in the content returned by your interface.

I would like to ask you what kind of processing you have done? Or is there some source code that I haven't seen?

Error fetching webpage content via code but successful via browser

When attempting to fetch webpage content via code, I encounter an error consistently. However, I've noticed that the same URL can be successfully accessed via a browser. The error message I'm receiving is as follows:

Error data from response: {
data: null,
path: 'url',
code: 400,
name: 'ParamValidationError',
status: 40001,
message: 'TypeError: Invalid URL',
readableMessage: 'ParamValidationError(url): TypeError: Invalid URL'
}

URL:https://www.trustpilot.com/review/cortexi.io

I've tried troubleshooting this issue, but so far, I haven't been able to pinpoint the exact cause. Any insights or suggestions on how to resolve this would be greatly appreciated. Thank you!

npm run build failed because shared files are not found

Error:

$ npm run build

> build
> tsc -p .

src/cloud-functions/crawler.ts:3:79 - error TS2307: Cannot find module '../shared' or its corresponding type declarations.

3 import { CloudHTTPv2, Ctx, Logger, OutputServerEventStream, RPCReflect } from '../shared';
                                                                                ~~~~~~~~~~~

src/db/crawled.ts:2:33 - error TS2307: Cannot find module '../shared/lib/firestore' or its corresponding type declarations.

2 import { FirestoreRecord } from '../shared/lib/firestore';
                                  ~~~~~~~~~~~~~~~~~~~~~~~~~

src/db/crawled.ts:9:21 - error TS4112: This member cannot have an 'override' modifier because its containing class 'Crawled' does not extend another class.

9     static override collectionName = 'crawled';
                      ~~~~~~~~~~~~~~

src/db/crawled.ts:11:14 - error TS4112: This member cannot have an 'override' modifier because its containing class 'Crawled' does not extend another class.

11     override _id!: string;
                ~~~

src/db/crawled.ts:36:21 - error TS4112: This member cannot have an 'override' modifier because its containing class 'Crawled' does not extend another class.

36     static override from(input: any) {
                       ~~~~

src/db/crawled.ts:46:14 - error TS4112: This member cannot have an 'override' modifier because its containing class 'Crawled' does not extend another class.

46     override degradeForFireStore() {
                ~~~~~~~~~~~~~~~~~~~

src/index.ts:6:50 - error TS2307: Cannot find module './shared' or its corresponding type declarations.

6 import { loadModulesDynamically, registry } from './shared';
                                                   ~~~~~~~~~~

src/services/puppeteer.ts:4:24 - error TS2307: Cannot find module '../shared/services/logger' or its corresponding type declarations.

4 import { Logger } from '../shared/services/logger';
                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~

src/services/puppeteer.ts:178:43 - error TS2339: Property 'fromFirestoreQuery' does not exist on type 'typeof Crawled'.

178             const cached = (await Crawled.fromFirestoreQuery(Crawled.COLLECTION.where('urlPathDigest', '==', digest).orderBy('createdAt', 'desc').limit(1)))?.[0];
                                              ~~~~~~~~~~~~~~~~~~

src/services/puppeteer.ts:178:70 - error TS2339: Property 'COLLECTION' does not exist on type 'typeof Crawled'.

178             const cached = (await Crawled.fromFirestoreQuery(Crawled.COLLECTION.where('urlPathDigest', '==', digest).orderBy('createdAt', 'desc').limit(1)))?.[0];
                                                                         ~~~~~~~~~~

src/services/puppeteer.ts:232:25 - error TS2339: Property 'save' does not exist on type 'typeof Crawled'.

232                 Crawled.save(
                            ~~~~

src/services/puppeteer.ts:240:26 - error TS7006: Parameter 'err' implicitly has an 'any' type.

240                 ).catch((err) => {
                             ~~~


Found 12 errors in 4 files.

Errors  Files
     1  src/cloud-functions/crawler.ts:3
     5  src/db/crawled.ts:2
     1  src/index.ts:6
     5  src/services/puppeteer.ts:4

To reproduce go to backend/functions and run npm run build from an account which doesn't have access to the thinapps-shared/backend.

It looks like the thinapps-shared backend does mostly logging monitoring and caching, but at this point, the project doesn't build without it.

Pile in reader format

Hi! This looks interesting. I wonder if you could convert the Pile dataset taken from respective urls in the jina reader format to experiment with LLM pre-training?

Set custom timeout?

First of all thank you very much for this library, it's immensely helpful and I'm glad to be a Jina customer.

When we use this with llms, we'd like to enforce a much stricter timeout if possible.

Are there any flags or header we can pass to enforce a shorter timeout than 30 seconds?

How to start ?

npm WARN EBADENGINE Unsupported engine {
npm WARN EBADENGINE package: undefined,
npm WARN EBADENGINE required: { node: '20' },
npm WARN EBADENGINE current: { node: 'v21.7.3', npm: '10.5.0' }
npm WARN EBADENGINE }

up to date, audited 1005 packages in 2s

146 packages are looking for funding
run npm fund for details

4 critical severity vulnerabilities

To address all issues (including breaking changes), run:
npm audit fix --force

Run npm audit for details.

Respect robots.txt and identify your system

Recently, some AI companies have given website administrators the option of opting out of AI training by using configuration options in robots.txt.

While this project is for prompting and RAG rather than training, I still think you should provide an option for website users to prevent their websites from becoming ad-hoc databases for or components of AI systems. It seems like you have made your software default to evading detection by using puppeteer's stealth plugin; the user-agent configuration that would allow website owners to identify your project's bots is commented out.

I think this default is deceptive and irresponsible. You should make sure users of your project respect these preferences by incorporating them into the software's defaults. Web administrators may not be inclined to support the additional traffic generated by people using their websites as a component of AI systems.

Optimize real src recognition of img with lazy loading

For web pages with lazy loading of images, the src of some images will be incorrectly recognized.

I hope to support the identification of data-src of IMG tags and the correctness judgment of src.
When src is not a legal url, data-src can be used. If data-src is still not a legal url, the element will be deleted.

Example: Pictures in articles from WeChat public accounts

![Image 4: Image](data:image/svg+xml,%3C%3Fxmlversion='1.0'encoding='UTF-8'%3F%3E%3Csvgwidth='1px'height='1px'viewBox='0011'version='1.1'xmlns='http://www.w3.org/2000/svg'xmlns:xlink='http://www.w3.org/1999/xlink'%3E%3Ctitle%3E%3C/title%3E%3Cgstroke='none'stroke-width='1'fill='none'fill-rule='evenodd'fill-opacity='0'%3E%3Cgtransform='translate(-249.000000,-126.000000)' fill='%23FFFFFF'%3E%3Crect x='249' y='126' width='1' height='1'%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E)

这不是完整的开源项目把?

没有本地部署文档。什么都不全,readme里边还写着安装命令干什么嘞?到底是想开源还是不想开源?想开源不能放这样的残次品出来把。

a misspelled word

2024-05-08: Image capion is off by default for better latency. To turn it on, set x-with-generated-alt in the request header.


image capion -> image caption

Published time to Json mode

Please add "Published Time" to JSON mode. We are investigating how to incorporate the published time to check for updated content downstream and replace vectors based on whether the published time has changed.

Add parameters to request full text (i.e. don't parse with @mozilla/readability)

Great tool, would love to make more use of this, however, I scrape a lot of home pages of websites and those pages wind up having far too much info removed by readability.js. I'd love to have a parameter I can pass that would allow me to use html-to-text instead. The code to add that is very simple:

import { convert } from 'html-to-text';

const options = {
	wordwrap: false,
	selectors: [
		{ selector: 'a', options: { hideLinkHrefIfSameAsText: true, noAnchorUrl: true, ignoreHref: true, linkBrackets: false } },
		{ selector: 'img', format: 'skip' },
		{ selector: 'p', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } },
		{ selector: 'pre', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } },
	]
};

text = convert(html, options);

That will get you clean text that's well structured from the html though much much longer for home pages like https://www.tanium.com. For that home page, the current reader produces text with: 327 tokens (using the gpt4 tiktoken tokenizer) and html-to-text gives back 2554 tokens. That's far too much information loss for my use case, esp given that much of that information is critical to understanding what the business does.

Finally, while it wouldn't work in js - if you're willing to connect a vision model to interpret images, perhaps you'd consider implementing Trafilatura for articles and similar pages as it slightly outperforms readability.js based on a 2023 analysis from this paper: https://downloads.webis.de/publications/papers/bevendorff_2023b.pdf

Here's one image from the analysis:
image

While adding a parameter like this conflicts a bit with prioritizing the convenience of just dropping the url at the end of https://r.jina.ai/ I think the added info is really critical. It's the difference between me being able to use this for my use case (which I don't think is an extreme edge case) and not being able to use this...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.