jina-ai / reader Goto Github PK

View Code? Open in Web Editor NEW

5.8K 5.8K 444.0 955 KB

Convert any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/

Home Page: https://jina.ai/reader

License: Apache License 2.0

TypeScript 99.59% JavaScript 0.41%

llm proxy

reader's People

Stargazers

Watchers

Forkers

endexai jtsang4 jnnnthnn shaneholloman aicodehunt yacineali74 xieisabug polya20 beimingmaster cassight 0smboy majiajue ericxsun ekaone xiechengmude igaozp wangbooth cced3000 maddyonline helihui luc4t codeaudit haodaohong fallondown macs1207 yanniszhou yuan505 ai-mou isaac-chao xinghz proximamonkey kyrolabs haiandaxia peacepowerx zhuifeng1016 songkq sherlockouo saliksik techventurebuilder samzong caomaocao ww-jermaine antouank ramstorageai wondergcj hyhnet misterypoem shahabkazemi mrtnrocks hafei m00yy jenningsje jdelacasa mbrukman rajendharmendra miles626 numinousmuses endexai solijafari samarsheikh001 cognosysai oxfordoutlander johnny-rice linecode cifangyiquan kovogo sunmingze oldcai seattand36 meyemucu boardtwinkle-baseat romancexoxox-p arani-k chamerlireackste wardipity28headlinte raerocketv cephagne-n henryhesz p-centhart heavenjoycertready yhopwator rocketsizzlin99 gament7chattymp f-farerthebest dubyapi94eyemucu ksinceringe sovyborn2hannah beachroon-r excillu-newsabar jakubik2023 humanshangcottonhope t-anarchyfully 64fuzzyfo unlimitorbe-tacticusal jansystemic yingzi6776 hadryan unixcrh ailabteam nicomaci

reader's Issues

Request to add an interface to receive HTML strings

Request to add an interface that receives HTML strings. This can avoid passing cookies and other information to access web pages, which is more secure and flexible.

thinapps-shared private repo

Hi,

I understand you guys should have a common lib for doing common tasks.
But is thinapps-shared planned to open source?

Or what's the open source plan for reader as a whole?

Thanks.

Reader only returning the cookie banner

Whenever I attempt to access a page from the unibo.it domain that includes a cookie banner, Jina only returns the content of the banner itself.

Here's an example: https://r.jina.ai/https://www.unibo.it/it/ateneo/organizzazione-e-sedi/servizi-di-ateneo/servizi-online/servizi-online-per-studenti/guida-servizi-online-studenti/liste-di-distribuzione-docenti-studenti

However, when utilizing the x-respond-with header (with any type), all the page content is properly returned.

Option to toggle the usage of Readability?

As I was experimenting with your API I noticed that it was a bit "too aggressive" on some pages, removing sections that I would want to keep in the final Markdown.

So I looked around both in the project code, as well as setting up an isolated test that only used turndown directly, but finally I found that the "culprit" was @mozilla/readability.

While this seems to do a great job at removing "irrelevant" content before it's converted to Markdown in most cases, I can definitely see how it might be a bit too greedy/aggressive in its cleanup strategy (i.e. not only in my specific case), and since I couldn't really find any combinations of config options for Readability that kept the specific "hero" section on the page I was trying with, I instead wanted to suggest that you might add the ability to simply enable/disable Readability completely?

As I'm not using the project by hosting the actual code locally or on my own server, but rather just using your public API, the ideal scenario would therefore be if this toggle could even exist as e.g. an extra parameter or alternative API endpoint.

Of course turndown should still be configured to remove things like <script> and <style> when not using Readability (if you don't already explicitly do this), but other than that I really think this alternative parsing option could be a very valuable addition!

How do I deploy it locally?

Client-Side Rendering websites

how does your package deal with Client-Side Rendering websites, it seems it fails to parse a YouTube video page

Can I give a document content directly to you to get the reader result instead of a URL? Is there an API similar to this, or is there any channel to do so?

访问的时候产生了报错

访问的时候产生了报错：{"data":null,"cause":{},"code":422,"name":"AssertionFailureError","status":42206,"message":"Failed to goto https://ddjjjc.gov.cn/pages/news.asp?id=4173: Error: net::ERR_CERT_COMMON_NAME_INVALID at https://ddjjjc.gov.cn/pages/news.asp?id=4173","readableMessage":"AssertionFailureError: Failed to goto https://ddjjjc.gov.cn/pages/news.asp?id=4173: Error: net::ERR_CERT_COMMON_NAME_INVALID at https://ddjjjc.gov.cn/pages/news.asp?id=4173"}
我是用requests请求的，这是i请求代码：
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.7'
}
urllib = "https://r.jina.ai/https://www.ddjjjc.gov.cn/pages/news.asp?id=4173"
resource = requests.get(urllib,headers=headers)
print(resource.text)

How to use it on pages that require authentication and login?

I tried it on some public pages and it worked fine, but it doesn't seem to work for pages that require login/authentication.

Anyway, thank you.

URL loses information in the conversion

Converting the URL at https://www.eventbrite.com/e/brian-mclaren-wisdom-and-courage-for-a-world-falling-apart-tickets-823891721317 results in the loss of several headings (Date and Time, Location, Refund Policy, etc.) in Jina reader. Rendered result is this:

Title: Brian McLaren | “Wisdom and Courage for a World Falling Apart"

URL Source: https://www.eventbrite.com/e/brian-mclaren-wisdom-and-courage-for-a-world-falling-apart-tickets-823891721317

Markdown Content:
Brian McLaren, author, speaker, activist, and public theologian notes that the challenge of living well and maintaining resilience in turbulent times requires new ways of thinking, becoming, and belonging. Facing nations, ecosystems, economies, religions, and other institutions in disarray, we are called to a spiritual transformation in our own lives that will express itself in transformation in our world.

Childcare for ages 2-12 will be provided by reservation through April 15.

When there are iframe tags on the page, the extracted content is the content of the iframe tags. Is there a way to handle the expected content that is not labeled with iframe tags?

url: https://r.jina.ai/https://new.qq.com/rain/a/20230723A067YG00

Not pulling image links correctly

Testing jina on our norwegian websites, we see that the images are not pulled correctly:

example URL: https://mikalsenutvikling.no/
Jina URL: https://r.jina.ai/https://mikalsenutvikling.no

First image on the website by Jina is given as ![Image 1: Daglig Leder - André mikalsen](data:image/svg+xml,%3Csvg%20xmlns='http://www.w3.org/2000/svg'%20viewBox='0%200%201200%201307'%3E%3C/svg%3E)

But in the HTML you can clearly see there is a src image with .png that should be the URL given in the Jina reader version:

src="https://mikalsenutvikling.no/wp-content/uploads/2022/11/Andre-Mikalsen-optimalisert.png

This has been the same on all sites we have tested.

Extraction didn't work

First of all, this is amazing! I tested it out a website I was looking at converting and it dind't work.

https://r.jina.ai/https://repairpal.com/honda/civic/radiator-fan-not-working

Original: https://repairpal.com/honda/civic/radiator-fan-not-working

Thoughts?

Only non-relevant page components returned

Fantastic project. Thank you!

Here is a page that (one would think) is straightforward to parse: https://access.redhat.com/security/cve/CVE-2023-45853 . However, none of the relevant information in the page makes it to the parsed version, only the corporate links and "scaffolding".

I figured I'd report it in case this can highlight some areas of improvement. Thanks again!

feat: read pdf like arxiv

https://arxiv.org/pdf/2403.09060.pdf

support docker deployment

Could you please add support for docker deployment to streamline setting up and running the project?

Not a large webpage, but timeout frequently

I think this tool is quite useful, but when I try it, TimeoutError occurs frequently.
I tried webpages from https://baijiahao.baidu.com/, which are not large. For example, https://baijiahao.baidu.com/s?id=1796552710224293134&wfr=spider&for=pc

Data Retrieval After Initial Page Load for Timely Updates

When using API, there's a limitation where it only captures data loaded during the initial page load. However, certain data is loaded dynamically after a brief delay, making it inaccessible through the current implementation. so is their any option available?

Timeout for both twitter and reddit happening frequently

Around 50% of the time, this tweet and the few others I tried shows a timeout error: https://r.jina.ai/https://twitter.com/EntreEden/status/1780771887624417315.

The response I get is this:

{"name":"TimeoutError",
"domainThrown":true,
"message":"Timed out after 10000 ms while waiting for the WS endpoint URL to appear in stdout!"}

Same with a reddit post I tried: https://www.reddit.com/r/cscareerquestions/comments/ufdyhd/after_4_years_of_working_im_slowly_learning_how/

I am in Canada, all on my home wifi and computer.

If you need help with scaling, I have a $5 Hetzner vps that I can pitch in. If I get guidance, I'm willing to help out with the issue as well.

您好，您这个项目我觉得非常有用，请问您是用的LLM来做的处理吗？

DOMException 500 Error

@hanxiao Could you please give some advice for this issue?

`https://r.jina.ai/https://www.msn.com/en-us/news/technology/the-best-ai-search-engines-and-tools-you-can-use-to-search-the-web/ar-BB1kbzFL`

{"code":500,"status":50000,"message":"Failed to execute 'setAttribute' on 'Element': '}' is not a valid attribute name.","name":"DOMException"}

Issue: Jina reader fails to parse URLs containing Chinese characters

Description:

We have encountered an issue where the Jina reader fails to parse URLs that contain Chinese characters. This issue is causing our application to throw errors and prevents us from properly extracting content from certain websites.

Steps to Reproduce:

Make a request to the Jina reader API with a URL containing Chinese characters, such as https://zh.wikipedia.org/wiki/%E5%91%A8%E6%9D%B0%E5%80%AB.
Observe that the Jina reader fails to parse the URL and returns an error.

Expected Behavior:

We expect the Jina reader to properly handle URLs containing Chinese characters and successfully parse the corresponding web pages. The reader should be able to decode the URL, retrieve the web page content, and return it as expected.

Actual Behavior:

When a URL containing Chinese characters is passed to the Jina reader, it fails to parse the URL and throws an error. The error message typically indicates that the reader is unable to read properties of undefined, specifically the 'parentNode' property.

Example Error Message:

Failed to fetch https://zh.wikipedia.org/wiki/%E5%91%A8%E6%9D%B0%E5%80%AB: {"code":500,"status":50000,"message":"Cannot read properties of undefined (reading 'parentNode')","name":"TypeError"}

Impact:

This issue prevents our application from properly extracting content from websites that have URLs containing Chinese characters. It limits the functionality of our application and affects the user experience when dealing with such websites.

Potential Causes:

The Jina reader may not be properly decoding the URL before making the request, leading to an invalid URL being passed to the underlying parsing logic.
The parsing logic within the Jina reader may not be handling URLs with Chinese characters correctly, resulting in the "Cannot read properties of undefined" error.

Workaround:

As a temporary workaround, we have implemented a filtering mechanism in our application to skip URLs that contain Chinese characters. However, this is not an ideal solution as it limits the functionality and coverage of our application.

Request:

We kindly request the Jina team to investigate this issue and provide a fix that allows the Jina reader to properly handle URLs containing Chinese characters. It would be greatly appreciated if you could provide an update on the progress and an estimated timeline for the resolution.

Additional Information:

We are currently using the official Jina reader API, not the open-source service.
We are in the process of setting up our own service, but we are unsure if the open-source service also has this issue.

Please let us know if you require any further information or if there are any specific details you need to investigate and resolve this issue.

Thank you for your attention to this matter. We look forward to your response and resolution.

Best regards,
Loki.W

Configure HTTP Proxy

Hey!

How can I configure/use an HTTP Proxy?

For example: https://docs.zyte.com/zyte-api/usage/proxy-mode.html

Please advise on how to filter out unwanted content

I studied the source code as a whole and found that I mainly use turndown to convert html into Markdown.

What puzzles me is that the things returned by your interface are very clean and do not contain some useless information on the page. I am very curious about how you do it. Specifically, for example, the navigation bar and comment area of some pages will not appear in the content returned by your interface.

I would like to ask you what kind of processing you have done? Or is there some source code that I haven't seen?

Error fetching webpage content via code but successful via browser

When attempting to fetch webpage content via code, I encounter an error consistently. However, I've noticed that the same URL can be successfully accessed via a browser. The error message I'm receiving is as follows:

Error data from response: {
data: null,
path: 'url',
code: 400,
name: 'ParamValidationError',
status: 40001,
message: 'TypeError: Invalid URL',
readableMessage: 'ParamValidationError(url): TypeError: Invalid URL'
}

URL：https://www.trustpilot.com/review/cortexi.io

I've tried troubleshooting this issue, but so far, I haven't been able to pinpoint the exact cause. Any insights or suggestions on how to resolve this would be greatly appreciated. Thank you!

npm run build failed because shared files are not found

Error:

$ npm run build

> build
> tsc -p .

src/cloud-functions/crawler.ts:3:79 - error TS2307: Cannot find module '../shared' or its corresponding type declarations.

3 import { CloudHTTPv2, Ctx, Logger, OutputServerEventStream, RPCReflect } from '../shared';
                                                                                ~~~~~~~~~~~

src/db/crawled.ts:2:33 - error TS2307: Cannot find module '../shared/lib/firestore' or its corresponding type declarations.

2 import { FirestoreRecord } from '../shared/lib/firestore';
                                  ~~~~~~~~~~~~~~~~~~~~~~~~~

src/db/crawled.ts:9:21 - error TS4112: This member cannot have an 'override' modifier because its containing class 'Crawled' does not extend another class.

9     static override collectionName = 'crawled';
                      ~~~~~~~~~~~~~~

src/db/crawled.ts:11:14 - error TS4112: This member cannot have an 'override' modifier because its containing class 'Crawled' does not extend another class.

11     override _id!: string;
                ~~~

src/db/crawled.ts:36:21 - error TS4112: This member cannot have an 'override' modifier because its containing class 'Crawled' does not extend another class.

36     static override from(input: any) {
                       ~~~~

src/db/crawled.ts:46:14 - error TS4112: This member cannot have an 'override' modifier because its containing class 'Crawled' does not extend another class.

46     override degradeForFireStore() {
                ~~~~~~~~~~~~~~~~~~~

src/index.ts:6:50 - error TS2307: Cannot find module './shared' or its corresponding type declarations.

6 import { loadModulesDynamically, registry } from './shared';
                                                   ~~~~~~~~~~

src/services/puppeteer.ts:4:24 - error TS2307: Cannot find module '../shared/services/logger' or its corresponding type declarations.

4 import { Logger } from '../shared/services/logger';
                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~

src/services/puppeteer.ts:178:43 - error TS2339: Property 'fromFirestoreQuery' does not exist on type 'typeof Crawled'.

178             const cached = (await Crawled.fromFirestoreQuery(Crawled.COLLECTION.where('urlPathDigest', '==', digest).orderBy('createdAt', 'desc').limit(1)))?.[0];
                                              ~~~~~~~~~~~~~~~~~~

src/services/puppeteer.ts:178:70 - error TS2339: Property 'COLLECTION' does not exist on type 'typeof Crawled'.

178             const cached = (await Crawled.fromFirestoreQuery(Crawled.COLLECTION.where('urlPathDigest', '==', digest).orderBy('createdAt', 'desc').limit(1)))?.[0];
                                                                         ~~~~~~~~~~

src/services/puppeteer.ts:232:25 - error TS2339: Property 'save' does not exist on type 'typeof Crawled'.

232                 Crawled.save(
                            ~~~~

src/services/puppeteer.ts:240:26 - error TS7006: Parameter 'err' implicitly has an 'any' type.

240                 ).catch((err) => {
                             ~~~


Found 12 errors in 4 files.

Errors  Files
     1  src/cloud-functions/crawler.ts:3
     5  src/db/crawled.ts:2
     1  src/index.ts:6
     5  src/services/puppeteer.ts:4

To reproduce go to backend/functions and run npm run build from an account which doesn't have access to the thinapps-shared/backend.

It looks like the thinapps-shared backend does mostly logging monitoring and caching, but at this point, the project doesn't build without it.

Pile in reader format

Hi! This looks interesting. I wonder if you could convert the Pile dataset taken from respective urls in the jina reader format to experiment with LLM pre-training?

The crawled content is incomplete

Both are product hunt link, one is complete while the other is incomplete. Trying x-no-cache also does not work. Can you please explain what's going on?

https://r.jina.ai/https://www.producthunt.com/posts/itnk-app

https://r.jina.ai/https://www.producthunt.com/posts/clipwing

Delivery of food

Set custom timeout?

First of all thank you very much for this library, it's immensely helpful and I'm glad to be a Jina customer.

When we use this with llms, we'd like to enforce a much stricter timeout if possible.

Are there any flags or header we can pass to enforce a shorter timeout than 30 seconds?

How to start ?

npm WARN EBADENGINE Unsupported engine {
npm WARN EBADENGINE package: undefined,
npm WARN EBADENGINE required: { node: '20' },
npm WARN EBADENGINE current: { node: 'v21.7.3', npm: '10.5.0' }
npm WARN EBADENGINE }

up to date, audited 1005 packages in 2s

146 packages are looking for funding
run npm fund for details

4 critical severity vulnerabilities

To address all issues (including breaking changes), run:
npm audit fix --force

Run npm audit for details.

Respect robots.txt and identify your system

Recently, some AI companies have given website administrators the option of opting out of AI training by using configuration options in robots.txt.

While this project is for prompting and RAG rather than training, I still think you should provide an option for website users to prevent their websites from becoming ad-hoc databases for or components of AI systems. It seems like you have made your software default to evading detection by using puppeteer's stealth plugin; the user-agent configuration that would allow website owners to identify your project's bots is commented out.

I think this default is deceptive and irresponsible. You should make sure users of your project respect these preferences by incorporating them into the software's defaults. Web administrators may not be inclined to support the additional traffic generated by people using their websites as a component of AI systems.

the search bar should have a closing option or should return to just displaying the logo after search ends.

I've found a special case where the reader doesn't run well

For example this page https://m.ke.com/bj/ershoufang/101120972798.html

This site builder jumps to a default page for requests, and you need to click "Continue" on the default page to go directly to the target page the next time.

I found that the reader gets the content of the default page every time. Can you please see if there is a good way to fix this?

Optimize real src recognition of img with lazy loading

For web pages with lazy loading of images, the src of some images will be incorrectly recognized.

I hope to support the identification of data-src of IMG tags and the correctness judgment of src.
When src is not a legal url, data-src can be used. If data-src is still not a legal url, the element will be deleted.

Example: Pictures in articles from WeChat public accounts

![Image 4: Image](data:image/svg+xml,%3C%3Fxmlversion='1.0'encoding='UTF-8'%3F%3E%3Csvgwidth='1px'height='1px'viewBox='0011'version='1.1'xmlns='http://www.w3.org/2000/svg'xmlns:xlink='http://www.w3.org/1999/xlink'%3E%3Ctitle%3E%3C/title%3E%3Cgstroke='none'stroke-width='1'fill='none'fill-rule='evenodd'fill-opacity='0'%3E%3Cgtransform='translate(-249.000000,-126.000000)' fill='%23FFFFFF'%3E%3Crect x='249' y='126' width='1' height='1'%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E)

Headers (wrongly) removed

In this page https://creator.poe.com/docs/quick-start
all the (bold and big) headers are wrongly removed by Jina.

Example (html on the left, jina markdown (rendered) on the right)

At the html level they look like this:

why the folder named thinapps-shared is empty，but it is used?

这不是完整的开源项目把？

没有本地部署文档。什么都不全，readme里边还写着安装命令干什么嘞？到底是想开源还是不想开源？想开源不能放这样的残次品出来把。

需要登陆才能访问的页面如何使用jina-reader呢

想要读取的页面需要登陆态才能访问
如何使用curl调用呢
在curl -H 添加cookie可以吗

feature suggestion: add light theme

Cannot get complete infomation for url

I cannot get complete infomation for url:
https://r.jina.ai/https://help.webex.com/29odsb/

Please help check

Scraping failure (JS): globalrelay.com/company/careers/jobs

Jina Reader does not extract the job postings from this website:
https://r.jina.ai/https://www.globalrelay.com/company/careers/jobs/?gh_jid=5117891004

不完全的提取

https://www.neu.edu.cn/xygk/lrld.htm
从学校领导的提取只能提取到宁恩承

A jina client for reader/search/rerank/embedding using posix shell/awk/curl

Hi, friends in Jina community,

We have created a client for jina, and released in our x-cmd's latest version.

This is the demo:
https://www.x-cmd.com/mod/jina

We crossfire the jina client with stack exchange, wikipedia client in x-cmd, to prepare the context for our gpt/gemini/kimi/... llm clients.

We are still exploring for more jina usages in x-cmd ecosystem. Thank you for providing this service.

a misspelled word

2024-05-08: Image capion is off by default for better latency. To turn it on, set x-with-generated-alt in the request header.

image capion -> image caption

Can you give a version that can be run independently locally?

Published time to Json mode

Please add "Published Time" to JSON mode. We are investigating how to incorporate the published time to check for updated content downstream and replace vectors based on whether the published time has changed.

Leebaby

How do I deploy it locally with docker?

Add parameters to request full text (i.e. don't parse with @mozilla/readability)

Great tool, would love to make more use of this, however, I scrape a lot of home pages of websites and those pages wind up having far too much info removed by readability.js. I'd love to have a parameter I can pass that would allow me to use html-to-text instead. The code to add that is very simple:

import { convert } from 'html-to-text';

const options = {
	wordwrap: false,
	selectors: [
		{ selector: 'a', options: { hideLinkHrefIfSameAsText: true, noAnchorUrl: true, ignoreHref: true, linkBrackets: false } },
		{ selector: 'img', format: 'skip' },
		{ selector: 'p', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } },
		{ selector: 'pre', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } },
	]
};

text = convert(html, options);

That will get you clean text that's well structured from the html though much much longer for home pages like https://www.tanium.com. For that home page, the current reader produces text with: 327 tokens (using the gpt4 tiktoken tokenizer) and html-to-text gives back 2554 tokens. That's far too much information loss for my use case, esp given that much of that information is critical to understanding what the business does.

Finally, while it wouldn't work in js - if you're willing to connect a vision model to interpret images, perhaps you'd consider implementing Trafilatura for articles and similar pages as it slightly outperforms readability.js based on a 2023 analysis from this paper: https://downloads.webis.de/publications/papers/bevendorff_2023b.pdf

Here's one image from the analysis:

While adding a parameter like this conflicts a bit with prioritizing the convenience of just dropping the url at the end of https://r.jina.ai/ I think the added info is really critical. It's the difference between me being able to use this for my use case (which I don't think is an extreme edge case) and not being able to use this...