jina-ai / reader Goto Github PK
View Code? Open in Web Editor NEWConvert any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/
Home Page: https://jina.ai/reader
License: Apache License 2.0
Convert any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/
Home Page: https://jina.ai/reader
License: Apache License 2.0
Request to add an interface that receives HTML strings. This can avoid passing cookies and other information to access web pages, which is more secure and flexible.
Hi,
I understand you guys should have a common lib for doing common tasks.
But is thinapps-shared planned to open source?
Or what's the open source plan for reader as a whole?
Thanks.
Whenever I attempt to access a page from the unibo.it domain that includes a cookie banner, Jina only returns the content of the banner itself.
However, when utilizing the x-respond-with header (with any type), all the page content is properly returned.
As I was experimenting with your API I noticed that it was a bit "too aggressive" on some pages, removing sections that I would want to keep in the final Markdown.
So I looked around both in the project code, as well as setting up an isolated test that only used turndown directly, but finally I found that the "culprit" was @mozilla/readability.
While this seems to do a great job at removing "irrelevant" content before it's converted to Markdown in most cases, I can definitely see how it might be a bit too greedy/aggressive in its cleanup strategy (i.e. not only in my specific case), and since I couldn't really find any combinations of config options for Readability that kept the specific "hero" section on the page I was trying with, I instead wanted to suggest that you might add the ability to simply enable/disable Readability completely?
As I'm not using the project by hosting the actual code locally or on my own server, but rather just using your public API, the ideal scenario would therefore be if this toggle could even exist as e.g. an extra parameter or alternative API endpoint.
Of course turndown
should still be configured to remove things like <script>
and <style>
when not using Readability
(if you don't already explicitly do this), but other than that I really think this alternative parsing option could be a very valuable addition!
how does your package deal with Client-Side Rendering websites, it seems it fails to parse a YouTube video page
访问的时候产生了报错:{"data":null,"cause":{},"code":422,"name":"AssertionFailureError","status":42206,"message":"Failed to goto https://ddjjjc.gov.cn/pages/news.asp?id=4173: Error: net::ERR_CERT_COMMON_NAME_INVALID at https://ddjjjc.gov.cn/pages/news.asp?id=4173","readableMessage":"AssertionFailureError: Failed to goto https://ddjjjc.gov.cn/pages/news.asp?id=4173: Error: net::ERR_CERT_COMMON_NAME_INVALID at https://ddjjjc.gov.cn/pages/news.asp?id=4173"}
我是用requests请求的,这是i请求代码:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.7'
}
urllib = "https://r.jina.ai/https://www.ddjjjc.gov.cn/pages/news.asp?id=4173"
resource = requests.get(urllib,headers=headers)
print(resource.text)
I tried it on some public pages and it worked fine, but it doesn't seem to work for pages that require login/authentication.
Anyway, thank you.
Converting the URL at https://www.eventbrite.com/e/brian-mclaren-wisdom-and-courage-for-a-world-falling-apart-tickets-823891721317 results in the loss of several headings (Date and Time, Location, Refund Policy, etc.) in Jina reader. Rendered result is this:
Title: Brian McLaren | “Wisdom and Courage for a World Falling Apart"
URL Source: https://www.eventbrite.com/e/brian-mclaren-wisdom-and-courage-for-a-world-falling-apart-tickets-823891721317
Markdown Content:
Brian McLaren, author, speaker, activist, and public theologian notes that the challenge of living well and maintaining resilience in turbulent times requires new ways of thinking, becoming, and belonging. Facing nations, ecosystems, economies, religions, and other institutions in disarray, we are called to a spiritual transformation in our own lives that will express itself in transformation in our world.
Childcare for ages 2-12 will be provided by reservation through April 15.
Testing jina on our norwegian websites, we see that the images are not pulled correctly:
example URL: https://mikalsenutvikling.no/
Jina URL: https://r.jina.ai/https://mikalsenutvikling.no
First image on the website by Jina is given as ![Image 1: Daglig Leder - André mikalsen](data:image/svg+xml,%3Csvg%20xmlns='http://www.w3.org/2000/svg'%20viewBox='0%200%201200%201307'%3E%3C/svg%3E)
But in the HTML you can clearly see there is a src image with .png that should be the URL given in the Jina reader version:
src="https://mikalsenutvikling.no/wp-content/uploads/2022/11/Andre-Mikalsen-optimalisert.png
This has been the same on all sites we have tested.
First of all, this is amazing! I tested it out a website I was looking at converting and it dind't work.
https://r.jina.ai/https://repairpal.com/honda/civic/radiator-fan-not-working
Original: https://repairpal.com/honda/civic/radiator-fan-not-working
Thoughts?
Fantastic project. Thank you!
Here is a page that (one would think) is straightforward to parse: https://access.redhat.com/security/cve/CVE-2023-45853 . However, none of the relevant information in the page makes it to the parsed version, only the corporate links and "scaffolding".
I figured I'd report it in case this can highlight some areas of improvement. Thanks again!
Could you please add support for docker deployment to streamline setting up and running the project?
I think this tool is quite useful, but when I try it, TimeoutError occurs frequently.
I tried webpages from https://baijiahao.baidu.com/, which are not large. For example, https://baijiahao.baidu.com/s?id=1796552710224293134&wfr=spider&for=pc
When using API, there's a limitation where it only captures data loaded during the initial page load. However, certain data is loaded dynamically after a brief delay, making it inaccessible through the current implementation. so is their any option available?
Around 50% of the time, this tweet and the few others I tried shows a timeout error: https://r.jina.ai/https://twitter.com/EntreEden/status/1780771887624417315.
The response I get is this:
{"name":"TimeoutError",
"domainThrown":true,
"message":"Timed out after 10000 ms while waiting for the WS endpoint URL to appear in stdout!"}
Same with a reddit post I tried: https://www.reddit.com/r/cscareerquestions/comments/ufdyhd/after_4_years_of_working_im_slowly_learning_how/
I am in Canada, all on my home wifi and computer.
If you need help with scaling, I have a $5 Hetzner vps that I can pitch in. If I get guidance, I'm willing to help out with the issue as well.
您好,您这个项目我觉得非常有用,请问您是用的LLM来做的处理吗?
@hanxiao Could you please give some advice for this issue?
`https://r.jina.ai/https://www.msn.com/en-us/news/technology/the-best-ai-search-engines-and-tools-you-can-use-to-search-the-web/ar-BB1kbzFL`
{"code":500,"status":50000,"message":"Failed to execute 'setAttribute' on 'Element': '}' is not a valid attribute name.","name":"DOMException"}
We have encountered an issue where the Jina reader fails to parse URLs that contain Chinese characters. This issue is causing our application to throw errors and prevents us from properly extracting content from certain websites.
https://zh.wikipedia.org/wiki/%E5%91%A8%E6%9D%B0%E5%80%AB
.We expect the Jina reader to properly handle URLs containing Chinese characters and successfully parse the corresponding web pages. The reader should be able to decode the URL, retrieve the web page content, and return it as expected.
When a URL containing Chinese characters is passed to the Jina reader, it fails to parse the URL and throws an error. The error message typically indicates that the reader is unable to read properties of undefined, specifically the 'parentNode' property.
Failed to fetch https://zh.wikipedia.org/wiki/%E5%91%A8%E6%9D%B0%E5%80%AB: {"code":500,"status":50000,"message":"Cannot read properties of undefined (reading 'parentNode')","name":"TypeError"}
This issue prevents our application from properly extracting content from websites that have URLs containing Chinese characters. It limits the functionality of our application and affects the user experience when dealing with such websites.
As a temporary workaround, we have implemented a filtering mechanism in our application to skip URLs that contain Chinese characters. However, this is not an ideal solution as it limits the functionality and coverage of our application.
We kindly request the Jina team to investigate this issue and provide a fix that allows the Jina reader to properly handle URLs containing Chinese characters. It would be greatly appreciated if you could provide an update on the progress and an estimated timeline for the resolution.
Please let us know if you require any further information or if there are any specific details you need to investigate and resolve this issue.
Thank you for your attention to this matter. We look forward to your response and resolution.
Best regards,
Loki.W
Hey!
How can I configure/use an HTTP Proxy?
For example: https://docs.zyte.com/zyte-api/usage/proxy-mode.html
I studied the source code as a whole and found that I mainly use turndown to convert html into Markdown.
What puzzles me is that the things returned by your interface are very clean and do not contain some useless information on the page. I am very curious about how you do it. Specifically, for example, the navigation bar and comment area of some pages will not appear in the content returned by your interface.
I would like to ask you what kind of processing you have done? Or is there some source code that I haven't seen?
When attempting to fetch webpage content via code, I encounter an error consistently. However, I've noticed that the same URL can be successfully accessed via a browser. The error message I'm receiving is as follows:
Error data from response: {
data: null,
path: 'url',
code: 400,
name: 'ParamValidationError',
status: 40001,
message: 'TypeError: Invalid URL',
readableMessage: 'ParamValidationError(url): TypeError: Invalid URL'
}
URL:https://www.trustpilot.com/review/cortexi.io
I've tried troubleshooting this issue, but so far, I haven't been able to pinpoint the exact cause. Any insights or suggestions on how to resolve this would be greatly appreciated. Thank you!
Error:
$ npm run build
> build
> tsc -p .
src/cloud-functions/crawler.ts:3:79 - error TS2307: Cannot find module '../shared' or its corresponding type declarations.
3 import { CloudHTTPv2, Ctx, Logger, OutputServerEventStream, RPCReflect } from '../shared';
~~~~~~~~~~~
src/db/crawled.ts:2:33 - error TS2307: Cannot find module '../shared/lib/firestore' or its corresponding type declarations.
2 import { FirestoreRecord } from '../shared/lib/firestore';
~~~~~~~~~~~~~~~~~~~~~~~~~
src/db/crawled.ts:9:21 - error TS4112: This member cannot have an 'override' modifier because its containing class 'Crawled' does not extend another class.
9 static override collectionName = 'crawled';
~~~~~~~~~~~~~~
src/db/crawled.ts:11:14 - error TS4112: This member cannot have an 'override' modifier because its containing class 'Crawled' does not extend another class.
11 override _id!: string;
~~~
src/db/crawled.ts:36:21 - error TS4112: This member cannot have an 'override' modifier because its containing class 'Crawled' does not extend another class.
36 static override from(input: any) {
~~~~
src/db/crawled.ts:46:14 - error TS4112: This member cannot have an 'override' modifier because its containing class 'Crawled' does not extend another class.
46 override degradeForFireStore() {
~~~~~~~~~~~~~~~~~~~
src/index.ts:6:50 - error TS2307: Cannot find module './shared' or its corresponding type declarations.
6 import { loadModulesDynamically, registry } from './shared';
~~~~~~~~~~
src/services/puppeteer.ts:4:24 - error TS2307: Cannot find module '../shared/services/logger' or its corresponding type declarations.
4 import { Logger } from '../shared/services/logger';
~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/services/puppeteer.ts:178:43 - error TS2339: Property 'fromFirestoreQuery' does not exist on type 'typeof Crawled'.
178 const cached = (await Crawled.fromFirestoreQuery(Crawled.COLLECTION.where('urlPathDigest', '==', digest).orderBy('createdAt', 'desc').limit(1)))?.[0];
~~~~~~~~~~~~~~~~~~
src/services/puppeteer.ts:178:70 - error TS2339: Property 'COLLECTION' does not exist on type 'typeof Crawled'.
178 const cached = (await Crawled.fromFirestoreQuery(Crawled.COLLECTION.where('urlPathDigest', '==', digest).orderBy('createdAt', 'desc').limit(1)))?.[0];
~~~~~~~~~~
src/services/puppeteer.ts:232:25 - error TS2339: Property 'save' does not exist on type 'typeof Crawled'.
232 Crawled.save(
~~~~
src/services/puppeteer.ts:240:26 - error TS7006: Parameter 'err' implicitly has an 'any' type.
240 ).catch((err) => {
~~~
Found 12 errors in 4 files.
Errors Files
1 src/cloud-functions/crawler.ts:3
5 src/db/crawled.ts:2
1 src/index.ts:6
5 src/services/puppeteer.ts:4
To reproduce go to backend/functions
and run npm run build
from an account which doesn't have access to the thinapps-shared/backend
.
It looks like the thinapps-shared backend does mostly logging monitoring and caching, but at this point, the project doesn't build without it.
Hi! This looks interesting. I wonder if you could convert the Pile dataset taken from respective urls in the jina reader format to experiment with LLM pre-training?
Both are product hunt link, one is complete while the other is incomplete. Trying x-no-cache also does not work. Can you please explain what's going on?
https://r.jina.ai/https://www.producthunt.com/posts/itnk-app
https://r.jina.ai/https://www.producthunt.com/posts/clipwing
First of all thank you very much for this library, it's immensely helpful and I'm glad to be a Jina customer.
When we use this with llms, we'd like to enforce a much stricter timeout if possible.
Are there any flags or header we can pass to enforce a shorter timeout than 30 seconds?
npm WARN EBADENGINE Unsupported engine {
npm WARN EBADENGINE package: undefined,
npm WARN EBADENGINE required: { node: '20' },
npm WARN EBADENGINE current: { node: 'v21.7.3', npm: '10.5.0' }
npm WARN EBADENGINE }
up to date, audited 1005 packages in 2s
146 packages are looking for funding
run npm fund
for details
4 critical severity vulnerabilities
To address all issues (including breaking changes), run:
npm audit fix --force
Run npm audit
for details.
Recently, some AI companies have given website administrators the option of opting out of AI training by using configuration options in robots.txt.
While this project is for prompting and RAG rather than training, I still think you should provide an option for website users to prevent their websites from becoming ad-hoc databases for or components of AI systems. It seems like you have made your software default to evading detection by using puppeteer's stealth plugin; the user-agent configuration that would allow website owners to identify your project's bots is commented out.
I think this default is deceptive and irresponsible. You should make sure users of your project respect these preferences by incorporating them into the software's defaults. Web administrators may not be inclined to support the additional traffic generated by people using their websites as a component of AI systems.
For example this page https://m.ke.com/bj/ershoufang/101120972798.html
This site builder jumps to a default page for requests, and you need to click "Continue" on the default page to go directly to the target page the next time.
I found that the reader gets the content of the default page every time. Can you please see if there is a good way to fix this?
For web pages with lazy loading of images, the src of some images will be incorrectly recognized.
I hope to support the identification of data-src of IMG tags and the correctness judgment of src.
When src is not a legal url, data-src can be used. If data-src is still not a legal url, the element will be deleted.
Example: Pictures in articles from WeChat public accounts
![Image 4: Image](data:image/svg+xml,%3C%3Fxmlversion='1.0'encoding='UTF-8'%3F%3E%3Csvgwidth='1px'height='1px'viewBox='0011'version='1.1'xmlns='http://www.w3.org/2000/svg'xmlns:xlink='http://www.w3.org/1999/xlink'%3E%3Ctitle%3E%3C/title%3E%3Cgstroke='none'stroke-width='1'fill='none'fill-rule='evenodd'fill-opacity='0'%3E%3Cgtransform='translate(-249.000000,-126.000000)' fill='%23FFFFFF'%3E%3Crect x='249' y='126' width='1' height='1'%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E)
In this page https://creator.poe.com/docs/quick-start
all the (bold and big) headers are wrongly removed by Jina.
Example (html on the left, jina markdown (rendered) on the right)
没有本地部署文档。什么都不全,readme里边还写着安装命令干什么嘞?到底是想开源还是不想开源?想开源不能放这样的残次品出来把。
想要读取的页面需要登陆态才能访问
如何使用curl调用呢
在curl -H 添加cookie可以吗
I cannot get complete infomation for url:
https://r.jina.ai/https://help.webex.com/29odsb/
Please help check
Jina Reader does not extract the job postings from this website:
https://r.jina.ai/https://www.globalrelay.com/company/careers/jobs/?gh_jid=5117891004
https://www.neu.edu.cn/xygk/lrld.htm
从学校领导的提取只能提取到宁恩承
Hi, friends in Jina community,
We have created a client for jina, and released in our x-cmd's latest version.
This is the demo:
https://www.x-cmd.com/mod/jina
We crossfire the jina client with stack exchange, wikipedia client in x-cmd, to prepare the context for our gpt/gemini/kimi/... llm clients.
We are still exploring for more jina usages in x-cmd ecosystem. Thank you for providing this service.
2024-05-08: Image capion is off by default for better latency. To turn it on, set x-with-generated-alt in the request header.
image capion -> image caption
Please add "Published Time" to JSON mode. We are investigating how to incorporate the published time to check for updated content downstream and replace vectors based on whether the published time has changed.
How do I deploy it locally with docker?
Great tool, would love to make more use of this, however, I scrape a lot of home pages of websites and those pages wind up having far too much info removed by readability.js. I'd love to have a parameter I can pass that would allow me to use html-to-text instead. The code to add that is very simple:
import { convert } from 'html-to-text';
const options = {
wordwrap: false,
selectors: [
{ selector: 'a', options: { hideLinkHrefIfSameAsText: true, noAnchorUrl: true, ignoreHref: true, linkBrackets: false } },
{ selector: 'img', format: 'skip' },
{ selector: 'p', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } },
{ selector: 'pre', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } },
]
};
text = convert(html, options);
That will get you clean text that's well structured from the html though much much longer for home pages like https://www.tanium.com. For that home page, the current reader produces text with: 327 tokens (using the gpt4 tiktoken tokenizer) and html-to-text gives back 2554 tokens. That's far too much information loss for my use case, esp given that much of that information is critical to understanding what the business does.
Finally, while it wouldn't work in js - if you're willing to connect a vision model to interpret images, perhaps you'd consider implementing Trafilatura for articles and similar pages as it slightly outperforms readability.js based on a 2023 analysis from this paper: https://downloads.webis.de/publications/papers/bevendorff_2023b.pdf
Here's one image from the analysis:
While adding a parameter like this conflicts a bit with prioritizing the convenience of just dropping the url at the end of https://r.jina.ai/ I think the added info is really critical. It's the difference between me being able to use this for my use case (which I don't think is an extreme edge case) and not being able to use this...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.