extractus / feed-extractor Goto Github PK

View Code? Open in Web Editor NEW

138.0 5.0 30.0 1.52 MB

Simplest way to read & normalize RSS/ATOM/JSON feed data

Home Page: https://extractor-demos.pages.dev/feed-extractor

License: MIT License

JavaScript 100.00%

feed-reader nodejs atom-feed jsonfeed rss deno

feed-extractor's Introduction

Installation

pnpm i @extractus/extractus

Usage

Extract html with default extractors, transformer, selector

import { extract } from '@extractus/extractus'

extract(htmlString, options)

Reference

Extractor

Extract all strings from the html Example: packages/defaults/extractors.ts

type Extractor =
  | ((input: string, context?: ExtractContext) => string | undefined)
  | ((input: string) => string | undefined)

Transformer

Transform the extracted strings. Such as normalize urls, filter blank strings Example: packages/defaults/transformer.ts

type Transformer =
  | ((input: Iterable<string | undefined>, context?: ExtractContext) => Iterable<string | undefined>)
  | ((input: Iterable<string | undefined>) => Iterable<string | undefined>)

Selector

Select one value from transformed values. Such as the first title, string to date object Example: packages/defaults/selector.ts

type Selector =
  | ((input: Iterable<string>, context?: ExtractContext) => T)
  | ((input: Iterable<string>) => T)

Development

Using pnpm for manage workspace

Clone repo
Open project in terminal or IDE
Run pnpm i at the root of project

Roadmap

https://github.com/orgs/extractus/projects/2/views/1

feed-extractor's People

Contributors

Stargazers

Watchers

feed-extractor's Issues

remove xml response content-type check ?

Some response don't have content-type, see this one

https://www.hackerone.com/blog.rss

I've commented the content-type check in your code, and it's working, but ... i don't know ... perhaps you would prefer to pass an option to enable/disable content-type check ?

What do you think ?

Entries with title.length < 10 are ignored

https://github.com/ndaidong/feed-reader/blob/master/src/main.js#L55

Any idea why?

Entry author

I don't know generally how often other blogs use the entry author but I could enjoy having access to it for my Lichess Blog Discord bot.

Update relevant links/docs for feed-extractor

Check list:

RSS Results Structure Changes Depending on Normalization

Hello!

I am glad to have found your module, it looks like it will make handling feeds easy.

The structure of the results from fetching an RSS feed depends on if the normalization option is set.

If it is false then the object containing the feed items is called item and if it is true then it is called entries.

I do not know if this is intended behaviour, however the documentation doesn't mention it either way.

`fetchOptions` was not passed to fetch

other options were not passed to fetch

feed-extractor/src/utils/retrieve.js

Lines 17 to 25 in 93c7dcb

    
           const { 
        
             headers = { 
        
               'user-agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0', 
        
             }, 
        
             proxy = null, 
        
             agent = null, 
        
           } = options 
        
           const res = proxy ? await profetch(url, proxy) : await fetch(url, { headers, agent })

Support Get favicon

hi, can feed-extractor, oEmbed Extractor support crawling to favicon URL like article-extractor?

[Feature Request] need more fields, want the result can be customized via options

I use this tool to parse rss feed, but there are some fields I need not in the result, such as image, owner.

I want two options extraFeedFields and extraEntryFields used as function, their return value will be merged into the feed and entry fields. So everyone can custom the result.

const feedData = await read('https://some-rss-feed-xml/', {
    extraFeedFields: (channel) {
       return {
          image: channel['itunes:image'],
          owner: channel['iutnes:owner']
      }
   },
})

result:

{
  "title": "xxx",
  "link": "xxx",
  "description": "xxx",
  "language": "",
  "generator": "",
  "published": "",
  "entries": [...],
  "image": {...},
  "owner": {...},
}

The link cannot be resolved when the hostname is not included

example:

<channel>
  <link>/</link>
  <language>en</language>
  <atom:link href="/index.xml" rel="self" type="application/rss+xml" />
  <item>
    <link>/posts/2023/06/piem/</link>
	<guid>/posts/2023/06/piem/</guid>
  </item>
</channel>

When the link is in the above format, it will be resolved as null:

{
  "link": null,
  "language": "en",
  "atom:link": {
    "@_href": "/index.xml",
    "@_rel": "self",
    "@_type": "application/rss+xml"
  },
  "item": [
    {
      "link": null,
      "guid": "/posts/2023/06/piem/"
    }
  ]
}

CDATA in description not parsed as desired

Hi Team,

Thanks for building this open-source tool. I'm new to dealing with RSS feeds and wanted an easy way to parse the data into typed objects. I'm having an issue with one feed where they have embedded A LOT Of CDATA in the description, with a lot of HTML with styles and links to images, etc..

Here is an example:
(NOTE: some of this is being hidden by the browser; open this issue in Edit view to see all the data might work. If there is a way to prevent it from rendering as HTML in this Issue, I don't know how.)

<description><![CDATA[<a href="https://someorg.org/blog/meeting-the-obligations-of-the-german-supply-chain-due-diligence-act-faqs/" title="Meeting the Obligations of the German Supply Chain Due Diligence Act: FAQs" rel="nofollow"><img width="300" height="157" src="https://someorg.org/wp-content/uploads/2022/11/Blog-German-DD-FI-300x157.jpg" class="webfeedsFeaturedVisual wp-post-image" alt="German Flag over building" decoding="async" style="float: left; margin-right: 5px;" link_thumbnail="1" loading="lazy" srcset="https://someorg.org/wp-content/uploads/2022/11/Blog-German-DD-FI-300x157.jpg 300w, https://someorg.org/wp-content/uploads/2022/11/Blog-German-DD-FI-1024x536.jpg 1024w, https://someorg.org/wp-content/uploads/2022/11/Blog-German-DD-FI-768x402.jpg 768w, https://someorg.org/wp-content/uploads/2022/11/Blog-German-DD-FI.jpg 1200w" sizes="(max-width: 300px) 100vw, 300px" /></a><p>The German Supply Chain <span class="glossaryLink"  aria-describedby="tt"  data-cmtooltip="&#38;lt;!-- wp:paragraph --&#38;gt;Often the second stage in the third-party risk management life cycle. Due diligence involves conducting a review of a potential third party prior to signing a contract. This review should involve developing a deeper understanding of the third party&#8217;s ownership, operations, resources, financial status, relevant employees, risk and control framework, business continuity program, third-party risk management program, and other factors important to the third-party relationship. Due diligence helps ensure the organization selects an appropriate third party to partner with, and that the organization understands both the inherent and residual risks posed by the relationship. These residual risks should be within the organization&#8217;s risk appetite.&#38;lt;br/&#38;gt;&#38;lt;!-- /wp:paragraph --&#38;gt;"  data-gt-translate-attributes='[{"attribute":"data-cmtooltip", "format":"html"}]'>Due Diligence</span> Act goes into effect January 2023 and is already making waves within supply chain, risk management, and compliance communities. [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://someorg.org/blog/meeting-the-obligations-of-the-german-supply-chain-due-diligence-act-faqs/">Meeting the Obligations of the German Supply Chain Due Diligence Act: FAQs</a> appeared first on <a rel="nofollow" href="https://someorg.org">Aravo</a>.</p>
]]></description>

Options: { descriptionMaxLen: 20000, xmlParserOptions: { // I've tried a bunch. . . nothing "worked"} }

Output:

description:  "The German Supply Chain Due Diligence Act goes into effect January 2023 and is already making waves within supply chain, risk management, and compliance communities. [&#8230;] The post Meeting the Obligations of the German Supply Chain Due Diligence Act: FAQs appeared first on Aravo."
link:  "https://aravo.com/blog/meeting-the-obligations-of-the-german-supply-chain-due-diligence-act-faqs/"
published:  "2022-12-01T14:33:07.000Z"
title:  "Meeting the Obligations of the German Supply Chain Due Diligence Act: FAQs"

Desired output: All contents of the description CDATA

Questions:

Is this something that can be supported?
How unusual (to you) is this use of the description field (all CDATA of HTML)?

`published` typed incorrectly

FeedData.published and FeedEntry.published are typed as Date, but aren't they string?

Would you accept a PR supporting CJS compatibility ?

I totally understand if you'd prefer not to, just let me know either way.

Thanks for the tool :)

Disable item description trimming?

Is it possible to skip truncating the description and returning full contents in any way?

Right now I provide descriptionMaxLenwith some impossible value (999999) but it's a bit hacky workaround.

It would be great if I could pass -1 or false to skip description truncation altogether.

Add support for fetch options

I needed to be able to pass options to the underlying fetch to adjust timeout, etc. Here's a patch to enable that, in case it's of use to anyone else:

@@ -15,9 +15,9 @@
 var isArray = bella.isArray;
 var isObject = bella.isObject;

-var toJSON = (source) => {
+var toJSON = (source, opts) => {
   return new Promise((resolve, reject) => {
-    fetch(source).then((res) => {
+    fetch(source, opts).then((res) => {
       if (res.ok && res.status === 200) {
         return res.text();
       }
@@ -174,9 +174,9 @@
 };


-var parse = (url) => {
+var parse = (url, opts = {}) => {
   return new Promise((resolve, reject) => {
-    toJSON(url).then((o) => {
+    toJSON(url, opts).then((o) => {
       let result;
       if (o.rss && o.rss.channel) {
         let t = o.rss.channel;

Hardcoded attributeNamePrefix value in xmlParserOptions

Hi!

First and foremost, thanks for your work!

I've been using the library in my GitHub action and I tried to change attributeNamePrefix property in xmlParserOptions but it didn't work. I had a look at the code and noticed it's hardcoded and thus impossible to change:

feed-extractor/src/utils/xmlparser.js

Line 23 in 4bdcdb1

attributeNamePrefix: '@_',

Is there any reasoning behind this decision I'm not aware of? Is it possible to make it modifiable via xmlParserOptions like the rest of the properties?

I can provide a pull request for this if you don't mind.

Thanks a lot!

Issue with package types

Hi,

Theres seems to be a problem with the package types, with something along the lines of:

There are types at '.../node_modules/@extractus/feed-extractor/index.d.ts', but this result could not be resolved when respecting package.json "exports". The '@extractus/feed-extractor' library may need to update its package.json or typings.

I was able to fix this locally by adding to the package.json "types": "./index.d.ts", on the exports section.
I can make a PR on this.

Support optional item elements in parsed RSS 2.0 items

I love the simplicity of this library, but it drops the optional elements on items, such as enclosure and author, leaving no way to access them. Could support for those be added?

Add `id` property to entries

Nice library!

It would be helpful to have an 'id' property added to each entry. This allows an entry to be uniquely tracked, and ensures that if the URL of a feed item updates, it's still considered the same entry.

JSON Feed has a required id property.
RSS has guid, but it is optional. If it's not set the recommendation is generally to just use the URL instead as the unique identifier.
Atom has the id field for each entry, which is also required.

So it should be pretty easy to normalize this into an id field and make it non-optional in the type definition.

feed.xml link field parsing error

https://stephango.com/feed.xml

link field parsing error

atom is working, rss2 and rss gets error message

Site: https://abikw.nvii-dev.de
When trying to fetch a feed with atom it's working, but when trying to fetch a feed with rss2 or rss I'm getting the following error:

TypeError: item.map is not a function
parseRSS webpack-internal:///./node_modules/feed-reader/src/utils/parser.js:89
read webpack-internal:///./node_modules/feed-reader/src/main.js:39
getFeedFile webpack-internal:///./node_modules/cache-loader/dist/cjs.js?!./node_modules/babel-loader/lib/index.js!./node_modules/cache-loader/dist/cjs.js?!./node_modules/vue-loader-v16/dist/index.js?!./src/pages/News.vue?vue&type=script&lang=js:28
created webpack-internal:///./node_modules/cache-loader/dist/cjs.js?!./node_modules/babel-loader/lib/index.js!./node_modules/cache-loader/dist/cjs.js?!./node_modules/vue-loader-v16/dist/index.js?!./src/pages/News.vue?vue&type=script&lang=js:38
callWithErrorHandling webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:6824
callWithAsyncErrorHandling webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:6833
callHook webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:2419
applyOptions webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:2321
finishComponentSetup webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:6561
setupStatefulComponent webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:6473
setupComponent webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:6403
mountComponent webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:4258
processComponent webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:4233
patch webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:3837
patchKeyedChildren webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:4722
patchChildren webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:4541
patchElement webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:4057
processElement webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:3917
patch webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:3834
componentUpdateFn webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:4443
run webpack-internal:///./node_modules/@vue/reactivity/dist/reactivity.esm-bundler.js:195
callWithErrorHandling webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:6824
flushJobs webpack-internal:///./node_modules/@vue/runtime-core/dist/runtime-core.esm-bundler.js:7060
cjs.js:31:17

Code I'm using:
getFeedFile() {
const url = 'https://abikw.nvii-dev.de/feed/rss';

    this.read(url)
      .then((feed) => {
        console.log('News - getFeedFile - feed', feed);
      })
      .catch((err) => {
        console.log('News - getFeedFile - error: ', err);
      });
  },

CORS

Hi,
thx for this awesome library.

Unfortunately, I can't fetch 90% of the RSS sources because of CORS issues.
Do you have any suggestions on how to solve it?

CI test does not pass Node 19

Wrong TS type without normalization.

When normalization option is false the returned type FeedData is false.

I suggest we return a generic type like Record<string, any>.

Support other date format in published

Like this:
Wed, 31 May 2023 13:55:24 -0000

add support windows-1251 encoding

feed-extractor does not understand windows-1251, which is still required for some sites.

Cannot define extra entry fields to fetch

Hi,
The RSS feed I fetch includes, for each item, an illustration image of the published article. As this field is not provided by default, I have tried to define it in the parser options as explained here in the documentation, but I get the following error when executing my script:

TypeError: Cannot read properties of undefined (reading '@_url')

Here is the definition of my options (I use typescript):

const options = {
    getExtraEntryFields: (entryData: any /* What is the expected type ?? */) => {
        const { enclosure } = entryData
        return {
            enclosure: {
                url: enclosure['@_url'], // enclosure is undefined ...
                type: enclosure['@_type'], // enclosure is undefined ...
            }
        }
    }
}

const rss = await extract(url, options)

I'm new to using RSS feeds, can you help me understand the error and get to the point?

Content type octet-stream support

Why is the content type checked?

I'm trying to consume a feed that returns octet-stream, can we support it?

feed-extractor/src/utils/retrieve.js

Line 33 in 8d65cd1

if (/(\+|\/)(xml|html)/.test(contentType)) {

ref: racket/racket-lang-org#235

Now available as a Github Action

Hi! Just wanted to alert you to the Github Action wrapper I created for this tool, making it possible to use in continuous integration for Github Pages builds. Would love a link in the README!

IDE Type Error when adding optional FeedEntries like category

In the index.d.ts file

feed-extractor/index.d.ts

Lines 3 to 12 in b81646a

    
           export interface FeedEntry { 
        
             /** 
        
              * id, guid, or generated identifier for the entry 
        
              */ 
        
             id: string; 
        
             link?: string; 
        
             title?: string; 
        
             description?: string; 
        
             published?: Date; 
        
           }

Since the feed entry allows for custom extra keys like categories and enclosure, Adding an optional parameter would stop the error that pops up in VSCode.

In my case, since I am using a string for tags (fetching medium RSS), adding the line like:

interface FeedEntry{
   ...
   category?: Array<string>;
}

But I believe this would fail again if we have a custom object for categories like text and domain as mentioned in the examples on npm.

CERT_HAS_EXPIRED

Hey bro.
I am using you npm package for some project.
I am getting the follow error

Error
at Function.createFromInputFallback (/Users/wellington/Developer/zuntaz-bots/node_modules/moment/moment.js:320:98)
at configFromString (/Users/wellington/Developer/zuntaz-bots/node_modules/moment/moment.js:2385:15)
at configFromInput (/Users/wellington/Developer/zuntaz-bots/node_modules/moment/moment.js:2611:13)
at prepareConfig (/Users/wellington/Developer/zuntaz-bots/node_modules/moment/moment.js:2594:13)
at createFromConfig (/Users/wellington/Developer/zuntaz-bots/node_modules/moment/moment.js:2561:44)
at createLocalOrUTC (/Users/wellington/Developer/zuntaz-bots/node_modules/moment/moment.js:2648:16)
at createLocal (/Users/wellington/Developer/zuntaz-bots/node_modules/moment/moment.js:2652:16)
at hooks (/Users/wellington/Developer/zuntaz-bots/node_modules/moment/moment.js:12:29)
at normalize (/Users/wellington/Developer/zuntaz-bots/node_modules/feed-reader/src/main.js:62:16)
at modify (/Users/wellington/Developer/zuntaz-bots/node_modules/feed-reader/src/main.js:139:14)
at Array.map ()
at toRSS (/Users/wellington/Developer/zuntaz-bots/node_modules/feed-reader/src/main.js:142:20)
at /Users/wellington/Developer/zuntaz-bots/node_modules/feed-reader/src/main.js:224:18
at runMicrotasks ()
at processTicksAndRejections (internal/process/task_queues.js:85:5)
FetchError: request to https://www.muywindows.com/feed failed, reason: certificate has expired
at ClientRequest. (/Users/wellington/Developer/zuntaz-bots/node_modules/node-fetch/index.js:133:11)
at ClientRequest.emit (events.js:209:13)
at TLSSocket.socketErrorListener (_http_client.js:406:9)
at TLSSocket.emit (events.js:209:13)
at emitErrorNT (internal/streams/destroy.js:91:8)
at emitErrorAndCloseNT (internal/streams/destroy.js:59:3)
at processTicksAndRejections (internal/process/task_queues.js:77:11) {
name: 'FetchError',
message: 'request to https://www.muywindows.com/feed failed, reason: certificate has expired',
type: 'system',
errno: 'CERT_HAS_EXPIRED',
code: 'CERT_HAS_EXPIRED'
}

fetch error: cert invalid

Type:     FetchError
Message:  request to https://www.logseqtimes.com/rss/ failed, reason: Hostname/IP does not match certificate's altnames: Host: www.logseqtimes.com. is not in the cert's altnames: DNS:fallback.tls.fastly.net

ref: avelino/bots.clj.social#103

there should be an option to bypass the certificate check

Empty description when content is wrapped in CDATA

Hi!

When I pass a feed that contains content wrapped in CDATA tags, normalized feed entry contains empty description.

Sample feeds:

For now I use a dirty workaround using getExtraEntryFields and some custom code to process HTML:

getExtraEntryFields: (feedEntry) => {
	const cdataDescription = feedEntry.description.includes("<![CDATA[")
	  ? stripAndTruncateHTML(
	      feedEntry.description
	        .replaceAll("<![CDATA[", "")
	        .replaceAll("]]>'", ""),
	      siteConfig.maxPostLength
	    )
	  : "";

	return { cdataDescription };
}

Also - do you have a donation link or something? I'd love to buy you a coffee because this project ROCKS. ❤️

How to debug feeds that throw an error?

Trying to pull something like https://www.nature.com/nature.rss - getting an error both locally and in demo. Ran the address through the w3c validator and came up valid.

Somewhat related, I'm also trying to use a proxy but to no avail as http://[email protected]:8887 is throwing Invalid URL

missing optionals entry fields ?

Hi again :)

Below a raw rss entry

<entry>
    <author>
        <name>/u/0xdea</name>
        <uri>https://www.reddit.com/user/0xdea</uri>
    </author>
    <category term="netsec" label="r/netsec"/>
    <content type="html">&amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/0xdea&quot;&gt; /u/0xdea &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://security.humanativaspa.it/automating-binary-vulnerability-discovery-with-ghidra-and-semgrep/&quot;&gt;[link]&lt;/a&gt;&lt;/span&gt; &amp;#32; &lt;span&gt;&lt;a href=&quot;https://www.reddit.com/r/netsec/comments/vtcsdv/automating_binary_vulnerability_discovery_with/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;</content>
    <id>t3_vtcsdv</id>
    <link href="https://www.reddit.com/r/netsec/comments/vtcsdv/automating_binary_vulnerability_discovery_with/" />
    <updated>2022-07-07T07:27:52+00:00</updated>
    <published>2022-07-07T07:27:52+00:00</published>
    <title>Automating binary vulnerability discovery with Ghidra and Semgrep</title>
</entry>

Below attributes returned by feed-reader, some fields are missing

{
  title: 'Automating binary vulnerability discovery with Ghidra and Semgrep',
  link: 'https://www.reddit.com/r/netsec/comments/vtcsdv/automating_binary_vulnerability_discovery_with/',
  description: 'submitted by /u/0xdea [link] [comments]',
  published: '2022-07-07T07:27:52.000Z',
}

We should expect something like that

{
  id:'t3_vtcsdv',
  author: {
    name:'/u/0xdea',
    uri:'https://www.reddit.com/user/0xdea'
  },
  category: {
      term:'netsec',
      label:'r/netsec'
  },
  content:{
      type:"html',
      rawValue:'&amp;#32; submitted by &amp;#32; &lt;a href=&quot;https://www.reddit.com/user/0xdea&quot;&gt; /u/0xdea &lt;/a&gt; &lt;br/&gt; &lt;span&gt;&lt;a href=&quot;https://security.humanativaspa.it/automating-binary-vulnerability-discovery-with-ghidra-and-semgrep/&quot;&gt[link]&lt;/a&gt;&lt;/span&gt;&amp;#32;&lt;span&gt;&lt;ahref=&quot;https://www.reddit.com/r/netsec/comments/vtcsdv/automating_binary_vulnerability_discovery_with/&quot;&gt;[comments]&lt;/a&gt;&lt;/span&gt;'
  },
  title: 'Automating binary vulnerability discovery with Ghidra and Semgrep',
  link: 'https://www.reddit.com/r/netsec/comments/vtcsdv/automating_binary_vulnerability_discovery_with/',
  description: 'submitted by /u/0xdea [link] [comments]',
  published: '2022-07-07T07:27:52.000Z',
  updated: '2022-07-07T07:27:52.000Z',
}

see #36
see #13

So, before start coding on my side, i'd like to know why you didn't implement all fields ? missing opportunity ? don't have time ? or you don't want for good reasons ?

Your module could be a good one, because many of others are using "request" module, which is deprecated since a long time now. Good opportunity. But if we can not access all other fields, your module will stay invisible

What do you think ? thank you !

Add options to get specifics fields

Hi,

Your library is really simple and great but I have a problem. Actually, I want to get the guid field of rss items but your library don't return it.

It's possible to add the possibility to set the fields to return on option of the parser ?

Thanks

Bug on heise.de

Creates faulty object on https://www.heise.de/rss/heise-atom.xml
also on https://www.tagesschau.de/index~atom.xml

(description shows [object]...)

Some items being ignore due to hardcoded limits

Can I ask what the purpose behind the length checks at the beginning of normalize() are? Specifically this:

  if (!link || !title ||
    !isString(link) || !isString(title) ||
→   link.length < 10 || title.length < 10) {
    return false;
  }

It's taken me ages to track down why some items in a feed are returning as undefined, and it's because the title was short.

Add content:encoded to FeedEntry

Thanks for a great tool. So far I've been using feed-extractor get feed items, and then passing each item's link to article-extractor to get the full article. However, I notice that in most of my feeds, the full text of the article is included in the RSS feed under the content:encoded tag. Is there already a way to get this data using feed-extractor so I wouldn't need to make a second call to article-extractor? It seems to me like it would be cool thing "encoded" were added as a property on FeedEntry, so that when it exists, we have access to it after parsing the feed. Is there a better way to do this?

fast-xml-parser regex vulnerability patch could be improved from a safety perspective

Summary

This is a comment on GHSA-6w63-h3fj-q4vw and the patches fixing it.

ref GHSA-gpv5-7x3g-ghjv

Details

The code which validates a name calls the validator:
https://github.com/NaturalIntelligence/fast-xml-parser/blob/ecf6016f9b48aec1a921e673158be0773d07283e/src/xmlparser/DocTypeReader.js#L145-L153
This checks for the presence of an invalid character. Such an approach is always risky, as it is so easy to forget to include an invalid character in the list. A safer approach is to validate entity names against the XML specification: https://www.w3.org/TR/xml11/#sec-common-syn - an ENTITY name is a Name:

[4]   NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] |
                        [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
                        [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a]  NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5]   Name ::= NameStartChar (NameChar)*

so the safest way to validate an entity name is to build a regex to represent this expression and check whether the name given matches the regex. (Something along the lines of /^[name start char class][name char class]*$/.) There's probably a nice way to simplify the explicit list rather than typing it out verbatim using Unicode character properties, but I don't know enough to do so.

Minor regression in v7.0.3

Hi!

After fixing #105 I noticed a small regression among RSS feeds that serve both content:encoded and description in their items.

content:encoded is used first (even though human-friendly description is available) and it results in a pile of HTML/CSS code being served.

Feed that shows this problem: https://turystyka-niecodzienna.pl/rss

I suspect it may be possible to fix by switching the order from content || description into description || content (and perhaps htmlContent || description into description || htmlContent?) in 3e1d612#diff-79bdb3bf907b1dc8f0ca3b16390b8e93716d86d536837a4cbda4d9b0b2b19ee7

better axios error handler

Ni !
hi @ndaidong

Thank you for your work.

I've forked your project, i'd like to improve error handling. Currently you return null on every axios errors.

What do you think about that ?

Making a PR right now

Thank you.

Medium feeds - no description tag, only content:encoded -> empty description

Hi again!

I found another edge case and this time it's Medium feeds being naughty. They contain no description tag, only content:encoded with HTML content.

Example: https://medium.com/feed/@ameliakusiak

Traditionally, it can be worked around with getExtraEntryFields.

	const {
	headers = {
	'user-agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0',
	},
	proxy = null,
	agent = null,
	} = options

	const res = proxy ? await profetch(url, proxy) : await fetch(url, { headers, agent })

	export interface FeedEntry {
	/**
	* id, guid, or generated identifier for the entry
	*/
	id: string;
	link?: string;
	title?: string;
	description?: string;
	published?: Date;
	}