A big problem of public podcast directories is that they can quickly fill up with dupl

Duplicate entries about docs-api HOT 5 CLOSED

podcastindex-org commented on July 20, 2024 1

Duplicate entries

from docs-api.

Comments (5)

daveajones commented on July 20, 2024 2

For more clarity on the duplicates issue, we require a "write" enabled developer key (must be approved) to submit new feeds to the index. Quite a few networks and platforms are auto-submitting their shows to us now, so the index is very clean. We aren't going to make this a free-for-all where someone can go rogue and just dump 5000 clone feeds in there. Nobody wants that. I've just finished up a roll-back feature where every addition gets attributed to a key and can be rolled back in batches if necessary. We want things very clean and we're moving slow to make sure that happens.

from docs-api.

adamc199 commented on July 20, 2024 1

I'll leave technical explanations to Dave, but we have 10 years of rss aggregation experience and have written many exceptions etc to combat this.

from docs-api.

commented on July 20, 2024

I don't see duplicates being a real big issue. They just have to be managed. There are several ways, and of course you're never going to get them all.

Regarding Broken Feeds: I'm not sure if Dave shared this with everyone, but I believe broken feeds will be removed after a specific number of tries over a period of time.

Regarding Duplicates: The goal of the whole project isn't to become a gate keeper of the podcasting world - in fact it's just the opposite. Duplicate feeds can arise from many factors. For example, when iTunes cracked down on keyword stuffing many podcasts got removed, and then were re-added. Some directories didn't remove the deleted feeds. So that created a large amount of duplicates. And then you're going to get intentional duplicates. Where the entire feed is pretty much the same but the feed URL is different. These are pretty easy to identify based on the episode titles, release date, and episode length. The easiest way to detect would be to set the feed url as a unique key. Not a perfect solution, and of course people can add query strings to circumvent the system. Then there are what I call partial duplicates. I learned this weekend that there are just about 96,000 episodes duplicated all related to Dungeons and Dragons. I suspect these are fans that create custom feeds based on favorite game action. They don't duplicate the whole feed, but just select episodes. Should they be removed -- I think not. But to answer your question, yes duplicate and dead feeds will be addressed.

from docs-api.

daveajones commented on July 20, 2024

Thanks for addressing this Mike. Correct on all fronts.

Broken feeds: Each feed has an "errors" counter. This counter is incremented based on the severity of the error. Each time the feed is pulled (downloaded), this happens. There is also a "parse_errors" counter for the same purpose on the parser side of things. The worse the error, the faster this counter goes up. When the puller error count tops 100, it gets marked as "dead" and the aggregators stop pulling the feed regularly, and it gets relegated to a single "error" aggregator that just gives best effort. Most errors just increment by 1. ENOTFOUND, ECONNREFUSED increment by 10. 4xx statuses increment by 4. 5xx http statuses increment by 5. If the best effort aggregator ever brings a feed back from the dead all the counters are reset to zero.

Duplicates: I haven't worried to much about this so far. As Mike says, they should be fairly easy to spot by just doing comparisons. We'll have a script at some point that will sweep across and check for the obvious ones. I'm about to create a new API endpoint listing recent feeds added to the index. That'll be a good firehose for checking shenanigans too. I'm open to any and all bright ideas on this front.

from docs-api.

ByteHamster commented on July 20, 2024

Thank you very much for all the replies.

We want things very clean and we're moving slow to make sure that happens.

This is what I hoped to hear. Duplicates like unofficial mirrors or old feeds without redirect (still returning a valid feed) can make the search function pretty much unusable for average users - at least from my experience with the gpodder.net search feature.

from docs-api.

Duplicate entries about docs-api HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent