Code Monkey home page Code Monkey logo

Comments (5)

daveajones avatar daveajones commented on July 20, 2024 2

For more clarity on the duplicates issue, we require a "write" enabled developer key (must be approved) to submit new feeds to the index. Quite a few networks and platforms are auto-submitting their shows to us now, so the index is very clean. We aren't going to make this a free-for-all where someone can go rogue and just dump 5000 clone feeds in there. Nobody wants that. I've just finished up a roll-back feature where every addition gets attributed to a key and can be rolled back in batches if necessary. We want things very clean and we're moving slow to make sure that happens.

from docs-api.

adamc199 avatar adamc199 commented on July 20, 2024 1

I'll leave technical explanations to Dave, but we have 10 years of rss aggregation experience and have written many exceptions etc to combat this.

from docs-api.

 avatar commented on July 20, 2024

I don't see duplicates being a real big issue. They just have to be managed. There are several ways, and of course you're never going to get them all.

Regarding Broken Feeds: I'm not sure if Dave shared this with everyone, but I believe broken feeds will be removed after a specific number of tries over a period of time.

Regarding Duplicates: The goal of the whole project isn't to become a gate keeper of the podcasting world - in fact it's just the opposite. Duplicate feeds can arise from many factors. For example, when iTunes cracked down on keyword stuffing many podcasts got removed, and then were re-added. Some directories didn't remove the deleted feeds. So that created a large amount of duplicates. And then you're going to get intentional duplicates. Where the entire feed is pretty much the same but the feed URL is different. These are pretty easy to identify based on the episode titles, release date, and episode length. The easiest way to detect would be to set the feed url as a unique key. Not a perfect solution, and of course people can add query strings to circumvent the system. Then there are what I call partial duplicates. I learned this weekend that there are just about 96,000 episodes duplicated all related to Dungeons and Dragons. I suspect these are fans that create custom feeds based on favorite game action. They don't duplicate the whole feed, but just select episodes. Should they be removed -- I think not. But to answer your question, yes duplicate and dead feeds will be addressed.

from docs-api.

daveajones avatar daveajones commented on July 20, 2024

Thanks for addressing this Mike. Correct on all fronts.

Broken feeds: Each feed has an "errors" counter. This counter is incremented based on the severity of the error. Each time the feed is pulled (downloaded), this happens. There is also a "parse_errors" counter for the same purpose on the parser side of things. The worse the error, the faster this counter goes up. When the puller error count tops 100, it gets marked as "dead" and the aggregators stop pulling the feed regularly, and it gets relegated to a single "error" aggregator that just gives best effort. Most errors just increment by 1. ENOTFOUND, ECONNREFUSED increment by 10. 4xx statuses increment by 4. 5xx http statuses increment by 5. If the best effort aggregator ever brings a feed back from the dead all the counters are reset to zero.

Duplicates: I haven't worried to much about this so far. As Mike says, they should be fairly easy to spot by just doing comparisons. We'll have a script at some point that will sweep across and check for the obvious ones. I'm about to create a new API endpoint listing recent feeds added to the index. That'll be a good firehose for checking shenanigans too. I'm open to any and all bright ideas on this front.

from docs-api.

ByteHamster avatar ByteHamster commented on July 20, 2024

Thank you very much for all the replies.

We want things very clean and we're moving slow to make sure that happens.

This is what I hoped to hear. Duplicates like unofficial mirrors or old feeds without redirect (still returning a valid feed) can make the search function pretty much unusable for average users - at least from my experience with the gpodder.net search feature.

from docs-api.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.