Code Monkey home page Code Monkey logo

talksearch-scraper's Introduction

TalkSearch scraper

This scraper is a command-line tool that extract information from YouTube playlists and push them to Algolia.

Usage

yarn index {config_name}

How it works

The ./configs/ folder contain custom configs, each containing a list of playlists to index.

The command will use the YouTube API to fetch data about the defined playlists and push them to Algolia.

Captions will be extracted from the videos if they are available. Each record in Algolia will represent one caption, also containing a .video, .playlist and .channel key. The distinct feature of Algolia is used to group records of the same video together, to display the most relevant caption each time.

Each channel will have its own index called {channel_name}_{channel_id}. All videos of all playlists will be saved in this index, but can be filtered based on the channel.id and playlist.id keys of the records.

Development

Start with yarn install to load all the dependencies.

The project will need ENV variables to connect to the services.

  • ALGOLIA_APP_ID and ALGOLIA_API_KEY for pushing records to Algolia
  • YOUTUBE_API_KEY to connect to the YouTube API
  • GOOGLE_APPLICATION_CREDENTIALS that point to the path to your google.service-account-file.json (create one here)

We suggest using a tool like direnv to load those variables through the use of a .envrc file.

Once everything is installed, you can run yarn index {config_name}

Debug calls

yarn run index:cache

This will read data from a disk cache of previous requests instead of making actual HTTP calls. If there is no cache hit for the request, it will do it for real.

This should be the preferred way of running the command for debugging purposes.

yarn run index:logs

This will log all HTTP calls raw responses to disk. This is useful when debugging, as it allow to dig into the responses of the APIs called.

talksearch-scraper's People

Contributors

pixelastic avatar utay avatar renovate-bot avatar haroenv avatar martyndavies avatar meschreiber avatar jessicag avatar renovate[bot] avatar

Stargazers

Varun Ganjigunte Prakash avatar Josscii avatar Erdi Doğan avatar Paul F avatar  avatar Imamuzzaki Abu Salam avatar R. Rajesh Jeba Anbiah avatar Israel Gonzalez-Brooks avatar Alex Patterson avatar  avatar David Dias avatar rollue avatar Yashodhan avatar Bruno Faust avatar Casey Gollan avatar OSINTAI avatar Simon Moisselin avatar Ivan avatar Scott Ivey avatar  avatar Abid Bilabar avatar  avatar Rodrigo Bressan avatar Moe avatar Adam Acosta avatar Ai-Lin Liou avatar morbidick avatar KOLANICH avatar Frank Taillandier avatar  avatar  avatar

Watchers

Jamie Rumbelow avatar Jean-François Grand avatar kfir avatar Long Chen avatar Eiji Shinohara avatar Ben Lower avatar Maxime Prades avatar Julien Lemoine avatar  avatar Cindy Cullen avatar James Cloos avatar Hamish avatar  avatar Ciprian Borodescu avatar  avatar Eric Wright avatar  avatar  avatar Raed avatar James Morgan avatar Augusto Farnese avatar Dustin Coates avatar Aayush Iyer avatar Michael Davis avatar Tatsuro Handa avatar Syed Hussain avatar Alexandra Anghel avatar agabet avatar Lionnel DUPOUY avatar Matthieu Dumont avatar Chuck Meyer avatar Adrian McMichael avatar Alex Webb avatar Thomas Raffray avatar Abhijit Mehta avatar Guillaume Morin avatar Sébastien Kurtzemann avatar  avatar mdecalf avatar  avatar Amine Dhobb avatar Fanette avatar Garry avatar Baptiste Coquelle avatar Freya Preimesberger avatar Oliver Han avatar Taylor Johnson avatar Denis avatar Eunice Lee avatar  avatar Mathieu Olivier avatar  avatar  avatar Sacha Vakili avatar Cat Weiss avatar Maria Lungu avatar Martin avatar Beatrice avatar Alexis Monks avatar Amandine Nassiri avatar  avatar Dan Brazier avatar  avatar

talksearch-scraper's Issues

Move indexing functionality to a background task

Whether on Heroku or elsewhere, indexing any channel with 30+ videos on it is a task that will run for too long to be hanging off a response being returned to the web view. The indexing functionality should be moved to a background job and enqueued on submission.

Indexing tags

YouTube uploaders can add their own tags to the videos. As those are user-generated, they cannot really be trusted and we should not display them in our demos.

But, if the owners put the correct data into them, they should still be able to filter based on them. For example a JavaScript conference wanting to filter by talks about React or VueJS.

Those tags should be added to the records and marked as attributesForFaceting

Caption language is not correct when several languages and manual input

For videos that have many different languages available in the transcripts, and all are manual, the extractor got confused as to which one to use.

https://www.youtube.com/watch?v=byva0hOj8CU&t=3633s for example is extracted with its German subtitles.

The check should take captions in this order of priority

  • Manual caption in the language of the video
  • Automatic caption in the language of the video
  • First caption in the language of the video

Broaden the search surface

This can be found with recognize who was, but not with recognize human.

image

The current implementation creates one record per line of transcript. As those keywords are on two different lines, they are not found. The solution might be to index more than one line in each record (line, line +1 and line -1), to broaden the search surface and fix this issue.

We would still need to create one record per line (not one record per 3 lines), to keep the correct time to jump to.

Indexing captions in various languages

If a video has captions in several languages, we should be able to index them all. This would add a caption.language field to each caption that could then be used for filtering.

Allowing fine-grain inclusion/exclusion of content

Today, all videos of specified playlists are added. Conference organizers might want finer grain of what to exclude from the list.

The current v2 branch uses a config file for each conference. This file contains an array of playlistId, but we can also think of a blockList of videoIds to exclude.

When using a single custom index, all YouTube URLs used must be stored

Currently 1 YouTube URL is stored in the METADATA index for each new channel/playlist/video added, this is fine when a 1:1 relationship exists but the use of custom indexes allows for multiple playlists to exist within one index, but only the last channel/playlist indexed will be stored which will cause issues if there is ever a reindex.

What to do?

Check METADATA for the existence of the custom index and if it exists push the new channel into an array for the attribute youtubeURL.

The reindexing method must also allow for this change, and loop through all channels when indexing back into the custom index.

Unable to extract speaker name when language is russian

When trying to index the OdessaJS conference, the Google NLP API chokes with the following error:

{ Error: 3 INVALID_ARGUMENT: The language ru is not supported for entity analysis.
    at Object.exports.createStatusError (/home/tim/local/www/algolia/talksearch/talksearch-scraper/node_modules/grpc/src/common.js:87:15)
    at Object.onReceiveStatus (/home/tim/local/www/algolia/talksearch/talksearch-scraper/node_modules/grpc/src/client_interceptors.js:1188:28)
    at InterceptingListener._callNext (/home/tim/local/www/algolia/talksearch/talksearch-scraper/node_modules/grpc/src/client_interceptors.js:564:42)
    at InterceptingListener.onReceiveStatus (/home/tim/local/www/algolia/talksearch/talksearch-scraper/node_modules/grpc/src/client_interceptors.js:614:8)
    at callback (/home/tim/local/www/algolia/talksearch/talksearch-scraper/node_modules/grpc/src/client_interceptors.js:841:24)
  code: 3,
  metadata: Metadata { _internal_repr: {} },
  details: 'The language ru is not supported for entity analysis.',
  note: 'Exception occurred in retry method that was not classified as transient' }

We might want to either:

  • handle this gracefully by skipping it
  • have a whitelist of language that can have NLP

Recognise failed videos as videos with no subtitles

The array of failed IDs that can appear as part of the indexing process happens because there is no subtitles available for that video in the language specified. If this happens, we should record a new attribute of 'noSubs' and set it to true so this can be used to display an icon on the front end that people can use to differentiate.

Attempt to split multiple speakers into an array

Example: SaaStr has panels at their conf that include more than one person speaking on the same topic, so whilst the title is the same, the speakers will be a list of one or more people that doesn't look good as a search facet.

What to do?

Check the speakers for common separators such as , or & and split on these into an array

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.