Code Monkey home page Code Monkey logo

archive_news_cc's Introduction

Closed Captions of News Videos from Archive.org

The repository provides scripts for downloading the data, and link to two datasets that were built using the scripts:


Downloading the Data from Archive.org

Download closed caption transcripts of nearly 1.3M news shows from http://archive.org.

There are three steps to downloading the transcripts:

  1. We start by searching https://archive.org/advancedsearch.php with collection collection:"tvarchive". This gets us unique identifiers for each of the news shows. An identifier is a simple string that combines channel_name, show_name, time, and date. The current final list of identifiers (2009--Nov. 2017) is posted here.

  2. Next, we use the identifier to build a URL where the metadata file and HTML file with the closed captions is posted. The general base URL is http://archive.org/download followed by the identifier.

  3. The third script parses the downloaded metadata and HTML closed caption files and creates a CSV along with the meta data.

For instance, we will go http://archive.org/download/CSPAN_20090604_230000 for identifier CSPAN_20090604_230000 And from http://archive.org/download/CSPAN_20090604_230000/CSPAN_20090604_230000_meta.xml, we read the link http://archive.org/details/CSPAN_20090604_230000, from which we get the text from HTML file. We also store the meta data from the META XML file.

Scripts

  1. Get Show Identifiers

  2. Download Metadata and HTML Files

    • Download the Metadata and HTML Files
    • Saves the metadata and HTML files to two separate folders specified in --meta and --html respectively. The default folder names are meta and html respectively.
  3. Parse Metadata and HTML Files

Running the Scripts

  1. Get all TV Archive identifiers from archive.org.

    python get_news_identifiers.py -o ../data/search.csv
    
  2. Download metadata and HTML files for all the shows in the sample input file

    python scrape_archive_org.py ../data/search-test.csv
    

    This will create two directories meta and html by default in the same folder as where the script is. We have included the first 25 metadata and first 25 html files.

    You can change the folder for meta by using the --meta flag. To change the directory for html, use the --html flag and specify the new directory. For instance,

    python scrape_archive_org.py --meta meta-foxnews --html html-foxnews ../data/search-test.csv
    

    Use -c/--compress option to store and parse the downloaded files in compression format (GZip).

  3. Parse and extract meta fields and text from sample metadata and HTML files.

    python parse_archive.py ../data/search-test.csv
    

    A sample output file.

Data

The data are hosted on Harvard Dataverse

Dataset Summary:

  1. 500k Dataset from 2014:

    • CSV: archive-cc-2014.csv.xza* (2.7 GB, split into 2GB files)
    • HTML: html-2014.7za* (10.4 GB, split into 2GB files)
  2. 860k Dataset from 2017:

    • CSV: archive-cc-2017.csv.gza* (10.6 GB, split into 2GB files)
    • HTML: html-2017.tar.gza* (20.2 GB, split into 2GB files)
    • Meta: meta-2017.tar.gza* (2.6 GB, split into 2GB files)
  3. 917k Dataset from 2022:

    • CSV: archive-cc-2022.csv.gza* (12.6 GB, split into 2GB files)
    • HTML: html-2022.tar.gza* (41.1 GB, split into 2GB files)
    • Meta: meta-2022.tar.gz (2.1 GB)
  4. 179k Dataset from 2023:

    • CSV: archive-cc-2023.csv.gz (1.7 GB)
    • HTML: html-2023.tar.gza* (7.3 GB, split into 2GB files)
    • Meta: meta-2023.tar.gz (317 MB)

Please note that the file sizes and splitting information mentioned above are approximate.

License

We are releasing the scripts under the MIT License.

Suggested Citation

Please credit Internet Archive for the data.

If you wanted to refer to this particular corpus so that the research is reproducible, you can cite it as:

archive.org TV News Closed Caption Corpus. Laohaprapanon, Suriyan and Gaurav Sood. 2017. https://github.com/notnews/archive_news_cc/     

archive_news_cc's People

Contributors

soodoku avatar suriyan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

archive_news_cc's Issues

Access Crawled Data

Hi, thanks for sharing this data! I wonder how I can access the two crawled datasets. It seems the link in the readme file doesn't work.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.