Code Monkey home page Code Monkey logo

cs325_project4's Introduction


News

CBS News Article Scraper

This webscraper takes ten URLs from .txt file, scrapes them for relevant information ignoring advertisements, images, and links to other articles. It then takes the text and puts them in five separate files.



About The Project

The file "main.py" has the code to read in links to ten CBS new articles and convert them to raw text- no ads, images, or links to other articles. It utilizes beautifulsoup, a python package, and requests, a python library, to simplify the scraping process. Main calls a few different functions- one to read in links, to scrape the text, then to write to a new file. Each of these (along with main) include exception handling such as link not working or being unretrievable.

The file "url.txt" contains a list of CBS World articles, each article is separated by a new line. You substitute any article for another CBS one, however anything exceeding ten will not be read.

The files such as "story1.txt" (also 2,3,4, etc) are the articles scraped by the program. They contain a summary of each article in plain text. Upon running the program with the links, it will create them (if they are not already created, else it will overwrite them). Not necessary to run, just here to show the results of program running.

The file "requirements.yml" contains the conda environment needed to run the code in this repository. You will need to take this file and import into your virtual environment using conda.

Getting Started

To get the web scraper up and running, these steps are operating on the assumption that you already have conda installed locally.

  • Clone the repo

    git clone https://github.com/sbar821/CBS-Article-Scraper
  • To create the conda environment, type the following:

    conda env create -f requirements.yml
  • Find ten CBS articles (or use ones already in the file) and put it into url.txt with each url terminated with a new line.

  • Run main.py using the conda environment or Visual Studio Code with conda as the interpreter.

  • Watch as the story1.txt and beyond are filled with the news articles selected! (Please note it only works with CBS news articles due to specific HTML tags used)

Utilizing OpenAI's GPT-3.5-Turbo API for Generating Summaries

  • Before even running the code, you need to set up an account on OpenAI and get your API key (KEEP THIS SAFE!!). For this to work, you need to ensure you have enough credits on your account. By creating a new one it will ensure that you will have $5 worth of credits. Then, create a .env file in CS325_p2 folder and put your key in the file.

  • The .gitignore will prevent the .env from being added to your repository so it will keep your key safe.

  • Following proper set up, you can query the API by calling

    completion = openai.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": f'Write your prompt here'}]
    )
  • You can use the results of the query by referring to completion.choices[0].message.content in your code. You can substitute 0 for other numbers if your code results in multiple outputs.

  • In the case of this article scraper, we return the summary provided and use it as an input along with a filename to write the summaries to a .txt file.

UML Diagram

cs325_project4's People

Contributors

sbar821 avatar

Watchers

Luke Welsh avatar

cs325_project4's Issues

Test Case Implementation

This issue lists the specific locations and functions that the test cases are implemented in Project 4. Note that some of the actual implementations of what were going to implement from the original Test Case List were different, mostly due to me finding better ways to test than I originally thought of.

  1. Verifying that urls.txt is non-empty:
  • Location: mod1_funct1.py, lines 20-22, inside get_url_from_file function.
  • Throws an exception if no URLs are found in input file.
  1. Check whether URLs in input.txt are accessible or not:
  • Location: mod1_funct1.py in lines 51-66.
  • Created new function called test_urls that ensure all URLs return error code 200 (meaning they exist and can be scraped for actual content).
  1. Check that write_stories_to_file works correctly in writing a file:
  • Location: mod2_funct1.py, lines 29-34, inside write_stories_to_file function.
  • If an error occurs when writing a to an output file, an exception is thrown.
  1. Verify that summary that is written exists/is non-empty:
  • Location: run.py, lines 55-58.
  • If write_summary_to_file creates an empty file, indicate that the specific file is empty.
  1. Verify Raw Data Retrieval from get_raw_from_file:
  • Location: mod1_funct1.py, lines 40-44, inside get_raw_from_file.
  • If the content from BeautifulSoup (in raw form) returns an empty array, throw an error saying so.
  1. Ensure necessary directories exist for saving files:
  • Location: run.py, lines 47-49.
  • If a necessary /processed directory does not exist, it will be created.
  1. Ensure consistent CBS News domain name for article URLs:
  • Location: mod1_funct1.py, lines 40-49, new function called ensure_cbs_news_domain. This is also called in run.py (lines 28).
  • If we have a URL that does not match the CBS Sports one, we halt execution of the program.
  1. Verify proper handling of duplicate URLs in urls.txt:
  • Location: mod1_funct1.py, lines 22-30.
  • If there are any duplicates of input URLs, then skip them return a new array with the duplicates removed.

Test Case List

Test Case List - Project 4

This issue represents the list of test cases we're going to tackle and implement for Project 4.
We already have a couple good ones that Sydney implemented, so we're going to reuse them for simplicity's sake.

  1. Check whether url.txt file has at least one URL:
  • Location: mod1_funct1.py lines 20-22
  • Test to ensure the function get_url_from_file returns a vector of string of at least size 1 (at least 1 URL).
  • Expected outcome: Function should produce a non-empty vector of article URLs.
  • This is already mostly implemented, so making any changes shouldn't be too difficult.
  1. Check whether URLs in url.txt are accessible URLs that can be later used for HTML parsing:
  • Location: mod1_funct1.py lines 56-70 (test_urls function)
  • Test to ensure that the URLs returned from get_url_from_file can be pinged and return a successful response (status code 200).
  • If any URL is inaccessible or returns an error, it should be flagged for further investigation.
  1. Check whether sanitized or unsanitized files are correctly written to their corresponding output files:
  • Location: mod2_funct1.py lines 29-34
  • Test to ensure the function write_stories_to_file appends text from array to filename.
  • Expected output: Each URL's sanitized/unsanitized article content is successfully written.
  • Already implemented for the most part
  1. Verify Summary Writing:
  • Location: run.py, lines 57-59
  • Test to ensure that the function write_summary_to_file in mod3_function3.py correctly writes summaries to files.
  • Expected outcome: Summaries should be written to files without errors.
  1. Verify Raw Data Retrieval:
  • Location: mod1_funct1.py and ensure_cbsnews_domain (lines 48-52).
  • Test to ensure that the function get_raw_from_file in mod1_funct1.py correctly retrieves raw data from the specified URLs.
  • Location: mod1_funct1.py, get_raw_from_file function.
  • Expected outcome: Function should return raw data from URLs without errors.
  1. Ensure necessary directories for saving files:
  • Location: run.py lines 47-49
  • Test to ensure that the directories that each type of processed file will be saved to exists.
  • Expected outcome: All processed files should have their necessary filepath directories already generated before attempting to save a specific file.
  1. Ensure that consistent domain name is used for input URLs:
  • Location: mod1_funct1.py, ensure_cbsnews_domain lines 46-52
  • Because many news sites vary in the structural HTML content of their tags, we need to make sure that users only use CBS News articles for more consistency upon executing the web scraper.
  • Expected outcome: all files that are processed can only be from the cbsnews.com domain.
  1. Verify Proper Handling of Duplicate URLs:
  • Location: mod1_funct1.py, within the get_url_from_file function. Line 25
  • Ensure that the program handles cases where duplicate URLs are provided in the url.txt file.
  • The function should remove duplicate URLs and return a list containing only unique URLs. No duplicate URLs should be present in the output.

We will be creating separate issues for each of the 8 test cases and detailing their progress until each of them are successfully completed/closed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.