Code Monkey home page Code Monkey logo

elmahsieh / udn_newsscrapper_gpt_categorizer Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 12 KB

This project automates the scraping of news articles from the United Daily News (UDN) website, filters and processes them using specified keywords and OpenAI's GPT for Named Entity Recognition (NER), and exports the categorized data into a CSV file.

Jupyter Notebook 56.73% Python 43.27%
api apscheduler beautifulsoup data-export data-extraction data-processing named-entity-recognition natural-language-processing pandarallel pandas web-scraping openai-gpt-models

udn_newsscrapper_gpt_categorizer's Introduction

UDN News Scrapper & GPT Categorizer

Overview

This project is designed to scrape news articles from the United Daily News (UDN) website, filter them based on specified keywords, process the articles using GPT for Named Entity Recognition (NER), and save the categorized articles into a CSV file for further analysis. The project combines web scraping, data extraction, and AI-powered text processing to automate the process of gathering, categorizing, and saving news data.

Features

  • Web Scraping: Uses requests and BeautifulSoup to scrape news articles from UDN.
  • Keyword Filtering: Filters articles based on specified keywords related to corruption and fraud.
  • Data Extraction: Extracts article titles, content, publication dates, and URLs.
  • Date Filtering: Filters articles based on their publication dates to only include recent news.
  • CSV Export: Saves the filtered articles into a CSV file for easy visualization and analysis.
  • Scheduling: Automates the scraping process to run at a specified time every day using APScheduler.
  • GPT Processing: Uses OpenAI's GPT models to perform NER and categorize the scraped articles.
  • Parallel Processing: Utilizes pandarallel for efficient parallel processing of DataFrame operations.

Requirements

  • Python 3.6+
  • Requests
  • BeautifulSoup4
  • Pandas
  • APScheduler
  • tqdm
  • openai
  • tiktoken
  • jieba
  • numpy
  • pandarallel

Installation

pip install requests beautifulsoup4 pandas apscheduler tqdm openai tiktoken jieba numpy pandarallel

udn_newsscrapper_gpt_categorizer's People

Contributors

elmahsieh avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.