Code Monkey home page Code Monkey logo

engage-scraper's Introduction

Engage Scraper

Installation

pip i engage-scraper

About

The Engage Scraper is a standalone library that can be included in any service. The purpose of the scraper is to catalog a municipality's council meeting agendas in a usable format for such things as the engage-client and engage-backend.

To extend this library for your municipality, override the methods of the base class from the scraper_core/ directory and put it in scraper_logics/, prefacing it with your municipality name. For an example see the Santa Monica, CA example in the scraper_logics/ directory. The Santa Monica example makes use of htmlutils.py because it requires HTML scraping for its sources. Feel free to make PRs with new utilities (for example, PDF scraping, RSS scraping, JSON parsing, etc.). The Santa Monica example also uses SQLAlchemy for its models and that is what is preferred for use in the dbutils.py, however you can use anything. ORMs are preferred rather than vanilla psycopg2 or the like.

To use the postgres dbutils.py make sure to set these 5 environment variables (check dev.env and see docker-compose usage below):

  • POSTGRES_HOST optional a host or hostname that is resolvable. Defaults to localhost
  • POSTGRES_USER required
  • POSTGRES_PASSWORD required
  • POSTGRES_PORT optional defaults to 5432
  • POSTGRES_DB required The database used for cataloging your municipality's agendas.

An example of using the Santa Monica scraper library

from engage_scraper.scraper_logics import santamonica_scraper_logic

scraper = santamonica_scraper_logic.SantaMonicaScraper(committee="Santa Monica City Council")
scraper.get_available_agendas()
scraper.scrape()

For SantaMonicaScraper instantiation

For twitter utils used in SantaMonicaScraer

To use the santa monica logic, you must create an App on twitter (will work to make this optional). Following making an app, please use the structure dev.env file to insert the appropriate parameters. But make sure not to make changes to the repository's file. Copy the file up one directory and edit it there. Following the edit, use the docker-compose.yml for testing. You can add examples to examples/ and run them from the script in scripts/ using the docker container.

For the SantaMonicaScraper class the init has these options

  • tz_string="America/Los_Angeles" # defaulted string
  • years=["2019"] # defaulted array of strings of years
  • committee="Santa Monica City Council" # defaulted string of council name

The exposed API methods for scraper are

  • .get_available_agendas() # To get available agendas, no arguments
  • .scrape() # To process agendas and store contents

Feel free to expose more

  • Write wrappers for internal functions if you want to expose them
  • Write extra functions to handle more complex municipality-specific tasks

engage-scraper's People

Contributors

eselkin avatar teddycr avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

engage-scraper's Issues

Santa Monica Scraper: Spacing

Spacing is an issue in both body and recommendation scraping. This stems from the heavy use of random unicode spacing characters, spans with no spacing but width, multiple spans in a single word. Currently we strip unicode to produce ASCII, but there may be a better way. We are left with results like:

computer system s ha ve become operationally

and

Authorize the City Man ager to negotiate a nd

or

$452,75 0(in clud esa contingency);

Successful scrape triggers tweet

Each time we successfully scrape new agenda content and post to the feed, we should post a new tweet to the Engage Santa Monica twitter account: @EngageStaMonica

Tweet language should equal:

Agenda Items for the next @santamonicacity City Council meeting is open for public feedback until 11:59am [Month | Date]. Head to http://sm.engage.town now to voice your opinion!

Month | Date text should be formatted like e.g. "Feb 26th".

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.