Code Monkey home page Code Monkey logo

srt-fbo-scraper's Introduction

CircleCI Known Vulnerabilities Maintainability

fbo-scraper (AKA Smartie)

FBO is the U.S. government's system of record for opportunities to do business with the government. Each night, the FBO system posts all updated opportunities as a pseudo-xml file that is made publically available via the File Transfer Protocol (FTP), which is a standard network protocol used for the transfer of computer files between a client and server on a computer network.

This project uses supervised machine learning to determine whether or not the solicitation documents of Information Communications Technology (ICT) notices contain appropriate setion 508 accessibility language.

Following a service-oriented architecture, this repository, along with a forthcoming API, provides a back-end to a UI that GSA policy experts will use to review ICT solicitations for 508 compliance; notify deficient solicitation owners; monitor changes in historical compliance; and validate predictions to improve model performance.

The application is designed to be run as a cron daemon within a Docker image on cloud.gov. This is tricky to achieve as traditional cron daemons need to run as root and have opinionated defaults for logging and error notifications. This usually makes them unsuitable for running in a containerized environment. So, instead of a system cron daemon, we're using supercronic to run the crontab.

Here's what happens every time the job is triggered:

  1. Download the pseudo-xml from the FBO FTP
  2. Convert that pseudo-xml to JSON
  3. Extract solictations from the Information Communications Technology (ICT) categories
  4. Srape each ICT soliticiaton's documents from their official FBO urls
  5. Extract the text from each of those documents using textract
  6. Feed the text of each document into a binary classifier to predict whether or not the document is 508 compliant (the classifier was built and binarized using sklearn based on approximately 1,000 hand-labeled solicitations)
  7. Insert data into a postgreSQL database
  8. Retrain the classifer if there is a sufficient number of human-validated predictions in the database (validation will occur via the UI)
  9. If the new model is an improvement, save it and carry on.

Getting Started

Prerequisites

This project uses:

  • Python 3.6.6
  • Docker
  • PostgreSQL 9.6.8

Below, we suggest venv for creating a virtual environment if you wish to run the scan locally.

To push to cloud.gov or interact with the app there, you'll need a cloud.gov account.

There are two docker images for this project: fbo-scraper and fbo-scraper-test. The former contains the application that can be pushed to cloud.gov (see instructions below) while the latter is strickly for testing during CI.

Local Implementation

If you have PostgreSQL, you can run the scan locally. Doing so will create a database with the following connection string: postgresql+psycopg2://localhost/test. To run it locally (using FBO data from the day before yesterday), do the following:

cd path/to/this/locally/cloned/repo
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
#now you can run the scan, with logs writing to fbo.log
python fbo.py

Running the tests

To run the tests, set up the environment like before but instead run:

python3 -W ignore -m unittest discover tests -p '*_test.py'

Several warnings and exceptions will print out. Those are by design as they're being mocked in the tests.

Deployment

Deployment requires a cloud.gov account and access to the application's org. If those prequisites are met, you can login with:

cf login -a api.fr.cloud.gov --sso

Then target the appropriate org and space by following the instructions.

Then push the app, creating the service first:

cf create-service <service> <service-tag>
#wait a few minutes for the service to be provisionned
cf create-service-key <service-tag> <service-key-name>    #if this returns an OK, then your service has been provisioned  
cf push srt-fbo-scraper --docker-image csmcallister/fbo-scraper
cf bind-service srt-fbo-scraper <service-tag>  
cf restage srt-fbo-scraper

Above, <service> is the name of a postgres service (e.g. aws-rds shared-psql) while <service-tag> is whatever you want to call this service.

Since services can sometimes take up to 60 minutes to be provisioned, we use cf create-service-key to ensure the service has been provisioned. See this for more details.

Logs

We don't do anything special with logging. We simply write them to STDOUT/STDERR and use https://login.fr.cloud.gov/login to view and search them.

A TODO is logging in JSON using python-json-logger. This will make the logs more easily searchable.

Contributing

Please read CONTRIBUTING for details on our code of conduct, and the process for submitting pull requests to us.

License

This project is licensed under the Creative Commons Zero v1.0 Universal License - see the LICENSE file for details

Acknowledgments

srt-fbo-scraper's People

Contributors

austinpeel0 avatar csmcallister avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.