Code Monkey home page Code Monkey logo

servidores-download-transparencia's Introduction

Introduction

This is a script I wrote to download employee data from the API of Portal da Transparencia (https://api.portaldatransparencia.gov.br/).

It uses asynchronous corroutines with asyncio to make simultaneous API requests and save the result of each list of employees from an agency a pickle using joblib.

Challenges

  • Using asynchronous code to speed up the data acquiring process
  • Dealing with the API itself:
    • Sometimes it just stops responding;
    • Response speed drastically varies with time of the day;
    • There is(?) a limit of 400 (700 between 0h-6h) calls per minute;
  • Controlling the speed of execution to ensure some progress before the API stops responding and/or the limit number of calls is reached.

Results

The script runs using a key.py file with a function that only returns a string with the API key. You can have your own key by following the steps shown here: https://www.portaldatransparencia.gov.br/api-de-dados/cadastrar-email

It uses a list of agencies (órgãos), that was acquired using the same API. The list is already in .csv format in this repository. Then, for each agency it tries to get all the employees data.
An asyncio.Semaphore is used to manage the number of simultaneous agencies it is calling for the employees. Once an agency list of employees is finished, it stores it's contents in a pickle file inside the ./pickle folder. A timeout is needed, as sometimes the API simply stops responding. When that happens, the script simply moves to the next agency code, leaving the timed-out one for the next iteration. A code 429 is expected, and the script simply stops making calls for the entire iteration, and sleeps for some time before attempting the whole list again. It checks for the pickle files it already has downloaded and skips them.

This ended up being a rough looking solution, but it gets the job done. Some agencies have too many employees, meaning it can reach the point of a 429 code before it finishes downloading the data, but results vary depending on the speed the API is responding and the time of the day. I tried to control the number of calls by counting and using time.sleep to wait 60 seconds before I reach the maximum number of calls, but this created some problems with the execution of the event loop, and I ended up giving this solution up for now, until I have a better understanding of how asyncio manages the coroutines. I could use asyncio.sleep, but this function only pauses the current coroutine, creating unwanted behavior when using many concurrent coroutines and a global connection counter.

How to use

Set up your API key by creating a python script with the key function, that returns your API key as a string. Run async_api_download.py to run the code. You can adjust the global variables ASYNC_AGENCIES and ASYNC_PAGES as you like - higher number of pages allows faster download of an agency with a high number of employees, but has the drawback of "wasting" calls on empty pages (as the script only knows if the last page has been reached after it awaits for all the page tasks to finish). A higher number of async agencies can increase the speed in which the script downloads agencies with small number of employees, but at the risk of reaching the maximum number of connections before it can dump all the employees to a file. You can leave the code running on the background.

After you finish downloading all the data, the employee_parser.py script can be used to parse all the employees into a single csv file. The result is a 1million~ rows csv. The desired columns can be controlled by editing the script.

servidores-download-transparencia's People

Contributors

ernani-rosa avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.