Code Monkey home page Code Monkey logo

okfn-brasil / serenata-de-amor Goto Github PK

View Code? Open in Web Editor NEW
4.5K 385.0 668.0 69.34 MB

🕵 Artificial Intelligence for social control of public administration | **This repository does not receive frequent updates. Check out the README**

Home Page: https://serenata.ai/en

License: MIT License

Python 76.58% HTML 1.21% JavaScript 0.24% CSS 0.18% Elm 21.38% Dockerfile 0.30% Shell 0.11%
machine-learning data-science artificial-intelligence politics civic-tech open-data

serenata-de-amor's Introduction

Build Status Code Climate Test Coverage Donate

Operação Serenata de Amor

  1. Non-tech crash course into Operação Serenata de Amor
  2. Tech crash course into Operação Serenata de Amor
  3. Contributing with code and tech skills
  4. Supporting
  5. Update
  6. Acknowledgments

Non-tech crash course into Operação Serenata de Amor

What

Serenata de Amor is an open project using artificial intelligence for social control of public administration.

Who

We are a group of people who believes in power to the people motto. We are also part of the Data Science for Civic Innovation Programme from Open Knowledge Brasil.

Among founders and long-term members, we can list a group of eight people – plus numerous contributors from the open source and open knowledge communities: Tatiana Balachova, Felipe Cabral, Eduardo Cuducos, Irio Musskopf, Bruno Pazzim, Ana Schwendler, Jessica Temporal, Yasodara Córdova and Pedro Vilanova.

How

Similar to organizations like Google, Facebook, and Netflix, we use technology to track government spendings and make open data accessible for everyone. We started looking into data from the Chamber of Deputies (Brazilian lower house) but we expanded to the Federal Senate (Brazilian upper house) and to municipalities.

When

Irio had the main ideas for the project in early 2016. For a few months, he experimented and gathered people around the project. September, 2016 marks the launching of our first crowd funding. Since then, we have been creating open source technological products and tools, as well as high quality content on civic tech on our Facebook and Medium.

Where

We have no non-virtual headquarters, but we work remotely everyday. Most of our ideas are crafted to work in any country that offers open data, but our main implementations focus in Brazil.

Why

Empowering citizens with data is important: people talk about smart cities, surveillance and privacy. We prefer to focus on smart citizens, accountability and open knowledge.

Tech crash course into Operação Serenata de Amor

What

Serenata de Amor develops open source tools to make it easy for people to use open data. The focus is to gather relevant insights and share them in an accessible interface. Through this interface, we invite citizens to dialogue with politicians, state and government about public spendings.

Who

Serenata's main role is played by Rosie: she is an artificial intelligence who analyzes Brazilian congresspeople expenses while they are in office. Rosie can find suspicious spendings and engage citizens in the discussion about these findings. She's on Twitter.

To allow people to visualize and make sense of data Rosie generates, we have created Jarbas. On this website, users can browse congresspeople expenses and get details about each of the suspicions. It is the starting point to validate a suspicion.

How

We have two main repositories on GitHub. This is the main repo and hosts Rosie and Jarbas. In addition, we have the toolbox - a pip installable package. Yet there are experimental notebooks maintained by the community and our static webpage.

When

Despite all these players acting together, the core part of the job is ran manually from time to time. The only part that is always online is Jarbas – freely serving a wide range of information about public expenditure 24/7.

Roughly once a month, we manually run Rosie and update Jarbas. A few times per year, we upload versioned datasets accessible via the toolbox – but we encourage you to use the toolbox to generate fresh datasets whenever you need.

Where

Jarbas is running in Digital Ocean droplets, and deployed using the Docker Cloud architecture.

Why

The answer to most technical why questions is because that is what we had in the past and enabled us to deliver fast. We acknowledge that this is not the best stack ever, but it has brought us here.

Contributing with code and tech skills

Make sure you have read the Tech crash course on this page. Next, check out our contributing guide.

Supporting

Update

Operation Serenata de Amor expanded into new projects. Because of this, Rosie, Jarbas and the infrastructure in general are receiving updates less frequently. If you have experience and want to help us resolve bugs faster or propose improvements, join our Discord and let's talk about the project! On the other hand, if you are looking for an active community to collaborate with, we invite you to discover the Querido Diário project.

Finally, if you want to know more details about Serenata's current situation, you can consult this text (in Portuguese) available on Apoia.se.

Acknowledgments

Open Knowledge Brasil Digital Ocean

serenata-de-amor's People

Contributors

anapaulagomes avatar anaschwendler avatar andrepinho avatar antonioj-mattos avatar cabral avatar caduvieira avatar caiocarrara avatar cuducos avatar dsakuma avatar fabiocorreacordeiro avatar fgrehm avatar filipelinhares avatar giovanisleite avatar irio avatar jtemporal avatar lacerdamarcelo avatar leportella avatar lipemorais avatar lpillmann avatar luizcavalcanti avatar marcusrehm avatar matheushf avatar oleggator avatar pyup-bot avatar rafonseca avatar ricardochaves avatar rogeriochaves avatar sergiomario avatar viniciusartur avatar wisner23 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

serenata-de-amor's Issues

How Can I to contribute?

Hi, I'd like to contribute in this project, but I don't know Python as well. How can I start to contribute? There is some initial guide of the project?

PS: Sorry, my english isn't good...

Chat, lista de e-mails ou outro canal de comunicação real-time

Ficaria feliz em saber um pouco mais de como estão se organizando e para onde vão. Sugiro a criação de um grupo de e-mails, organização no Slack ou equivalente, canal no IRC ou algo assim.

Isso deve ser gratuito (ou muito barato) e simples de configurar.

Já existe esse canal?

Por que não em português?

  • O projeto nasceu por conta da corrupção no Brasil
  • Acredito eu que maior parte dos participantes do grupo no Telegram sejam brasileiros
  • Regularmente, as pessoas escrevem uma palavra em português entre aspas ou por não saber traduzir-la ou porque não tem tradução.
  • muitas vezes saem frases em inglês incorretas, por falta de domínio/uso diário da língua por muitos

Então porque não conversarmos simplesmente em português?
Ou em vez disso, criar um grupo do Telegram para discussões em português?
Ou realmente, mudar de plataforma para conversação, como proposto em #81 #46 ?
Eu realmente acho que é importante facilitarmos a comunicação entre nós, e acho que isso é melhor alcançado ao usarmos nossa língua nativa para tal.

Captcha for the CNPJ

Is it illegal to break the captcha on Receita's website? If it is not maybe I can share a project that I did using scikit-learn.

Como eu posso contribuir

Como eu posso contribuir?
Conheço Python, Data Science e alguns conceitos de Big Data

Gostaria de saber a área mais específica que eu posso contribuir, com o que e etc! E também gostaria de promover um outro projeto para a Data Science Brigade sobre: qual a quantidade da população que está online que é a favor ou contra o impeachment

Backup pictures from all receipts

It is vital for the project have a way of accessing all receipts, from any reimbursement since the first available and not depend from Chamber of Deputies.

Besides having proofs for legal reports, its useful for offline analyses. #32 is one I think about; doing OCR for generating new structured data is another.

Here's a function that, based on a record from quota datasets, returns the picture URL from the Chamber of Deputies' website:

def document_url(record):
    return 'http://www.camara.gov.br/cota-parlamentar/documentos/publ/%s/%s/%s.pdf' % \
        (record['applicant_id'], record['year'], record['document_id'])

List of beneficiaries of Bolsa Família

Today a friend tweeted that he had a document from Rio de Janeiro Public Prosecutor's Office with an attachment listing businesspeople who have been receiving Bolsa Família (a Brazilian social benefits focused on extreme poverty situation — i.e. certainly not businesspeople).

As we'll probably have names of partners in companies where congresspeople have spent CEAP money, and we can scrap (e.g. from here) the names of people receiving Bolsa Família benefit.

We could spot more businespeople receiving the benefit (but according to @cabral this is a task already being done by Rio de Janeiro Public Prosecutor's Office), and also have a meaningful variable for our model.

Exploratory Analysis on companies with the same name

When collecting information about companies and working on other analyses, I found a large number of distinct companies with the same name (but different CNPJ's). Want to know more about their existence through an exploratory analysis, and possibly get more insights for new irregularities.

Here's a short list of company names with multiple CNPJ's in the quota:

AUTO POSTO CASCAO LTDA
IRMAOS BRETAS , FILHOS E CIA LTDA
FLECHA S.A. TURISMO COMERCIO E INDUSTRIA
POSTO AUTOMAN LTDA
ALLPARK EMPREENDIMENTOS, PARTICIPACOES E SERVICOS S.A.
RGB RESTAURANTES LTDA.
CANAA COMBUSTIVEIS PARA VEICULOS LTDA
MARIETTA COMERCIO DE ALIMENTOS LTDA

Which documents do not belong to congressperson

Problem: If you divide 213 million(the amount of the quota used per year) by 266 thousand (the average amount used per congressperson) the result is 800 people. The number is 287 higher than the total of 513 congressperson in activity (list of congressperson )
The number of 800 have alternates of congressperson, leaderships of the PSDB and PT parties and also have some questions like: who is SDD ?

  • we need a detection solution to say which document(receipts) do NOT belong directly to a congressperson (alternate or not: alternate list )
  • we need to create a new data set with the columns | congresspersonID| congressperson? (boolean) | congressperson is active?(boolean)
  • why is possible for non-congressperson to use the quota for activity?

Find expenses made with companies of relatives

According to "ATO DA MESA Nº 43, DE 21/5/2009 Art. 4º § 13", no expense can be reimbursed if made with companies where the applicant (politician) is owner, partner or relative (maximum of third degree).

Why Telegram instead of Gitter or irc?

Telegram is a proprietary thing that require people to install stuff in their phones. Who knows what sort of tracking they have and they share with governments. This project should use something more open IMO.

Simple web service to return everything we know about a given reimbursement

Specially for internal reports (e.g. something strange found during analyses, needing further investigation before public reports), a web service to return everything we know about a specific record from the Quota for Exercising Parliamentary Activity would come in handy.

By everything we know, I mean the following:

  • Record as reported by Chamber of Deputies (e.g. dataset.iloc[0] with dataset being pd.read_csv('data/2016-08-08-current-year.xz').
  • Image of the receipt (check related work in #33).
  • Detailed information about the company receiving the payment (#19 is creating such dataset).

This short notebook could be replaced by a URL to this web service, with 5621548 (document_id) or 20574089000107 (cnpj_cpf) as parameter.

Find clusters of politicians spending with companies owned by each others relatives

According to "ATO DA MESA Nº 43, DE 21/5/2009 Art. 4º § 13", no expense can be reimbursed if made with companies where the applicant (politician) is owner, partner or relative (maximum of third degree).

Politicians may know about this and have groups to "exchange expenses" to cheat the law. e.g. Politician A pays for Politician B's relatives; Politician B pays for Politician A's relatives.

Calculate and find anomalies on traveled speeds from last meal

The Quota for Exercise of Parliamentary activity says that meal expenses can be reimbursed just for the politician, excluding guests and assistants. Creating a feature with information of traveled speed from last meal can help us detect anomalies compared to other expenses.

For doing so, we need the location of each expense. One possible way of getting that is fetching the address of each CNPJ receiving money from expenses with "Congressperson meal" as subquota_description.

Find expenses in brothels

Brothels will usually print receipts using companies with names and main activities more "morally acceptable", e.g. a restaurant. If we can gather a list of CNPJ's from the major brothels in Brazilian capitals, this could generate a very interesting report and news coverage.

Original idea by Pedro Vilanova.

Create new social accounts

Aiming for public relations, the project need some social accounts to spread the word and maintain itself through crowdfunding, patreon or another

Twitter

Linkedin (?)
Medium (using Data Science BR?)
Others?

Discord channel

On conversations on the main Telegram group, I verified that some people thinks that a Discord channel is a good way for us to communicate. Then I've created a channel (https://discord.gg/vhCbB3C) on Discord for Serenata do Amor.

I request, if you guys agree, to put the link on the description of the project, to inform people about the channel.

And by the way, I will pass the administration rights of the channel for the founders, any time.

Best regards and good work guys

Data from candidates in 2016 elections (Dados de candidatos nas eleições 2016)

Caso seja do interesse, foi disponibilizada uma forma de ver os dados dos candidatos das eleições 2016 de todas as cidades do país:
http://www.tse.jus.br/eleicoes/eleicoes-2016/divulgacao-de-candidaturas-e-contas-eleitorais

Contém informação de bens declarados pelos candidatos.

Aparentemente o scrapping é fácil.. exemplo de um candidato: http://divulgacandcontas.tse.jus.br/divulga/rest/v1/candidatura/buscar/2016/86517/2/candidato/210000011570

Parabéns pelo projeto.

Pack common tools in a specific module

While coding, planning to code or code reviewing I'm starting see a lot of repetition. My intention is to create a new module (tools/ maybe) with tools to be used by anyone in other scripts (setting up scripts from src/, exploring and analyzing data in develop/ etc.).

Anyone is free to join me in this, we can schedule and pair as this has no urgency at all.

And anyone can contribute listing functions that could be inside this module. I get started:

  • list path to the CSV (.xz), i.e. the existing datasets
  • load translations
  • list all record (with or without filters) from the datasets
  • … 

Translate dataset to English

The main language of the project is English: works well mixed in programming languages like Python and provides a low barrier for non-Brazilian contributors. Today, the dataset we make available by default for them is a set of XMLs from The Chamber of Deputies, in Portuguese. We need attribute names and categorical values to be translated to English.

Telegram group link

The link is not available anymore in the new README.md and not available in our website either.

Benford analysis on net values as features.

Since we are auditing data, Benford analysis came as a first thought.

For those unfamiliar with the topic, we compare the distribution of leading digits in the data with the naturally expected distribution.

Intro - https://en.wikipedia.org/wiki/Benford%27s_law
More info - https://www.agacgfm.org/AGA/FraudToolkit/documents/BenfordsLaw.pdf

On one hand, it should NOT be directly applied to human generated values (such as service prices), however we can use deviation measures (absolute difference, squared difference,etc.) and sample statistics as input for other classification algorithms in our framework.

I started a notebook using R "benford.analysis" package to perform basic exploration in the data. I also started to develop the "benford.subquota" function to extract features from the serenata_de_amor dataset.

Notebook: https://github.com/pingfreud/serenata-de-amor/blob/R-analysis/develop/2016-09-20-im-benford-analysis.ipynb
Function: https://github.com/pingfreud/serenata-de-amor/blob/R-analysis/develop/benford.subquota.R

I'm don't have formal training in programming, so code related feedback is extremely welcome.

If you are willing to collaborate, check the pending goals in the files listed above. I would be happy to discuss other related topics.

Negative expenses on the dataset

@Irio 's descriptive analysis shows that we have documents with negative net_value in 2016-08-08-last-year.xz dataset. All of them are flight tickets issued but not used by the congressperson.

My hypothesis is: for each of those documents there must be a corresponding document with a positive net_value where the flight ticket was first issued.

It can easily be proved since document_number should be the same for both documents.
One thing to keep in mind is that flight companies usually don't offer full reimbursement. That means the negative values will often be smaller than their positive counterpart (well, they're negative, but you get what I mean).

I think it's important to validate this because negative values without a corresponding positive one can hurt the accuracy of calculations regarding net_value.

Invitation to Telegram group expired

The invitation links for the Telegram group in the README files (pt and en) expired, returning the following error in the web interface:

Method: messages.checkChatInvite
Url: N/A
Result: {"_":"rpc_error","error_code":400,"error_message":"INVITE_HASH_EXPIRED"}
Stack: Error
    at Object.h [as invokeApi] (https://web.telegram.org/js/app.js:23:25735)
    at v (https://web.telegram.org/js/app.js:25:31282)
    at g (https://web.telegram.org/js/app.js:25:28732)
    at m (https://web.telegram.org/js/app.js:25:28121)
    at Object.f [as start] (https://web.telegram.org/js/app.js:25:31073)
    at new <anonymous> (https://web.telegram.org/js/app.js:26:13653)
    at Object.r [as instantiate] (https://web.telegram.org/js/app.js:11:21370)
    at https://web.telegram.org/js/app.js:12:15279
    at Object.link (https://web.telegram.org/js/app.js:16:12334)
    at https://web.telegram.org/js/app.js:11:6148

The Android app returns:

Sorry, this chat does not seem to exist.

Compare expenses made with lodging against official prices of rooms

Filtering quota's dataset by records with value 'Lodging, except for congressperson from Distrito Federal' in the column subquota_description will return many expenses made with hotels. We could match the value in the receipt against publicly available (through Booking.com, for instance) range of prices.

Brazilian Air Force and congresspeople traveling for free

This is not an issue. But I thought that I could document the suggestion here:

Brazilian Air Force aircrafts can be used by Brazilian officials (congresspeople included). Luckly they keep a very transparent record of who flew, when, and the departure and destination information: http://www.fab.mil.br/voos

Unfortunately it looks like this is not congresspeople preferred way to travel (just a few records of them), and Brazilian Air Force keep records of the title of the official (presidente of whatever), not of her/his name… this makes things difficult.

Anyway, people working with data on transportation (cc @Irio @andrepinho) might want to take a look.

Share final data on Kaggle community

Kaggle's community is very large, the users are willing to help noble causes and the technical skills of the top tiers are amazing.

I shared some data I webscrapped describing terrorist attacks and got contributions from several users.

https://www.kaggle.com/argolof/predicting-terrorism

I think the project can make use of Kaggle to increase predictive performance and maybe get some different perspectives from people around the globe.

Looking for corruption on the Federal Budget

The Brazilian Constitution allows each parliamentary allocate a portion of the federal budget for a specific purpose. But there is a problem because the law also allowed the parliamentary indicate the institution (NGO, Association, Foundation, public agency) that will receive the money. This creates a major risk of embezzlement, if the money is intended to entities controlled by the Parliament itself.

The federal government publishes the list of entities that received funds in this way. This list indicates which entity received the money, what she should do and what was the congressman who was the author of the amendment. address http://portal.convenios.gov.br/images/docs/CGSIS/csv/siconv_emenda.csv.zip

We could build a tool to check the reputation of such entities. This information would indicate higher risk of corruption.

We can get the information about the reputation from various sources: protests because of debts for this CNPJ, jundiciais actions against authority (sites of the courts, JusBrasil), criminal actions against the leaders (courts sites), leaders of donations to campaign parliamentary (TSE) and others.

Português

A constituição brasileira permite que cada parlamentar destine uma parte do orçamento federal para um objetivo específico. Mas existe um problema porque a Lei também permite que o parlamentar indique a instituição (ONG, Associação, Fundação, órgão público) que irá receber esse dinheiro. Isso gera um grande risco de desvio de dinheiro, se o dinheiro for destinado a entidades controladas pelo próprio parlamentar.

O governo federal divulga a lista de entidades que receberam verbas dessa forma. Essa lista indica para qual entidade recebeu o dinheiro, o que ela deveria fazer e quem foi o parlamentar que foi autor da emenda. Endereço: http://portal.convenios.gov.br/images/docs/CGSIS/csv/siconv_emenda.csv.zip

Nós poderíamos construir uma ferramenta que verifique a reputação dessas entidades. Essa informação indicaria emendas com alto risco de corrupção.

Podemos obter as informações sobre a reputação a partir de várias fontes: protestos em razão de dívidas cíveis (buscar pelo CNPJ em sites como http://www.ieptb.com.br/), ações judiciais contra entidade (sites dos tribunais, jusbrasil), ações civis e criminais contra os dirigentes (sites de tribunais), doações de dirigentes a campanha do parlamentar ou do partido (TSE) e outras.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.