Code Monkey home page Code Monkey logo

querido-diario's Introduction

Português (BR) | English (US)

Querido Diário

Querido Diário

Dentro do ecossistema do Querido Diário, este repositório é o responsável pela tarefa de raspagem dos sites publicadores de diários oficiais.

Conheça mais sobre as tecnologias e a história do projeto no site do Querido Diário

Sumário

Como contribuir

catarse

Agradecemos por considerar contribuir com o Querido Diário! 🎉

Você encontra como fazê-lo no CONTRIBUTING.md!

Além disso, consulte a documentação do Querido Diário para te ajudar.

Ambiente de desenvolvimento

Você precisa ter Python (+3.0) e o framework Scrapy instalados.

Os comandos abaixo preparam o ambiente em sistema operacional Linux. Eles consistem em criar um ambiente virtual de Python, instalar os requisitos listados em requirements-dev e a ferramenta para padronização de código pre-commit.

python3 -m venv .venv
source .venv/bin/activate
pip install -r data_collection/requirements-dev.txt
pre-commit install

A configuração em outros sistemas operacionais está disponível em "como configurar o ambiente de desenvolvimento", incluindo mais detalhes para quem deseja contribuir com o desenvolvimento do repositório.

Template para raspadores

Ao invés de começar um arquivo de raspador do zero, você pode inicializar um arquivo de código de raspador já no padrão do Querido Diário, a partir de um template. Para isso, faça:

  1. Vá para o diretório data_collection:
cd data_collection
  1. Acione o template:
scrapy genspider -t qdtemplate <uf_nome_do_municipio> <https://sitedomunicipio...>

Um arquivo uf_nome_do_municipio.py será criado no diretório spiders, com alguns campos já preenchidos. O diretório é organizado por UF, lembre-se de mover o arquivo para o diretório adequado.

Como executar

Para experimentar a execução de um raspador já integrado ao projeto ou testar o que esteja desenvolvendo, siga os comandos:

  1. Se ainda não o fez, ative o ambiente virtual no diretório /querido-diario:
source .venv/bin/activate
  1. Vá para o diretório data_collection:
cd data_collection
  1. Verifique a lista de raspadores disponíveis:
scrapy list
  1. Execute um raspador da lista:
scrapy crawl <nome_do_raspador>       //exemplo: scrapy crawl ba_acajutiba
  1. Os diários coletados na raspagem serão salvos no diretório data_collection/data

Dicas de execução

Além dos comandos acima, o Scrapy oferece outros recursos para configurar o comando de raspagem. Os recursos a seguir podem ser usados sozinhos ou combinados.

  • Limite de data
    Ao executar o item 4, o raspador coletará todos os diários oficiais do site publicador daquele município. Para execuções menores, utilize a flag de atributo -a seguida de:

start_date=AAAA-MM-DD: definirá a data inicial de coleta de diários.

scrapy crawl <nome_do_raspador> -a start_date=<AAAA-MM-DD>

end_date=AAAA-MM-DD: definirá a data final de coleta de diários. Caso omitido, assumirá a data do dia em que está sendo executado.

scrapy crawl <nome_do_raspador> -a end_date=<AAAA-MM-DD>
  • Arquivo de log
    É possível enviar o log da raspagem para um arquivo ao invés de deixá-lo no terminal. Isto é particularmente útil quando se desenvolve um raspador que apresenta problemas e você quer enviar o arquivo de log no seu PR para obter ajuda. Para isso, use a flag de configuração -s seguida de:

LOG_FILE=log_<nome_do_municipio>.txt: definirá o arquivo para armazenar as mensagens de log.

scrapy crawl <nome_do_raspador> -s LOG_FILE=log_<nome_do_municipio>.txt
  • Tabela de raspagem
    Também é possível construir uma tabela que lista todos os diários e metadados coletados pela raspagem, ficando mais fácil de ver como o raspador está se comportando. Para isso, use a flag de saída -o seguida de um nome para o arquivo.
scrapy crawl <nome_do_raspador> -o <nome_do_municipio>.csv

Solução de problemas

Confira o arquivo de solução de problemas para resolver os problemas mais frequentes com a configuração do ambiente do projeto.

Suporte

Discord Invite

Ingresse em nosso canal de comunidade para trocas sobre os projetos, dúvidas, pedidos de ajuda com contribuição e conversar sobre inovação cívica em geral.

Agradecimentos

Este projeto é mantido pela Open Knowledge Brasil e possível graças às comunidades técnicas, às Embaixadoras de Inovação Cívica, às pessoas voluntárias e doadoras financeiras, além de universidades parceiras, empresas apoiadoras e financiadoras.

Conheça quem apoia o Querido Diário.

Open Knowledge Brasil

Twitter Follow Instagram Follow LinkedIn Follow

A Open Knowledge Brasil é uma organização da sociedade civil sem fins lucrativos, cuja missão é utilizar e desenvolver ferramentas cívicas, projetos, análises de políticas públicas, jornalismo de dados para promover o conhecimento livre nos diversos campos da sociedade.

Todo o trabalho produzido pela OKBR está disponível livremente.

Licença

Código licenciado sob a Licença MIT.

querido-diario's People

Contributors

alfakini avatar alvarolqueiroz avatar anapaulagomes avatar antoniovendramin avatar ayharano avatar brunolellis avatar cuducos avatar danielbom avatar dannnylo avatar ddevdan avatar feliperuhland avatar gbonesso avatar giovanisleite avatar he7d3r avatar irio avatar jaswdr avatar jvanz avatar luzfcb avatar ogecece avatar pgarcias01 avatar rennerocha avatar rodbv avatar rodolfolottin avatar sergiomario avatar tatilattanzi avatar trevineju avatar vicitel avatar victor-torres avatar vitorbaptista avatar winzen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

querido-diario's Issues

Cities

Hey people,

Adding @cuducos work in progress tracking here to help us keep track of the work to be done.

✅ Done
🔜 In progress

# Cidade Crawler Parser Issue PR
1 São Paulo #7
2 Rio de Janeiro 🔜 #15 #29
3 Brasília
4 Salvador #47
5 Fortaleza 🔜 #52
6 Belo Horizonte #33
7 Manaus 🔜 #51
8 Curitiba #42
9 Recife 🔜
10 Porto Alegre
11 Goiânia 🔜 #6
12 Belém
13 Guarulhos #4
14 Campinas #2
15 São Luís 🔜 #22
16 São Gonçalo
17 Maceió 🔜 #32
18 Duque de Caxias
19 Natal
20 Campo Grande #35
21 Teresina 🔜 #53
22 São Bernardo do Campo
23 João Pessoa
24 Nova Iguaçu
25 Santo André
26 São José dos Campos
27 Osasco
28 Jaboatão dos Guararapes
29 Ribeirão Preto #31
30 Uberlândia #37
31 Sorocaba
32 Contagem
33 Aracaju
34 Feira de Santana 🔜 #25
35 Cuiabá
36 Joinville 🔜 #30
37 Juiz de Fora 🔜 #12 #13
38 Londrina
39 Aparecida de Goiânia
40 Porto Velho
41 Ananindeua
42 Serra
43 Niterói
44 Belford Roxo
45 Campos dos Goytacazes Campos dos Goytacazes
46 Vila Velha
47 Florianópolis #17
48 Caxias do Sul
49 Macapá
50 Mauá
51 São João de Meriti
52 São José do Rio Preto
53 Santos 🔜 #14
54 Mogi das Cruzes
55 Betim
56 Diadema
57 Campina Grande
58 Jundiaí
59 Maringá
60 Montes Claros 🔜 #26
61 Piracicaba
62 Carapicuíba
63 Olinda
64 Cariacica
65 Rio Branco
66 Anápolis
67 Bauru
68 Vitória
69 Caucaia
70 Itaquaquecetuba
71 São Vicente
72 Bandeira caruaru.jpg Caruaru
73 Vitória da Conquista
74 Blumenau
75 Franca #5
76 Pelotas
77 Ponta Grossa #45
78 Canoas #10
79 Petrolina
80 Boa Vista
81 Ribeirão das Neves
82 Paulista
83 Uberaba
84 Cascavel
85 Guarujá
86 Praia Grande
87 Taubaté
88 São José dos Pinhais
89 Limeira
90 Petrópolis
91 Camaçari
92 Santarém
93 Mossoró
94 Suzano
95 Palmas #1
96 Governador Valadares 🔜 #19
97 Taboão da Serra
98 Santa Maria
99 Gravataí
100 Várzea Grande
XXX Foz do Iguaçu #34 #27
XXX Araguaina #3

Make seed error

I runned docker-compose down and make setup

make seed
make[1]: Entering directory '/home/giovani/workspace/diario-oficial'
docker-compose up --detach postgres
Builds, (re)creates, starts, and attaches to containers for a service.

Unless they are already running, this command also starts any linked services.

The `docker-compose up` command aggregates the output of each container. When
the command exits, all containers are stopped. Running `docker-compose up -d`
starts the containers in the background and leaves them running.

If there are existing containers for a service, and the service's configuration
or image was changed after the container's creation, `docker-compose up` picks
up the changes by stopping and recreating the containers (preserving mounted
volumes). To prevent Compose from picking up changes, use the `--no-recreate`
flag.

If you want to force Compose to stop and recreate all containers, use the
`--force-recreate` flag.

Usage: up [options] [--scale SERVICE=NUM...] [SERVICE...]

Options:
    -d                         Detached mode: Run containers in the background,
                               print new container names.
                               Incompatible with --abort-on-container-exit.
    --no-color                 Produce monochrome output.
    --no-deps                  Don't start linked services.
    --force-recreate           Recreate containers even if their configuration
                               and image haven't changed.
                               Incompatible with --no-recreate.
    --no-recreate              If containers already exist, don't recreate them.
                               Incompatible with --force-recreate.
    --no-build                 Don't build an image, even if it's missing.
    --build                    Build images before starting containers.
    --abort-on-container-exit  Stops all containers if any container was stopped.
                               Incompatible with -d.
    -t, --timeout TIMEOUT      Use this timeout in seconds for container shutdown
                               when attached or when containers are already
                               running. (default: 10)
    --remove-orphans           Remove containers for services not
                               defined in the Compose file
    --exit-code-from SERVICE   Return the exit code of the selected service container.
                               Implies --abort-on-container-exit.
    --scale SERVICE=NUM        Scale SERVICE to NUM instances. Overrides the `scale`
                               setting in the Compose file if present.
Makefile:16: recipe for target 'seed' failed
make[1]: *** [seed] Error 1
make[1]: Leaving directory '/home/giovani/workspace/diario-oficial'
Makefile:10: recipe for target 'setup' failed
make: *** [setup] Error 2

Error in the web container: Request failed with status code 400

Hi, I tried to run the project (as described in the README.md: make setup && docker-compose up) – All went well but the websever:

image

The output in the docker-compose logs repeatedly shows:

web_1         |  warning  in ./node_modules/bulma/bulma.sass
web_1         | 
web_1         | (Emitted value instead of an instance of Error) postcss-custom-properties: /mnt/code/node_modules/bulma/bulma.sass:5915:5: Custom property ignored: not scoped to the top-level :root element (.columns.is-variable.is-8 { ... --columnGap: ... })
web_1         | 
web_1         |  @ ./node_modules/bulma/bulma.sass 4:14-152 13:3-17:5 14:22-160
web_1         |  @ ./.nuxt/App.js
web_1         |  @ ./.nuxt/index.js
web_1         |  @ ./.nuxt/client.js
web_1         |  @ multi webpack-hot-middleware/client?name=client&reload=true&timeout=30000&path=/__webpack_hmr ./.nuxt/client.js

And when I hit localhost:8080 there is a different (rather longer) error message in the logs.

web_1         | { Error: Request failed with status code 400
web_1         |     at createError (/mnt/code/node_modules/axios/lib/core/createError.js:16:15)
web_1         |     at settle (/mnt/code/node_modules/axios/lib/core/settle.js:18:12)
web_1         |     at IncomingMessage.handleStreamEnd (/mnt/code/node_modules/axios/lib/adapters/http.js:201:11)
web_1         |     at IncomingMessage.emit (events.js:185:15)
web_1         |     at endReadableNT (_stream_readable.js:1106:12)
web_1         |     at process._tickCallback (internal/process/next_tick.js:178:19)
web_1         |   config: 
web_1         |    { adapter: [Function: httpAdapter],
web_1         |      transformRequest: { '0': [Function: transformRequest] },
web_1         |      transformResponse: { '0': [Function: transformResponse] },
web_1         |      timeout: 0,
web_1         |      xsrfCookieName: 'XSRF-TOKEN',
web_1         |      xsrfHeaderName: 'X-XSRF-TOKEN',
web_1         |      maxContentLength: -1,
web_1         |      validateStatus: [Function: validateStatus],
web_1         |      headers: 
web_1         |       { Accept: 'application/json, text/plain, */*',
web_1         |         'User-Agent': 'axios/0.18.0' },
web_1         |      method: 'get',
web_1         |      url: 'http://api:3000/bidding_exemptions?select=*,gazette{file_url,is_extra_edition,power}&order=date.desc',
web_1         |      data: undefined },
web_1         |   request: 
web_1         |    ClientRequest {
web_1         |      _events: 
web_1         |       { socket: [Function],
web_1         |         abort: [Function],
web_1         |         aborted: [Function],
web_1         |         error: [Function],
web_1         |         timeout: [Function],
web_1         |         prefinish: [Function: requestOnPrefinish] },
web_1         |      _eventsCount: 6,
web_1         |      _maxListeners: undefined,
web_1         |      output: [],
web_1         |      outputEncodings: [],
web_1         |      outputCallbacks: [],
web_1         |      outputSize: 0,
web_1         |      writable: true,
web_1         |      _last: true,
web_1         |      upgrading: false,
web_1         |      chunkedEncoding: false,
web_1         |      shouldKeepAlive: false,
web_1         |      useChunkedEncodingByDefault: false,
web_1         |      sendDate: false,
web_1         |      _removedConnection: false,
web_1         |      _removedContLen: false,
web_1         |      _removedTE: false,
web_1         |      _contentLength: 0,
web_1         |      _hasBody: true,
web_1         |      _trailer: '',
web_1         |      finished: true,
web_1         |      _headerSent: true,
web_1         |      socket: 
web_1         |       Socket {
web_1         |         connecting: false,
web_1         |         _hadError: false,
web_1         |         _handle: null,
web_1         |         _parent: null,
web_1         |         _host: 'api',
web_1         |         _readableState: [ReadableState],
web_1         |         readable: false,
web_1         |         _events: [Object],
web_1         |         _eventsCount: 7,
web_1         |         _maxListeners: undefined,
web_1         |         _writableState: [WritableState],
web_1         |         writable: false,
web_1         |         _bytesDispatched: 210,
web_1         |         _sockname: null,
web_1         |         _pendingData: null,
web_1         |         _pendingEncoding: '',
web_1         |         allowHalfOpen: false,
web_1         |         server: null,
web_1         |         _server: null,
web_1         |         parser: null,
web_1         |         _httpMessage: [Circular],
web_1         |         _idleNext: null,
web_1         |         _idlePrev: null,
web_1         |         _idleTimeout: -1,
web_1         |         [Symbol(asyncId)]: 27540,
web_1         |         [Symbol(lastWriteQueueSize)]: 0,
web_1         |         [Symbol(bytesRead)]: 312 },
web_1         |      connection: 
web_1         |       Socket {
web_1         |         connecting: false,
web_1         |         _hadError: false,
web_1         |         _handle: null,
web_1         |         _parent: null,
web_1         |         _host: 'api',
web_1         |         _readableState: [ReadableState],
web_1         |         readable: false,
web_1         |         _events: [Object],
web_1         |         _eventsCount: 7,
web_1         |         _maxListeners: undefined,
web_1         |         _writableState: [WritableState],
web_1         |         writable: false,
web_1         |         _bytesDispatched: 210,
web_1         |         _sockname: null,
web_1         |         _pendingData: null,
web_1         |         _pendingEncoding: '',
web_1         |         allowHalfOpen: false,
web_1         |         server: null,
web_1         |         _server: null,
web_1         |         parser: null,
web_1         |         _httpMessage: [Circular],
web_1         |         _idleNext: null,
web_1         |         _idlePrev: null,
web_1         |         _idleTimeout: -1,
web_1         |         [Symbol(asyncId)]: 27540,
web_1         |         [Symbol(lastWriteQueueSize)]: 0,
web_1         |         [Symbol(bytesRead)]: 312 },
web_1         |      _header: 'GET /bidding_exemptions?select=*,gazette%7Bfile_url,is_extra_edition,power%7D&order=date.desc HTTP/1.1\r\nAccept: application/json, text/plain, */*\r\nUser-Agent: axios/0.18.0\r\nHost: api:3000\r\nConnection: close\r\n\r\n',
web_1         |      _onPendingData: [Function: noopPendingOutput],
web_1         |      agent: 
web_1         |       Agent {
web_1         |         _events: [Object],
web_1         |         _eventsCount: 1,
web_1         |         _maxListeners: undefined,
web_1         |         defaultPort: 80,
web_1         |         protocol: 'http:',
web_1         |         options: [Object],
web_1         |         requests: {},
web_1         |         sockets: [Object],
web_1         |         freeSockets: {},
web_1         |         keepAliveMsecs: 1000,
web_1         |         keepAlive: false,
web_1         |         maxSockets: Infinity,
web_1         |         maxFreeSockets: 256 },
web_1         |      socketPath: undefined,
web_1         |      timeout: undefined,
web_1         |      method: 'GET',
web_1         |      path: '/bidding_exemptions?select=*,gazette%7Bfile_url,is_extra_edition,power%7D&order=date.desc',
web_1         |      _ended: true,
web_1         |      res: 
web_1         |       IncomingMessage {
web_1         |         _readableState: [ReadableState],
web_1         |         readable: false,
web_1         |         _events: [Object],
web_1         |         _eventsCount: 3,
web_1         |         _maxListeners: undefined,
web_1         |         socket: [Socket],
web_1         |         connection: [Socket],
web_1         |         httpVersionMajor: 1,
web_1         |         httpVersionMinor: 1,
web_1         |         httpVersion: '1.1',
web_1         |         complete: true,
web_1         |         headers: [Object],
web_1         |         rawHeaders: [Array],
web_1         |         trailers: {},
web_1         |         rawTrailers: [],
web_1         |         upgrade: false,
web_1         |         url: '',
web_1         |         method: null,
web_1         |         statusCode: 400,
web_1         |         statusMessage: 'Bad Request',
web_1         |         client: [Socket],
web_1         |         _consuming: true,
web_1         |         _dumped: false,
web_1         |         req: [Circular],
web_1         |         responseUrl: 'http://api:3000/bidding_exemptions?select=*,gazette%7Bfile_url,is_extra_edition,power%7D&order=date.desc',
web_1         |         read: [Function] },
web_1         |      aborted: undefined,
web_1         |      timeoutCb: null,
web_1         |      upgradeOrConnect: false,
web_1         |      parser: null,
web_1         |      maxHeadersCount: null,
web_1         |      _redirectable: 
web_1         |       Writable {
web_1         |         _writableState: [WritableState],
web_1         |         writable: true,
web_1         |         _events: [Object],
web_1         |         _eventsCount: 2,
web_1         |         _maxListeners: undefined,
web_1         |         _options: [Object],
web_1         |         _redirectCount: 0,
web_1         |         _requestBodyLength: 0,
web_1         |         _requestBodyBuffers: [],
web_1         |         _onNativeResponse: [Function],
web_1         |         _currentRequest: [Circular],
web_1         |         _currentUrl: 'http://api:3000/bidding_exemptions?select=*,gazette%7Bfile_url,is_extra_edition,power%7D&order=date.desc' },
web_1         |      [Symbol(isCorked)]: false,
web_1         |      [Symbol(outHeadersKey)]: { accept: [Array], 'user-agent': [Array], host: [Array] } },
web_1         |   response: 
web_1         |    { status: 400,
web_1         |      statusText: 'Bad Request',
web_1         |      headers: 
web_1         |       { 'transfer-encoding': 'chunked',
web_1         |         date: 'Wed, 25 Apr 2018 14:05:40 GMT',
web_1         |         server: 'postgrest/0.4.4.0 (f9e770b)',
web_1         |         'content-type': 'application/json; charset=utf-8' },
web_1         |      config: 
web_1         |       { adapter: [Function: httpAdapter],
web_1         |         transformRequest: [Object],
web_1         |         transformResponse: [Object],
web_1         |         timeout: 0,
web_1         |         xsrfCookieName: 'XSRF-TOKEN',
web_1         |         xsrfHeaderName: 'X-XSRF-TOKEN',
web_1         |         maxContentLength: -1,
web_1         |         validateStatus: [Function: validateStatus],
web_1         |         headers: [Object],
web_1         |         method: 'get',
web_1         |         url: 'http://api:3000/bidding_exemptions?select=*,gazette{file_url,is_extra_edition,power}&order=date.desc',
web_1         |         data: undefined },
web_1         |      request: 
web_1         |       ClientRequest {
web_1         |         _events: [Object],
web_1         |         _eventsCount: 6,
web_1         |         _maxListeners: undefined,
web_1         |         output: [],
web_1         |         outputEncodings: [],
web_1         |         outputCallbacks: [],
web_1         |         outputSize: 0,
web_1         |         writable: true,
web_1         |         _last: true,
web_1         |         upgrading: false,
web_1         |         chunkedEncoding: false,
web_1         |         shouldKeepAlive: false,
web_1         |         useChunkedEncodingByDefault: false,
web_1         |         sendDate: false,
web_1         |         _removedConnection: false,
web_1         |         _removedContLen: false,
web_1         |         _removedTE: false,
web_1         |         _contentLength: 0,
web_1         |         _hasBody: true,
web_1         |         _trailer: '',
web_1         |         finished: true,
web_1         |         _headerSent: true,
web_1         |         socket: [Socket],
web_1         |         connection: [Socket],
web_1         |         _header: 'GET /bidding_exemptions?select=*,gazette%7Bfile_url,is_extra_edition,power%7D&order=date.desc HTTP/1.1\r\nAccept: application/json, text/plain, */*\r\nUser-Agent: axios/0.18.0\r\nHost: api:3000\r\nConnection: close\r\n\r\n',
web_1         |         _onPendingData: [Function: noopPendingOutput],
web_1         |         agent: [Agent],
web_1         |         socketPath: undefined,
web_1         |         timeout: undefined,
web_1         |         method: 'GET',
web_1         |         path: '/bidding_exemptions?select=*,gazette%7Bfile_url,is_extra_edition,power%7D&order=date.desc',
web_1         |         _ended: true,
web_1         |         res: [IncomingMessage],
web_1         |         aborted: undefined,
web_1         |         timeoutCb: null,
web_1         |         upgradeOrConnect: false,
web_1         |         parser: null,
web_1         |         maxHeadersCount: null,
web_1         |         _redirectable: [Writable],
web_1         |         [Symbol(isCorked)]: false,
web_1         |         [Symbol(outHeadersKey)]: [Object] },
web_1         |      data: 
web_1         |       { message: 'Could not find foreign keys between these entities, No relation found between bidding_exemptions and gazette' } },
web_1         |   statusCode: 500,
web_1         |   name: 'NuxtServerError' }

As a newbie to Vue.js and Nuxt, may I ask if anyone have any clue about what's going on? Many thanks : )

Deployable package.

We need to have some script to create a package/container to facilitate the spiders deploy in a production environment. I think the easy way to go is to use the same Dockerfile we use to run the spider in the docker-compose to build the container used in production. But we need some automation to build the container and publish it in a container registry somewhere. Like the docker hub or some registry of the infrastructure provider repository (i.e. Digital Ocean).

Another option is to have a package in something similar to Open Build Service and build a rpm/deb package and install it in the production server using some deb/rpm repository.

This issue is related to the issue #157

Aracaju/SE Spider

Hi everybody, I'm stuck trying to build the Aracaju City spider. This is the main site for the gazettes: http://sga.aracaju.se.gov.br:5011/legislacao/faces/diario_form_pesq.jsp. It's a JSP page and the requests must contain some session data. I could retrieve the gazettes through a direct link by the gazette official number (http://sga.aracaju.se.gov.br:5011/diarios/3970.pdf), in this case I don't have extra information, like publishing date for example. Is there a best approach to deal with this?

WebUI to find and visualize the gazettes

In order to have the gazettes available to the public, we need to have a WebUI to allow the user to search the files by city and dates. The initial idea is using Django for that.

Related to issue #157

Make the gazzetes available for the public

During the past days I've being discussing with @sergiomario about run the spiders in production and make the scraped files available in a central web page. The first version does not need to be to fancy. The idea is to run the spider in a server/cluster, store the files and build a simple web page allowing the user to search and read the scraped files.

As the Serenata de Amor already run in the Digital Ocean, I think we can continue in the same provider. All we need in the first version will be a server/k8s cluster, PostgreSQL and a file storage. We can address all these needs with the DO products available.

To achieve this goal we see the follow issues need to be addressed:

  1. Where run spiders, API and web page: a simple server with a cron job or set up something more sofisticated, like a Kubernetes cluster, to run the workloads.
  2. Avoid unnecessary request: If we already collect the gazettes until 02/19/2020. Let's start the spider from 02/20/2020 (By the way, really cool date xD)
  3. Automation: we are few people, we should automate as much as possible.
  4. UX: find some UX wizard to do a cool web page.

@sergiomario, am I forgetting something?

Apache Tika

Try to use the Apache Tika to extract data from DOC and PDF files. Replacing the current pipeline steps to extract the text from the downloaded files

Rio de Janeiro/RJ Craw

Hello guys.

I'm having trouble understanding the crawler results from rio de janeiro ...

If I test the crawler of rio de janeiro (following the orientation of CONTRIBUTING.md):

sudo docker-compose run --rm processing bash -c "cd data_collection && scrapy crawl rj_rio_de_janeiro"

The result seems to be wrong:

[...]

2018-08-27 00:32:53 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET http://doweb.rio.rj.gov.br/ler_pdf.php?download=ok&edi_id=3864> referred in <None>
2018-08-27 00:32:53 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET http://doweb.rio.rj.gov.br/ler_pdf.php?download=ok&edi_id=3864> referred in <None>
2018-08-27 00:32:53 [scrapy.core.scraper] ERROR: Error processing {'date': datetime.date(2018, 8, 20),
 'file_urls': ['http://doweb.rio.rj.gov.br/ler_pdf.php?download=ok&edi_id=3864'],
 'files': [{'checksum': '49228de889bf8edd753fad4b184adaa3',
            'path': 'full/c73158205ba52ccb878c48d4353d298db7586850.php?download=ok&edi_id=3864',
            'url': 'http://doweb.rio.rj.gov.br/ler_pdf.php?download=ok&edi_id=3864'}],
 'is_extra_edition': True,
 'power': 'executive',
 'scraped_at': datetime.datetime(2018, 8, 27, 0, 32, 53, 7640),
 'territory_id': '3304557'}
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/mnt/code/data_collection/gazette/pipelines.py", line 14, in process_item
    item["source_text"] = self.pdf_source_text(item)
  File "/mnt/code/data_collection/gazette/pipelines.py", line 29, in pdf_source_text
    with open(text_path) as file:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/data/full/c73158205ba52ccb878c48d4353d298db7586850.php?download=ok&edi_id=3864.txt'
I/O Error: Couldn't open file '/mnt/data/full/c73158205ba52ccb878c48d4353d298db7586850.php?download=ok': No such file or directory.
2018-08-27 00:32:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://doweb.rio.rj.gov.br/?buscar_diario=ok&tipo=1&data_busca=21/08/2018> (referer: http://doweb.rio.rj.gov.br)
2018-08-27 00:32:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://doweb.rio.rj.gov.br/?buscar_diario=ok&tipo=1&data_busca=24/08/2018> (referer: http://doweb.rio.rj.gov.br)
2018-08-27 00:32:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://doweb.rio.rj.gov.br/?buscar_diario=ok&tipo=1&data_busca=26/08/2018> (referer: http://doweb.rio.rj.gov.br)
2018-08-27 00:32:53 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET http://doweb.rio.rj.gov.br/ler_pdf.php?download=ok&edi_id=3869> referred in <None>
2018-08-27 00:32:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://doweb.rio.rj.gov.br/?buscar_diario=ok&tipo=1&data_busca=25/08/2018> (referer: http://doweb.rio.rj.gov.br)
2018-08-27 00:32:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://doweb.rio.rj.gov.br/?buscar_diario=ok&tipo=1&data_busca=28/07/2018> (referer: http://doweb.rio.rj.gov.br)
2018-08-27 00:32:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://doweb.rio.rj.gov.br/?buscar_diario=ok&tipo=1&data_busca=30/07/2018> (referer: http://doweb.rio.rj.gov.br)

When I run the crawler of Porto Alegre (for comparison) I get an intelligible result:

sudo docker-compose run --rm processing bash -c "cd data_collection && scrapy crawl rs_porto_alegre"

result is:
[...]

                'EXTRATO DE TERMO ADITIVO\n'
                '     PROCESSO: 009.003517.14.4\n'
                '     CONTRATANTE: Departamento Municipal de Previdência dos '
                'Servidores Públicos do Município de Porto Alegre.\n'
                '     CONTRATADA: Agência Estado S/A.\n'
                '     OBJETO: prorrogação do contrato n. 02/2015 de licença de '
                'uso do software AE Broadcast Profissional, 04 pontos de '
                'acesso, por 12 meses, a contar de 01.04.2018.\n'
                '     Valor Mensal: R$ 9.625,04.\n'
                '     BASE LEGAL: Artigo 57, inciso II, da Lei 8.666/93 e suas '
                'alterações.\n'
                '\n'
                '                                                                                  '
                'Porto Alegre, 24 de abril de 2018.\n'
                '\n'
                '\n'
                '                                                                             '
                'RENAN DA SILVA AGUIAR, Diretor-Geral.\n'
                '\n'
                '\n'
                '\n'
                '\n'
                '      EXPEDIENTE\n'
                '\n'
                '\n'
                '      PREFEITURA MUNICIPAL DE PORTO ALEGRE\n'
                '      Diário Oficial Eletrônico de Porto Alegre\n'
                '      Órgão de Divulgação Oficial do Município\n'
                '      Instituído pela Lei nº 11.029 de 3 de janeiro de 2011\n'

Error when starting service, database keys missing?

When I try to run the service using docker-compose up and access http://localhost:8080/ I get this error:

Request failed with status code 400

The error that shows in the terminal is related with database:

Could not find foreign keys between these entities, 
No relation found between bidding_exemptions and gazette

[WIP] Spider for Ribeirão Preto/SP

Me and @kerollaine are working together on this.

We've managed to create a simple Postman Collection showing the sequence of requests needed to download a single day Gazette from https://www.ribeiraopreto.sp.gov.br/J015/pesquisaData.xhtml. It is available here: https://documenter.getpostman.com/view/2394724/diariooficial/RW1boKvQ.

Because the application is JSF-based, and it maintains the view data in the server, we need to do all three requests to get a PDF file.

Cities that depend on js

What should be our approach to cities that depend on JS? I'm more used to Selenium and a browser, but we probably should define a single approach.

CI pipeline

Take advantage of the github actions to build a container images containing all the spiders. To make it possible to pull it in another place and run it

Code formatter?

I see a lot of feedback here focused on code formatting. What about automatizing this?

We could recommend Black, and add a single line ($ black . --check) to our CI. This way PEP8 and other common code style linters would automatically raise a red light helping contributors to keep the code in a way that is compatible with best practices among the Python community.

Also with $ black . we can edit the whole code base at once to fit the style guide.

Surely I'm a candidate to implement this, but I'd like to hear from you some counter-arguments ; )

Fix Apache Tika URL

Currently, the URL to download the Apache Tika in our container image to run the spider changes with some annoying frequency. Every time a version is released the old one is deleted and our URL is not valid anymore. We need to find a way to make it more stable. In the current version, the URL is maintained by Unicamp.

On top of my head I think we can look for a URL which keeps more than one version at time. Thus, the binary will be available for more time.

Lista das 100 maiores municípios do Brasil

Oi pessoal, quero começar a contribuir no projeto pegando uma cidade para fazer o scrapy, porém, no site não existe informação sobre quais são os municípios target, apenas que são as 100 maiores do Brasil, mas baseado em que? Habitantes, PIB, etc...

Change spider to allow select date/period of the items

Actually when executing the spiders all available gazettes are retrieved. This could be a waste of resources as the data extracted (the gazette) doesn't change.

I had a quick discussion with @Irio , and he suggested the following options:

  1. Include a parameter indicating the start date that the spider should collect the gazettes;
  2. Check the database to identify when was the last gazette collected, and then collect everything after that.

Suggestion: Use pytest and tox

I'd like to suggest using pytest instead of unittest. It's a very small change, the current tests don't need to change at all, and for it we'd get a few nice features from pytest. The one I'm missing the most are more useful assert error messages. For example, if I do assert my_dict == other_dict with unittest and it fails, I just get AssertionError thrown, without any message. If I use the same code with pytest I get a helpful message saying which elements of the dicts aren't the same. The same for comparing arrays, dicts, or any other element types.

And tox is useful to allow the tests to run using a single command, instead of having to remember python -m unittest discover when I want to run the tests outside of Docker. It keeps the virtualenvs configured, sets whatever env variables are necessary, and can also run lint.

These 2 libraries became the standard for most Python projects I see, and for good reason: they're great!

I'd be happy to send a PR with these changes (or just pytest) if you agree.

cc @Irio @cuducos

Command Execution

Hello. I need help with executing a command.
after executing the command:

docker-compose run --rm processing scrapy shell http://www2.portoalegre.rs.gov.br/dopa/

The system executes a few lines and is interrupting waiting for a parameter command.
Below the result of the command:

docker-compose run --rm processing scrapy shell http://www2.portoalegre.rs.gov.br/dopa/
Starting diario-oficial_redis_1 ...
Starting diario-oficial_postgres_1 ... done
Starting diario-oficial_redis_1 ... done
2018-08-06 20:42:23 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-08-06 20:42:23 [scrapy.utils.log] INFO: Versions: lxml 4.2.3.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.6 (default, Jul 17 2018, 11:12:33) - [GCC 6.3.0 20170516], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.3, Platform Linux-4.9.93-linuxkit-aufs-x86_64-with-debian-9.5
2018-08-06 20:42:24 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0}
2018-08-06 20:42:24 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage']
2018-08-06 20:42:24 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-08-06 20:42:24 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-08-06 20:42:24 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-08-06 20:42:24 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-08-06 20:42:24 [scrapy.core.engine] INFO: Spider opened
2018-08-06 20:42:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www2.portoalegre.rs.gov.br/dopa/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7f96c3f2b630>
[s] item {}
[s] request <GET http://www2.portoalegre.rs.gov.br/dopa/>
[s] response <200 http://www2.portoalegre.rs.gov.br/dopa/>
[s] settings <scrapy.settings.Settings object at 0x7f96c3f2b6a0>
[s] spider <DefaultSpider 'default' at 0x7f96c3b77470>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]:

Can you identify any errors?
Is it necessary to use a parameter?

Executing a crawler error - permission denied

I am getting this error when I try to run the crawlers...

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1362, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 http://www.guarulhos.sp.gov.br/uploads/pdf/2128651035.pdf>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/scrapy/pipelines/files.py", line 401, in media_downloaded
    checksum = self.file_downloaded(response, request, info)
  File "/usr/local/lib/python3.6/site-packages/scrapy/pipelines/files.py", line 434, in file_downloaded
    self.store.persist_file(path, buf, info)
  File "/usr/local/lib/python3.6/site-packages/scrapy/pipelines/files.py", line 54, in persist_file
    with open(absolute_path, 'wb') as f:
PermissionError: [Errno 13] Permission denied: '/mnt/data/full/a4bcac7d55a3f3e11b1dd8e0152d86037098dc5c.pdf'

I tried to change the permissions (chmod -R 0777 ,) and rerun, but nothing changed...

Period of interest

Some of the spiders are limited to get documents from the current year till 2014.
This rule must be applied to all spiders?

Create date filter

Hello guys, how we talked some days ago, we found a potential problem to integrate OKFN's projects to Jusbrasil stack of Brazilian data extraction.

We discussed some points about it, and we suggested some changes in the project:

  • Create a way where is possible to pass a date to filter results.
  • Based on these filters, make fewer requests than possible.
  • [Minor] Return PDF's links and more information if possible, like publication date, etc.

An important point to remember is that we are available to help with coding theses changes, if necessary.

Permission data/territories.csv

Hi,
I'm getting an error with permission of data/territories.csv.
drwxr-xr-x 2 root root 452K jun 4 20:09 full
-rw-r--r-- 1 root root 193K jun 4 20:58 territories.csv

Should this file be root?

Como colaborar com o projeto?

Olá pessoal!

Em 2018 fizemos um mutirão no Fab Lab Joinville para criar spiders para diversas cidades. Olhando por cima, tem uns 15 PRs abertos daquela época.

Gostaria de levar novamente o projeto para o laboratório semanal de Computação, onde trazemos projetos para que os participantes contribuam. Porém, não está muito claro para mim como podemos contribuir com ele.

Seria o caso de revisarmos os spiders com PR em aberto?

Ou então criar um portal para visualização dos diários oficiais?

Ou ainda, extrair dados dos diários já coletados?

Gostaria de poder contribuir com o projeto, e temos tempo e pessoas dispostas a colaborar.

Fico no aguardo de um retorno :)

Abraço!

Canoas/RS

Comecei a fazer o crawler de Canoas/RS, porém encontrei alguns problemas. O site do diário de Canoas não disponibiliza link direto para o PDF, ao invés disso usa cookie e sessão para o acesso e isso complica um pouco as coisas.

Como funciona:
Quando é acessado o site e escolhido o diário que deseja acessar, são feitas algumas requisições para o servidor com o dia e as páginas escolhidos. E então retorna o arquivo PDF. Caso acesse diretamente o link do arquivo, o server verifica a sessão e retorna o PDF escolhido nas requisições anteriores ou 404. Não existe um caminho único para o arquivo, o link é o mesmo para todos os arquivos, o que muda é só a data e páginas setados na sessão.

Então é necessário manter uma sessão única para cada dia do diário, pelo que eu vi é possivel fazer isso com o Scrapy. Porém, no projeto atualmente é rodado o crawler que pega todos os diários e as URLs dos arquivos e depois passa para a próxima pipeline que baixa os arquivos. Porém acredito que nessa troca de pipeline não se mantenha os cookies e a sessão e como nesse caso seria necessário manter a sessão para fazer o download, creio que não funcionaria. Qual vcs acham que é a melhor forma de resolver isso?

Outro problema é que não teremos o link direto para o arquivo do diário como temos de Porto Alegre.

Production infrastrucure configuration

We need to have scripts to setup and configure a production environment for running the spider.

My suggestion is use Ansible for that job. We can used to configure, install and update the servers running the workloads. It is easy to use and does not require anything installed in the server. All configuration is done via ssh.

Related to issue #157

Poor site of Duque de Caxias - RJ

The site has two format of links and file name for the gazettes.

  1. http://duquedecaxias.rj.gov.br/portal/boletim-oficial/2015/01-Janeiro/
  2. http://duquedecaxias.rj.gov.br/portal/arquivos/2018/fevereiro/

The first format is for gazzetes under 2017 and is veeery incomplete. Is missing a LOT of gazzetes.

The second format is for gazzetes of 2017 and above. In this format there is no way to get the day of the gazzete.

It's necessary enter in contact with the city hall. What would be better, a project representative to contact the city hall or me?

O sp_guaruja.py de Guarujá não está funcionando

Boa tarde,

O código de raspagem para a prefeitura de guarujá não está funcionando. Acredito que seja porque o site da prefeitura foi mudado faz pouco tempo e a estrutura do site do diário oficial foi mudada.

Muito obrigado.

Replacement for Scrapy Contracts

@rennerocha recently raised the issue of Scrapy Contracts often making the test suite fail due to false positives.

The initial idea of using Contracts was to have a quick (for experienced developers) and beginner-friendly way of writing Scrapy tests, exempting developers from having to write test cases, creating fixtures and maintaining them up-to-date.

We could also keep using Contracts, knowing they will eventually fail in online tests, using them only as a way of monitoring the state of all crawlers (today we have 13 different spiders) until we have a better and more robust one.

Enable PDF upload to Google Cloud Storage

This issue is intended for people with little to no experience with open source contributions.

Nowadays, every time a Scrapy spider is run, the gazette PDF is saved and stored in the local filesystem, more specifically in the /mnt/data/ folder.

https://github.com/okfn-brasil/diario-oficial/blob/490270da0201471b9968268cd2eb0e9bde272310/processing/data_collection/gazette/settings.py#L10

When in a development environment, we still want to have files stored in the local filesystem, in this same folder. When in a production environment, the file should be sent to Google Cloud Storage.

Ideally, this behavior should be defined by an environment variable such as FILES_STORAGE.

Scrapy's documentation on the topic:
https://doc.scrapy.org/en/latest/topics/media-pipeline.html#id1

https://diario.serenata.ai is currently offline

Apparently, the diario-api service is offline. This is the result from curl https://diario.serenata.ai:

{
  "status": 500,
  "message": "getaddrinfo ENOTFOUND diario-api diario-api:3000",
  "name": "NuxtServerError"
}

Make unit_test failing due to new version of pytest

Make unit_test is failing since the release of pytest 3.10.0.

$ make unit_test
docker-compose run --rm processing pytest -p no:cacheprovider
Starting diario-oficial_rabbitmq_1 ... done
Starting diario-oficial_postgres_1 ... done
Starting diario-oficial_redis_1    ... done
==================================================================== test session starts =====================================================================
platform linux -- Python 3.6.7, pytest-3.10.0, py-1.7.0, pluggy-0.8.0
rootdir: /mnt/code, inifile:
plugins: celery-4.1.1
collected 42 items                                                                                                                                           

tests/test_tasks.py .......                                                                                                                            [ 16%]
tests/gazette/data/test_bidding_exemption_parsing.py .....................                                                                             [ 66%]
tests/gazette/data/test_row_update.py ....                                                                                                             [ 76%]
tests/gazette/data/test_section_parsing.py .....                                                                                                       [ 88%]
tests/gazette/locations/test_go_goiania.py ...                                                                                                         [ 95%]
tests/gazette/locations/test_rs_porto_alegre.py ..                                                                                                     [100%]Traceback (most recent call last):
  File "/usr/local/bin/pytest", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/_pytest/config/__init__.py", line 76, in main
    return config.hook.pytest_cmdline_main(config=config)
  File "/usr/local/lib/python3.6/site-packages/pluggy/hooks.py", line 284, in __call__
    return self._hookexec(self, self.get_hookimpls(), kwargs)
  File "/usr/local/lib/python3.6/site-packages/pluggy/manager.py", line 67, in _hookexec
    return self._inner_hookexec(hook, methods, kwargs)
  File "/usr/local/lib/python3.6/site-packages/pluggy/manager.py", line 61, in <lambda>
    firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
  File "/usr/local/lib/python3.6/site-packages/pluggy/callers.py", line 208, in _multicall
    return outcome.get_result()
  File "/usr/local/lib/python3.6/site-packages/pluggy/callers.py", line 80, in get_result
    raise ex[1].with_traceback(ex[2])
  File "/usr/local/lib/python3.6/site-packages/pluggy/callers.py", line 187, in _multicall
    res = hook_impl.function(*args)
  File "/usr/local/lib/python3.6/site-packages/_pytest/main.py", line 218, in pytest_cmdline_main
    return wrap_session(config, _main)
  File "/usr/local/lib/python3.6/site-packages/_pytest/main.py", line 211, in wrap_session
    session=session, exitstatus=session.exitstatus
  File "/usr/local/lib/python3.6/site-packages/pluggy/hooks.py", line 284, in __call__
    return self._hookexec(self, self.get_hookimpls(), kwargs)
  File "/usr/local/lib/python3.6/site-packages/pluggy/manager.py", line 67, in _hookexec
    return self._inner_hookexec(hook, methods, kwargs)
  File "/usr/local/lib/python3.6/site-packages/pluggy/manager.py", line 61, in <lambda>
    firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
  File "/usr/local/lib/python3.6/site-packages/pluggy/callers.py", line 203, in _multicall
    gen.send(outcome)
  File "/usr/local/lib/python3.6/site-packages/_pytest/terminal.py", line 627, in pytest_sessionfinish
    outcome.get_result()
  File "/usr/local/lib/python3.6/site-packages/pluggy/callers.py", line 80, in get_result
    raise ex[1].with_traceback(ex[2])
  File "/usr/local/lib/python3.6/site-packages/pluggy/callers.py", line 187, in _multicall
    res = hook_impl.function(*args)
  File "/usr/local/lib/python3.6/site-packages/_pytest/stepwise.py", line 102, in pytest_sessionfinish
    self.config.cache.set("cache/stepwise", [])
AttributeError: 'Config' object has no attribute 'cache'
Makefile:4: recipe for target 'unit_test' failed
make: *** [unit_test] Error 1

postgres save error: duplicate key value violates unique constraint

When running crawler the second time a duplicity exception is generated in the Postgres

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/mnt/code/data_collection/gazette/pipelines.py", line 28, in process_item
    session.commit()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1026, in commit
    self.transaction.commit()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 493, in commit
    self._prepare_impl()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 472, in _prepare_impl
    self.session.flush()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2458, in flush
    self._flush(objects)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2596, in _flush
    transaction.rollback(_capture_exception=True)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 129, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2556, in _flush
    flush_context.execute()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute
    rec.execute(self)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/unitofwork.py", line 589, in execute
    uow,
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py", line 245, in save_obj
    insert,
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py", line 1120, in _emit_insert_statements
    statement, params
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 988, in execute
    return meth(self, multiparams, params)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 287, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1107, in _execute_clauseelement
    distilled_params,
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
    e, statement, parameters, cursor, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1466, in _handle_dbapi_exception
    util.raise_from_cause(sqlalchemy_exception, exc_info)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 383, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 128, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1244, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 550, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "gazettes_territory_id_date_file_checksum_key"
DETAIL:  Key (territory_id, date, file_checksum)=(3509502, 2016-09-01, 6509975a6cba06faa73eb0b463acea30) already exists.

[SQL: INSERT INTO gazettes (source_text, date, is_extra_edition, is_parsed, power, file_checksum, file_path, file_url, scraped_at, created_at, territory_id) VALUES (%(source_text)s, %(date)s, %(is_extra_edition)s, %(is_parsed)s, %(power)s, %(file_checksum)s, %(file_path)s, %(file_url)s, %(scraped_at)s, %(created_at)s, %(territory_id)s) RETURNING gazettes.id]
[parameters: {'source_text': 'Nº 11.431 - Ano XLV\n                           Diário Oficial                     Quinta-feira, 01 de setembro de 2016                               ... (1320312 characters truncated) ... . - Alteração da forma de cobrança\n                                                                                       da taxa de material.\n\x0c', 'date': datetime.date(2016, 9, 1), 'is_extra_edition': False, 'is_parsed': False, 'power': 'executive_legislature', 'file_checksum': '6509975a6cba06faa73eb0b463acea30', 'file_path': 'full/4abcf9e6df73b424f8e3b38140a3062af85b4800.pdf', 'file_url': 'http://www.campinas.sp.gov.br/uploads/pdf/2118544071.pdf', 'scraped_at': datetime.datetime(2019, 12, 6, 17, 40, 55, 19918), 'created_at': datetime.datetime(2019, 12, 6, 17, 41, 1, 688711), 'territory_id': '3509502'}]
(Background on this error at: http://sqlalche.me/e/gkpj)

Dispensa de licitação e Inexigibilidade

Acham válida a ideia de localizarmos, além das dispensas, as inexigibilidades também?

“A diferença básica entre as duas hipóteses está no fato de que, na dispensa, há possibilidade de competição que justifique a licitação; de modo que a lei faculta a dispensa, que fica inserida na competência discricionária da Administração. Nos casos de inexigibilidade, não há possibilidade de competição, porque só existe um objeto ou uma pessoa que atenda às necessidades da Administração; a licitação é, portanto, inviável”.
DI PIETRO, Maria Sylvia Zanella., Direito administrativo. 14. ed. São Paulo: Atlas, 2002, p. 310, 320-321.
Fonte

spider automation test

I believe we should find a way to run some kind of automatic testing in the spiders. My first idea is use the github actions to run the spiders for a while and check if it downloads some files. If it does, great, the test passes. Otherwise, it fails.

Furthermore, if the test failed, we can use the github action to open an issue automatically.

Black pre-commit unstable

Back pre-commit it is unstable.

When I run docker-compose run --rm processing black . --check, I get

(...)
would reformat /mnt/code/tests/gazette/data/test_bidding_exemption_parsing.py
All done! ✨ 🍰 ✨
1 file would be reformatted, 56 files would be left unchanged.

So I run docker-compose run --rm processing black . --safe and this file is formatted.

But when I try to commit, pre-commit prevent and format the file back to the format that black from docker complained about it (test_bidding_exemption_parsing).

I suspect that the black inside the docker and the black used in the pre-commit are different, but I do not know how to turn them equal


I already tried to:

  • Install black globally in the same version of requirements.txt.
  • Use the entry argument on .pre-commit-config.yaml using the docker command of black.

Common platform used by many municipalities

While looking for gazettes for Paranavaí the official site redirects to a portal that seems to be used by lots of smaller municipalities.

http://www.diariomunicipal.com.br

This portal even allow to search for terms like "Dispensa"

http://www.diariomunicipal.com.br/amp/pesquisar?entidadeUsuaria=&titulo=Dispensa&nome_orgao=&dataInicio=01%2F05%2F2018&dataFim=31%2F05%2F2018&Enviar=&_token=aAfXBvwcSMnp4lRq1qoriAGxW9rav7iawfmpUeHGLMk

Only problem is that the result is not a single pdf, but a page for each result. I believe that another pipeline would be needed for scraping this website.
Although i haven't found any city on the top 100 using this portal, maybe it would be worth scraping it given the amount of cities that uses it.

What do you think?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.