okfn-brasil / querido-diario Goto Github PK

📰 Diários oficiais brasileiros acessíveis a todos | 📰 Brazilian government gazettes, accessible to everyone.

Home Page: https://queridodiario.ok.org.br/

License: MIT License

Python 99.72% Makefile 0.28%

data-science machine-learning politics artificial-intelligence open-data civic-tech governments-gazettes govtech spider scraping

querido-diario's Introduction

Português (BR) | English (US)

Querido Diário

Dentro do ecossistema do Querido Diário, este repositório é o responsável pela tarefa de raspagem dos sites publicadores de diários oficiais.

Conheça mais sobre as tecnologias e a história do projeto no site do Querido Diário

Sumário

Como contribuir
Ambiente de desenvolvimento
Template para raspadores
Como executar
- Dicas de execução
Solução de problemas
Suporte
Agradecimentos
Open Knowledge Brasil
Licença

Como contribuir

Agradecemos por considerar contribuir com o Querido Diário! 🎉

Você encontra como fazê-lo no CONTRIBUTING.md!

Além disso, consulte a documentação do Querido Diário para te ajudar.

Ambiente de desenvolvimento

Você precisa ter Python (+3.0) e o framework Scrapy instalados.

Os comandos abaixo preparam o ambiente em sistema operacional Linux. Eles consistem em criar um ambiente virtual de Python, instalar os requisitos listados em requirements-dev e a ferramenta para padronização de código pre-commit.

python3 -m venv .venv
source .venv/bin/activate
pip install -r data_collection/requirements-dev.txt
pre-commit install

A configuração em outros sistemas operacionais está disponível em "como configurar o ambiente de desenvolvimento", incluindo mais detalhes para quem deseja contribuir com o desenvolvimento do repositório.

Template para raspadores

Ao invés de começar um arquivo de raspador do zero, você pode inicializar um arquivo de código de raspador já no padrão do Querido Diário, a partir de um template. Para isso, faça:

Vá para o diretório data_collection:

cd data_collection

Acione o template:

scrapy genspider -t qdtemplate <uf_nome_do_municipio> <https://sitedomunicipio...>

Um arquivo uf_nome_do_municipio.py será criado no diretório spiders, com alguns campos já preenchidos. O diretório é organizado por UF, lembre-se de mover o arquivo para o diretório adequado.

Como executar

Para experimentar a execução de um raspador já integrado ao projeto ou testar o que esteja desenvolvendo, siga os comandos:

Se ainda não o fez, ative o ambiente virtual no diretório /querido-diario:

source .venv/bin/activate

Vá para o diretório data_collection:

cd data_collection

Verifique a lista de raspadores disponíveis:

scrapy list

Execute um raspador da lista:

scrapy crawl <nome_do_raspador>       //exemplo: scrapy crawl ba_acajutiba

Os diários coletados na raspagem serão salvos no diretório data_collection/data

Dicas de execução

Além dos comandos acima, o Scrapy oferece outros recursos para configurar o comando de raspagem. Os recursos a seguir podem ser usados sozinhos ou combinados.

Limite de data
Ao executar o item 4, o raspador coletará todos os diários oficiais do site publicador daquele município. Para execuções menores, utilize a flag de atributo -a seguida de:

start_date=AAAA-MM-DD: definirá a data inicial de coleta de diários.

scrapy crawl <nome_do_raspador> -a start_date=<AAAA-MM-DD>

end_date=AAAA-MM-DD: definirá a data final de coleta de diários. Caso omitido, assumirá a data do dia em que está sendo executado.

scrapy crawl <nome_do_raspador> -a end_date=<AAAA-MM-DD>

Arquivo de log
É possível enviar o log da raspagem para um arquivo ao invés de deixá-lo no terminal. Isto é particularmente útil quando se desenvolve um raspador que apresenta problemas e você quer enviar o arquivo de log no seu PR para obter ajuda. Para isso, use a flag de configuração -s seguida de:

LOG_FILE=log_<nome_do_municipio>.txt: definirá o arquivo para armazenar as mensagens de log.

scrapy crawl <nome_do_raspador> -s LOG_FILE=log_<nome_do_municipio>.txt

Tabela de raspagem
Também é possível construir uma tabela que lista todos os diários e metadados coletados pela raspagem, ficando mais fácil de ver como o raspador está se comportando. Para isso, use a flag de saída -o seguida de um nome para o arquivo.

scrapy crawl <nome_do_raspador> -o <nome_do_municipio>.csv

Solução de problemas

Confira o arquivo de solução de problemas para resolver os problemas mais frequentes com a configuração do ambiente do projeto.

Suporte

Ingresse em nosso canal de comunidade para trocas sobre os projetos, dúvidas, pedidos de ajuda com contribuição e conversar sobre inovação cívica em geral.

Agradecimentos

Este projeto é mantido pela Open Knowledge Brasil e possível graças às comunidades técnicas, às Embaixadoras de Inovação Cívica, às pessoas voluntárias e doadoras financeiras, além de universidades parceiras, empresas apoiadoras e financiadoras.

Conheça quem apoia o Querido Diário.

Open Knowledge Brasil

A Open Knowledge Brasil é uma organização da sociedade civil sem fins lucrativos, cuja missão é utilizar e desenvolver ferramentas cívicas, projetos, análises de políticas públicas, jornalismo de dados para promover o conhecimento livre nos diversos campos da sociedade.

Todo o trabalho produzido pela OKBR está disponível livremente.

Licença

Código licenciado sob a Licença MIT.

querido-diario's People

Contributors

Stargazers

Watchers

Forkers

luzfcb brunolellis smkbarbosa gustavodemari cleitondelima giovanisleite eliezerfb danielnegri kerollaine danielmbicalho buenolucas guilhermebr lucasfloro chicocvenancio pedrro juniorcarvalho beothorn alfakini hilam alvarolqueiroz insighted4 danicuki winireis mvinoba dannnylo pedrocervi filipehb otherpirate mat-bit dudangel jmanuelr alice-scholze murilooon ctrentini laurindodumba samuelgrigolato rhogeranacleto richardson-souza marlesson danielsbastos gabrielalqueiroz agessner viniciusartur goblinbr he7d3r cabral vitorbaptista vit0r engaugusto weibemoura yanmarques odenir antoniovendramin lucas-armand jtemporal pauloabelha thiagonobrega cuchi guibeira markx3 matstavares dmarasquin ciromreis guidiego rqclima maludecks josemalcher jflaviojf3 guilhermebarile cezargarrido irisferrarini jzferreira nubiofs heylouiz jborges42 pedroamaralf feliperuhland gilbertobotaro marcelomata tiago-nunesgv developerfred wescleycaldeira rodrigodmalves camachods mathiasfls hugoleodev ogecece victor-ribeiro-mi favoriteprojects claudioalfonso lguima gusrabbit ramondomiingos phmaul aracele marcobelo davidlso werberth brunoassis victor-torres

querido-diario's Issues

Cities

Hey people,

Adding @cuducos work in progress tracking here to help us keep track of the work to be done.

✅ Done
🔜 In progress

#	Cidade	Crawler	Parser	Issue	PR
1	São Paulo			#7
2	Rio de Janeiro	🔜		#15	#29
3	Brasília
4	Salvador				#47
5	Fortaleza	🔜			#52
6	Belo Horizonte				#33
7	Manaus	🔜			#51
8	Curitiba				#42
9	Recife	🔜
10	Porto Alegre	✅	✅
11	Goiânia	✅	🔜		#6
12	Belém
13	Guarulhos	✅			#4
14	Campinas	✅			#2
15	São Luís	🔜			#22
16	São Gonçalo
17	Maceió	🔜			#32
18	Duque de Caxias
19	Natal
20	Campo Grande				#35
21	Teresina	🔜			#53
22	São Bernardo do Campo
23	João Pessoa
24	Nova Iguaçu
25	Santo André
26	São José dos Campos
27	Osasco
28	Jaboatão dos Guararapes
29	Ribeirão Preto			#31
30	Uberlândia				#37
31	Sorocaba
32	Contagem
33	Aracaju
34	Feira de Santana	🔜			#25
35	Cuiabá
36	Joinville	🔜			#30
37	Juiz de Fora	🔜		#12	#13
38	Londrina
39	Aparecida de Goiânia
40	Porto Velho
41	Ananindeua
42	Serra	✅
43	Niterói
44	Belford Roxo
45	Campos dos Goytacazes Campos dos Goytacazes
46	Vila Velha	✅
47	Florianópolis	✅			#17
48	Caxias do Sul
49	Macapá
50	Mauá
51	São João de Meriti
52	São José do Rio Preto
53	Santos	🔜			#14
54	Mogi das Cruzes
55	Betim
56	Diadema
57	Campina Grande
58	Jundiaí
59	Maringá
60	Montes Claros	🔜			#26
61	Piracicaba
62	Carapicuíba
63	Olinda
64	Cariacica	✅
65	Rio Branco
66	Anápolis
67	Bauru
68	Vitória	✅
69	Caucaia
70	Itaquaquecetuba
71	São Vicente
72	Bandeira caruaru.jpg Caruaru
73	Vitória da Conquista
74	Blumenau
75	Franca	✅			#5
76	Pelotas
77	Ponta Grossa				#45
78	Canoas			#10
79	Petrolina
80	Boa Vista
81	Ribeirão das Neves
82	Paulista
83	Uberaba
84	Cascavel
85	Guarujá
86	Praia Grande
87	Taubaté
88	São José dos Pinhais
89	Limeira
90	Petrópolis
91	Camaçari
92	Santarém
93	Mossoró
94	Suzano
95	Palmas	✅			#1
96	Governador Valadares	🔜			#19
97	Taboão da Serra
98	Santa Maria
99	Gravataí
100	Várzea Grande
XXX	Foz do Iguaçu				#34 #27
XXX	Araguaina				#3

Make seed error

I runned docker-compose down and make setup

make seed
make[1]: Entering directory '/home/giovani/workspace/diario-oficial'
docker-compose up --detach postgres
Builds, (re)creates, starts, and attaches to containers for a service.

Unless they are already running, this command also starts any linked services.

The `docker-compose up` command aggregates the output of each container. When
the command exits, all containers are stopped. Running `docker-compose up -d`
starts the containers in the background and leaves them running.

If there are existing containers for a service, and the service's configuration
or image was changed after the container's creation, `docker-compose up` picks
up the changes by stopping and recreating the containers (preserving mounted
volumes). To prevent Compose from picking up changes, use the `--no-recreate`
flag.

If you want to force Compose to stop and recreate all containers, use the
`--force-recreate` flag.

Usage: up [options] [--scale SERVICE=NUM...] [SERVICE...]

Options:
    -d                         Detached mode: Run containers in the background,
                               print new container names.
                               Incompatible with --abort-on-container-exit.
    --no-color                 Produce monochrome output.
    --no-deps                  Don't start linked services.
    --force-recreate           Recreate containers even if their configuration
                               and image haven't changed.
                               Incompatible with --no-recreate.
    --no-recreate              If containers already exist, don't recreate them.
                               Incompatible with --force-recreate.
    --no-build                 Don't build an image, even if it's missing.
    --build                    Build images before starting containers.
    --abort-on-container-exit  Stops all containers if any container was stopped.
                               Incompatible with -d.
    -t, --timeout TIMEOUT      Use this timeout in seconds for container shutdown
                               when attached or when containers are already
                               running. (default: 10)
    --remove-orphans           Remove containers for services not
                               defined in the Compose file
    --exit-code-from SERVICE   Return the exit code of the selected service container.
                               Implies --abort-on-container-exit.
    --scale SERVICE=NUM        Scale SERVICE to NUM instances. Overrides the `scale`
                               setting in the Compose file if present.
Makefile:16: recipe for target 'seed' failed
make[1]: *** [seed] Error 1
make[1]: Leaving directory '/home/giovani/workspace/diario-oficial'
Makefile:10: recipe for target 'setup' failed
make: *** [setup] Error 2

Error in the web container: Request failed with status code 400

Hi, I tried to run the project (as described in the README.md: make setup && docker-compose up) – All went well but the websever:

The output in the docker-compose logs repeatedly shows:

web_1         |  warning  in ./node_modules/bulma/bulma.sass
web_1         | 
web_1         | (Emitted value instead of an instance of Error) postcss-custom-properties: /mnt/code/node_modules/bulma/bulma.sass:5915:5: Custom property ignored: not scoped to the top-level :root element (.columns.is-variable.is-8 { ... --columnGap: ... })
web_1         | 
web_1         |  @ ./node_modules/bulma/bulma.sass 4:14-152 13:3-17:5 14:22-160
web_1         |  @ ./.nuxt/App.js
web_1         |  @ ./.nuxt/index.js
web_1         |  @ ./.nuxt/client.js
web_1         |  @ multi webpack-hot-middleware/client?name=client&reload=true&timeout=30000&path=/__webpack_hmr ./.nuxt/client.js

And when I hit localhost:8080 there is a different (rather longer) error message in the logs.

web_1         | { Error: Request failed with status code 400
web_1         |     at createError (/mnt/code/node_modules/axios/lib/core/createError.js:16:15)
web_1         |     at settle (/mnt/code/node_modules/axios/lib/core/settle.js:18:12)
web_1         |     at IncomingMessage.handleStreamEnd (/mnt/code/node_modules/axios/lib/adapters/http.js:201:11)
web_1         |     at IncomingMessage.emit (events.js:185:15)
web_1         |     at endReadableNT (_stream_readable.js:1106:12)
web_1         |     at process._tickCallback (internal/process/next_tick.js:178:19)
web_1         |   config: 
web_1         |    { adapter: [Function: httpAdapter],
web_1         |      transformRequest: { '0': [Function: transformRequest] },
web_1         |      transformResponse: { '0': [Function: transformResponse] },
web_1         |      timeout: 0,
web_1         |      xsrfCookieName: 'XSRF-TOKEN',
web_1         |      xsrfHeaderName: 'X-XSRF-TOKEN',
web_1         |      maxContentLength: -1,
web_1         |      validateStatus: [Function: validateStatus],
web_1         |      headers: 
web_1         |       { Accept: 'application/json, text/plain, */*',
web_1         |         'User-Agent': 'axios/0.18.0' },
web_1         |      method: 'get',
web_1         |      url: 'http://api:3000/bidding_exemptions?select=*,gazette{file_url,is_extra_edition,power}&order=date.desc',
web_1         |      data: undefined },
web_1         |   request: 
web_1         |    ClientRequest {
web_1         |      _events: 
web_1         |       { socket: [Function],
web_1         |         abort: [Function],
web_1         |         aborted: [Function],
web_1         |         error: [Function],
web_1         |         timeout: [Function],
web_1         |         prefinish: [Function: requestOnPrefinish] },
web_1         |      _eventsCount: 6,
web_1         |      _maxListeners: undefined,
web_1         |      output: [],
web_1         |      outputEncodings: [],
web_1         |      outputCallbacks: [],
web_1         |      outputSize: 0,
web_1         |      writable: true,
web_1         |      _last: true,
web_1         |      upgrading: false,
web_1         |      chunkedEncoding: false,
web_1         |      shouldKeepAlive: false,
web_1         |      useChunkedEncodingByDefault: false,
web_1         |      sendDate: false,
web_1         |      _removedConnection: false,
web_1         |      _removedContLen: false,
web_1         |      _removedTE: false,
web_1         |      _contentLength: 0,
web_1         |      _hasBody: true,
web_1         |      _trailer: '',
web_1         |      finished: true,
web_1         |      _headerSent: true,
web_1         |      socket: 
web_1         |       Socket {
web_1         |         connecting: false,
web_1         |         _hadError: false,
web_1         |         _handle: null,
web_1         |         _parent: null,
web_1         |         _host: 'api',
web_1         |         _readableState: [ReadableState],
web_1         |         readable: false,
web_1         |         _events: [Object],
web_1         |         _eventsCount: 7,
web_1         |         _maxListeners: undefined,
web_1         |         _writableState: [WritableState],
web_1         |         writable: false,
web_1         |         _bytesDispatched: 210,
web_1         |         _sockname: null,
web_1         |         _pendingData: null,
web_1         |         _pendingEncoding: '',
web_1         |         allowHalfOpen: false,
web_1         |         server: null,
web_1         |         _server: null,
web_1         |         parser: null,
web_1         |         _httpMessage: [Circular],
web_1         |         _idleNext: null,
web_1         |         _idlePrev: null,
web_1         |         _idleTimeout: -1,
web_1         |         [Symbol(asyncId)]: 27540,
web_1         |         [Symbol(lastWriteQueueSize)]: 0,
web_1         |         [Symbol(bytesRead)]: 312 },
web_1         |      connection: 
web_1         |       Socket {
web_1         |         connecting: false,
web_1         |         _hadError: false,
web_1         |         _handle: null,
web_1         |         _parent: null,
web_1         |         _host: 'api',
web_1         |         _readableState: [ReadableState],
web_1         |         readable: false,
web_1         |         _events: [Object],
web_1         |         _eventsCount: 7,
web_1         |         _maxListeners: undefined,
web_1         |         _writableState: [WritableState],
web_1         |         writable: false,
web_1         |         _bytesDispatched: 210,
web_1         |         _sockname: null,
web_1         |         _pendingData: null,
web_1         |         _pendingEncoding: '',
web_1         |         allowHalfOpen: false,
web_1         |         server: null,
web_1         |         _server: null,
web_1         |         parser: null,
web_1         |         _httpMessage: [Circular],
web_1         |         _idleNext: null,
web_1         |         _idlePrev: null,
web_1         |         _idleTimeout: -1,
web_1         |         [Symbol(asyncId)]: 27540,
web_1         |         [Symbol(lastWriteQueueSize)]: 0,
web_1         |         [Symbol(bytesRead)]: 312 },
web_1         |      _header: 'GET /bidding_exemptions?select=*,gazette%7Bfile_url,is_extra_edition,power%7D&order=date.desc HTTP/1.1\r\nAccept: application/json, text/plain, */*\r\nUser-Agent: axios/0.18.0\r\nHost: api:3000\r\nConnection: close\r\n\r\n',
web_1         |      _onPendingData: [Function: noopPendingOutput],
web_1         |      agent: 
web_1         |       Agent {
web_1         |         _events: [Object],
web_1         |         _eventsCount: 1,
web_1         |         _maxListeners: undefined,
web_1         |         defaultPort: 80,
web_1         |         protocol: 'http:',
web_1         |         options: [Object],
web_1         |         requests: {},
web_1         |         sockets: [Object],
web_1         |         freeSockets: {},
web_1         |         keepAliveMsecs: 1000,
web_1         |         keepAlive: false,
web_1         |         maxSockets: Infinity,
web_1         |         maxFreeSockets: 256 },
web_1         |      socketPath: undefined,
web_1         |      timeout: undefined,
web_1         |      method: 'GET',
web_1         |      path: '/bidding_exemptions?select=*,gazette%7Bfile_url,is_extra_edition,power%7D&order=date.desc',
web_1         |      _ended: true,
web_1         |      res: 
web_1         |       IncomingMessage {
web_1         |         _readableState: [ReadableState],
web_1         |         readable: false,
web_1         |         _events: [Object],
web_1         |         _eventsCount: 3,
web_1         |         _maxListeners: undefined,
web_1         |         socket: [Socket],
web_1         |         connection: [Socket],
web_1         |         httpVersionMajor: 1,
web_1         |         httpVersionMinor: 1,
web_1         |         httpVersion: '1.1',
web_1         |         complete: true,
web_1         |         headers: [Object],
web_1         |         rawHeaders: [Array],
web_1         |         trailers: {},
web_1         |         rawTrailers: [],
web_1         |         upgrade: false,
web_1         |         url: '',
web_1         |         method: null,
web_1         |         statusCode: 400,
web_1         |         statusMessage: 'Bad Request',
web_1         |         client: [Socket],
web_1         |         _consuming: true,
web_1         |         _dumped: false,
web_1         |         req: [Circular],
web_1         |         responseUrl: 'http://api:3000/bidding_exemptions?select=*,gazette%7Bfile_url,is_extra_edition,power%7D&order=date.desc',
web_1         |         read: [Function] },
web_1         |      aborted: undefined,
web_1         |      timeoutCb: null,
web_1         |      upgradeOrConnect: false,
web_1         |      parser: null,
web_1         |      maxHeadersCount: null,
web_1         |      _redirectable: 
web_1         |       Writable {
web_1         |         _writableState: [WritableState],
web_1         |         writable: true,
web_1         |         _events: [Object],
web_1         |         _eventsCount: 2,
web_1         |         _maxListeners: undefined,
web_1         |         _options: [Object],
web_1         |         _redirectCount: 0,
web_1         |         _requestBodyLength: 0,
web_1         |         _requestBodyBuffers: [],
web_1         |         _onNativeResponse: [Function],
web_1         |         _currentRequest: [Circular],
web_1         |         _currentUrl: 'http://api:3000/bidding_exemptions?select=*,gazette%7Bfile_url,is_extra_edition,power%7D&order=date.desc' },
web_1         |      [Symbol(isCorked)]: false,
web_1         |      [Symbol(outHeadersKey)]: { accept: [Array], 'user-agent': [Array], host: [Array] } },
web_1         |   response: 
web_1         |    { status: 400,
web_1         |      statusText: 'Bad Request',
web_1         |      headers: 
web_1         |       { 'transfer-encoding': 'chunked',
web_1         |         date: 'Wed, 25 Apr 2018 14:05:40 GMT',
web_1         |         server: 'postgrest/0.4.4.0 (f9e770b)',
web_1         |         'content-type': 'application/json; charset=utf-8' },
web_1         |      config: 
web_1         |       { adapter: [Function: httpAdapter],
web_1         |         transformRequest: [Object],
web_1         |         transformResponse: [Object],
web_1         |         timeout: 0,
web_1         |         xsrfCookieName: 'XSRF-TOKEN',
web_1         |         xsrfHeaderName: 'X-XSRF-TOKEN',
web_1         |         maxContentLength: -1,
web_1         |         validateStatus: [Function: validateStatus],
web_1         |         headers: [Object],
web_1         |         method: 'get',
web_1         |         url: 'http://api:3000/bidding_exemptions?select=*,gazette{file_url,is_extra_edition,power}&order=date.desc',
web_1         |         data: undefined },
web_1         |      request: 
web_1         |       ClientRequest {
web_1         |         _events: [Object],
web_1         |         _eventsCount: 6,
web_1         |         _maxListeners: undefined,
web_1         |         output: [],
web_1         |         outputEncodings: [],
web_1         |         outputCallbacks: [],
web_1         |         outputSize: 0,
web_1         |         writable: true,
web_1         |         _last: true,
web_1         |         upgrading: false,
web_1         |         chunkedEncoding: false,
web_1         |         shouldKeepAlive: false,
web_1         |         useChunkedEncodingByDefault: false,
web_1         |         sendDate: false,
web_1         |         _removedConnection: false,
web_1         |         _removedContLen: false,
web_1         |         _removedTE: false,
web_1         |         _contentLength: 0,
web_1         |         _hasBody: true,
web_1         |         _trailer: '',
web_1         |         finished: true,
web_1         |         _headerSent: true,
web_1         |         socket: [Socket],
web_1         |         connection: [Socket],
web_1         |         _header: 'GET /bidding_exemptions?select=*,gazette%7Bfile_url,is_extra_edition,power%7D&order=date.desc HTTP/1.1\r\nAccept: application/json, text/plain, */*\r\nUser-Agent: axios/0.18.0\r\nHost: api:3000\r\nConnection: close\r\n\r\n',
web_1         |         _onPendingData: [Function: noopPendingOutput],
web_1         |         agent: [Agent],
web_1         |         socketPath: undefined,
web_1         |         timeout: undefined,
web_1         |         method: 'GET',
web_1         |         path: '/bidding_exemptions?select=*,gazette%7Bfile_url,is_extra_edition,power%7D&order=date.desc',
web_1         |         _ended: true,
web_1         |         res: [IncomingMessage],
web_1         |         aborted: undefined,
web_1         |         timeoutCb: null,
web_1         |         upgradeOrConnect: false,
web_1         |         parser: null,
web_1         |         maxHeadersCount: null,
web_1         |         _redirectable: [Writable],
web_1         |         [Symbol(isCorked)]: false,
web_1         |         [Symbol(outHeadersKey)]: [Object] },
web_1         |      data: 
web_1         |       { message: 'Could not find foreign keys between these entities, No relation found between bidding_exemptions and gazette' } },
web_1         |   statusCode: 500,
web_1         |   name: 'NuxtServerError' }

As a newbie to Vue.js and Nuxt, may I ask if anyone have any clue about what's going on? Many thanks : )

Deployable package.

We need to have some script to create a package/container to facilitate the spiders deploy in a production environment. I think the easy way to go is to use the same Dockerfile we use to run the spider in the docker-compose to build the container used in production. But we need some automation to build the container and publish it in a container registry somewhere. Like the docker hub or some registry of the infrastructure provider repository (i.e. Digital Ocean).

Another option is to have a package in something similar to Open Build Service and build a rpm/deb package and install it in the production server using some deb/rpm repository.

This issue is related to the issue #157

Aracaju/SE Spider

Hi everybody, I'm stuck trying to build the Aracaju City spider. This is the main site for the gazettes: http://sga.aracaju.se.gov.br:5011/legislacao/faces/diario_form_pesq.jsp. It's a JSP page and the requests must contain some session data. I could retrieve the gazettes through a direct link by the gazette official number (http://sga.aracaju.se.gov.br:5011/diarios/3970.pdf), in this case I don't have extra information, like publishing date for example. Is there a best approach to deal with this?

WebUI to find and visualize the gazettes

In order to have the gazettes available to the public, we need to have a WebUI to allow the user to search the files by city and dates. The initial idea is using Django for that.

Related to issue #157

Make the gazzetes available for the public

During the past days I've being discussing with @sergiomario about run the spiders in production and make the scraped files available in a central web page. The first version does not need to be to fancy. The idea is to run the spider in a server/cluster, store the files and build a simple web page allowing the user to search and read the scraped files.

As the Serenata de Amor already run in the Digital Ocean, I think we can continue in the same provider. All we need in the first version will be a server/k8s cluster, PostgreSQL and a file storage. We can address all these needs with the DO products available.

To achieve this goal we see the follow issues need to be addressed:

Where run spiders, API and web page: a simple server with a cron job or set up something more sofisticated, like a Kubernetes cluster, to run the workloads.
Avoid unnecessary request: If we already collect the gazettes until 02/19/2020. Let's start the spider from 02/20/2020 (By the way, really cool date xD)
Automation: we are few people, we should automate as much as possible.
UX: find some UX wizard to do a cool web page.

@sergiomario, am I forgetting something?

PDF Reading Error

Hello guys.

Just to report...

Accidentally I found this mistake caused by decimal places: dot and comma stuff.

This is the PDF content:

http://dopaonlineupload.procempa.com.br/dopaonlineupload/1318_ce_20150119_executivo.pdf

And this image bellow it's we have on https://diario.serenata.ai

Hope it's help...

See you guys...

Clearly handle integration tests and code style check in Makefile and CI

As discussed in #84 (comment)

tika-app-1.22.jar Not Found

http://ftp.unicamp.br/pub/apache/tika/tika-app-1.22.jar

Not Found

The requested URL /pub/apache/tika/tika-app-1.22.jar was not found on this server.

Apache Tika

Try to use the Apache Tika to extract data from DOC and PDF files. Replacing the current pipeline steps to extract the text from the downloaded files

Rio de Janeiro/RJ Craw

Hello guys.

I'm having trouble understanding the crawler results from rio de janeiro ...

If I test the crawler of rio de janeiro (following the orientation of CONTRIBUTING.md):

sudo docker-compose run --rm processing bash -c "cd data_collection && scrapy crawl rj_rio_de_janeiro"

The result seems to be wrong:

[...]

2018-08-27 00:32:53 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET http://doweb.rio.rj.gov.br/ler_pdf.php?download=ok&edi_id=3864> referred in <None>
2018-08-27 00:32:53 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET http://doweb.rio.rj.gov.br/ler_pdf.php?download=ok&edi_id=3864> referred in <None>
2018-08-27 00:32:53 [scrapy.core.scraper] ERROR: Error processing {'date': datetime.date(2018, 8, 20),
 'file_urls': ['http://doweb.rio.rj.gov.br/ler_pdf.php?download=ok&edi_id=3864'],
 'files': [{'checksum': '49228de889bf8edd753fad4b184adaa3',
            'path': 'full/c73158205ba52ccb878c48d4353d298db7586850.php?download=ok&edi_id=3864',
            'url': 'http://doweb.rio.rj.gov.br/ler_pdf.php?download=ok&edi_id=3864'}],
 'is_extra_edition': True,
 'power': 'executive',
 'scraped_at': datetime.datetime(2018, 8, 27, 0, 32, 53, 7640),
 'territory_id': '3304557'}
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/mnt/code/data_collection/gazette/pipelines.py", line 14, in process_item
    item["source_text"] = self.pdf_source_text(item)
  File "/mnt/code/data_collection/gazette/pipelines.py", line 29, in pdf_source_text
    with open(text_path) as file:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/data/full/c73158205ba52ccb878c48d4353d298db7586850.php?download=ok&edi_id=3864.txt'
I/O Error: Couldn't open file '/mnt/data/full/c73158205ba52ccb878c48d4353d298db7586850.php?download=ok': No such file or directory.
2018-08-27 00:32:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://doweb.rio.rj.gov.br/?buscar_diario=ok&tipo=1&data_busca=21/08/2018> (referer: http://doweb.rio.rj.gov.br)
2018-08-27 00:32:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://doweb.rio.rj.gov.br/?buscar_diario=ok&tipo=1&data_busca=24/08/2018> (referer: http://doweb.rio.rj.gov.br)
2018-08-27 00:32:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://doweb.rio.rj.gov.br/?buscar_diario=ok&tipo=1&data_busca=26/08/2018> (referer: http://doweb.rio.rj.gov.br)
2018-08-27 00:32:53 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET http://doweb.rio.rj.gov.br/ler_pdf.php?download=ok&edi_id=3869> referred in <None>
2018-08-27 00:32:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://doweb.rio.rj.gov.br/?buscar_diario=ok&tipo=1&data_busca=25/08/2018> (referer: http://doweb.rio.rj.gov.br)
2018-08-27 00:32:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://doweb.rio.rj.gov.br/?buscar_diario=ok&tipo=1&data_busca=28/07/2018> (referer: http://doweb.rio.rj.gov.br)
2018-08-27 00:32:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://doweb.rio.rj.gov.br/?buscar_diario=ok&tipo=1&data_busca=30/07/2018> (referer: http://doweb.rio.rj.gov.br)

When I run the crawler of Porto Alegre (for comparison) I get an intelligible result:

sudo docker-compose run --rm processing bash -c "cd data_collection && scrapy crawl rs_porto_alegre"

result is:
[...]

                'EXTRATO DE TERMO ADITIVO\n'
                '     PROCESSO: 009.003517.14.4\n'
                '     CONTRATANTE: Departamento Municipal de Previdência dos '
                'Servidores Públicos do Município de Porto Alegre.\n'
                '     CONTRATADA: Agência Estado S/A.\n'
                '     OBJETO: prorrogação do contrato n. 02/2015 de licença de '
                'uso do software AE Broadcast Profissional, 04 pontos de '
                'acesso, por 12 meses, a contar de 01.04.2018.\n'
                '     Valor Mensal: R$ 9.625,04.\n'
                '     BASE LEGAL: Artigo 57, inciso II, da Lei 8.666/93 e suas '
                'alterações.\n'
                '\n'
                '                                                                                  '
                'Porto Alegre, 24 de abril de 2018.\n'
                '\n'
                '\n'
                '                                                                             '
                'RENAN DA SILVA AGUIAR, Diretor-Geral.\n'
                '\n'
                '\n'
                '\n'
                '\n'
                '      EXPEDIENTE\n'
                '\n'
                '\n'
                '      PREFEITURA MUNICIPAL DE PORTO ALEGRE\n'
                '      Diário Oficial Eletrônico de Porto Alegre\n'
                '      Órgão de Divulgação Oficial do Município\n'
                '      Instituído pela Lei nº 11.029 de 3 de janeiro de 2011\n'

Error when starting service, database keys missing?

When I try to run the service using docker-compose up and access http://localhost:8080/ I get this error:

Request failed with status code 400

The error that shows in the terminal is related with database:

Could not find foreign keys between these entities, 
No relation found between bidding_exemptions and gazette

Feira de Santana's gazette not working after city hall website changes

FYI: Feira de Santana's city hall made changes to their website certificate. The city's spider isn't working after this change (made around the 5 of June).

Portal da transparência - São Paulo/SP

A cidade de São Paulo disponibiliza os dados de compras e licitações no Portal da Transparência nos formatos csv, xls e ods desde 2005.
Link: http://transparencia.prefeitura.sp.gov.br/contas/Paginas/ComprasLicitacoes.aspx

Estou criando esta issue apenas para documentar e avaliar qual seria a melhor opção para armazenar um dado mais estruturado (comparando com os pdfs das demais cidades).

[WIP] Spider for Ribeirão Preto/SP

Me and @kerollaine are working together on this.

We've managed to create a simple Postman Collection showing the sequence of requests needed to download a single day Gazette from https://www.ribeiraopreto.sp.gov.br/J015/pesquisaData.xhtml. It is available here: https://documenter.getpostman.com/view/2394724/diariooficial/RW1boKvQ.

Because the application is JSF-based, and it maintains the view data in the server, we need to do all three requests to get a PDF file.

Rio Branco spider

Add an spider for Rio Branco, capital of Acre

Cities that depend on js

What should be our approach to cities that depend on JS? I'm more used to Selenium and a browser, but we probably should define a single approach.

CI pipeline

Take advantage of the github actions to build a container images containing all the spiders. To make it possible to pull it in another place and run it

Code formatter?

I see a lot of feedback here focused on code formatting. What about automatizing this?

We could recommend Black, and add a single line ($ black . --check) to our CI. This way PEP8 and other common code style linters would automatically raise a red light helping contributors to keep the code in a way that is compatible with best practices among the Python community.

Also with $ black . we can edit the whole code base at once to fit the style guide.

Surely I'm a candidate to implement this, but I'd like to hear from you some counter-arguments ; )

Fix Apache Tika URL

Currently, the URL to download the Apache Tika in our container image to run the spider changes with some annoying frequency. Every time a version is released the old one is deleted and our URL is not valid anymore. We need to find a way to make it more stable. In the current version, the URL is maintained by Unicamp.

On top of my head I think we can look for a URL which keeps more than one version at time. Thus, the binary will be available for more time.

Lista das 100 maiores municípios do Brasil

Oi pessoal, quero começar a contribuir no projeto pegando uma cidade para fazer o scrapy, porém, no site não existe informação sobre quais são os municípios target, apenas que são as 100 maiores do Brasil, mas baseado em que? Habitantes, PIB, etc...

Change spider to allow select date/period of the items

Actually when executing the spiders all available gazettes are retrieved. This could be a waste of resources as the data extracted (the gazette) doesn't change.

I had a quick discussion with @Irio , and he suggested the following options:

Include a parameter indicating the start date that the spider should collect the gazettes;
Check the database to identify when was the last gazette collected, and then collect everything after that.

Introduce middleware to ignore requests of pages alread scraped

To avoid download already scraped items, we could include this middleware:
https://github.com/scrapy-plugins/scrapy-deltafetch

This middleware ignore requests to pages containing items seen in previous crawls.

Yesterday I talked about #67, we need to test the middleware with this kind of requests to see how it works.

Suggestion: Use pytest and tox

I'd like to suggest using pytest instead of unittest. It's a very small change, the current tests don't need to change at all, and for it we'd get a few nice features from pytest. The one I'm missing the most are more useful assert error messages. For example, if I do assert my_dict == other_dict with unittest and it fails, I just get AssertionError thrown, without any message. If I use the same code with pytest I get a helpful message saying which elements of the dicts aren't the same. The same for comparing arrays, dicts, or any other element types.

And tox is useful to allow the tests to run using a single command, instead of having to remember python -m unittest discover when I want to run the tests outside of Docker. It keeps the virtualenvs configured, sets whatever env variables are necessary, and can also run lint.

These 2 libraries became the standard for most Python projects I see, and for good reason: they're great!

I'd be happy to send a PR with these changes (or just pytest) if you agree.

cc @Irio @cuducos

Command Execution

Hello. I need help with executing a command.
after executing the command:

docker-compose run --rm processing scrapy shell http://www2.portoalegre.rs.gov.br/dopa/

The system executes a few lines and is interrupting waiting for a parameter command.
Below the result of the command:

docker-compose run --rm processing scrapy shell http://www2.portoalegre.rs.gov.br/dopa/
Starting diario-oficial_redis_1 ...
Starting diario-oficial_postgres_1 ... done
Starting diario-oficial_redis_1 ... done
2018-08-06 20:42:23 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-08-06 20:42:23 [scrapy.utils.log] INFO: Versions: lxml 4.2.3.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.6 (default, Jul 17 2018, 11:12:33) - [GCC 6.3.0 20170516], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.3, Platform Linux-4.9.93-linuxkit-aufs-x86_64-with-debian-9.5
2018-08-06 20:42:24 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0}
2018-08-06 20:42:24 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage']
2018-08-06 20:42:24 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-08-06 20:42:24 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-08-06 20:42:24 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-08-06 20:42:24 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-08-06 20:42:24 [scrapy.core.engine] INFO: Spider opened
2018-08-06 20:42:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www2.portoalegre.rs.gov.br/dopa/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7f96c3f2b630>
[s] item {}
[s] request <GET http://www2.portoalegre.rs.gov.br/dopa/>
[s] response <200 http://www2.portoalegre.rs.gov.br/dopa/>
[s] settings <scrapy.settings.Settings object at 0x7f96c3f2b6a0>
[s] spider <DefaultSpider 'default' at 0x7f96c3b77470>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]:

Can you identify any errors?
Is it necessary to use a parameter?

Executing a crawler error - permission denied

I am getting this error when I try to run the crawlers...

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 1362, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 http://www.guarulhos.sp.gov.br/uploads/pdf/2128651035.pdf>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/scrapy/pipelines/files.py", line 401, in media_downloaded
    checksum = self.file_downloaded(response, request, info)
  File "/usr/local/lib/python3.6/site-packages/scrapy/pipelines/files.py", line 434, in file_downloaded
    self.store.persist_file(path, buf, info)
  File "/usr/local/lib/python3.6/site-packages/scrapy/pipelines/files.py", line 54, in persist_file
    with open(absolute_path, 'wb') as f:
PermissionError: [Errno 13] Permission denied: '/mnt/data/full/a4bcac7d55a3f3e11b1dd8e0152d86037098dc5c.pdf'

I tried to change the permissions (chmod -R 0777 ,) and rerun, but nothing changed...

Rio de Janeiro/RJ [wip]

Iniciei uma implementação da cidade do Rio de Janeiro. Pareceu trabalhosa inicialmente, mas acho que consegui mapear todas as situações.

O que desenvolvi por enquanto:
https://github.com/brunolellis/diario-oficial/blob/rj_rio_de_janeiro/processing/data_collection/gazette/spiders/rj_rio_de_janeiro.py

DO: http://doweb.rio.rj.gov.br

Period of interest

Some of the spiders are limited to get documents from the current year till 2014.
This rule must be applied to all spiders?

Create date filter

Hello guys, how we talked some days ago, we found a potential problem to integrate OKFN's projects to Jusbrasil stack of Brazilian data extraction.

We discussed some points about it, and we suggested some changes in the project:

Create a way where is possible to pass a date to filter results.
Based on these filters, make fewer requests than possible.
[Minor] Return PDF's links and more information if possible, like publication date, etc.

An important point to remember is that we are available to help with coding theses changes, if necessary.

Permission data/territories.csv

Hi,
I'm getting an error with permission of data/territories.csv.
drwxr-xr-x 2 root root 452K jun 4 20:09 full
-rw-r--r-- 1 root root 193K jun 4 20:58 territories.csv

Should this file be root?

Como colaborar com o projeto?

Olá pessoal!

Em 2018 fizemos um mutirão no Fab Lab Joinville para criar spiders para diversas cidades. Olhando por cima, tem uns 15 PRs abertos daquela época.

Gostaria de levar novamente o projeto para o laboratório semanal de Computação, onde trazemos projetos para que os participantes contribuam. Porém, não está muito claro para mim como podemos contribuir com ele.

Seria o caso de revisarmos os spiders com PR em aberto?

Ou então criar um portal para visualização dos diários oficiais?

Ou ainda, extrair dados dos diários já coletados?

Gostaria de poder contribuir com o projeto, e temos tempo e pessoas dispostas a colaborar.

Fico no aguardo de um retorno :)

Abraço!

Canoas/RS

Site do Diário de Canoas: http://sistemas.canoas.rs.gov.br/gt/publico/dof/index.jsf
Gist do testes que fiz: https://gist.github.com/Lrcezimbra/49a8cb3135e4f043212ab9976b3f56a7

Comecei a fazer o crawler de Canoas/RS, porém encontrei alguns problemas. O site do diário de Canoas não disponibiliza link direto para o PDF, ao invés disso usa cookie e sessão para o acesso e isso complica um pouco as coisas.

Como funciona:
Quando é acessado o site e escolhido o diário que deseja acessar, são feitas algumas requisições para o servidor com o dia e as páginas escolhidos. E então retorna o arquivo PDF. Caso acesse diretamente o link do arquivo, o server verifica a sessão e retorna o PDF escolhido nas requisições anteriores ou 404. Não existe um caminho único para o arquivo, o link é o mesmo para todos os arquivos, o que muda é só a data e páginas setados na sessão.

Então é necessário manter uma sessão única para cada dia do diário, pelo que eu vi é possivel fazer isso com o Scrapy. Porém, no projeto atualmente é rodado o crawler que pega todos os diários e as URLs dos arquivos e depois passa para a próxima pipeline que baixa os arquivos. Porém acredito que nessa troca de pipeline não se mantenha os cookies e a sessão e como nesse caso seria necessário manter a sessão para fazer o download, creio que não funcionaria. Qual vcs acham que é a melhor forma de resolver isso?

Outro problema é que não teremos o link direto para o arquivo do diário como temos de Porto Alegre.

Production infrastrucure configuration

We need to have scripts to setup and configure a production environment for running the spider.

My suggestion is use Ansible for that job. We can used to configure, install and update the servers running the workloads. It is easy to use and does not require anything installed in the server. All configuration is done via ssh.

Related to issue #157

Poor site of Duque de Caxias - RJ

The site has two format of links and file name for the gazettes.

The first format is for gazzetes under 2017 and is veeery incomplete. Is missing a LOT of gazzetes.

The second format is for gazzetes of 2017 and above. In this format there is no way to get the day of the gazzete.

It's necessary enter in contact with the city hall. What would be better, a project representative to contact the city hall or me?

O sp_guaruja.py de Guarujá não está funcionando

Boa tarde,

O código de raspagem para a prefeitura de guarujá não está funcionando. Acredito que seja porque o site da prefeitura foi mudado faz pouco tempo e a estrutura do site do diário oficial foi mudada.

Muito obrigado.

FECAM spider

I've started to take a look in the FECAM web site to see how we can extract the data from the Santa Catarina cities.

https://diariomunicipal.sc.gov.br/site/

Replacement for Scrapy Contracts

@rennerocha recently raised the issue of Scrapy Contracts often making the test suite fail due to false positives.

The initial idea of using Contracts was to have a quick (for experienced developers) and beginner-friendly way of writing Scrapy tests, exempting developers from having to write test cases, creating fixtures and maintaining them up-to-date.

We could also keep using Contracts, knowing they will eventually fail in online tests, using them only as a way of monitoring the state of all crawlers (today we have 13 different spiders) until we have a better and more robust one.

Upload scraped files to digital ocean

Creates a pipeline item to upload the scrapped files to the Digital Ocean spaces.

Related to issue #157

Juiz de Fora - MG

Iniciei a construção do crawler para o executivo da Prefeitura de Juiz de Fora. Os sites são:

Portal da Transparência (Executivo): https://www.pjf.mg.gov.br/secretarias/cpl/editais/resultados/2018/index.php
Portal da Transparência (Legislativo): http://www.camarajf.mg.gov.br/diariolegislativo/

Enable PDF upload to Google Cloud Storage

This issue is intended for people with little to no experience with open source contributions.

Nowadays, every time a Scrapy spider is run, the gazette PDF is saved and stored in the local filesystem, more specifically in the /mnt/data/ folder.

https://github.com/okfn-brasil/diario-oficial/blob/490270da0201471b9968268cd2eb0e9bde272310/processing/data_collection/gazette/settings.py#L10

When in a development environment, we still want to have files stored in the local filesystem, in this same folder. When in a production environment, the file should be sent to Google Cloud Storage.

Ideally, this behavior should be defined by an environment variable such as FILES_STORAGE.

Scrapy's documentation on the topic:
https://doc.scrapy.org/en/latest/topics/media-pipeline.html#id1

PB Campina Grande Executive Spider

I have started a spider to executive gazettes of campina grande/PB...

Soon I will send a pull request...

https://diario.serenata.ai is currently offline

Apparently, the diario-api service is offline. This is the result from curl https://diario.serenata.ai:

{
  "status": 500,
  "message": "getaddrinfo ENOTFOUND diario-api diario-api:3000",
  "name": "NuxtServerError"
}

Make unit_test failing due to new version of pytest

Make unit_test is failing since the release of pytest 3.10.0.

$ make unit_test
docker-compose run --rm processing pytest -p no:cacheprovider
Starting diario-oficial_rabbitmq_1 ... done
Starting diario-oficial_postgres_1 ... done
Starting diario-oficial_redis_1    ... done
==================================================================== test session starts =====================================================================
platform linux -- Python 3.6.7, pytest-3.10.0, py-1.7.0, pluggy-0.8.0
rootdir: /mnt/code, inifile:
plugins: celery-4.1.1
collected 42 items                                                                                                                                           

tests/test_tasks.py .......                                                                                                                            [ 16%]
tests/gazette/data/test_bidding_exemption_parsing.py .....................                                                                             [ 66%]
tests/gazette/data/test_row_update.py ....                                                                                                             [ 76%]
tests/gazette/data/test_section_parsing.py .....                                                                                                       [ 88%]
tests/gazette/locations/test_go_goiania.py ...                                                                                                         [ 95%]
tests/gazette/locations/test_rs_porto_alegre.py ..                                                                                                     [100%]Traceback (most recent call last):
  File "/usr/local/bin/pytest", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/_pytest/config/__init__.py", line 76, in main
    return config.hook.pytest_cmdline_main(config=config)
  File "/usr/local/lib/python3.6/site-packages/pluggy/hooks.py", line 284, in __call__
    return self._hookexec(self, self.get_hookimpls(), kwargs)
  File "/usr/local/lib/python3.6/site-packages/pluggy/manager.py", line 67, in _hookexec
    return self._inner_hookexec(hook, methods, kwargs)
  File "/usr/local/lib/python3.6/site-packages/pluggy/manager.py", line 61, in <lambda>
    firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
  File "/usr/local/lib/python3.6/site-packages/pluggy/callers.py", line 208, in _multicall
    return outcome.get_result()
  File "/usr/local/lib/python3.6/site-packages/pluggy/callers.py", line 80, in get_result
    raise ex[1].with_traceback(ex[2])
  File "/usr/local/lib/python3.6/site-packages/pluggy/callers.py", line 187, in _multicall
    res = hook_impl.function(*args)
  File "/usr/local/lib/python3.6/site-packages/_pytest/main.py", line 218, in pytest_cmdline_main
    return wrap_session(config, _main)
  File "/usr/local/lib/python3.6/site-packages/_pytest/main.py", line 211, in wrap_session
    session=session, exitstatus=session.exitstatus
  File "/usr/local/lib/python3.6/site-packages/pluggy/hooks.py", line 284, in __call__
    return self._hookexec(self, self.get_hookimpls(), kwargs)
  File "/usr/local/lib/python3.6/site-packages/pluggy/manager.py", line 67, in _hookexec
    return self._inner_hookexec(hook, methods, kwargs)
  File "/usr/local/lib/python3.6/site-packages/pluggy/manager.py", line 61, in <lambda>
    firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
  File "/usr/local/lib/python3.6/site-packages/pluggy/callers.py", line 203, in _multicall
    gen.send(outcome)
  File "/usr/local/lib/python3.6/site-packages/_pytest/terminal.py", line 627, in pytest_sessionfinish
    outcome.get_result()
  File "/usr/local/lib/python3.6/site-packages/pluggy/callers.py", line 80, in get_result
    raise ex[1].with_traceback(ex[2])
  File "/usr/local/lib/python3.6/site-packages/pluggy/callers.py", line 187, in _multicall
    res = hook_impl.function(*args)
  File "/usr/local/lib/python3.6/site-packages/_pytest/stepwise.py", line 102, in pytest_sessionfinish
    self.config.cache.set("cache/stepwise", [])
AttributeError: 'Config' object has no attribute 'cache'
Makefile:4: recipe for target 'unit_test' failed
make: *** [unit_test] Error 1

Should we extract "Inexigibilidade de licitação" as well?

While working on #106, I found a few "inexigibilidade de licitação", which seems to be cases where there is only one supplier (e.g. artists for concerts). For example, see page 5 on http://www.diariooficial.feiradesantana.ba.gov.br/atos/executivo/12L2UO1692015.pdf. Would we like to extract it as well?

postgres save error: duplicate key value violates unique constraint

When running crawler the second time a duplicity exception is generated in the Postgres

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/mnt/code/data_collection/gazette/pipelines.py", line 28, in process_item
    session.commit()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1026, in commit
    self.transaction.commit()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 493, in commit
    self._prepare_impl()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 472, in _prepare_impl
    self.session.flush()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2458, in flush
    self._flush(objects)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2596, in _flush
    transaction.rollback(_capture_exception=True)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 129, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2556, in _flush
    flush_context.execute()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute
    rec.execute(self)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/unitofwork.py", line 589, in execute
    uow,
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py", line 245, in save_obj
    insert,
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py", line 1120, in _emit_insert_statements
    statement, params
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 988, in execute
    return meth(self, multiparams, params)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 287, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1107, in _execute_clauseelement
    distilled_params,
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
    e, statement, parameters, cursor, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1466, in _handle_dbapi_exception
    util.raise_from_cause(sqlalchemy_exception, exc_info)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 383, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 128, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1244, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 550, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "gazettes_territory_id_date_file_checksum_key"
DETAIL:  Key (territory_id, date, file_checksum)=(3509502, 2016-09-01, 6509975a6cba06faa73eb0b463acea30) already exists.

[SQL: INSERT INTO gazettes (source_text, date, is_extra_edition, is_parsed, power, file_checksum, file_path, file_url, scraped_at, created_at, territory_id) VALUES (%(source_text)s, %(date)s, %(is_extra_edition)s, %(is_parsed)s, %(power)s, %(file_checksum)s, %(file_path)s, %(file_url)s, %(scraped_at)s, %(created_at)s, %(territory_id)s) RETURNING gazettes.id]
[parameters: {'source_text': 'Nº 11.431 - Ano XLV\n                           Diário Oficial                     Quinta-feira, 01 de setembro de 2016                               ... (1320312 characters truncated) ... . - Alteração da forma de cobrança\n                                                                                       da taxa de material.\n\x0c', 'date': datetime.date(2016, 9, 1), 'is_extra_edition': False, 'is_parsed': False, 'power': 'executive_legislature', 'file_checksum': '6509975a6cba06faa73eb0b463acea30', 'file_path': 'full/4abcf9e6df73b424f8e3b38140a3062af85b4800.pdf', 'file_url': 'http://www.campinas.sp.gov.br/uploads/pdf/2118544071.pdf', 'scraped_at': datetime.datetime(2019, 12, 6, 17, 40, 55, 19918), 'created_at': datetime.datetime(2019, 12, 6, 17, 41, 1, 688711), 'territory_id': '3509502'}]
(Background on this error at: http://sqlalche.me/e/gkpj)

Dispensa de licitação e Inexigibilidade

Acham válida a ideia de localizarmos, além das dispensas, as inexigibilidades também?

“A diferença básica entre as duas hipóteses está no fato de que, na dispensa, há possibilidade de competição que justifique a licitação; de modo que a lei faculta a dispensa, que fica inserida na competência discricionária da Administração. Nos casos de inexigibilidade, não há possibilidade de competição, porque só existe um objeto ou uma pessoa que atenda às necessidades da Administração; a licitação é, portanto, inviável”.
DI PIETRO, Maria Sylvia Zanella., Direito administrativo. 14. ed. São Paulo: Atlas, 2002, p. 310, 320-321.
Fonte

spider automation test

I believe we should find a way to run some kind of automatic testing in the spiders. My first idea is use the github actions to run the spiders for a while and check if it downloads some files. If it does, great, the test passes. Otherwise, it fails.

Furthermore, if the test failed, we can use the github action to open an issue automatically.

Black pre-commit unstable

Back pre-commit it is unstable.

When I run docker-compose run --rm processing black . --check, I get

(...)
would reformat /mnt/code/tests/gazette/data/test_bidding_exemption_parsing.py
All done! ✨ 🍰 ✨
1 file would be reformatted, 56 files would be left unchanged.

So I run docker-compose run --rm processing black . --safe and this file is formatted.

But when I try to commit, pre-commit prevent and format the file back to the format that black from docker complained about it (test_bidding_exemption_parsing).

I suspect that the black inside the docker and the black used in the pre-commit are different, but I do not know how to turn them equal

I already tried to:

Install black globally in the same version of requirements.txt.
Use the entry argument on .pre-commit-config.yaml using the docker command of black.

Common platform used by many municipalities

While looking for gazettes for Paranavaí the official site redirects to a portal that seems to be used by lots of smaller municipalities.

http://www.diariomunicipal.com.br

This portal even allow to search for terms like "Dispensa"

http://www.diariomunicipal.com.br/amp/pesquisar?entidadeUsuaria=&titulo=Dispensa&nome_orgao=&dataInicio=01%2F05%2F2018&dataFim=31%2F05%2F2018&Enviar=&_token=aAfXBvwcSMnp4lRq1qoriAGxW9rav7iawfmpUeHGLMk

Only problem is that the result is not a single pdf, but a page for each result. I believe that another pipeline would be needed for scraping this website.
Although i haven't found any city on the top 100 using this portal, maybe it would be worth scraping it given the amount of cities that uses it.

What do you think?

okfn-brasil / querido-diario Goto Github PK

querido-diario's Introduction

Querido Diário

Sumário

Como contribuir

Ambiente de desenvolvimento

Template para raspadores

Como executar

Dicas de execução

Solução de problemas

Suporte

Agradecimentos

Open Knowledge Brasil

Licença

querido-diario's People

Contributors

Stargazers

Watchers

Forkers

querido-diario's Issues

Recommend Projects

Recommend Topics

Recommend Org