Code Monkey home page Code Monkey logo

demos's People

Contributors

benjaminpochat avatar dependabot[bot] avatar

Stargazers

 avatar

Watchers

 avatar

demos's Issues

Analyze french intercommunalités web sites

Extend / enhance scraping process to analyze french intercommunalités web sites as well.
Less numerous than commune, easier to roll over in a short time, in the perspective of and to detect and alerting for new council reports.

Too many java.io.IOException in PDFConverting process

In the PDF converting process logs, we found many errors like this one, whereas the url is correct and can be downloaded successfully in a classic web browser.

020-01-01 21:11:52 ERROR PdfConverter:41 - An error occurs while converting http://mairie-larbresle.fr/public/files/0/mairie/conseil_municipal/comptes_rendus/2013/crcm09-07-2013_57fb82f263f07.pdf into text.
java.io.IOException: Error: End-of-File, expected line
at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1123)
at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2595)
at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:2566)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:219)
at org.demos.pdfconverter.process.PdfConverter.getPDDocument(PdfConverter.java:50)
at org.demos.pdfconverter.process.PdfConverter.convert(PdfConverter.java:36)
at org.apache.kafka.streams.kstream.internals.AbstractStream.lambda$withKey$1(AbstractStream.java:103)
at org.apache.kafka.streams.kstream.internals.KStreamMapValues$KStreamMapProcessor.process(KStreamMapValues.java:40)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:117)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:201)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:180)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:133)
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:87)
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:366)
at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:199)
at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:420)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:890)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:805)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:774)
2020-01-01 21:11:52 INFO WebDocumentFilterByContent:18 - Document found at url http://mairie-larbresle.fr/public/files/0/mairie/conseil_municipal/comptes_rendus/2013/crcm09-07-2013_57fb82f263f07.pdf is dropped from the processing stream and won't be sent to the classification process, because its text content is null.

The solution might be given here :
https://stackoverflow.com/questions/50953924/pdfbox-ioexception-end-of-file-expected-line

Enable OCR for reports scanned as images in PDF

The classification process works only for PDF containing text.
Some of official reports contain text scanned as images embedded in the PDF. These reports cannot be referenced by Demos. An OCR module to convert text as images in real images would improve the coverage of the documents analysed.

https://tika.apache.org/ seems to offer this feature... There might be some other open source candidates for that...

Optimizing the scraping/classification process : scraping rotation

The scraping process needs to be optimized to avoid selecting randomly the local government to be analyzed. One simple rule is : analyzing in priority the local government's web site that has not been analyzed since a long time. This needs to persist in the database the date of the last scan for each web site.

docker service stops if the timout for converting pdf -> text is reached

The error is given below
demos-scraper | 2019-06-12 00:23:39,771 - demos - INFO - gc.garbage =
demos-scraper | Unhandled Error
demos-scraper | Traceback (most recent call last):
demos-scraper | File "/app/src/main/python/process/archiving/archiver.py", line 24, in _crawl_local_governments_web_sites
demos-scraper | crawling_process.crawl()
demos-scraper | File "/app/src/main/python/process/crawling/crawling_process.py", line 22, in crawl
demos-scraper | self.crawler_process.start()
demos-scraper | File "/usr/local/lib/python3.5/site-packages/scrapy/crawler.py", line 291, in start
demos-scraper | reactor.run(installSignalHandlers=False) # blocking call
demos-scraper | File "/usr/local/lib/python3.5/site-packages/twisted/internet/base.py", line 1261, in run
demos-scraper | self.mainLoop()
demos-scraper | --- ---
demos-scraper | File "/usr/local/lib/python3.5/site-packages/twisted/internet/base.py", line 1273, in mainLoop
demos-scraper | self.doIteration(t)
demos-scraper | File "/usr/local/lib/python3.5/site-packages/twisted/internet/epollreactor.py", line 218, in doPoll
demos-scraper | l = self._poller.poll(timeout, len(self._selectables))
demos-scraper | File "/app/src/main/python/process/pdf_converter/pdf_converter.py", line 29, in timeout_handler
demos-scraper | raise Exception("timeout reached (" + str(self._timeout) + 's)')
demos-scraper | builtins.Exception: timeout reached (300s)

Structure the technical documentation

The technical documentation (readme.md) is messy. It should be structured in the same way the code structure is structured :

  • training
  • scraping
  • gui
  • core
  • package

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.