alephdata / aleph Goto Github PK
View Code? Open in Web Editor NEWSearch and browse documents and data; find the people and companies you look for.
Home Page: http://docs.aleph.occrp.org
License: MIT License
Search and browse documents and data; find the people and companies you look for.
Home Page: http://docs.aleph.occrp.org
License: MIT License
When a user's search matches documents that are not visible to them, return the name of the person they need to contact to get access to such documents.
(talking with @danohuiginn)
This exists on OpenOil's aleph, should port.
Facet for tags on a documents.
This should be another analyser, perhaps based on a stand-alone library like: https://github.com/pudo/tidbits
For incremental scrapes on very large data sources, this would allow the system to cancel the scrape, once a certain number of requests haven't found fresh documents.
Right now, we're always rendering the PDF version, but people might want to see our glorious OCR results, too -- and perhaps apply some google translate to it.
And make sure all of the metadata properties are automatically introduced into the relevant mapping.
This requires:
So that we can easily edit sources, watchlists etc.
At the moment, the code used to receive an OAuth provider response and to turn it into a set of roles is a hard-coded set of supported services. This should come in via a plugin architecture instead.
Perhaps also for file size and similar attributes.
Use Case
As a journalist / data importer, I want to be alerted of mentions of an an entity am interested in, so that I can sift through the imported documents incase the entity of interest is missing in the current bunch of documents.
An instance of Aleph should have a whitelist of "plausible" languages which can be detected and which documents can be tagged with. All others are removed both at ingest and results presentation time.
This makes authz much simpler, and makes more sense in the context of allowing users to upload their own documents.
Hi,
Thanks Once again for all your Help so far!
I have finally managed to get the Aleph UI up and running,
Here is the URL: http://54.191.176.203:13376/
Now i want to perform the following Tasks.
Our Architecture is consist on the following Items.
a. AWS EC2 c4x4 large shared Instance
b. OS
Distributor ID: Ubuntu
Description: Ubuntu 14.04.4 LTS
Release: 14.04
Codename: trusty
API Errors: When we Click on Login it throws Error. how can we fix it.
Regards.
have installed aleph via docker and docker-compose. Upon running "aleph upgrade" it throws the following Error.
[root@localhost aleph]# docker-compose run worker /bin/bash
Starting aleph_elasticsearch_1
Starting aleph_postgres_1
root@58ced8c:/aleph# aleph upgrade
INFO:aleph.model:Beginning database migration...
INFO:alembic.runtime.migration:Context impl PostgresqlImpl.
INFO:alembic.runtime.migration:Will assume transactional DDL.
INFO:aleph.model:Creating system roles...
WARNING:elasticsearch:PUT /aleph/mapping/document [status:404 request:0.835s]
Traceback (most recent call last):
File "/usr/local/bin/aleph", line 9, in
load_entry_point('aleph', 'console_scripts', 'aleph')()
File "/aleph/aleph/manage.py", line 167, in main
manager.run()
File "/usr/local/lib/python2.7/site-packages/flask_script/_init.py", line 412, in run
result = self.handle(sys.argv[0], sys.argv[1:])
File "/usr/local/lib/python2.7/site-packages/flask_script/init.py", line 383, in handle
res = handle(args, *config)
File "/usr/local/lib/python2.7/site-packages/flask_script/commands.py", line 216, in call
return self.run(args, *kwargs)
File "/aleph/aleph/manage.py", line 148, in upgrade
upgrade_search()
File "/aleph/aleph/index/admin.py", line 26, in upgrade_search
doc_type=TYPE_DOCUMENT)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/client/utils.py", line 69, in _wrapped
return func(args, params=params, *kwargs)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/client/indices.py", line 291, in put_mapping
'_mapping', doc_type), params=params, body=body)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/transport.py", line 307, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 93, in perform_request
self._raise_error(response.status, raw_data)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/connection/base.py", line 105, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.NotFoundError: TransportError(404, u'index_not_found_exception')
First of, great work on this project!
I've gotten a bit stuck with getting a version of this running completely locally using docker. I've noticed I can specify the archive type as file and run a RabbitMQ queue instead of using SQS which is really nice. I've tried to do that using the docker set-up but seem to be getting a connection error when I come to use aleph crawldir
.
Error:
INFO:aleph.ingest.ingestor:Traceback (most recent call last):
File "/aleph/aleph/ingest/__init__.py", line 58, in ingest_file
ingest.delay(collection_id, meta.to_attr_dict())
File "/usr/local/lib/python2.7/site-packages/celery/app/task.py", line 453, in delay
return self.apply_async(args, kwargs)
File "/usr/local/lib/python2.7/site-packages/celery/app/task.py", line 565, in apply_async
**dict(self._get_exec_options(), **options)
File "/usr/local/lib/python2.7/site-packages/celery/app/base.py", line 354, in send_task
reply_to=reply_to or self.oid, **options
File "/usr/local/lib/python2.7/site-packages/celery/app/amqp.py", line 310, in publish_task
**kwargs
File "/usr/local/lib/python2.7/site-packages/kombu/messaging.py", line 172, in publish
routing_key, mandatory, immediate, exchange, declare)
File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 457, in _ensured
interval_max)
File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 369, in ensure_connection
interval_start, interval_step, interval_max, callback)
File "/usr/local/lib/python2.7/site-packages/kombu/utils/__init__.py", line 246, in retry_over_time
return fun(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 237, in connect
return self.connection
File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 742, in connection
self._connection = self._establish_connection()
File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 697, in _establish_connection
conn = self.transport.establish_connection()
File "/usr/local/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 116, in establish_connection
conn = self.Connection(**opts)
File "/usr/local/lib/python2.7/site-packages/amqp/connection.py", line 165, in __init__
self.transport = self.Transport(host, connect_timeout, ssl)
File "/usr/local/lib/python2.7/site-packages/amqp/connection.py", line 186, in Transport
return create_transport(host, connect_timeout, ssl)
File "/usr/local/lib/python2.7/site-packages/amqp/transport.py", line 299, in create_transport
return TCPTransport(host, connect_timeout)
File "/usr/local/lib/python2.7/site-packages/amqp/transport.py", line 95, in __init__
raise socket.error(last_err)
error: [Errno 111] Connection refused
I've added RabbitMQ as a separate container and linked it to the other containers. Here is my docker-compose file:
postgres:
image: postgres:9.4
volumes:
- "/opt/aleph/data/postgres:/var/lib/postgresql/data"
- "/opt/aleph/logs/postgres:/var/log"
environment:
POSTGRES_USER: aleph
POSTGRES_PASSWORD: aleph
POSTGRES_DATABASE: aleph
ports:
- "127.0.0.1:5439:5432"
elasticsearch:
image: elasticsearch:2.2.0
volumes:
- "/opt/aleph/data/elasticsearch:/usr/share/elasticsearch/data"
- "/opt/aleph/logs/elasticsearch:/var/log"
ports:
- "127.0.0.1:9201:9209"
# environment:
# ES_HEAP_SIZE: 4g
worker:
build: .
command: celery -A aleph.queue worker -c 10 -l INFO --logfile=/var/log/celery.log
links:
- postgres
- elasticsearch
- rabbitmq
volumes:
- "/:/host"
- "/opt/aleph/data:/opt/aleph/data"
- "/opt/aleph/logs/worker:/var/log"
environment:
C_FORCE_ROOT: 'true'
ALEPH_ELASTICSEARCH_URI: http://elasticsearch:9200/
ALEPH_DATABASE_URI: postgresql://aleph:aleph@postgres/aleph
POLYGLOT_DATA_PATH: /opt/aleph/data
TESSDATA_PREFIX: /usr/share/tesseract-ocr
env_file:
- aleph.env
beat:
build: .
command: celery -A aleph.queue beat -s /var/run/celerybeat-schedule
links:
- postgres
- elasticsearch
- rabbitmq
volumes:
- "/opt/aleph/logs/beat:/var/log"
- "/opt/aleph/run/beat:/var/run"
environment:
C_FORCE_ROOT: 'true'
ALEPH_ELASTICSEARCH_URI: http://elasticsearch:9200/
ALEPH_DATABASE_URI: postgresql://aleph:aleph@postgres/aleph
env_file:
- aleph.env
web:
build: .
command: gunicorn -w 5 -b 0.0.0.0:8000 --log-level info --log-file /var/log/gunicorn.log aleph.manage:app
ports:
- "13376:8000"
links:
- postgres
- elasticsearch
- rabbitmq
volumes:
- "/opt/aleph/logs/web:/var/log"
environment:
ALEPH_ELASTICSEARCH_URI: http://elasticsearch:9200/
ALEPH_DATABASE_URI: postgresql://aleph:aleph@postgres/aleph
env_file:
- aleph.env
rabbitmq:
image: rabbitmq
ports:
- 5672
At the moment, user-triggered background processing (such as entity updates) are handled by the same Q as bulk document imports, which means it can be delayed by hours or days. These should go into different queues and be processed either by different worker daemons, or at a different priority.
Logout followup page points to /crawlers
- which is only accessible by an admin who's logged in.
There should be a way to run all documents from a particular source through an external entity extractor, such as Stanford NER or Reuters PermID
Currently, you can set up an alert on a particular query, but the query "Show documents matching entities on this watchlist" is not accessible via the UI. Need to expose it, and assign good names to these alerts.
For us in the new occrp.org home page.
Create a complete log of queries performed and documents viewed by each user in the database as a foundation of future personalisation / ranking functions.
Holy shit, why do we need to have multi-page images?
This results from the PDF load failing, I suspect it might be related to the CORS configuration of the S3 bucket CORS.
into one (or more) base images, to streamline deployment
Starting a discussion here, ping @danohuiginn.
Watchlists are right now meant to be collections of entities that get cross-referenced with the documents in aleph. I'm planning to extend this with additional functionality, such as making the Watchlist/Entity relationships many-to-many and allowing for the de-duplication of entities (currently, our aleph has 5 Bashar Al-Assads).
The next step could be to make Watchlists capable of holding documents as well as entities. This might make sense to allow users to group together documents they're interested in for a particular purpose. However, this is where the name stops making sense. I'd therefore like to propose renaming Watchlists now (before there are too many API dependencies).
What do you think?
Document search fails with error below:
Excerpts from ElasticSearch logs:
[2016-06-12 18:36:09,389][DEBUG][action.search ] [Robert Bruce Banner] All shards failed for phase: [query]
RemoteTransportException[[Robert Bruce Banner][172.31.5.13:9300][indices:data/read/search[phase/query]]]; nested: SearchParseException[failed to parse search source [{"sort": [{"doc_count": "desc"}, "_score"], "query": {"filtered": {"filter": {"terms": {"collection_id": []}}, "query": {"bool": {"must": [{"match_phrase_prefix": {"terms": "ghana"}}, {"range": {"doc_count": {"gte": 0}}}]}}}}, "_source": ["name", "$schema", "terms", "doc_count"], "size": 5}]]; nested: SearchParseException[No mapping found for [doc_count] in order to sort on];
Caused by: SearchParseException[failed to parse search source [{"sort": [{"doc_count": "desc"}, "_score"], "query": {"filtered": {"filter": {"terms": {"collection_id": []}}, "query": {"bool": {"must": [{"match_phrase_prefix": {"terms": "ghana"}}, {"range": {"doc_count": {"gte": 0}}}]}}}}, "_source": ["name", "$schema", "terms", "doc_count"], "size": 5}]]; nested: SearchParseException[No mapping found for [doc_count] in order to sort on];
Running version: c5fb231
ES version: 2.3.1
@pudo any ideas?
This is a tracking ticket to explain the why & how of integrating a graph engine into aleph.
@danohuiginn to submit PR :)
Yet another round of events stuff: make a database table with all important user interactions, i.e. login, logout, search and document view/fetch. Both for statistics and, in the long run, to show users their own search history.
Collections should have their own UI in which users can browse, edit, add and remove documents and entities associated with that collection.
Some file formats, such as ESRI Shapefiles or Cronos databases, are based on the contents of a directory, rather than a file. Since Aleph handles archiving on a per-file basis, these data types cannot be ingested properly. The proposed solution to this issue is to introduce a new mechanism, bundles. A bundle is a generated ZIP file, e.g. folder-name.shapefile
or database.cro
, that is created when the folder is available and then parsed as a file by a format-specific ingestor. The bundling is done upon crawling from a directory or package.
[2016-07-26 12:44:49,927: ERROR/MainProcess] Task aleph.ingest.ingest[86d1301b-1d82-40f4-9208-dd16e568fc07] raised unexpected: AttributeError("'s3.ObjectSummary' object has no attribute 'do
wnload_file'",)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/newrelic-2.46.0.37/newrelic/hooks/application_celery.py", line 66, in wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
return self.run(*args, **kwargs)
File "/aleph/aleph/ingest/__init__.py", line 97, in ingest
Ingestor.dispatch(collection_id, meta)
File "/aleph/aleph/ingest/ingestor.py", line 102, in dispatch
local_path = get_archive().load_file(meta)
File "/aleph/aleph/archive/s3.py", line 82, in load_file
obj.download_file(path)
AttributeError: 's3.ObjectSummary' object has no attribute 'download_file'
after uploading a file via web interface
Collection
should support a natural boost, on a scale of 1-6, that relative to other collections. This can be used to rank exclusive in-house materials over a scrape of a government database, over press clippings.
/cc @danohuiginn
Hi,
As i have successfully installed, however unable to figure out. How can I run it's Aleph Web UI for Search.
Server Installation Details:
Finally:
root@7748cd87f225:/aleph# aleph runserver
INFO:werkzeug: * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
Kindly Guide me what are the further dependencies and how can i get it's Web UI.
Regards.
The following come to mind:
This can be done via LibreOffice, but the file types need to be added and PDF conversion needs to be verified.
Current, if a file cannot be ingested, no record is left in the database (i.e. no Document
is created). Instead, a stub record should be created and filled with any exceptions that may occur during processing (e.g. unsupported file format, parsing errors, etc.)
aleph.env should contain a list of directories to look for template files, allowing any file to be overridden
custom style sheets can be specified from config
Outlook exports, rfc822 files (.mbox, .msg) and Maildirs :) What should this look like in the UI, a PDF or something more structured?
Sift through large sets of structured and unstructured data, and find the people and comapnies you look for.
Should read: comapnies->companies
For the API, how to search etc.
in particular I've seen this for documents in Afrikaans, English and Xhoza e.g. https://www.westerncape.gov.za/other/2011/9/extragaz_6908.pdf
When a user is signed in, he is assigned a set of roles via oauth. This includes his user role, but also other roles, such as user groups. These links aren't currently stored in the database, which means off-line subsystems (like alerting) can't know which roles a user is permitted to access.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.