alephdata / aleph Goto Github PK

Search and browse documents and data; find the people and companies you look for.

License: MIT License

Makefile 0.30% Python 32.62% Mako 0.02% HTML 0.11% Shell 0.35% JavaScript 39.33% Dockerfile 0.14% SCSS 6.63% TypeScript 20.50%

python data-search graph-database journalism osint investigative-journalism

aleph's People

Contributors

Stargazers

Watchers

Forkers

danohu openoil-ug nightsh luthien123 rlugojr andkamau mevey stefanw backgroundcheck adamchainz tomjie nivertech transparencee code4sa karna1995 shamil-khan ericschles solertis zalias hwkongsgaard neo4reo nkwood debunge vanzaj vck ab212 smmbllsm wilbrodn gsuvorov singingwolfboy brianrusso maquchizi cdharris internationalaccountabilityproject nyimbi fin rhiaro jbothma adolfoeliazat apsolut-resources lamb003 atelierdechile rahulrawat11 reuf grukz kaue-cauin tpreusse gellinn gazeti kkrbalam ro9ueadmin dtorrx karriek silogit public-people paperbaek jigsawsecurity renesugar roukdanus gavinrozzi pjryan126 eruditepanda dkhurshudian afcarl cxz stephengrey arezola iuliise mustafaascha resero-labs felixebert k9team3 nt0z jbaehne ueland fahro gavargas22 critocrito fork-for-review regous kjacks rsquared2016 mathiasfls davidste blackopr modulexcite ancir e7dal flakon84 vishalbelsare ataddatalog iprunache devopsotrator github-userx heartofstone seekersapp2013 compa-inc stofstar tulipelover sylvainlapoix

aleph's Issues

"Peek" into hidden search results

When a user's search matches documents that are not visible to them, return the name of the person they need to contact to get access to such documents.

Crawler ideas [discussion]

(talking with @danohuiginn)

Crawlers should give feedback on last run, next run etc.
[opt] Crawlers should be scheduled automatically
Crawlers need to be fully incremental
Crawlers are configured via their crawler class (no in-DB json schmu)

Sort results either by relevance or date

This exists on OpenOil's aleph, should port.

Add "keywords" as well-defined facet

Facet for tags on a documents.

Extract emails, URLs, phone numbers etc.

This should be another analyser, perhaps based on a stand-alone library like: https://github.com/pudo/tidbits

Option to bail on a crawler when it finds no fresh leads

For incremental scrapes on very large data sources, this would allow the system to cancel the scrape, once a certain number of requests haven't found fresh documents.

View extracted page text for textual documents

Right now, we're always rendering the PDF version, but people might want to see our glorious OCR results, too -- and perhaps apply some google translate to it.

Allow subclassing of Metadata

And make sure all of the metadata properties are automatically introduced into the relevant mapping.

Allow users to edit document metadata

This requires:

Fix up document write authorisation via a new field on documents.
Improve validation of metadata in the application.
UI elements for editing metadata.

Use Flask-Admin for CRUD on domain entities

So that we can easily edit sources, watchlists etc.

Make OAuth follow-up pluggable

At the moment, the code used to receive an OAuth provider response and to turn it into a set of roles is a hard-coded set of supported services. This should come in via a plugin architecture instead.

Time filters using sliders

Perhaps also for file size and similar attributes.

Event triggers | notification alerts

Use Case
As a journalist / data importer, I want to be alerted of mentions of an an entity am interested in, so that I can sift through the imported documents incase the entity of interest is missing in the current bunch of documents.

Language whitelisting

An instance of Aleph should have a whitelist of "plausible" languages which can be detected and which documents can be tagged with. All others are removed both at ingest and results presentation time.

Merge 'Source' and 'Collection' into one domain object

This makes authz much simpler, and makes more sense in the context of allowing users to upload their own documents.

Error after clicking on login

Hi,

Thanks Once again for all your Help so far!

I have finally managed to get the Aleph UI up and running,

Here is the URL: http://54.191.176.203:13376/

Now i want to perform the following Tasks.

create categories like we see on: https://data.occrp.org
How to add a domain on Docker so that this IP should be changed. FQDN.
Switch Ports. from 13376 to 80.
SSL Installation.
Do we require any particular installation on Docker.

Our Architecture is consist on the following Items.

a. AWS EC2 c4x4 large shared Instance
b. OS
Distributor ID: Ubuntu
Description: Ubuntu 14.04.4 LTS
Release: 14.04
Codename: trusty

API Errors: When we Click on Login it throws Error. how can we fix it.

Regards.

aleph upgrade Error on Installation

have installed aleph via docker and docker-compose. Upon running "aleph upgrade" it throws the following Error.

[root@localhost aleph]# docker-compose run worker /bin/bash
Starting aleph_elasticsearch_1
Starting aleph_postgres_1
root@58ced8c:/aleph# aleph upgrade
INFO:aleph.model:Beginning database migration...
INFO:alembic.runtime.migration:Context impl PostgresqlImpl.
INFO:alembic.runtime.migration:Will assume transactional DDL.
INFO:aleph.model:Creating system roles...
WARNING:elasticsearch:PUT /aleph/mapping/document [status:404 request:0.835s]
Traceback (most recent call last):
File "/usr/local/bin/aleph", line 9, in
load_entry_point('aleph', 'console_scripts', 'aleph')()
File "/aleph/aleph/manage.py", line 167, in main
manager.run()
File "/usr/local/lib/python2.7/site-packages/flask_script/_init.py", line 412, in run
result = self.handle(sys.argv[0], sys.argv[1:])
File "/usr/local/lib/python2.7/site-packages/flask_script/init.py", line 383, in handle
res = handle(args, *config)
File "/usr/local/lib/python2.7/site-packages/flask_script/commands.py", line 216, in call
return self.run(args, *kwargs)
File "/aleph/aleph/manage.py", line 148, in upgrade
upgrade_search()
File "/aleph/aleph/index/admin.py", line 26, in upgrade_search
doc_type=TYPE_DOCUMENT)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/client/utils.py", line 69, in _wrapped
return func(args, params=params, *kwargs)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/client/indices.py", line 291, in put_mapping
'_mapping', doc_type), params=params, body=body)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/transport.py", line 307, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 93, in perform_request
self._raise_error(response.status, raw_data)
File "/usr/local/lib/python2.7/site-packages/elasticsearch/connection/base.py", line 105, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.NotFoundError: TransportError(404, u'index_not_found_exception')

Document how to use rabbitmq inside the docker setup

First of, great work on this project!

I've gotten a bit stuck with getting a version of this running completely locally using docker. I've noticed I can specify the archive type as file and run a RabbitMQ queue instead of using SQS which is really nice. I've tried to do that using the docker set-up but seem to be getting a connection error when I come to use aleph crawldir.

Error:

INFO:aleph.ingest.ingestor:Traceback (most recent call last):
  File "/aleph/aleph/ingest/__init__.py", line 58, in ingest_file
    ingest.delay(collection_id, meta.to_attr_dict())
  File "/usr/local/lib/python2.7/site-packages/celery/app/task.py", line 453, in delay
    return self.apply_async(args, kwargs)
  File "/usr/local/lib/python2.7/site-packages/celery/app/task.py", line 565, in apply_async
    **dict(self._get_exec_options(), **options)
  File "/usr/local/lib/python2.7/site-packages/celery/app/base.py", line 354, in send_task
    reply_to=reply_to or self.oid, **options
  File "/usr/local/lib/python2.7/site-packages/celery/app/amqp.py", line 310, in publish_task
    **kwargs
  File "/usr/local/lib/python2.7/site-packages/kombu/messaging.py", line 172, in publish
    routing_key, mandatory, immediate, exchange, declare)
  File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 457, in _ensured
    interval_max)
  File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 369, in ensure_connection
    interval_start, interval_step, interval_max, callback)
  File "/usr/local/lib/python2.7/site-packages/kombu/utils/__init__.py", line 246, in retry_over_time
    return fun(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 237, in connect
    return self.connection
  File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 742, in connection
    self._connection = self._establish_connection()
  File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 697, in _establish_connection
    conn = self.transport.establish_connection()
  File "/usr/local/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 116, in establish_connection
    conn = self.Connection(**opts)
  File "/usr/local/lib/python2.7/site-packages/amqp/connection.py", line 165, in __init__
    self.transport = self.Transport(host, connect_timeout, ssl)
  File "/usr/local/lib/python2.7/site-packages/amqp/connection.py", line 186, in Transport
    return create_transport(host, connect_timeout, ssl)
  File "/usr/local/lib/python2.7/site-packages/amqp/transport.py", line 299, in create_transport
    return TCPTransport(host, connect_timeout)
  File "/usr/local/lib/python2.7/site-packages/amqp/transport.py", line 95, in __init__
    raise socket.error(last_err)
error: [Errno 111] Connection refused

I've added RabbitMQ as a separate container and linked it to the other containers. Here is my docker-compose file:

postgres:
  image: postgres:9.4
  volumes:
    - "/opt/aleph/data/postgres:/var/lib/postgresql/data"
    - "/opt/aleph/logs/postgres:/var/log"
  environment:
    POSTGRES_USER:     aleph
    POSTGRES_PASSWORD: aleph
    POSTGRES_DATABASE: aleph
  ports:
   - "127.0.0.1:5439:5432"

elasticsearch:
  image: elasticsearch:2.2.0
  volumes:
    - "/opt/aleph/data/elasticsearch:/usr/share/elasticsearch/data"
    - "/opt/aleph/logs/elasticsearch:/var/log"
  ports:
    - "127.0.0.1:9201:9209"
  # environment:
  # ES_HEAP_SIZE: 4g

worker:
    build: .
    command: celery -A aleph.queue worker -c 10 -l INFO --logfile=/var/log/celery.log
    links:
      - postgres
      - elasticsearch
      - rabbitmq
    volumes:
      - "/:/host"
      - "/opt/aleph/data:/opt/aleph/data"
      - "/opt/aleph/logs/worker:/var/log"
    environment:
      C_FORCE_ROOT: 'true'
      ALEPH_ELASTICSEARCH_URI: http://elasticsearch:9200/
      ALEPH_DATABASE_URI: postgresql://aleph:aleph@postgres/aleph
      POLYGLOT_DATA_PATH: /opt/aleph/data
      TESSDATA_PREFIX: /usr/share/tesseract-ocr
    env_file:
      - aleph.env

beat:
    build: .
    command: celery -A aleph.queue beat -s /var/run/celerybeat-schedule
    links:
      - postgres
      - elasticsearch
      - rabbitmq
    volumes:
      - "/opt/aleph/logs/beat:/var/log"
      - "/opt/aleph/run/beat:/var/run"
    environment:
      C_FORCE_ROOT: 'true'
      ALEPH_ELASTICSEARCH_URI: http://elasticsearch:9200/
      ALEPH_DATABASE_URI: postgresql://aleph:aleph@postgres/aleph
    env_file:
      - aleph.env

web:
    build: .
    command: gunicorn -w 5 -b 0.0.0.0:8000 --log-level info --log-file /var/log/gunicorn.log aleph.manage:app
    ports:
      - "13376:8000"
    links:
      - postgres
      - elasticsearch
      - rabbitmq
    volumes:
      - "/opt/aleph/logs/web:/var/log"
    environment:
      ALEPH_ELASTICSEARCH_URI: http://elasticsearch:9200/
      ALEPH_DATABASE_URI: postgresql://aleph:aleph@postgres/aleph
    env_file:
      - aleph.env
rabbitmq:
  image: rabbitmq
  ports:
    - 5672

Separate user and import queues

At the moment, user-triggered background processing (such as entity updates) are handled by the same Q as bulk document imports, which means it can be delayed by hours or days. These should go into different queues and be processed either by different worker daemons, or at a different priority.

Logout followup page points to access restricted `/crawlers`

Logout followup page points to /crawlers - which is only accessible by an admin who's logged in.

Improvements to email support

Show relevant message for attachments
Show attachments for messages
Show other messages in thread.

Make entity/document re-indexing smarter

Option to run documents through external NER

There should be a way to run all documents from a particular source through an external entity extractor, such as Stanford NER or Reuters PermID

Subscribe to alerts for all entities on a watchlist

Currently, you can set up an alert on a particular query, but the query "Show documents matching entities on this watchlist" is not accessible via the UI. Need to expose it, and assign good names to these alerts.

API endpoint for listing documents by content hash

For us in the new occrp.org home page.

Log queries and document access to database

Create a complete log of queries performed and documents viewed by each user in the database as a foundation of future personalisation / ranking functions.

TIFF images can be multi-page, but only the first page is indexed

Holy shit, why do we need to have multi-page images?

Document view does not work on Safari

This results from the PDF load failing, I suspect it might be related to the CORS configuration of the S3 bucket CORS.

Break up Dockerfile

into one (or more) base images, to streamline deployment

Webkit install should happen first (since it won't change)
then apt-get and node installs, which change occasionally
then copy/install of working directory, which changes ALL THE TIME

Rename "Watchlist" to "Collection" or "Project"

Starting a discussion here, ping @danohuiginn.

Watchlists are right now meant to be collections of entities that get cross-referenced with the documents in aleph. I'm planning to extend this with additional functionality, such as making the Watchlist/Entity relationships many-to-many and allowing for the de-duplication of entities (currently, our aleph has 5 Bashar Al-Assads).

The next step could be to make Watchlists capable of holding documents as well as entities. This might make sense to allow users to group together documents they're interested in for a particular purpose. However, this is where the name stops making sense. I'd therefore like to propose renaming Watchlists now (before there are too many API dependencies).

What do you think?

Explore Aho-Corasick for Entity Matching

cf. occrp-attic/entityman@0f30c3f (thanks @builtofire!)

Document Search Error

Document search fails with error below:

Excerpts from ElasticSearch logs:

[2016-06-12 18:36:09,389][DEBUG][action.search            ] [Robert Bruce Banner] All shards failed for phase: [query]
RemoteTransportException[[Robert Bruce Banner][172.31.5.13:9300][indices:data/read/search[phase/query]]]; nested: SearchParseException[failed to parse search source [{"sort": [{"doc_count": "desc"}, "_score"], "query": {"filtered": {"filter": {"terms": {"collection_id": []}}, "query": {"bool": {"must": [{"match_phrase_prefix": {"terms": "ghana"}}, {"range": {"doc_count": {"gte": 0}}}]}}}}, "_source": ["name", "$schema", "terms", "doc_count"], "size": 5}]]; nested: SearchParseException[No mapping found for [doc_count] in order to sort on];
Caused by: SearchParseException[failed to parse search source [{"sort": [{"doc_count": "desc"}, "_score"], "query": {"filtered": {"filter": {"terms": {"collection_id": []}}, "query": {"bool": {"must": [{"match_phrase_prefix": {"terms": "ghana"}}, {"range": {"doc_count": {"gte": 0}}}]}}}}, "_source": ["name", "$schema", "terms", "doc_count"], "size": 5}]]; nested: SearchParseException[No mapping found for [doc_count] in order to sort on];

Running version: c5fb231
ES version: 2.3.1

@pudo any ideas?

Graph engine integration

This is a tracking ticket to explain the why & how of integrating a graph engine into aleph.

Why?

Recommendation engine (people who researched A also want to look into B)
Linking unstructured document info with structured DBs in a graph

How?

The following can be modelled as graph nodes: Documents, Entities, Aliases, Phones, Emails, Collections. They are connected via MENTIONS, AKA, CONTAINS.

User registration and password-based login

@danohuiginn to submit PR :)

Search history and event logging

Yet another round of events stuff: make a database table with all important user interactions, i.e. login, logout, search and document view/fetch. Both for statistics and, in the long run, to show users their own search history.

Re-work collection UI

Collections should have their own UI in which users can browse, edit, add and remove documents and entities associated with that collection.

Handle directory-based imports via bundles

Some file formats, such as ESRI Shapefiles or Cronos databases, are based on the contents of a directory, rather than a file. Since Aleph handles archiving on a per-file basis, these data types cannot be ingested properly. The proposed solution to this issue is to introduce a new mechanism, bundles. A bundle is a generated ZIP file, e.g. folder-name.shapefile or database.cro, that is created when the folder is available and then parsed as a file by a format-specific ingestor. The bundling is done upon crawling from a directory or package.

error using S3 archive

[2016-07-26 12:44:49,927: ERROR/MainProcess] Task aleph.ingest.ingest[86d1301b-1d82-40f4-9208-dd16e568fc07] raised unexpected: AttributeError("'s3.ObjectSummary' object has no attribute 'do
wnload_file'",)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/newrelic-2.46.0.37/newrelic/hooks/application_celery.py", line 66, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/aleph/aleph/ingest/__init__.py", line 97, in ingest
    Ingestor.dispatch(collection_id, meta)
  File "/aleph/aleph/ingest/ingestor.py", line 102, in dispatch
    local_path = get_archive().load_file(meta)
  File "/aleph/aleph/archive/s3.py", line 82, in load_file
    obj.download_file(path)
AttributeError: 's3.ObjectSummary' object has no attribute 'download_file'

after uploading a file via web interface

Boost collections

Collection should support a natural boost, on a scale of 1-6, that relative to other collections. This can be used to rank exclusive in-house materials over a scrape of a government database, over press clippings.

/cc @danohuiginn

Unable to Get Wb UI of aleph

Hi,

As i have successfully installed, however unable to figure out. How can I run it's Aleph Web UI for Search.

Server Installation Details:

Distributor ID: Ubuntu
Description: Ubuntu 14.04.4 LTS
Release: 14.04
Codename: trusty
Aleph, Installation as per the pudo/aleph installation.
Docker and Docker-compose Installation as per the URL.

Finally:

root@7748cd87f225:/aleph# aleph runserver
INFO:werkzeug: * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)

Kindly Guide me what are the further dependencies and how can i get it's Web UI.

Regards.

Improvements to tabular data viewer

The following come to mind:

Support ingestion of PPT/PPTX/ODP presentations

This can be done via LibreOffice, but the file types need to be added and PDF conversion needs to be verified.

Allow users to add documents to a collection

Log failed ingests to the database

Current, if a file cannot be ingested, no record is left in the database (i.e. no Document is created). Instead, a stub record should be created and filled with any exceptions that may occur during processing (e.g. unsupported file format, parsing errors, etc.)

per-site templates and css

aleph.env should contain a list of directories to look for template files, allowing any file to be overridden

custom style sheets can be specified from config

Support ingestion of Emails

Outlook exports, rfc822 files (.mbox, .msg) and Maildirs :) What should this look like in the UI, a PDF or something more structured?

Typo in Aleph repo description

Sift through large sets of structured and unstructured data, and find the people and comapnies you look for.

Should read: comapnies->companies

Support documentation pages

For the API, how to search etc.

Only one language detected for documents in two or languages

in particular I've seen this for documents in Afrikaans, English and Xhoza e.g. https://www.westerncape.gov.za/other/2011/9/extragaz_6908.pdf

Store role/role relationships

When a user is signed in, he is assigned a set of roles via oauth. This includes his user role, but also other roles, such as user groups. These links aren't currently stored in the database, which means off-line subsystems (like alerting) can't know which roles a user is permitted to access.