opensanctions / yente Goto Github PK

API for OpenSanctions with support for entity search and bulk matching of data collections. Supports Reconciliation API spec.

Home Page: https://www.opensanctions.org/docs/yente/

License: MIT License

Dockerfile 0.68% Makefile 0.26% Python 98.96% Shell 0.10%

opensanctions sanctions kyc aml ofac sanction-lists

yente's Introduction

OpenSanctions

OpenSanctions aggregates and provides a comprehensive open-source database of sanctions data, politically exposed persons, and related entities. Key functionalities in this codebase include:

Parsing of raw source data.
Cleaning and standardization of data structures.
Deduplication to maintain data integrity.
Exporting the data into a variety of output formats.

We build on top of the Follow the Money framework, a JSON-focused anti-corruption data model, as the schema for all our crawlers. FtM data is then optionally exposed to simplified formats like CSV.

Quick Links

Collaborate with us in Development

Introduction

At the heart of our project is a crawler framework dubbed zavod. To activate the project, you have the option to employ either docker-compose.yml or the Makefile.

For an enriched experience backed by extensive documentation, we recommend opting for the Makefile. More details can be found in the zavod documentation.

Environment Setup

Database Initialization:

zavod can use a database in order to cache information from the data sources. Launch a terminal and set up your database with:

docker compose up -d db

Project Building:

Next, commence the build process with:

make build
# Alternatively, for direct execution:
docker-compose build --pull

Deploying the Crawler

Kickstart the crawling process with:

# This zeroes in on the dataset located in the datasets directory
docker compose run --rm app zavod crawl datasets/de/abgeordnetenwatch/de_abgeordnetenwatch.yml

Associated Repositories

opensanctions/nomenklatura: building on top of FollowTheMoney, nomenklatura provides a framework for storing data statements with full data lineage, and for integrating entity data from multiple sources. It also handles the data enrichment function that links OpenSanctions to external databases like OpenCorporates.
opensanctions/yente: API for entity matching and searching.

Licensing

The code within this repository is licensed under the MIT License. For content and data, we adhere to CC 4.0 Attribution-NonCommercial.

yente's People

Contributors

Stargazers

Watchers

yente's Issues

Is there a way to use /data/datasets/index.json instead of https://data.opensanctions.org/datasets/latest/index.json?

In manifest.yml am I on the right track to use the local /data/datasets/ generated by opensanctions/opensanctions/ instead of the index available at https://data.opensanctions.org/datasets/latest/index.json?

Something like this: /app/manifests/manifest.yml (??)

schedule: "*/30 * * * *"
catalogs:
  - path: /data/datasets/index.json
    scope: all

When I try this nothing seems to happen when running yente. After looking at manifest.py it seems that url: is required here. If I use the default configuration it works and populates elasticsearch but not with the custom one above. With manifests.yml above it just starts and sits there with no fetching/indexing.

TLDR; I guess what I'm asking is how does one use the local datasets/ created by a locally running https://github.com/opensanctions/opensanctions instead of fetching all the data from OpenSanctions.org?

I'm running it like this (docker swarm):

  yente:
    image: ghcr.io/opensanctions/yente:latest
    environment:
      YENTE_ENDPOINT_URL: https://<url>
      YENTE_MANIFEST: /app/manifests/manifest.yml
      YENTE_ELASTICSEARCH_URL: http://elasticsearch:9200
      YENTE_STATEMENT_API: "false"
      YENTE_UPDATE_TOKEN: <randomstuff>
    volumes:
      - /mnt/gfs/OpenSanctions/data:/data
      - /mnt/gfs/OpenSanctions/manifest.yml:/app/manifests/manifest.yml
    networks:
      - traefik_public
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints: [node.role == manager]
      restart_policy:
        condition: on-failure
      labels:
        - ...

Implement token-authenticated /updatez endpoint

Implement similar entities endpoint

Based on https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html perhaps, in order to show below entity profiles.

Support matching on partial dates

We should probably index lesser precisions for each date.

Set up more container security scanning

e.g. using https://github.com/marketplace/actions/anchore-container-scan

compressed responses for large responses

If I make a request with Accept-Encoding: gzip, compress, br it would be nice to receive a compressed response, e.g. for a large nested entity like https://api.opensanctions.org/entities/Q4416090

There seems to be fastapi middleware for gzip compression https://fastapi.tiangolo.com/advanced/middleware/#gzipmiddleware

Return a 400 if /match receives and entity type that is not "matchable"

This just produces no results at the moment, but it would be better to actually give an error to avoid having people implement this.

Allow httpx to use proxies

Dear team,

Related to #84, since the switch to httpx for data retrieval we cannot use our proxy anymore and therefore cannot update Yente. Could you please add the option to specify a proxy, as in: https://www.python-httpx.org/advanced/proxies/ ?

for example
with httpx.Client(proxy=YENTE_PROXY) as client: ...
where YENTE_PROXY is an env var.

Thank you very much!

s3 link?

hi there team, you can add files to yente with a file link or a url. Would a S3 link also work?

Adding `include_dataset` as the opposite of existing `exclude_dataset`

Hi there, just running this by you before making a PR for it.

We are only interested in a few of the datasets. Right now, the "correct" way is that we should query the catalog to see what is there and remove everything that we are not interested in. However, it would be ideal if we could avoid the call to the catalog api all together.

On top of it: as a cherry on top, it would be nicer if we could add a new env var to change the behavior of indexer as well. This way, we could just import the documents that belong to datasets that we are interested in. It should not make a huge performance boost but it wouldn't hurt either.

I am up for making a PR for the include_dataset parameter (unless you have a better name for it). The indexer stuff: please let me know what you think.

Implement incremental scans

cc @everplays - because you mentioned it.

There's two tiers to this:

Implement query parameters for the /search and /match APIs that allow a since= search. They would probably just filter on since > entity.last_changed. This gives you the ability to do incremental queries with the state managed externally.
Implement some sort of state in the API itself. In this scenario, a client would submit a /match-style entity and a callback mechanism (e.g. a webhook) and the API would then store this query, execute it at a set interval and notify the client whenever a new result is available. This could also be implemented a REST primitive, a la POST /alerts and GET /alerts?since=xxx.

I'm curious: which of these did you have in mind?

The default index refresh interval is "every minute past every 2nd hour" instead of the desired "every 2nd hour."

The current default value for CRONTAB = env_str("YENTE_CRONTAB", "* */2 * * *") in yente/settings.py sets the index refresh interval to "every minute past every 2nd hour" instead of the desired "every 2nd hour:" https://github.com/opensanctions/yente/blame/4dc0dfb7ac61c91b132a7f3f3519d3eb3538490c/yente/settings.py#L100.

Fix: Replace the default value with 0 */2 * * * or similar.

On a related note, the yente settings documentation also still states that the default refresh interval is 30 minutes.

Fuzzy matching not working

Hi,

The fuzzy matching parameter has no effect:

I tried to return results for https://api.opensanctions.org/search/default?q=Barrrack%20Obama&fuzzy=true and it should return a result since there's only 1 letter changing

I checked in the code https://github.com/opensanctions/yente/blob/main/yente/search/queries.py#L85 and in Elastic Search documentation, it should work but as a matter of fact, it does not.

Searching on Google returns results linked to a wrong mapping but I could not find any problem in the ES mapping either. I ended up updating the text_query function to this:

def text_query(
    dataset: Dataset,
    schema: Schema,
    query: str,
    filters: FilterDict = {},
    fuzzy: bool = False,
):

    if not len(query.strip()):
        should = {"match_all": {}}
    elif fuzzy and query.find('~') == -1:
        should = {
            "match": {
                "text": {
                    "query": query,
                    "fuzziness": "AUTO",
                    "lenient": True,
                    "operator":"AND"
                }
            }
        }
    else:
        should = {
            "query_string": {
                "query": query,
                "fields": ["names^3", "text"],
                "default_operator": "and",
            }
        }
    return filter_query([should], dataset=dataset, schema=schema, filters=filters)

The reason for this line fuzzy and query.find('~') == -1 is to not mix fuzziness and ~ operator. If query contains ~, the fuzzy parameter is just ignored

@pudo any comment on this ?

I can open a pull request if needed

Added Support for Yente with Elasticsearch on CapRover: A Template for Easy Deployment

We are unsure where this should be directed, but we aim to integrate Yente and the indexing service with Elasticsearch within CapRover for experimentation. We have developed a template.

How to install this on CapRover?

Navigate to Apps
Click on "One Click Apps/Databases"
Navigate to the very bottom of the list, and click on the last item, called >> TEMPLATE <<
Copy the following section to the box:

captainVersion: 4
caproverOneClickApp:
    instructions:
        start: Starting of Yente.
        end: Yente is deployed.
    variables:
        - id: $$cap_elasticsearch_version
          label: 'Elasticsearch Version Tag'
          description: 'Check out the releases overview: https://hub.docker.com/_/elasticsearch'
          defaultValue: 8.4.1
          validRegex: /^([^\s^\/])+$/
        - id: $$cap_elasticsearch_cluster_name
          label: Cluster Name
          description: Only nodes within the same cluster name can be combined
          defaultValue: elasticsearch-cluster
          validRegex: /^([^\s^\/])+$/
        - id: $$cap_elasticsearch_discovery_type
          label: Discovery Type
          description: Discovery type, for a single node cluster use `single-node`, otherwise `multi-node`
          defaultValue: single-node
          validRegex: /^([^\s^\/])+$/
        - id: $$cap_elasticsearch_security_enabled
          label: Security Enabled
          defaultValue: 'false'
          description: 'When you enable this option, Elasticsearch will create a random password (see startup logs) for the `elastic` user and create SSL certificates required for authentication. It is recommended to leave this off for a quick setup. Warning: make sure to enable HTTP Basic Auth in CapRover!'
          validRegex: /^([^\s^\/])+$/
        - id: $$cap_container_index_port
          label: Container TCP Port
          defaultValue: '9200'
          description: Internal port for Elasticsearch the container listens to.
          validRegex: /^([0-9])+$/
        - id: $$cap_container_app_port
          label: Container TCP Port
          defaultValue: '8000'
          description: Internal port for Yente the container listens to.
          validRegex: /^([0-9])+$/
    displayName: Yente
    isOfficial: true
    description: Yente is an open source data match-making API. The service provides several HTTP endpoints to search, retrieve or match FollowTheMoney entities, including people, companies or vessels that are subject to international sanctions.
    documentation: Taken from https://github.com/opensanctions/yente

services:
  $$cap_appname-index:
    image: docker.elastic.co/elasticsearch/elasticsearch:$$cap_elasticsearch_version
    caproverExtra:
      notExposeAsWebApp: 'true'
      containerHttpPort: $$cap_container_index_port
    volumes:
      - $$cap_appname-index-elasticsearch-data:/usr/share/elasticsearch/data
    restart: always
    environment:
      CLI_JAVA_OPTS: -Xms512m -Xmx512m
      cluster.name: $$cap_elasticsearch_cluster_name
      discovery.type: $$cap_elasticsearch_discovery_type
      http.port: $$cap_container_index_port
      node.name: $$cap_appname-index
      xpack.security.enabled: $$cap_elasticsearch_security_enabled
  $$cap_appname-app:
    image: ghcr.io/opensanctions/yente:latest
    depends_on:
      - $$cap_appname-index
    environment:
      YENTE_ELASTICSEARCH_URL: http://srv-captain--$$cap_appname-index:9200
      YENTE_STATEMENT_API: "false"
      YENTE_UPDATE_TOKEN: ""
      YENTE_ELASTICSEARCH_INDEX: "$$cap_appname-index"
    healthcheck:
      test: [ "CMD", "curl", "-f", "http://localhost:8000/healthz" ]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 3s
    deploy:
      mode: replicated
      replicas: 1
      restart_policy:
        condition: on-failure
        delay: 10s
        max_attempts: 3
        window: 120s
    caproverExtra:
      containerHttpPort: $$cap_container_app_port
volumes:
  index-os-data: null

Additionally, we plan to submit a GitHub MR on repository https://github.com/caprover/one-click-apps/ to enable the CapRover community to explore this setup as well.

Unable to connect to elastic search

I have followed the Readme and tried to run docker-compose up, But it is failing on this error.

Connection error caused by: ClientConnectorError(Cannot connect to host index:9200 ssl:default [Connection refused])

[warning ] Node <AiohttpHttpNode(http://index:9200)> has failed for 11 times in a row, putting on 30 second timeout [elastic_transport.node_pool]
app_1 | 2023-01-27T15:40:30.004014Z [error] Cannot connect to ElasticSearch: ConnectionError('Cannot connect to host index:9200 ssl:default [Connection refused]', errors=(ConnectionError('Cannot connect to host index:9200 ssl:default [Connection refused]', errors=(ClientConnectorError(ConnectionKey(host='index', port=9200, is_ssl=False, ssl=None, proxy=None, proxy_auth=None, proxy_headers_hash=3119452235181044255)

Re-instate deep nesting tests

After migrating OpenSanctions to use externals, the wd_curated dataset ended up being empty. That's what all tests for yente were written against. So I've now switched it over to eu_fsf, but eu_fsf doesn't have family ties. Need to find that somewhere to test deep nesting of entities again.

Improve matching API results

Some users of yente have reported issues with the matching API. These described problems are:

Entities that are true matches score much too low, especially when the names are very short.
Entities that have the same DOB are ranked very highly, irrespective of other features (like names).
The way in which match is set to false when there are two matches is un-intuitive.
Regulators want phonetic matching.

Here's some of the steps we're going to explore:

Remove some sparse entities from the matcher training data
Index soundex forms for all names
Make name length less important in name match quality
Implement a specific "OFAC style matching mode"

Make a helm chart

For kubernetes

https://helm.sh/docs/topics/charts/

Explore migrating to opensearch-py

They're having a little shit-flinging battle on the backs of every open source project using their products:

spring-projects/spring-data-elasticsearch#1880

Looks like OpenSearch-py will continue to work with ElasticSearch, but probably trail ES by a few versions. Need to explore how up-to-date its async support is.

Implement a good test harness

We should probably test each endpoint with a few different inputs. Need to come up with a good fixture dataset that has some linked entities and interesting alphabets in names. Swiss sanctions list?

Support for `dataset` allow-listing in /search and /match APIs

We want to introduce a new query argument that will add support for /match/default?dataset=eu_fsf&dataset=us_ofac_sdn to let users pick the lists they want to screen in, rather than the having to define custom manifests or using exclude_dataset.

Allow Aiohttp to use environmental variables for proxies

We have to use a proxy within our organisation to connect to data.opensanctions.org. For most Lunix applications it suffices to set the environmental variables HTTP_PROXY and HTTPS_PROXY but Aiohttp requires the flag 'trust_env=True' to read from environmental variables; https://docs.aiohttp.org/en/stable/client_advanced.html - Proxy support.

For example:
async with aiohttp.ClientSession(trust_env=True) as session: async with session.get("http://python.org") as resp: print(resp.status)

Could this, or another way of specifying a proxy, be added to Yente?

index ready time

Hello

Running on a 8core 8gb ram system... since 30min ago

curl http://localhost:8000/readyz
{"detail":"Index not ready."}

anyone knows aproximatley how much time it takes to finish it?

Thank you

Expose latest available/loaded timestamps in /catalog

In order to be able to see when the latest data is from, and if has actually been successfully loaded.
Possibly linked to #335

Figure out match vs. match_phrase queries

not sure what the effective benefits are.

Index freshness endpoint

Make an endpoint that returns information about the active index, it's freshness vis a vis the update schedule.

elasticsearch.helpers.BulkIndexError: 37 document(s) failed to index.

Seen in logs after following docker-compose instrutions at https://www.opensanctions.org/docs/yente/deploy/

yente-tut-app-1  | 2023-06-27T10:16:35.283539Z [error    ] Indexing error: BulkIndexError('37 document(s) failed to index.') [yente.search.indexer] dataset=default entities_url=https://data.opensanctions.org/datasets/20230627/default/entities.ftm.json index=yente-entities-default-00320230627063008
yente-tut-app-1  | Traceback (most recent call last):
yente-tut-app-1  |   File "/app/yente/search/indexer.py", line 115, in index_entities
yente-tut-app-1  |     await async_bulk(es, docs, yield_ok=False, stats_only=True, chunk_size=1000)
yente-tut-app-1  |   File "/venv/lib/python3.11/site-packages/elasticsearch/_async/helpers.py", line 337, in async_bulk
yente-tut-app-1  |     async for ok, item in async_streaming_bulk(
yente-tut-app-1  |   File "/venv/lib/python3.11/site-packages/elasticsearch/_async/helpers.py", line 252, in async_streaming_bulk
yente-tut-app-1  |     async for data, (ok, info) in azip(  # type: ignore
yente-tut-app-1  |   File "/venv/lib/python3.11/site-packages/elasticsearch/_async/helpers.py", line 156, in azip
yente-tut-app-1  |     yield tuple([await x.__anext__() for x in aiters])
yente-tut-app-1  |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
yente-tut-app-1  |   File "/venv/lib/python3.11/site-packages/elasticsearch/_async/helpers.py", line 156, in <listcomp>
yente-tut-app-1  |     yield tuple([await x.__anext__() for x in aiters])
yente-tut-app-1  |                  ^^^^^^^^^^^^^^^^^^^
yente-tut-app-1  |   File "/venv/lib/python3.11/site-packages/elasticsearch/_async/helpers.py", line 127, in _process_bulk_chunk
yente-tut-app-1  |     for item in gen:
yente-tut-app-1  |   File "/venv/lib/python3.11/site-packages/elasticsearch/helpers/actions.py", line 274, in _process_bulk_chunk_success
yente-tut-app-1  |     raise BulkIndexError(f"{len(errors)} document(s) failed to index.", errors)
yente-tut-app-1  | elasticsearch.helpers.BulkIndexError: 37 document(s) failed to index.

matching still works fine

two indexes were loading at the same time, then I think the first completed and the second failed?

more logs from just before

yente-tut-app-1  | 2023-06-27T10:16:28.123512Z [info     ] Index: 335000 entities...      [yente.search.indexer] index=yente-entities-default-00320230627063008
yente-tut-app-1  | 2023-06-27T10:16:28.315509Z [info     ] Index: 2157000 entities...     [yente.search.indexer] index=yente-entities-default-00320230626183008
yente-tut-app-1  | 2023-06-27T10:16:29.612188Z [info     ] Index: 2158000 entities...     [yente.search.indexer] index=yente-entities-default-00320230626183008
yente-tut-app-1  | 2023-06-27T10:16:30.304867Z [info     ] Index: 336000 entities...      [yente.search.indexer] index=yente-entities-default-00320230627063008
yente-tut-app-1  | 2023-06-27T10:16:31.307114Z [info     ] Index: 2159000 entities...     [yente.search.indexer] index=yente-entities-default-00320230626183008
yente-tut-app-1  | 2023-06-27T10:16:32.216707Z [info     ] Index: 337000 entities...      [yente.search.indexer] index=yente-entities-default-00320230627063008
yente-tut-app-1  | 2023-06-27T10:16:32.329842Z [info     ] Index is now aliased to: yente-entities [yente.search.indexer] index=yente-entities-default-00320230626183008
yente-tut-app-1  | 2023-06-27T10:16:32.339328Z [info     ] Delete other index             [yente.search.indexer] index=yente-entities-default-00320230627063008
index            | {"@timestamp":"2023-06-27T10:16:32.341Z", "log.level": "INFO", "message":"[yente-entities-default-00320230627063008/S79Og3JaRmqF7Eh0Rzu-AQ] deleting index", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[index][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataDeleteIndexService","elasticsearch.cluster.uuid":"hOqHptlAT-ezXFwQK84TfQ","elasticsearch.node.id":"ZrtQG-U2S1miZBgej2ZZYQ","elasticsearch.node.name":"index","elasticsearch.cluster.name":"opensanctions-index"}
yente-tut-app-1  | 2023-06-27T10:16:32.476391Z [info     ] Index update complete.         [yente.search.indexer] changed=True
yente-tut-app-1  | 2023-06-27T10:16:32.476725Z [info     ] Closing elasticsearch client   [yente.search.base] 
yente-tut-app-1  | 2023-06-27T10:16:34.133746Z [info     ] Index: 338000 entities...      [yente.search.indexer] index=yente-entities-default-00320230627063008
index            | {"@timestamp":"2023-06-27T10:16:34.184Z", "log.level": "INFO", "message":"[yente-entities-default-00320230627063008] creating index, cause [auto(bulk api)], templates [], shards [1]/[1]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[index][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataCreateIndexService","elasticsearch.cluster.uuid":"hOqHptlAT-ezXFwQK84TfQ","elasticsearch.node.id":"ZrtQG-U2S1miZBgej2ZZYQ","elasticsearch.node.name":"index","elasticsearch.cluster.name":"opensanctions-index"}
index            | {"@timestamp":"2023-06-27T10:16:34.528Z", "log.level": "INFO", "message":"[yente-entities-default-00320230627063008/PhUtJZRXS_W4tl_kC81NxA] create_mapping", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[index][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"hOqHptlAT-ezXFwQK84TfQ","elasticsearch.node.id":"ZrtQG-U2S1miZBgej2ZZYQ","elasticsearch.node.name":"index","elasticsearch.cluster.name":"opensanctions-index"}
index            | {"@timestamp":"2023-06-27T10:16:34.579Z", "log.level": "INFO", "message":"[yente-entities-default-00320230627063008/PhUtJZRXS_W4tl_kC81NxA] update_mapping [_doc]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[index][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"hOqHptlAT-ezXFwQK84TfQ","elasticsearch.node.id":"ZrtQG-U2S1miZBgej2ZZYQ","elasticsearch.node.name":"index","elasticsearch.cluster.name":"opensanctions-index"}
index            | {"@timestamp":"2023-06-27T10:16:34.626Z", "log.level": "INFO", "message":"[yente-entities-default-00320230627063008/PhUtJZRXS_W4tl_kC81NxA] update_mapping [_doc]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[index][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"hOqHptlAT-ezXFwQK84TfQ","elasticsearch.node.id":"ZrtQG-U2S1miZBgej2ZZYQ","elasticsearch.node.name":"index","elasticsearch.cluster.name":"opensanctions-index"}
index            | {"@timestamp":"2023-06-27T10:16:34.688Z", "log.level": "INFO", "message":"[yente-entities-default-00320230627063008/PhUtJZRXS_W4tl_kC81NxA] update_mapping [_doc]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[index][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"hOqHptlAT-ezXFwQK84TfQ","elasticsearch.node.id":"ZrtQG-U2S1miZBgej2ZZYQ","elasticsearch.node.name":"index","elasticsearch.cluster.name":"opensanctions-index"}
index            | {"@timestamp":"2023-06-27T10:16:34.730Z", "log.level": "INFO", "message":"[yente-entities-default-00320230627063008/PhUtJZRXS_W4tl_kC81NxA] update_mapping [_doc]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[index][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"hOqHptlAT-ezXFwQK84TfQ","elasticsearch.node.id":"ZrtQG-U2S1miZBgej2ZZYQ","elasticsearch.node.name":"index","elasticsearch.cluster.name":"opensanctions-index"}
index            | {"@timestamp":"2023-06-27T10:16:34.778Z", "log.level": "INFO", "message":"[yente-entities-default-00320230627063008/PhUtJZRXS_W4tl_kC81NxA] update_mapping [_doc]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[index][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"hOqHptlAT-ezXFwQK84TfQ","elasticsearch.node.id":"ZrtQG-U2S1miZBgej2ZZYQ","elasticsearch.node.name":"index","elasticsearch.cluster.name":"opensanctions-index"}
index            | {"@timestamp":"2023-06-27T10:16:34.818Z", "log.level": "INFO", "message":"[yente-entities-default-00320230627063008/PhUtJZRXS_W4tl_kC81NxA] update_mapping [_doc]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[index][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"hOqHptlAT-ezXFwQK84TfQ","elasticsearch.node.id":"ZrtQG-U2S1miZBgej2ZZYQ","elasticsearch.node.name":"index","elasticsearch.cluster.name":"opensanctions-index"}
index            | {"@timestamp":"2023-06-27T10:16:34.863Z", "log.level": "INFO", "message":"[yente-entities-default-00320230627063008/PhUtJZRXS_W4tl_kC81NxA] update_mapping [_doc]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[index][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"hOqHptlAT-ezXFwQK84TfQ","elasticsearch.node.id":"ZrtQG-U2S1miZBgej2ZZYQ","elasticsearch.node.name":"index","elasticsearch.cluster.name":"opensanctions-index"}
index            | {"@timestamp":"2023-06-27T10:16:34.903Z", "log.level": "INFO", "message":"[yente-entities-default-00320230627063008/PhUtJZRXS_W4tl_kC81NxA] update_mapping [_doc]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[index][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"hOqHptlAT-ezXFwQK84TfQ","elasticsearch.node.id":"ZrtQG-U2S1miZBgej2ZZYQ","elasticsearch.node.name":"index","elasticsearch.cluster.name":"opensanctions-index"}
index            | {"@timestamp":"2023-06-27T10:16:35.069Z", "log.level": "INFO", "message":"[yente-entities-default-00320230627063008/PhUtJZRXS_W4tl_kC81NxA] update_mapping [_doc]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[index][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"hOqHptlAT-ezXFwQK84TfQ","elasticsearch.node.id":"ZrtQG-U2S1miZBgej2ZZYQ","elasticsearch.node.name":"index","elasticsearch.cluster.name":"opensanctions-index"}
index            | {"@timestamp":"2023-06-27T10:16:35.140Z", "log.level": "INFO", "message":"[yente-entities-default-00320230627063008/PhUtJZRXS_W4tl_kC81NxA] update_mapping [_doc]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[index][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"hOqHptlAT-ezXFwQK84TfQ","elasticsearch.node.id":"ZrtQG-U2S1miZBgej2ZZYQ","elasticsearch.node.name":"index","elasticsearch.cluster.name":"opensanctions-index"}
index            | {"@timestamp":"2023-06-27T10:16:35.184Z", "log.level": "INFO", "message":"[yente-entities-default-00320230627063008/PhUtJZRXS_W4tl_kC81NxA] update_mapping [_doc]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[index][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"hOqHptlAT-ezXFwQK84TfQ","elasticsearch.node.id":"ZrtQG-U2S1miZBgej2ZZYQ","elasticsearch.node.name":"index","elasticsearch.cluster.name":"opensanctions-index"}
yente-tut-app-1  | 2023-06-27T10:16:35.283539Z [error    ] Indexing error: BulkIndexError('37 document(s) failed to index.') [yente.search.indexer] dataset=default entities_url=https://data.opensanctions.org/datasets/20230627/default/entities.ftm.json index=yente-entities-default-00320230627063008
yente-tut-app-1  | Traceback (most recent call last):
yente-tut-app-1  |   File "/app/yente/search/indexer.py", line 115, in index_entities
yente-tut-app-1  |     await async_bulk(es, docs, yield_ok=False, stats_only=True, chunk_size=1000)
yente-tut-app-1  |   File "/venv/lib/python3.11/site-packages/elasticsearch/_async/helpers.py", line 337, in async_bulk
yente-tut-app-1  |     async for ok, item in async_streaming_bulk(

Implement nomenklatura loader

Index statements into search index and re-base API

Work on matching API precision/recall

We want this to be reasonably precise. Also, maybe opensanctions/opensanctions#139 wants to move to nomenklatura and then be used here, too.

Readme.md wget url and app in docker-compose

Expose detailed dataset metadata in /catalog API

Right now this returns a very minimalistic representation of the datasets. We should perhaps just return the full dataset metadata as we have seen it in the source catalog, with yente's interpretation layered on top.

Array query params handling

Hello,

I would like to discuss the handling of GET query parameters in the OpenSanctions API.

After a thorough debugging, I have identified the reason why my query is not functioning as expected. It is a simple matching request with excluded datasets, as shown below:

Original Query:

https://api.opensanctions.org/match/sanctions?exclude_dataset[]=ua_nabc_sanctions&exclude_dataset[]=ua_nsdc_sanctions&exclude_dataset[]=ua_sfms_blacklist

However, the API expects the following syntax for array parameters:

Expected Query Syntax:

?exclude_dataset=ua_nabc_sanctions&exclude_dataset=ua_nsdc_sanctions&exclude_dataset=ua_sfms_blacklist

In my opinion, this approach seems a bit unconventional. Typically, arrays in GET queries are passed in one of the following common ways:

Using square brackets for each parameter value:

param[]=value1&param[]=value2...

Comma-separated values for a single parameter:

param=value1,value2...

Is this behavior the intended and expected behavior for the OpenSanctions API?

Thank you for your clarification.

Allow indexing/including unrelated datasets

We want to be able to load offshoreleaks and other stuff like this into the default collection at run-time. Probably means detaching our definition of the datasets from the index.json spec a bit at some point.

In order to do this, I want to introduce a manifest.yml to describe all the datasets in the system. This would a) reference the OpenSanctions index and how often to fetch that, b) be able to add more sources that are not part of OpenSanctions.

Here's a proposed format:

opensanctions:
  index: https://data.opensanctions.org/datasets/latest/index.json
  scope: default
  schedule: "*/30 * * * *"
sources:
  icij_offshoreleaks:
    title: ICIJ OffshoreLeaks
    entities_url: https://data.opensanctions.org/contrib/icij-offshoreleaks/full-oldb.json
    schedule: null
    collections:
      - all
      - offshore
  local_dataset1:
    title: My local fraudsters
    schedule: "* 30 1 * * *"
    # Apply an FtM namespace:
    namespace: true
    collections:
      - all
      - fraud
    queries:
      csv_url: file:///home/pudo/data/fraudsters.csv
      entities: (see https://docs.alephdata.org/developers/mappings)

This would have the following effects:

a) Load all OpenSanctions data inside the default dataset, checking for updates every 30 minutes
b) Load the ICIJ OffshoreLeaks database once and include those entities in search results for the collections all and offshore.
c) Generate FtM objects from a local CSV file and load those entities into a new dataset once per night.

Mark some stopwords for search queries

Querying the /search API for "Mr. Saddam Hussein" should still work, which probably requires making the "Mr." a stopword.

Self-signed certificate support

Hello,

I would like to submit a feature request related to our use of certificates for internal services. We utilize certificates issued by our internal, self-signed CA. Specifically, our Elasticsearch API use one of these certificates for the nodes within the cluster.

My request is to enable the capability of passing a certificate chain to the Yente client when it connects to Elasticsearch. I believe this enhancement should not be overly complex to implement. I am willing to contribute by preparing a patch myself. However, if someone could guide me on where to start, it would greatly expedite the process by reducing the time spent analyzing the client's workflow.

Additionally, some logs related to a failed connection

Cannot connect to ElasticSearch: 
TlsError("Cannot connect to host node1:9200 ssl:True 
[SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')]", 
errors=(TlsError("Cannot connect to host node1:9200 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')]"

Consider switching to OpenSearch

Seems like the AWS/Elastic war has also left Elastic doing more and more silly things, we may need to switch to the newly open source version of things sooner rather than later :/

Allow Aiohttp to use proxy