iai-group / nordlys Goto Github PK

View Code? Open in Web Editor NEW

31.0 31.0 10.0 99.52 MB

Nordlys: Toolkit for entity-oriented and semantic search

Home Page: http://nordlys.cc/

License: Other

Python 87.78% CSS 2.50% JavaScript 2.57% HTML 5.99% Shell 1.16%

nordlys's People

Contributors

Stargazers

Watchers

Forkers

ageron mahmoudaljabary tim5go anukat2015 teanalab theophory zhangby2085 sethips zxlzr dhimanak

nordlys's Issues

[core] Update RetrievalResults class

Update the class with meta parameters (total hits)
Update the retrieval package to return RetrievalResults objects

[doc] add description about config files

how to change the settings of mongo, elastic, api, etc.

Document mongoDB install/setup settings

Add to installation instructions that a log folder needs to be created

mongo_dbpedia-2015-10.tar.bz2 corrupted?

Hi,
I am following the nordlys installation instructions
The command below give some strange output
./scripts/load_mongo_dumps.sh mongo_dbpedia-2015-10.tar.bz2
One of the things it is saying is that the compressed file ends unexpectedly and that it perhaps is corrupted.
Can you help fix this? Or is this supposed to happen?

I've attached the full output
script_output.txt

403 forbidden when loading data

I am getting 403 error when I'm trying to use ./scripts/load_mongo_dumps.sh to download data.
Full log:

~/nordlys ❯❯❯ ./scripts/load_mongo_dumps.sh mongo_dbpedia-2015-10.tar.bz2                                                                      master
MongoDB shell version v3.6.3
connecting to: mongodb://127.0.0.1:27017
MongoDB server version: 3.6.3
{
	"db" : "test",
	"collections" : 0,
	"views" : 0,
	"objects" : 0,
	"avgObjSize" : 0,
	"dataSize" : 0,
	"storageSize" : 0,
	"numExtents" : 0,
	"indexes" : 0,
	"indexSize" : 0,
	"fileSize" : 0,
	"fsUsedSize" : 0,
	"fsTotalSize" : 0,
	"ok" : 1
}
mongodb running!
############ Loading Mongo collection ...
--2018-05-28 22:11:59--  http://iai.group/downloads/nordlys-v02/mongo_dbpedia-2015-10.tar.bz2
Resolving iai.group (iai.group)... 162.241.224.152
Connecting to iai.group (iai.group)|162.241.224.152|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://gustav1.ux.uis.no/downloads/nordlys-v02/mongo_dbpedia-2015-10.tar.bz2 [following]
--2018-05-28 22:11:59--  http://gustav1.ux.uis.no/downloads/nordlys-v02/mongo_dbpedia-2015-10.tar.bz2
Resolving gustav1.ux.uis.no (gustav1.ux.uis.no)... 152.94.1.85
Connecting to gustav1.ux.uis.no (gustav1.ux.uis.no)|152.94.1.85|:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2018-05-28 22:11:59 ERROR 403: Forbidden.

tar (child): /home/fedor/nordlys/tmp/mongo_dbpedia-2015-10.tar.bz2: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
2018-05-28T22:11:59.879-0400	the --db and --collection args should only be used when restoring from a BSON file. Other uses are deprecated and will not exist in the future; use --nsInclude instead
2018-05-28T22:11:59.880-0400	building a list of collections to restore from /home/fedor/nordlys/tmp dir
2018-05-28T22:11:59.880-0400	done

Cannot import name 'NTriplesParser when build dbpedia index

When I tried to build the index from dpbedia dump using the following code,

VERSION=2015-10
python -m nordlys.core.data.dbpedia.indexer_dbpedia_types data/config/dbpedia-$VERSION/index_types.config.json

I face the error ImportError: cannot import name 'NTriplesParser' from 'rdflib.plugins.parsers.ntriples'
The traceback is:

File "/home/xxxx/miniconda3/envs/matchmaker/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/xxxx/miniconda3/envs/matchmaker/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/mnt/nfs/scratch1/xxxx/nordlys/nordlys/core/data/dbpedia/indexer_dbpedia_types.py", line 34, in <module>
    from rdflib.plugins.parsers.ntriples import NTriplesParser
ImportError: cannot import name 'NTriplesParser' from 'rdflib.plugins.parsers.ntriples' (/home/xxxx/miniconda3/envs/matchmaker/lib/python3.8/site-packag
es/rdflib/plugins/parsers/ntriples.py)

My package and python versions:
python==3.8.12
rdflib==6.0.2

[GUI] Generate entity cards on-the-fly using AJAX

Refine API logging

Have only a single line on server errors with a different status code/text.
Also include the name of the component invoked as an extra column

Check if auxiliary entity-type file is still needed

New indexer_dbpedia_types should be able to build type index directly from DBpedia dumps

Nordlys seems to contain a memory leak

When regenerating the dbpedia v2 runs using the configs from data/dbpedia-entity-v2/config the memory usage of nordlys continuously rises to a point where it exceeds 16 gigabytes (I do not remember the exact number, it has been several weeks. I just remember that it caused several gigabytes of swap to be used on my my 16Gb ram machine.).

I have been able to work around this issue by splitting queries_stopped.json into several files (by splitting the queries file in half the final memory usage is also cut roughly in half). But without modifications like this it is at best very slow to perform this action on "commodity" hardware, or at worst (swap on slow storage device) not possible as it can cause elastic search to exceed to default time-out value.

[core] Check new elastic version

If efficiency is improved
More API methods to access term statistics
Consistency with the current code

[GUI] change GUI entry point from ER to all tabs

Add support for multiple knowledge bases

Including DBpedia 2016-10 and Wikidata

el.py FileNotFoundError in load_kb_snapshot

I setup nordlys on my system. Running the entity linking python script, I get stuck on a file not found error with file 'data/el/snapshot_2015_10.txt' (see below)

The directory el only lists these files

data/el
data/el/yerd
data/el/yerd/qrels_YERD_er.txt
data/el/yerd/qrels_YERD_elq.txt
data/el/yerd/queries_YERD.json
data/el/yerd/YERD_2015_10.tsv
data/el/erd
data/el/erd/qrels_ERD_elq.txt
data/el/erd/queries_ERD.json
data/el/erd/qrels_ERD_er.txt
data/el/model.txt
data/el/config_ltr.json

Any suggestions?

$ python -m nordlys.services.el 
2019-06-25 17:27:03,639 - nordlys - INFO - Loading KB snapshot of proper named entities ...
Traceback (most recent call last):
  File "/home/ben/nordlys/anaconda-bin/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ben/nordlys/anaconda-bin/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ben/nordlys/nordlys/nordlys/services/el.py", line 198, in <module>
    main(arg_parser())
  File "/home/ben/nordlys/nordlys/nordlys/services/el.py", line 187, in main
    el = EL(conf, Entity(), ElasticCache(DBPEDIA_INDEX), FeatureCache())
  File "/home/ben/nordlys/nordlys/nordlys/services/el.py", line 90, in __init__
    load_kb_snapshot(self.__config["kb_snapshot"])
  File "/home/ben/nordlys/nordlys/nordlys/logic/el/el_utils.py", line 18, in load_kb_snapshot
    with open(kb_file, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/el/snapshot_2015_10.txt'

[TTI] Refactor lexical and w2v features

Refactor w2v code: flatten structure

Fix rendering of EL results on GUI

python package elasticsearch-6.0.0 gives unexpected keyword argument 'analyzer'

Hello,

I followed all the installation instruction succesfully and decided to see if the program can succesfully replicate the entity retrieval results.

This gave the following error.

root@IR:~/nordlys# python -m nordlys.core.retrieval.retrieval data/dbpedia-entity-v2-replication-test/config/retrieval_bm25.config.json
2017-12-13 12:16:26,963 - nordlys - INFO - scoring [INEX_LD-2009022] Szechwan dish food cuisine
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/root/anaconda3/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/nordlys/nordlys/core/retrieval/retrieval.py", line 263, in <module>
    main(arg_parser())
  File "/root/nordlys/nordlys/core/retrieval/retrieval.py", line 256, in main
    r.batch_retrieval()
  File "/root/nordlys/nordlys/core/retrieval/retrieval.py", line 209, in batch_retrieval
    results = self.retrieve(queries[query_id])
  File "/root/nordlys/nordlys/core/retrieval/retrieval.py", line 186, in retrieve
    query = self.__elastic.analyze_query(query)
  File "/root/nordlys/nordlys/core/retrieval/elastic.py", line 116, in analyze_query
    tokens = self.__es.indices.analyze(index=self.__index_name, body=query, analyzer=analyzer)["tokens"]
  File "/root/anaconda3/lib/python3.5/site-packages/elasticsearch/client/utils.py", line 76, in _wrapped
    return func(*args, params=params, **kwargs)
TypeError: analyze() got an unexpected keyword argument 'analyzer'

Uninstalling version 6.0.0 of the package and installing 2.3.0 solved the problem.

[TTI] Implement LTR method

Clarify FACC extraction file and fix markup

[GUI] make interface compatible with mobile devices

[doc] installation from raw data files

[core] Check redis for entity surface form dictionary

https://redis.io/

Can we store only surface form => list of entities or also an associated value?
I.e., can we have multiple values for a key? Can we have a tuple as a value for a key
string => (entity, score)

Mention download part in installation.rst file

[API] Add support for API key

Usage of API key should be optional
Valid API keys in config file

Rename dbpedia-2015-10 to dbpedia-2015-10_sample

Rename data/raw-data/dbpedia-2015-10 => data/raw-data/dbpedia-2015-10_sample. (The two would be exactly identical structure, but the sample is part of the repo and is small.)
Update documentation/scripts to work with sample by default.
download_all.sh should create data/raw-data/dbpedia-2015-10 and download the full files under that.

Startup commands need to be on install doc page

Incorrect filename in build_indices.sh for DBPedia types index

There is a mismatch between the file in the repository and the file expected in build_indices.sh when building the DBPedia types index.

The file in the repository is called index_dbpedia-2015-10_types.config.json while the script looks for index_dbpedia_2015_10_types.config.json.

My suggestion would be renaming the file to the latter to keep the format uniform.

Create config files for DBpedia sample

Create config files in data/config_sample based on data/config (same filenames with the paths changed).
Also create config files for ER, EL and TTI using the sample indices.