Hi, I installed the latest Opemsemanticsearch Version as deb-Package in my Ubuntu

I worked one once, but i tried it on the new server and the tagging of 'Currency

Is it possible to deactivated the standard solr tags like: currency, phone numbers, Money, law clause,... about open-semantic-search HOT 5 OPEN

Aculo0815 commented on June 2, 2024

Is it possible to deactivated the standard solr tags like: currency, phone numbers, Money, law clause,...

from open-semantic-search.

Comments (5)

josefkarlkraus commented on June 2, 2024 1

Probably editing the file /etc/opensemanticsearch/etl solves your problem, especially by changing the lines regarding regex and by uncommenting lines with:
enhance_extract_email
enhance_extract_phone
enhance_extract_law
enhance_extract_money

from open-semantic-search.

feathered-arch commented on June 2, 2024 1

Y'know, I always wondered if anyone got those to work for them. I also disabled these because while it's a great concept, particularly when indexing things like the Panama Papers, it doesn't seem intelligent enough to properly parse things out without resorting to a lot of regex testing.

from open-semantic-search.

Aculo0815 commented on June 2, 2024

Great, it works. Thank's a lot. Now I'm ready to install it on a production VMware for my Dev-Team

from open-semantic-search.

Aculo0815 commented on June 2, 2024

I worked one once, but i tried it on the new server and the tagging of 'Currency' is stil there.
The 'phone numbers', 'Money' and 'Law clause' (and i added 'iban') tags are gone, that worked
I've done the following steps:
- change the /etc/opensemanticsearch/etl
- Maybe restart of the 'opensemanticetl ' service is enought, but I reboot the whole ubuntu machine
here are my changes of the etl file, did i miss a config?

# -*- coding: utf-8 -*-

#
# ETL config for connector(s)
#

# print debug messages
#config['verbose'] = True


#
# Languages for language specific index
#
# Each document is analyzed without grammar rules in the index fields like content, additionally it can be added/copied to language specific index fields/analyzers
# Document language is autodetected by default plugin enhance_detect_language_tika_server

# If index support enhanced analytics for specific languages, we can add/copy data to language specific fields/analyzers
# Set which languages are configured and shall be used in index for language specific analysis/stemming/synonyms
# Default / if not set all languages that are supported will be analyzed additionally language specific
#config['languages'] = ['en','de','fr','hu','it','pt','nl','cz','ro','ru','ar','fa']

# force to language specific analysis additional in this language(s) grammar & synonyms, even if language autodetection detects other language
#config['languages_force'] = ['en','de']


# only use language for language specific analysis which are added / uncommented later
#config['languages'] = []

# add English
#config['languages'].append('en')

# add German / Deutsch
#config['languages'].append('de')

# add French / Francais
#config['languages'].append('fr')

# add Hungarian
#config['languages'].append('hu')

# add Spanish
#config['languages'].append('es')

# add Portuguese
#config['languages'].append('pt')

# add Italian
#config['languages'].append('it')

# add Czech
#config['languages'].append('cz')

# add Dutch
#config['languages'].append('nl')

# add Romanian
#config['languages'].append('ro')

# add Russian
#config['languages'].append('ru')



#
# Index/storage
#

#
# Solr URL and port
#

config['export'] = 'export_solr'

# Solr server
config['solr'] = 'http://localhost:8983/solr/'

# Solr core
config['index'] = 'opensemanticsearch'


#
# Elastic Search
#

#config['export'] = 'export_elasticsearch'

# Index
#config['index'] = 'opensemanticsearch'


#
# Tika for text and metadata extraction
#

# Tika server (with tesseract-ocr-cache)
# default: http://localhost:9998

#config['tika_server'] = 'http://localhost:9998'

# Tika server with fake OCR cache of tesseract-ocr-cache used if OCR in later ETL tasks
# default: http://localhost:9999

#config['tika_server_fake_ocr'] = 'http://localhost:9999'


#
# Annotations
#

# add plugin for annotation/tagging/enrichment of documents
config['plugins'].append('enhance_annotations')

# set alternate URL of annotation server
#config['metadata_server'] = 'http://localhost/search-apps/annotate/json'


#
# RDF Knowledge Graph
#

# add RDF Metadata Plugin for granular import of RDF file statements to entities of knowledge graphs
config['plugins'].append('enhance_rdf')


#
# Config for OCR (automatic text recognition of text in images)
#

# Disable OCR for image files (i.e for more performance and/or because you don't need the text within images or have only photos without photographed text)
#config['ocr'] = False

# Option to disable OCR of embedded images in PDF by Tika
# so (if alternate plugin is enabled) OCR will be done only by alternate
# plugin enhance_pdf_ocr (which else works only as fallback, if Tika exceptions)
#config['ocr_pdf_tika'] = False

# Use OCR cache
config['ocr_cache'] = '/var/cache/tesseract'

# Option to disable OCR cache
#config['ocr_cache'] = None

# Do OCR for images embedded in PDF documents (i.e. designed images or scanned or photographed documents)
config['plugins'].append('enhance_pdf_ocr')

#OCR language

#If other than english you have to install package tesseract-XXX (tesseract language support) for your language
#and set ocr_lang to this value (be careful, the tesseract package for english is "eng" (not "en") german is named "deu", not "de"!)

# set OCR language to English/default
#config['ocr_lang'] = 'eng'

# set OCR language to German/Deutsch
#config['ocr_lang'] = 'deu'

# set multiple OCR languages
config['ocr_lang'] = 'eng+deu'


#
# Regex pattern for extraction
#

# Enable Regex plugin
config['plugins'].append('enhance_regex')

# Regex config for IBAN extraction
#config['regex_lists'].append('/etc/opensemanticsearch/regex/iban.tsv')


#
# Email address and email domain extraction
#
#config['plugins'].append('enhance_extract_email')


#
# Phone number extraction
#
#config['plugins'].append('enhance_extract_phone')


#
# Config for Named Entities Recognition (NER) and Named Entity Linking (NEL)
#

# Enable Entity Linking / Normalization and dictionary based Named Entities Extraction from thesaurus and ontologies
config['plugins'].append('enhance_entity_linking')

# Enable SpaCy NER plugin
config['plugins'].append('enhance_ner_spacy')

# Spacy NER Machine learning classifier (for which language and with which/how many classes)

# Default classifier if no classifier for specific language

# disable NER for languages where no classifier defined in config['spacy_ner_classifiers']
config['spacy_ner_classifier_default'] = None

# Set default classifier to English (only if you are sure, that all documents you index are english)
# config['spacy_ner_classifier_default'] = 'en_core_web_sm'

# Set default classifier to German (only if you are sure, that all documents you index are german)
# config['spacy_ner_classifier_default'] = 'de_core_news_sm'

# Language specific classifiers (mapping to autodetected document language to Spacy classifier / language)
#
# You have to download additional language classifiers for example english (en) or german (de) by
# python3 -m spacy download en
# python3 -m spacy download de
# ...

config['spacy_ner_classifiers'] = {
    'da': 'da_core_news_sm',
    'de': 'de_core_news_sm',
    'en': 'en_core_web_sm',
    'es': 'es_core_news_sm',
    'fr': 'fr_core_news_sm',
    'it': 'it_core_news_sm',
    'lt': 'lt_core_news_sm',
    'nb': 'nb_core_news_sm',
    'nl': 'nl_core_news_sm',
    'pl': 'pl_core_news_sm',
    'pt': 'pt_core_news_sm',
    'ro': 'ro_core_news_sm',
}


# Enable Stanford NER plugin
#config['plugins'].append('enhance_ner_stanford')

# Stanford NER Machine learning classifier (for which language and with how many classes, which need more computing time)

# Default classifier if no classifier for specific language

# disable NER for languages where no classifier defined in config['stanford_ner_classifiers']
config['stanford_ner_classifier_default'] = None

# Set default classifier to English (only if you are sure, that all documents you index are english)
#config['stanford_ner_classifier_default'] = '/usr/share/java/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz'

# Set default classifier to German (only if you are sure, that all documents you index are german)
#config['stanford_ner_classifier_default'] = '/usr/share/java/stanford-ner/classifiers/german.conll.germeval2014.hgc_175m_600.crf.ser.gz'

# Language specific classifiers (mapping to autodetected document language)
# Before you have to download additional language classifiers to the configured path
config['stanford_ner_classifiers'] = {
    'en': '/usr/share/java/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz',
    'es': '/usr/share/java/stanford-ner/classifiers/spanish.ancora.distsim.s512.crf.ser.gz',
    'de': '/usr/share/java/stanford-ner/classifiers/german.conll.germeval2014.hgc_175m_600.crf.ser.gz',
}

# If Stanford NER JAR not in standard path
config['stanford_ner_path_to_jar'] = "/usr/share/java/stanford-ner/stanford-ner.jar"

# Stanford NER Java options like RAM settings
config['stanford_ner_java_options'] = '-mx1000m'


#
# Law clauses extraction
#

#config['plugins'].append('enhance_extract_law')


#
# Money extraction
#

#config['plugins'].append('enhance_extract_money')


#
# Neo4j graph database
#

# exports named entities and relations to Neo4j graph database

# Enable plugin to export entities and connections to Neo4j graph database
#config['plugins'].append('export_neo4j')

# Neo4j server
#config['neo4j_host'] = 'localhost'

# Username & password
#config['neo4j_user'] = 'xxx'
#config['neo4j_password'] = 'xxx'

from open-semantic-search.

josefkarlkraus commented on June 2, 2024

I just realize, that i had to go a bit further to deactivate those as well: You can simply deactivate the Django facets for e.g. "phone", "currency" and so on.
(maybe the procedure is a bit hacky but it works):

Create Django account
cd /var/lib/opensemanticsearch
python3 manage.py createsuperuser
Access Django web interface
http://xxx.xxx.xxx.xxx/search-apps/admin/ >> Thesaurus >> Facets
Deactivate facets in web interface

deactivate all facets that you don't need by clicking on them and set:
Enabled: "No",
Snippets enabled: "No"
Graph enabled: "No"
SAVE

from open-semantic-search.

Is it possible to deactivated the standard solr tags like: currency, phone numbers, Money, law clause,... about open-semantic-search HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent