Code Monkey home page Code Monkey logo

opensemanticsearch / open-semantic-search Goto Github PK

View Code? Open in Web Editor NEW
926.0 53.0 160.0 9.11 MB

Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)

Home Page: https://opensemanticsearch.org

License: GNU General Public License v3.0

Shell 71.43% Dockerfile 24.59% JavaScript 3.98%
search search-engine search-interface ocr ui python semantic skos thesaurus ontologies

open-semantic-search's People

Contributors

agdinten avatar davidshq avatar feathered-arch avatar hpiedcoq avatar mandalka avatar opensemanticsearch avatar s-vx avatar thecocce avatar wsldankers avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

open-semantic-search's Issues

desktop search without VM?

I love open semantic search. But using the VM is a bit unelegant. Could we have a simple gui for a desktop app (or just an indicator) where we can select which folders to index and start solr, tika, and the other components? The web frontend is great for searching!

UI: Normalized entities for navigation / faceted search / interactive filters

For navigation, interactive filtering and aggregated overview use mainly the facet with normalized entities with only one prefered label per entity additional to the (sometimes interesting, so they will be available in separate facet with suffix match_ss) different entries for each occuring alias, language or synonym.

'Connector_Web' object has no attribute 'map_id'

Can you please let me know where I'm doing mistake?

I'm trying to get the index for web page by passing the URI. I'm getting the error as 'Connector_Web' object has no attribute 'map_id'. Is this because of missing any package or dependency version issue?

screenshot-69 46 26 106 2016-06-14 16-56-57

Dependency on PHP package

After new Debian release, change dependency from former php5 package to new php package, so its not dependent on version any more and the same package name like in Ubuntu.

Upgrade to Debian 9 Stretch

Upgrade package with newest versions of ETL, Search UI and Search Apps / changed dependencies (now Python 3 based)

Permission Error after installation

Hi support,

I have just installed the Ubuntu server package, but at the final step, I got the following error.

N: Can't drop privileges for downloading as file '/home/solr/Downloads/open-semantic-search-server-ubuntu_xenial_16.10.10.deb' couldn't be accessed by user '_apt'. - pkgAcquire::Run (13: Permission denied)

After that I tried to run this command:

opensemanticsearch-index-dir /usr/share/doc

to index some files, but I got the following error:

"Exception while data enrichment of file:///usr/share/doc/unzip/ToDo with plugin filter_file_not_modified: [Errno socket error] [Errno 111] Connection refused
Error while exporting to index or database: file:///usr/share/doc/unzip/ToDo
Exception while processing file /usr/share/doc/unzip/ToDo : <urlopen error [Errno 111] Connection refused>"

Also, when going to "localhost/search", I see this error on the home page"

"Error:

Apache_Solr_HttpTransportException: '0' Status: Communication Error in /usr/share/solr-php-ui/Apache/Solr/Service.php:338 Stack trace: #0 /usr/share/solr-php-ui/Apache/Solr/Service.php(1170): Apache_Solr_Service->sendRawGet('http://localhos...') opensemanticsearch/open-semantic-search-server#1 /usr/share/solr-php-ui/index.php(1185): Apache_Solr_Service->search(':_', 0, 10, Array) #232 {main}"

Not sure, what to do? This is a new VM image created, so I don't think there is a firewall installed already. Any idea?

Thanks

Ubuntu Packages for Ubuntu version (LTS)

New Ubuntu 16.04 LTS version needs own seperated deb packages since they have other dependencies than Debian stable, since PHP package names and versions differ.

Thanks for bug reports and donation!

Planned for end of first week of october.

Embed full documentation in releases

Since there is not only fulltext search but more and more features, we have to export the documentation from (Drupal powered) website to local HTML files.

So in future by clicking "Help" you can access the full documentation (at the moment only the very short help text) offline or within your intranet, without accessing a public website.

Can the content types DOCX, XLSX and PPTX be shown as Word, Excel and Powerpoint ?

In the content type tab the old Microsoft formats are show as Word, Excel and Powerpoint, but the new formats introduced since 2007 are not as you can see in picture 1 they are called application/vnd.openformats etc.
That makes it confusing for the user who want to find his Word, Excel and Powerpoint documents.
I found config.mimetypes.php but I cannot imagine how to solve the problem so DOC and DOCX files are shown as Word, XLS and XLSX as Excel and PPT and PPTX as Powerpoint.

Crawl website

If enough time or donations, extend ETL web page indexing UI to a lightweight crawler for indexing a whole website/intranet instead of only the (initial) webpage, so no additional installation&configuration of web crawler neccesary anymore.

  • Option deepness how deep to follow links.
  • Bandwith / max frequency options
  • Use Robots config of website

Thesaurus (SKOS)

Enhance Named Entities Manager towards Thesaurus Manager:

  • Consolidate models from tagging app to use names or concepts from thesaurus manager.
  • Since concepts or entities can now have relations like broader or narrower or a tree structure: Support this taxonomies/hierarchies in search UI for faceted search
  • Use the alternate labels, aliases, synonyms and hidden labels / misspellings not only for faceted search but for full text search, too. I.e. by exporting to synonyms.txt for Solr or enrich search query
  • Add UI element for choice of language for multilingual thesaurus (to set/change field "lang" in models)

Add filemonitoring option to ETL filecrawler web UI

The ETL datasources UI for managing files and directories for (re)crawling should have an option (checkbox) to monitor this file or directory, so you don't have to configure filemonitoring for immeddiately indexing on file changes manually in /etc/opensemanticsearch/filemonitoring/files ...

wrong dependency to mod_wsgi for ubuntu 16.04 package

I have downloaded this deb file
https://www.opensemanticsearch.org/download/open-semantic-search_17.07.03.deb
I'm running ubuntu server 16.04.2

The django apps are not working because there is a dependency to libapache2-mod-wsgi but since you are upgrading to python3 it should be libapache2-mod-wsgi-py3

also had to add a line to the settings file because otherwise some modules are not found
sys.path.append("/usr/lib/python3/dist-packages")

Keep up the good work!

Language specifc packages

Since features like stemming, OCR and Entity Recognition are language specific, provide fully preconfigured packages for german, so no further configuration of Solr Schema and ETL config will be needed.

Can we store and retrieve audio media as well via the menu ?

Articles, images and video are covered in the menu but radio not.
Would be nice if (solution 1) there is an audio button next to videos (instead of the table button for example) or if (solution 2) audio formats are stored together with video formats and the “video” button is changed to “media” button.
The formats I think of are MP3, WAV and FLAC.

xenial installs incompatible django 1.8.7

open-semantic-search-server-ubuntu_xenial_16.10.10.deb

There is a dependency in the .deb package which requires python-django to be installed. In Xenial the version to be installed is 1.8.7, which does not work with several django-views. However removing the dependency and installing django 1.7.7 via pip seems to resolve the problem.

Upgrade Solr

Upgrade to new Solr release so we can search for multi term synonyms, too.

Comment on highlighted search words instructions.

Hi support team,

I believe there is something wrong at this page https://www.opensemanticsearch.org/doc/admin/config/stemming

under "Change the language / grammar", it reads:

and to be able to highlight searched words in the snippets even if written in other form change

...
<field name="content" type="text_en" indexed="false" stored="true" multiValued="true"/> ...

to

...
<field name="content" type="text_de" indexed="false" stored="true" multiValued="true"/>

I don't see any difference here except for the type "en vs. de'. How is this supposed to highlight words that are stemmed from the searched word even if't is not identical to the search term?

Please feel free to delete this post if I am wrong.

Thanks

Query segmenter to a microservice or Python

Migrate the query segmenter/query enrichment which allows to combine stemming and * operator within a single query and for options like synonyms from solr-php-ui which to Python or a micro service so it can be used in the Python / Django / open-semantic-search-apps environment, too

How enable the User Interface for a bidi language?

Hi admin,

I see that you can either switch between English or German. What if I want to have the interface in Arabic and make the whole layout RTL? What should I do and which files should be modified?

Thanks,

Filesystem monitoring *.deb install fails on Ubuntu Xenial

I'm trying to install the filesystem monitoring *.debs on Ubuntu Xenial, but I receive the following error message:

opensemanticsearch-trigger-filemonitoring : Depends: opensemanticsearch-connector-files (>= 0)
but it is not installable

What am I'm doing wrong?

FWIW here are all setup steps done on this server:

sudo apt install ./open-semantic-search-server-ubuntu_xenial_16.10.10.deb
sudo apt-get install unoconv
sudo apt install ./opensemanticsearch-trigger-filemonitoring_15.06.26_all.deb
...
opensemanticsearch-trigger-filemonitoring : Depends: opensemanticsearch-connector-files (>= 0)
but it is not installable

PDF files are being processed by Tesseract instead of extracting text

Hi,

my pdf files are already processed by FineReader, containing text layer. Why the system uses tesseract to OCR them? How to configure OSS to just extract text, and when not possible to perform OCR? How to monitor which files are being processed by index process?

Thanks,
Ryszard

It would be very nice if the facets can be closed

When I store 1000 documents there will be about 150 authors shown and that makes it difficult to scroll all the way down for tags and content types to handle.
If I could close a facet everything could be handled from the screen I am looking at and scrolling down would not be necessary in most cases.
If (option 1) closing facets is not easy to implement then (option 2) I would like to delete the author facet but I could not find the right spot to make the change.

Non terminating tesseract command with opensemanticsearch-index-dir

I am using the Open Semantic Search Appliance VM for indexing a shared samba drive (/mnt/g) with the command:
# opensemanticsearch-index-dir -q /mnt/g
After a night the command is blocked on:
# ps auxww | grep tess
root 4784 0.0 0.0 12728 2100 pts/6 R+ 09:00 0:00 grep tess
root 6503 98.1 1.4 369904 59360 pts/4 R+ May19 1053:07 tesseract -l ita /tmp/opensemanticetl_pdf_ocr_pcdsY3/image-002-019.pbm /tmp/opensemanticetl_ocr_35c53e7cd57715bc9b76c4143f2418d5

I have terminate the 'opensemanticsearch-index-dir -q /mnt/g' command killing tesseract:
# kill 6503

The opensemanticsearch-index-dir command should check if the tesseract command terminate in a reasonable amount of time, if not killing it.

Regards,
Maurizio

Create django superuser for login to admin interface

Until automated in Open Semantic Desktop Search, you can create the superuser admin for login to the Django admin interface by following commands on command line:

cd /var/lib/opensemanticsearch
python manage.py createsuperuser

So you get a shell if using VMs:

  • Open Semantic Desktop Search: open terminal and use su to get root
  • Search appliance VM: Just login as root without password

Where can I find Queue manager and Scheduler?

Hi admin,

I have added 2 rss feeds and I want to see how the results look like in Search, but I don't know how to commit these to Solr? How and where can I force the update of newly added feeds?

According to this page: https://opensemanticsearch.org/doc/modules

there should be 2 modules "Queue manager " and "Scheduler", but I cannot seem to find them. Where can I find them?

What command to use in crontab if I want to use cron?

It would be good to be able to trigger this functionality through user interface?

By the way, the link to ManifoldCF end user documentation in the following page under "Learn more" is broken. Can you please fix it?
Here is the broken URL used there.

https://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html

One last thing and I don't want to create a new issue for it, but please let me know if this is what I supposed to do.

I cannot seem to find the link to "Delete" command that deletes Solr index.

According to this page https://opensemanticsearch.org/doc/admin/rest-api , it should be at this path:

http://127.0.0.1:/search-apps/api/delete, but nothing there.

Can you also please let me know how to enable remote access to Solr admin page?

thanks,
Mohamed

Removing deleted or renamed files from search index

Ist there a way to notify the solr search index about removed/renamed files? They seem to stay in the index forever. Manually forcing to reindex a directory seems not to help. I am mapping the filenames from the file directory to a http URL.

UI: Use labels instead of URIs for facets/fields from RDF knowledge graph

To enable full text search, linked references (URIs) in RDF knowledge graphs are transformed to/indexed/showed by their labels/text instead of the reference URI/ID.

Similar should happen for printing facet/field/columns/property names in preview or table view for more pretty tables or preview and better readability (because RDF labels language sensitive: in custom language).

Add own preview view for type knowledge graph / RDF

new package template issue

Hello,

I've installed the new packaged and it fixes the python3 dependency but I think some elements are not linked. I've got an error message saying that Django/Python can't find the templates for each of the apps. - search-list / thesaurus / datasources.

I've tried specifying the template path directly to test but it doesn't work.

Any idea?

Thanks

Here is the error output:

Request Method:

http://xxx/search-apps/search-list/
1.10.7
TemplateDoesNotExist
index.html
/usr/lib/python3/dist-packages/django/template/loader.py in get_template, line 25
/usr/bin/python3
3.5.3
['/var/lib/opensemanticsearch',
'/usr/lib/python35.zip',
'/usr/lib/python3.5',
'/usr/lib/python3.5/plat-x86_64-linux-gnu',
'/usr/lib/python3.5/lib-dynload',
'/usr/local/lib/python3.5/dist-packages',
'/usr/lib/python3/dist-packages',
'/usr/lib/python3/dist-packages',
'/usr/lib/python3/dist-packages/opensemanticetl',
'/']
Mon, 10 Jul 2017 13:26:16 +0000

Can't download

Stupid question but the files for open-semantic-search desktop are nowhere to be found (error 404 on the official site)
Thanks !

SPARQL query/filter in Ontologies manager

Allow to config a SPARQL query to fetch ontologies or ontology parts from a triplestore or to filter the ontology by (complex) SPARQL queries.

So it will be possible to use only filtered parts of ontologies or ontologies from triplestores without managing the SPARQL queries/filters and downloads by external tools.

use external indices

Is it possible for opensemanticsearch to use data from external indices like xapian, nepomuk, or zeitgeist? Many users already use apps like recoll, gnome tracker, or others that build their own fulltext database. It would be great if open semantic seach could include this data into its search results.

Especially with apps like zeitgeist that use temporal semantic aspects to group files and documents.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.