jaeksoft / opensearchserver Goto Github PK

View Code? Open in Web Editor NEW

499.0 77.0 190.0 510.23 MB

Open-source Enterprise Grade Search Engine Software

Home Page: http://www.opensearchserver.com

License: Apache License 2.0

HTML 32.85% Java 66.67% Dockerfile 0.48%

search search-engine crawler webcrawler webcrawling custom-search indexing lucene opensearchserver java

opensearchserver's Introduction

OpenSearchServer

OpenSearchServer is a powerful, enterprise-class, search engine software based on Lucene. Using the web user interface, the crawlers (web, file, database, ...) and the JSON webservice you will be able to integrate quickly and easily advanced full-text search capabilities in your application. OpenSearchServer runs on Linux/Unix/BSD/Windows.

Quickstart

Docker image

Not yet there.. coming soon..

Go with the interface and/or the API

http://localhost:9090

Useful links

Download binaries: https://www.opensearchserver.com/#download
The documentation: https://www.opensearchserver.com/documentation
Issues (bugs, enhancements): https://github.com/jaeksoft/opensearchserver/issues

Features

Search functions

Advanced full-text search features
Phonetic search
Advanced boolean search with query language
Clustered results with faceting and collapsing
Filter search using sub-requests (including negative filters)
Geolocation
Spell-checking
Relevance customization
Search suggestion facility (auto-completion)

Indexation

Supports 18 languages
Fields schema with analyzers in each language
Several filters: n-gram, lemmatization, shingle, stripping diacritic from words,…
Automatic language recognition
Named entity recognition
Word synonyms and expression synonyms
Export indexed terms with frequencies
Automatic classification

Document supported

HTML / XHTML
MS Office documents (Word, Excel, Powerpoint, Visio, Publisher)
OpenOffice documents
Adobe PDF (with OCR)
RTF, Plaintext
Audio files metadata (wav, mp3, AIFF, Ogg)
Torrent files
OCR over images

Crawlers

The web crawler for internet, extranet and intranet
The file systems crawler for local and remote files (NFS, SMB/CIFS, FTP, FTPS, SWIFT)
The database crawler for all JDBC databases (MySQL, PostgreSQL, Oracle, SQL Server, …)
Filter inclusion or exclusion with wildcards
Session parameters removal
SQL join and linked files support
Screenshot capture

General

JSON web service
Index replication and sharding
Federated search

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

opensearchserver's People

Contributors

Stargazers

Watchers

Forkers

hostingeuropebose severineg emmanuel-keller debae malikgale sebastien-andrivet alexandretoyer brownman90 thinkbox naveenann djoudi rui-santos jbrodo markleppa lowerland iceriverweng kalmuthu mr1azl wantongtang marten-cz giampaologabba firstteam dreadchild dhootha chirilo p4c1fist djpearso daquino sindibadthesailor2013 catroot savanibharat woestler iamsavvy harshkumar1 tobias-husmann ddnbgroup nikolayvoronchikhin robfromca gibugeorge hacklinux ubuntuevangelist webvg ronaldoviber mazhen2009 shahg2 465060874 3bu1 sg1303 wjwalcher foxweek duyetdev-collections kast tanmoydeb07 axellh anukat2015 hao707822882 gitter-badger tetraa nicohaase guillaumelecerf bobosui pking74 jbriard davgit digideskio tiger66639 smartnetguru elijahcruz12 xuyongfs wuatanabe eugenesavenko jcvergar rcv-legado kakkartushar1 ianmadlenya vrozkovec wooodhead xunux kevinledoyen alscure fictionalname poinea uruscg-llc stacky72op tjfdh topmy babarinde zhao123456 forkme7 venomshock devmobileaim keyboardcowboy42 iny coresoft2 cloudxtreme mzeidhassan adarsh-bm12 ozhiganov lifedom jai2033shankar

opensearchserver's Issues

Analyzer : improve RegularExpressionFilter with a "search and replace" feature

A "preg_replace" option could be useful. We should be able to work with captured groups ($1, $2, ...) to define what the output of the filter is.
See for example: http://php.net/manual/fr/function.preg-replace.php

Thank you

API to index CSV document

This new API let's put document in the index based on text document (like CSV). Each line is a document. The fields are extracted using a regular expression.

API Document DELETE: use payload + implement dry-run

Payload

It could be nice to be able to give values with an attached payload to the DELETE API for Document instead of having to give every parameters in the URL.
For instance if one wants to delete several document based on the product_id field instead of calling : .../services/rest/index/{index_name}/document/product_id/1/2/3/4/5/6/7/8/9/10 it would be possible to call .../services/rest/index/{index_name}/document/product_id and give this as a payload:

Dry-run

It would also be nice to add a dry-run feature to the DELETE Document API, that would return the number of documents this query would delete if actually run.

This could be an extra parameter to the query: ?dryrun

Thank you,
Alexandre

URL crawl has problem if URL contains white space(%20) in between

Hi,
URL crawl is valid for all urls which doesn't have spaces(%20). But this feature is unable to crawl URL like :
http://localhost:9090/services/rest/index/my_index/crawler/web/crawl?url=http://www.my%20example.org/

my%20example => my example

Regards,
Mayur

Massive update for classifiers

It could be nice to provide a massive update feature for the classifiers. For example classifiers could be updated massively by uploading a CSV file well formatted.

Thank you,
Alexandre

New Term filter and Phrase filter

The Term filter let search for a term in a field.
The Phrase filter let search for several terms in a field (with a phrase slop)

Problem in fields API

I am implementing a node driver for the API, i am currently in fields.

https://github.com/jaeksoft/opensearchserver/wiki/Field-create-update

First problem

The "PUT" method doesn't work, i get a 500.

Second problem

I tried with "POST", it seems to work, but it doesn't.

$ http post http://localhost:9090/services/rest/index/my_index/field name=my_field
HTTP/1.1 200 OK
Content-Type: application/json
Date: Tue, 22 Oct 2013 08:56:13 GMT
Server: Apache-Coyote/1.1
Transfer-Encoding: chunked

{
    "info": "",
    "successful": true
}


$ http get http://localhost:9090/services/rest/index/my_index/field
HTTP/1.1 200 OK
Content-Type: application/json
Date: Tue, 22 Oct 2013 08:56:14 GMT
Server: Apache-Coyote/1.1
Transfer-Encoding: chunked

{
    "fields": [],
    "info": "0 field(s)",
    "successful": true
}

I use httpie for requesting.

Feature to deactivate canonical url handling in web crawler

It would be nice to develop a feature allowing canonical handling to be switched off for web crawler.

Thank you,
Alexandre

Unicode search is not working in version 1.5

Hi,
Unicode search which is working fine in version 1.4 is gone stopped in version 1.5.
In ver.1.5 admin application; in query edit its working but for XML/HTTP API call it takes garbage value in advscore query.

Regards,
Mayur

Implement polish language

Hello,

Could you add polish language to the list of langs supported by OSS?

Thank you,
Alexandre

Precise rights management

It would be nice to be able to create users in OSS that have a very limited scope of what they are allowed to do.

Eg. create a user that is only able to edit a synonym list for an index.

Ability to duplicate a database crawl process

It could be nice to be able to duplicate a database crawl, with all the configuration and fields mapping.

Thank you,
Alexandre

Crawler/parser transform letter with accent into html entities

When HTML parser extracts values from the web pages, specially from the <title> attribute, it seems that letters with accent (é, é, ...) are transformed into HTML entities, and those HTML entities are then indexed.

These leads to some problem, for instance when using the default "web crawler" template which puts the value from the <title> node into the autocomplete field of the schema: suggestions from the autocompletion will then contain some HTML entities instead of proper letters.

Thank you,
Alexandre

Improve spell check query: return existing words

In spell check queries there is currently a small "bug": if this query is queried with an existing word it will not return it but rather it will try to find replacements for it.

For instance:

search spell check for "blu dress"
what could be expected: blue for blu, and dress, as this word already exists in index
what is returned: blue for blu, but dresser for dress

Thank you,
Alexandre

Returns distance when using GeoFilter

When using the GeoFilter in a query, returns the distance.

Script REST API

Method POST
URL: http://localhost:9090/services/rest/script

Script example:

[
  { "command": "ON_ERROR", "parameters": [ "RESUME" ] },
  { "command": "WEBDRIVER_OPEN", "parameters": [ "FIREFOX" ] },
  { "command": "WEBDRIVER_SET_TIMEOUTS", "parameters": [ 60, 60 ] },
  { "command": "WEBDRIVER_RESIZE", "parameters": [ 1024, 768 ] }, 
  { "command": "WEBDRIVER_GET", "parameters": [ "http://www.open-search-server.com" ] },
  { "command": "SLEEP", "parameters": [ 2 ] },
  { "command": "WEBDRIVER_CLOSE" } 
]

Web crawler, sitemap : follow links to other sitemaps

It could be great if OSS was able to follow the sitemap links inside a sitemap, as described here: http://www.sitemaps.org/protocol.html#index

Some sitemaps linked in the main sitemap could be gzipped.

Thank you,
Alexandre

Query check scheduler task

This new sheduler task should be used to check if the content of an index is valid. The task will execute a search request. A JSONPath or XPath query can then be applied. If the result is bad an email can be send, or a callback URL can be called.

Issue with Geolocation filter with negative coordinates

The Geolocation filter may return wrong results on negative coordinates

Synonyms: create an API to manage them

It would be nice to have an API that would allow full management of synonyms lists.

Thank you,
Alexandre

CSS ignore class: ability to configure our own

It could be useful to be able to configure which CSS classes must be ignore during crawl process. This would overwrite the default "opensearchserver.ignore" CSS class.

We should be able to provide several CSS classes.

Thank you,
Alexandre

Highlighted attribute is not set in Search API (XML/HTTP)

The highlighted attribute

<snippet name="title" highlighted="yes">Press <strong>release</strong></snippet>

Binary indexation failed when using XML Load and URL tag

Connection manager has been shutdown com.jaeksoft.searchlib.crawler.web.spider.HttpAbstract.execute(HttpAbstract.java:151)

REST Crawler callback

The REST crawler should be able to execute a callback for each document indexed.

Choose a method (POST, PATCH, GET)
Define the URL
One call per document or one call for all documents
Pass id(s) using a query parameter (providing the name of the parameter)
Pass id(s) using payload (JSON Array)

Multiple proxies

For a crawler, possibility to set up several proxies. For a crawl run, the crawler will use all available proxies.

New XmlXPathParser

This new parser will be able to extract data from any XML document using XPATH queries.

Add a timestamp before error in schedulers list

Inside the Scheduler tab on the "Jobs list" screen in the "Last error" column it would be nice to display the timestamp at which the error occured just before the error label.

This would allow users to know if the error is an old one or just happened.

Thank you,
Alexandre

Outlook message parser

A parser able to extract full text data from .msg files (MAPI MSG Outlook)

Query of type "Search (field)" should allow search with phrase query only

When checking the checkbox "Phrase" in the "Search (field)" type of query the final query is a term query + phrase query. There is no way to run only a phrase query.

For instance in this example:

==> We should be able to search only for title:"my tailor is rich"^10.0 content:"my tailor is rich"^10.0. You could add a Terms checkbox near to the Phrase checkbox, for example checked by default.

Thank you,
Alexandre

Crawler Field mapping: use same interface than HTML parser field mapping

It would be great to have the same interface in Crawler Field Mapping than the one displayed in HTML Parser Field Mapping: this would allow the use of Analyzer and Regular expression on those mappings.

Thank you

Missing libraries

Some external libraries that were once embedded into the full package are now missing:

jCIFS
Mysql Connector

Thank you,
Alexandre

Renderer: problem with facets and navigation

When using a renderer if one clicks a facet and then navigate through the pages the facet filter is lost.

Thank you,
Alexandre

Error reading 'driverClassList

org.zkoss.zel.ELException: Error reading 'driverClassList' on type com.jaeksoft.searchlib.web.controller.crawler.database.DatabaseCrawlListController
java.lang.UnsupportedClassVersionError: net/sourceforge/jtds/jdbc/Driver : Unsupported major.minor version 51.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)

Problem in documents API

When i make DELETE request to the documents API, it removes the document but it says that 0 document are deleted :

{ successful: true, info: '0 document(s) deleted by id' }

Named Entity Recognition module

Boost frest results

Hi,

The recency of the results is a very important need to improve the relevancy of web search results. This is very useful for news, business documents and local search.

The improvment needed into OpenSearchServer is to be able to boost document by age, that means not just sort the doc by age because this bypasses the score but boost more recent document related to the date of the query against the document update date.

Some articles about this feature :

http://fr.slideshare.net/LucidImagination/boosting-documents-in-solr-by-recency-popularity-and-user-preferences

http://www.solrtutorial.com/boost-documents-by-age.html

http://jontai.me/blog/2013/01/advanced-scoring-in-elasticsearch/

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html

http://stackoverflow.com/questions/4724451/boost-fresh-documents-with-lucene

Thanks.
Youri.

Web crawler: crawl url "new" in parallel to the "old" ones

When the crawlers finds some "old" url (= url to re-fetch) it crawls them in priority and does not crawl anymore the "new" ones, even if the parameters of the crawl process allow crawling for several hosts in parallel.

Crawler should not stop crawling "new" url when it starts re-crawling "old" ones.

Thank you,
Alexandre

Crawler: pattern URL: allow use of wildcard in hostnames

In the definition of URL patterns it should be possible to use a wildcard (*) also for the hostname part of the URL.

For instance: http://*.mydomain.com/*

Thank you,
Alexandre

Field query: add option to choose different weight for phrase queries

When creating a fields query and checking the "Phrase" checkbox the same weight is then used for the query when searching for phrase or not.
We should be able to decide of a different weight for the "phrase" query.

Thank you,
Alexandre

Copy Of: ability to copy from several fields

The new "Copy Of" version should allow copy from several fields, not only one.

HTML parser - capture with XPath

Would be nice to have in the HTML parser the possibility to capture data using XPath
thx

Be able to abort database and scheduler process

Analyzer : GroupAllTokensFilter : should not add a separator after last token

Filter GroupAllTokensFilter should not add its "separator" after the last token. For example if the input to this token are 19 and 50 and the chosen separator is , result should be 19,50 and not 19,50,.

Thank you

URL browser: "delete selected URL" action deletes every URL

In the URL Browser the action "Delete Selected URL" delets every URL from the URL database, even when a selection has been made previously by running some specific search on the URLs.

Thank you,
Alexandre

External process for Parsers

The parsers are responsible of text extraction. Using many libraries, a parser can consume a lot of memory and may generate an out of memory, crashing the OpenSearchServer instance.

When the external feature is enabled, each time a parser is called, a new process is created.

Web crawl returns "Error - Not Allowed" when using a proxy

07:29:19,880 ERROR: root -
java.lang.UnsupportedOperationException
at org.apache.http.impl.client.InternalHttpClient.getParams(InternalHttpClient.java:206)
at com.jaeksoft.searchlib.crawler.web.spider.ProxyHandler.check(ProxyHandler.java:83)

Search field API: phraseBoost should be optionnal

When using the API to search with a query of type "field query" it seems whe need to pass a boost parameter and a phraseBoost parameter. The second one could be optionnal, and the first one could be used for term and phrase queries if there is no phraseBoost parameter.

Thank you,
Alexandre

Add an embedded regexp tester

I would be nice to embed in OSS an internal regexp tester that could be used each time a regexp is needed (HTML parser, Regexp filter in analyzers, ...).

Crawler Field mapping: provide more "urlWhen..." source fields

Crawler field mapping should provide several urlWhen... fields, like for example :

urlWhenDay = {url}YYYYMMDD
urlWhenMonth = {url}YYYYMM
urlWhenYear = {url}YYYY

Thank you

Crawler: Authentication : allow web form authentication

The crawler should crawl the webpages with form authentication.

jaeksoft / opensearchserver Goto Github PK

opensearchserver's Introduction

Quickstart

Docker image

Go with the interface and/or the API

Useful links

Features

Search functions

Indexation

Document supported

Crawlers

General

License

opensearchserver's People

Contributors

Stargazers

Watchers

Forkers

opensearchserver's Issues

Payload

Dry-run

First problem

Second problem

Recommend Projects

Recommend Topics

Recommend Org