Code Monkey home page Code Monkey logo

opensearchserver's Introduction

Build Status Maven Central Join the chat at https://gitter.im/jaeksoft/opensearchserver

OpenSearchServer is a powerful, enterprise-class, search engine software based on Lucene. Using the web user interface, the crawlers (web, file, database, ...) and the JSON webservice you will be able to integrate quickly and easily advanced full-text search capabilities in your application. OpenSearchServer runs on Linux/Unix/BSD/Windows.

Quickstart

Docker image

Not yet there.. coming soon..

Go with the interface and/or the API

http://localhost:9090

Useful links

Features

Search functions

  • Advanced full-text search features
  • Phonetic search
  • Advanced boolean search with query language
  • Clustered results with faceting and collapsing
  • Filter search using sub-requests (including negative filters)
  • Geolocation
  • Spell-checking
  • Relevance customization
  • Search suggestion facility (auto-completion)

Indexation

  • Supports 18 languages
  • Fields schema with analyzers in each language
  • Several filters: n-gram, lemmatization, shingle, stripping diacritic from words,…
  • Automatic language recognition
  • Named entity recognition
  • Word synonyms and expression synonyms
  • Export indexed terms with frequencies
  • Automatic classification

Document supported

  • HTML / XHTML
  • MS Office documents (Word, Excel, Powerpoint, Visio, Publisher)
  • OpenOffice documents
  • Adobe PDF (with OCR)
  • RTF, Plaintext
  • Audio files metadata (wav, mp3, AIFF, Ogg)
  • Torrent files
  • OCR over images

Crawlers

  • The web crawler for internet, extranet and intranet
  • The file systems crawler for local and remote files (NFS, SMB/CIFS, FTP, FTPS, SWIFT)
  • The database crawler for all JDBC databases (MySQL, PostgreSQL, Oracle, SQL Server, …)
  • Filter inclusion or exclusion with wildcards
  • Session parameters removal
  • SQL join and linked files support
  • Screenshot capture

General

  • JSON web service
  • Index replication and sharding
  • Federated search

License

Copyright Emmanuel Keller / Jaeksoft (2008-2020)

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

opensearchserver's People

Contributors

adamlonsdale avatar alexandretoyer avatar andersevenrud avatar emmanuel-keller avatar gitter-badger avatar guillaumelecerf avatar hiranchaudhuri1 avatar naveenann avatar nicohaase avatar sebastien-andrivet avatar tetraa avatar tobias-husmann avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

opensearchserver's Issues

API to index CSV document

This new API let's put document in the index based on text document (like CSV). Each line is a document. The fields are extracted using a regular expression.

API Document DELETE: use payload + implement dry-run

Payload

It could be nice to be able to give values with an attached payload to the DELETE API for Document instead of having to give every parameters in the URL.
For instance if one wants to delete several document based on the product_id field instead of calling : .../services/rest/index/{index_name}/document/product_id/1/2/3/4/5/6/7/8/9/10 it would be possible to call .../services/rest/index/{index_name}/document/product_id and give this as a payload:

1
2
3
4
5
6
7
8
9
10

Dry-run

It would also be nice to add a dry-run feature to the DELETE Document API, that would return the number of documents this query would delete if actually run.

This could be an extra parameter to the query: ?dryrun

Thank you,
Alexandre

Massive update for classifiers

It could be nice to provide a massive update feature for the classifiers. For example classifiers could be updated massively by uploading a CSV file well formatted.

Thank you,
Alexandre

Problem in fields API

I am implementing a node driver for the API, i am currently in fields.

https://github.com/jaeksoft/opensearchserver/wiki/Field-create-update

First problem

The "PUT" method doesn't work, i get a 500.

Second problem

I tried with "POST", it seems to work, but it doesn't.

$ http post http://localhost:9090/services/rest/index/my_index/field name=my_field
HTTP/1.1 200 OK
Content-Type: application/json
Date: Tue, 22 Oct 2013 08:56:13 GMT
Server: Apache-Coyote/1.1
Transfer-Encoding: chunked

{
    "info": "",
    "successful": true
}


$ http get http://localhost:9090/services/rest/index/my_index/field
HTTP/1.1 200 OK
Content-Type: application/json
Date: Tue, 22 Oct 2013 08:56:14 GMT
Server: Apache-Coyote/1.1
Transfer-Encoding: chunked

{
    "fields": [],
    "info": "0 field(s)",
    "successful": true
}

I use httpie for requesting.

Unicode search is not working in version 1.5

Hi,
Unicode search which is working fine in version 1.4 is gone stopped in version 1.5.
In ver.1.5 admin application; in query edit its working but for XML/HTTP API call it takes garbage value in advscore query.

Regards,
Mayur

Precise rights management

It would be nice to be able to create users in OSS that have a very limited scope of what they are allowed to do.

Eg. create a user that is only able to edit a synonym list for an index.

Crawler/parser transform letter with accent into html entities

When HTML parser extracts values from the web pages, specially from the <title> attribute, it seems that letters with accent (é, é, ...) are transformed into HTML entities, and those HTML entities are then indexed.

These leads to some problem, for instance when using the default "web crawler" template which puts the value from the <title> node into the autocomplete field of the schema: suggestions from the autocompletion will then contain some HTML entities instead of proper letters.

Thank you,
Alexandre

Improve spell check query: return existing words

In spell check queries there is currently a small "bug": if this query is queried with an existing word it will not return it but rather it will try to find replacements for it.

For instance:

  • search spell check for "blu dress"
  • what could be expected: blue for blu, and dress, as this word already exists in index
  • what is returned: blue for blu, but dresser for dress

Thank you,
Alexandre

Script REST API

Method POST
URL: http://localhost:9090/services/rest/script

Script example:

[
  { "command": "ON_ERROR", "parameters": [ "RESUME" ] },
  { "command": "WEBDRIVER_OPEN", "parameters": [ "FIREFOX" ] },
  { "command": "WEBDRIVER_SET_TIMEOUTS", "parameters": [ 60, 60 ] },
  { "command": "WEBDRIVER_RESIZE", "parameters": [ 1024, 768 ] }, 
  { "command": "WEBDRIVER_GET", "parameters": [ "http://www.open-search-server.com" ] },
  { "command": "SLEEP", "parameters": [ 2 ] },
  { "command": "WEBDRIVER_CLOSE" } 
]

Query check scheduler task

This new sheduler task should be used to check if the content of an index is valid. The task will execute a search request. A JSONPath or XPath query can then be applied. If the result is bad an email can be send, or a callback URL can be called.

CSS ignore class: ability to configure our own

It could be useful to be able to configure which CSS classes must be ignore during crawl process. This would overwrite the default "opensearchserver.ignore" CSS class.

We should be able to provide several CSS classes.

Thank you,
Alexandre

REST Crawler callback

The REST crawler should be able to execute a callback for each document indexed.

  • Choose a method (POST, PATCH, GET)
  • Define the URL
  • One call per document or one call for all documents
  • Pass id(s) using a query parameter (providing the name of the parameter)
  • Pass id(s) using payload (JSON Array)

Multiple proxies

For a crawler, possibility to set up several proxies. For a crawl run, the crawler will use all available proxies.

New XmlXPathParser

This new parser will be able to extract data from any XML document using XPATH queries.

Add a timestamp before error in schedulers list

Inside the Scheduler tab on the "Jobs list" screen in the "Last error" column it would be nice to display the timestamp at which the error occured just before the error label.

This would allow users to know if the error is an old one or just happened.

Thank you,
Alexandre

Query of type "Search (field)" should allow search with phrase query only

When checking the checkbox "Phrase" in the "Search (field)" type of query the final query is a term query + phrase query. There is no way to run only a phrase query.

For instance in this example:
fieldquery

==> We should be able to search only for title:"my tailor is rich"^10.0 content:"my tailor is rich"^10.0. You could add a Terms checkbox near to the Phrase checkbox, for example checked by default.

Thank you,
Alexandre

Missing libraries

Some external libraries that were once embedded into the full package are now missing:

  • jCIFS
  • Mysql Connector

Thank you,
Alexandre

Error reading 'driverClassList

org.zkoss.zel.ELException: Error reading 'driverClassList' on type com.jaeksoft.searchlib.web.controller.crawler.database.DatabaseCrawlListController
java.lang.UnsupportedClassVersionError: net/sourceforge/jtds/jdbc/Driver : Unsupported major.minor version 51.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)

Problem in documents API

When i make DELETE request to the documents API, it removes the document but it says that 0 document are deleted :

{ successful: true, info: '0 document(s) deleted by id' }

Boost frest results

Hi,

The recency of the results is a very important need to improve the relevancy of web search results. This is very useful for news, business documents and local search.

The improvment needed into OpenSearchServer is to be able to boost document by age, that means not just sort the doc by age because this bypasses the score but boost more recent document related to the date of the query against the document update date.

Some articles about this feature :

http://fr.slideshare.net/LucidImagination/boosting-documents-in-solr-by-recency-popularity-and-user-preferences

http://www.solrtutorial.com/boost-documents-by-age.html

http://jontai.me/blog/2013/01/advanced-scoring-in-elasticsearch/

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html

http://stackoverflow.com/questions/4724451/boost-fresh-documents-with-lucene

Thanks.
Youri.

Web crawler: crawl url "new" in parallel to the "old" ones

When the crawlers finds some "old" url (= url to re-fetch) it crawls them in priority and does not crawl anymore the "new" ones, even if the parameters of the crawl process allow crawling for several hosts in parallel.

Crawler should not stop crawling "new" url when it starts re-crawling "old" ones.

Thank you,
Alexandre

External process for Parsers

The parsers are responsible of text extraction. Using many libraries, a parser can consume a lot of memory and may generate an out of memory, crashing the OpenSearchServer instance.

When the external feature is enabled, each time a parser is called, a new process is created.

Web crawl returns "Error - Not Allowed" when using a proxy

07:29:19,880 ERROR: root -
java.lang.UnsupportedOperationException
at org.apache.http.impl.client.InternalHttpClient.getParams(InternalHttpClient.java:206)
at com.jaeksoft.searchlib.crawler.web.spider.ProxyHandler.check(ProxyHandler.java:83)

Search field API: phraseBoost should be optionnal

When using the API to search with a query of type "field query" it seems whe need to pass a boost parameter and a phraseBoost parameter. The second one could be optionnal, and the first one could be used for term and phrase queries if there is no phraseBoost parameter.

Thank you,
Alexandre

Add an embedded regexp tester

I would be nice to embed in OSS an internal regexp tester that could be used each time a regexp is needed (HTML parser, Regexp filter in analyzers, ...).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.