Code Monkey home page Code Monkey logo

querqy's Introduction

Build Querqy Docker Integration Tests for Querqy Querqy for Solr Querqy Core

⚠️ IMPORTANT: Querqy 5.5 for Solr introduces some breaking changes that will affect you if you are upgrading from an older version and if

  • you are using Info Logging, or
  • rely on the debug output format, or
  • you are using a custom rewriter implementation

See here for the release notes: https://querqy.org/docs/querqy/release-notes.html#major-changes-in-querqy-for-solr-5-5-1

Querqy

Querqy is a framework for query preprocessing in Java-based search engines.

This is the repository for querqy-core, querqy-lucene and querqy-solr. Repositories for further Querqy integrations can be found at:

Documentation and 'Getting started'

Visit docs.querqy.org/querqy/ for detailed documentation.

Please make sure you read the release notes!

Check out Querqy.org for related projects that help you speed up search software development.

Developer channel: Join #querqy-dev on the E-commerce search Slack space

License

Querqy is licensed under the Apache License, Version 2.

Contributing

Please read our developer guidelines before contributing.

querqy's People

Contributors

dependabot[bot] avatar johannesdaniel avatar magro avatar martin-g avatar mkr avatar pfries avatar renekrie avatar tboeghk avatar tkaessmann avatar tomglk avatar wcurrie avatar worleydl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

querqy's Issues

Rewrite the rewriter deployment

Rewriter deployment is currently handled as a mix of configuration in solrconfig.xml and additional artefacts (like the 'rules.txt' files for the Common Rules rewriter).

This approach has a number of issues:

  • Applying changes in the configuration requires a collection/core reload. Depending on how long it takes to load the rewriters and on the 'useColdSearcher' setting, Solr can be left without a core to answer queries for a few moments.
  • The deployment is just not handy: updating an additional artefact like 'rules.txt' requires pushing data to ZooKeeper in the first step and then reloading the collection in a separate HTTP request. Some Querqy users also had to work around the file size limit in ZooKeeper.

I'd like to suggest rewriting the rewriter configuration and deployment from scratch. The goals should as follows:

  • A rewriter should be defined using a single HTTP call (REST API). This call should define, the name under which the rewriter can be used, the rewriter class to be loaded and all the specific settings for this rewriter. This implies that, in the case of the CommonRules rewriter, there will be no separate rules.txt file. (Compare 'Defining a rewriter' in Querqy for Elasticsearch - https://github.com/querqy/querqy-elasticsearch#defining-a-rewriter )
  • The API request handler should take care of validating the rewriter definition, storing it and replacing a previously loaded rewriter instance of the same name in a background process so that search traffic will not be disrupted. The request handler should also make the configuration size handling transparent to the client.
  • The API should define means to unload and delete rewriters.
  • Rewriter deployment should work in both, SolrCloud and stand-alone mode.

A few thoughts on the implementation:

I think we should get rid of a pre-configured rewriter chain when implementing this instead of creating a second API endpoint to configure the chain. We currently have three elements that are defined at chain level:

  1. The list of rewriters. This could be specified per request (see Querqy for Elasticsearch implementation). As a positive side-effect, this would make the chain more flexible.
  2. The term query cache. Much as it helped us in older Solr versions (around 4.x), it doesn't seem to provide any performance gain for current Solr/Lucene versions. We might just want to deprecate and remove it.
  3. Sinks and mappings for rewriter logging. I think we can just provide standard sinks that are always available and then configure the mapping at rewriter level.

There seem to be two ideas where to save the rewriter configuration: in ZooKeeper or in a separate collection (maybe Solr's blob store). I leave this as a matter for discussion. If we did it in ZK we would need to make sure that we work around ZK file size limits (splits, compression) and that we are not mis-using ZK. If we used a separate collection we would probably have to think about the order in which collections are loaded. In both cases, we would have to think about how we propagate rewriter update events across nodes.

I've talked with a number of people about this issue, pinging you to join the discussion here: @JohannesDaniel @mkr @MighTguY @tboeghk

WordBreakCompoundRewriter should check for collations

The WordBreakCompoundRewriter should check for collations within the spellcheck field. If it only checks for the general existence of each part of the compound in the entire index, we could get funny combinations of words. We should get more meaningful de-compounds by checking for the co-occurrence of the splits within a document (via the spellcheck field).

Rewriter for (de)compounding

Implement a rewriter for the (de)compounding of query terms using Lucene's WordBreakSpellChecker. The (de)compound should be added as a query-time synonym.

Create a WordBreakCompoundRewriter variant for German

The WordBreakCompoundRewriter uses a dictionary of words (= a Lucene index field) to test if a word can be split into parts. The split will be allowed if all parts exist in the dictionary as words.

While this strategy works in many cases for German, there also exist compounding strategies that involve a morphological change of the modifier word so that a simple test 'split into parts' + 'lookup' will miss these cases. For example, the word 'baumwolljacke' (cotton jacket) is composed of 'baumwolle' + 'jacke' so that a simple split 'baumwolljacke => baumwoll + jacke' would not find 'baumwolle' in the dictionary because of the missing 'e'.

S. Langer. 1998. Zur Morphologie und Semantik von Nominalkomposita. (Tagungsband der 4. Konferenz zur Verarbeitung natürlicher Sprache (KONVENS)) lists 19 patterns of morphological change that can be found in German compound creation. The idea is to apply at least the most productive patterns before looking up the word splits in the dictionary. (see - https://www.cis.uni-muenchen.de/~stef/veroeffentlichungen/konvens1998.pdf, p.6)

WordBreakCompoundRewriter should support case-insensitive dictionaries

Currently, the WordBreakCompoundRewriter validates candidates for compounds and term constituents against a dictionary field. There is no pre-processing/analysis before validation so a upper-case search term will never validate against a lowercased dictionary field, i.e. "Hunde Futter" will never be compounded if there is only "hundefutter" indexed in the dictionary field.

Two possible fixes:

  • Get the field's query analyzer from the Solr schema and perform analysis before term validation.
  • Provide an option to just lowercase the term before validation.

Boost on a purely negative query is not applied

Boosting documents that don't contain a given term doesn't work:

a =>
 UP: -b

should boost docs that don't contain b to the top of the search result. This doesn't happen and the debug output shows that querqy.lucene.rewrite.AdditiveBoostFunction produces 0.0 as the score for b. This probably happens as b is produced in a MUST_NOT clause of a BooleanQuery, which doesn't produce a score.

Criteria on Rule Selection

Currently if for 1 search term, there can be cases for multiple rules get selected and getting applied.
So in the updated part, User can choose which rules can be applied, by adding filtering, sorting and selection criteria.
For ex: for a search term "black shoes", there can be multiple rules.
In that case, the user can provide the criteria in URL to solr as:
Sorting criteria: "rules.criteria.sort": "priority desc"
multiple Filter criteria: "rules.criteria.filter": "active:true"
Selection criteria i.e after applying filter & sorting there can be multiple rules, so if the user wants to apply top n rules, they can mention that as "rules.criteria.size": "1",

Rewriters can supply tracking information

At the moment, there is no systematic way for Rewriters to add tracking information to the response. The available workarounds would be using debug mode or adding a 'decoration', which would either have performance issues (debug) or mix concerns ('decorate').

The idea is to provide an interface and implementation that

  • allows Rewriters to pass tracking information to the search engine. The search engine can then decide how to deal with the information (eg. return it as part of the response or send it to a logging framework etc)
  • allows to turn tracking on/off (global and per request?!). Rewriters should 'know' whether tracking is turned on, so that they can decide not to create the tracking data at all in case it is turned off.

java.lang.NoSuchMethodError: org.apache.lucene.index.Terms.iterator(Lorg/apache/lucene/index/TermsEnum;)Lorg/apache/lucene/index/TermsEnum

On Solr 5.2.1 & Querqy 2.7.0 (querqy-solr-2.7.0-jar-with-dependencies.jar), I'm getting the error referenced in the issue title: java.lang.NoSuchMethodError: org.apache.lucene.index.Terms.iterator(Lorg/apache/lucene/index/TermsEnum;)Lorg/apache/lucene/index/TermsEnum

In Readme.md, it lists the versions of Querqy that match versions of Solr as:

Querqy 2.6.x to 2.7.x - Solr 5.1
Querqy 2.8.x - Solr 5.3.x

If I try with 2.8 I get a different error message.

This issue on another project may be relevant: OpenSextant/SolrTextTagger#40

It states there: 'So annoyingly in Solr 5.2, this method referred to in your stack trace was changed without regard to backward-compatibility. . . . It's a trivial change to the iterator() method; simply drop the argument to take no argument.'

Separate rewriting controller from Solr QParser plugin

querqy.solr.QuerqyDismaxQParser contains rather complex code for handling request parameters and to control the Querqy query rewriting. As the query rewriting is Solr-independent, the code that controls the query rewriting should be separated and be placed in the querqy-lucene module. The goal is to make it reusable for Elasticsearch (or Lucene in general) and to reduce the complexity of the QuerqyDismaxQParser.

Assure JDK 12 compatibility

Under JDK12 we get at least this runtime error:

<pre>java.lang.IncompatibleClassChangeError: Method &apos;org.apache.lucene.index.Term querqy.lucene.LuceneQueryUtil.toLuceneTerm(java.lang.String, java.lang.CharSequence, boolean)&apos; must be Methodref constant

when using the ClassicWordBreakCompoundRewriter (which should be removed in Querqy 5 in favour of the WordBreakCompoundRewriter).

Support custom language normalization and suffix handling for TrieMap-Lookups

Problem
Given that the product type for a washing machine is maintained as "Geschirrspüler" in the product data, a Search Manager might want to configure a rule like
spülmaschine => SYNONYM: geschirrspüler

However, there are many ways to type "spülmaschine" and not all of them are spellings, e. g. spulmaschine, spuelmaschine, spülmaschinen, spulmaschinen, spuelmaschinen, ... Currently, all of these variants must be configured separately in order to match the synonym rule and to find "Geschirrspüler".

Idea
Enabling developers to define char replacements and suffix handling for TrieMap lookups. This could be implemented by wrapping the current TrieMap implementation. The rules could look like this:

nen =>
    SUFFIX: ne

ü =>
    REPLACE: u

To avoid abusal behavior leading to ambiguities, the rules could be limited (e. g. 3 chars as a maximum for a suffix rule and 2 chars as a maximum for a replace rule.

Routing with querqy

Hi Rene,

I haven't yet tested but do you think any issue using route or shard parameters being used with querqy.

Thnx

The AND operator in query treated as keyword

Hello Rene,

I was looking further into querqy for my use case and realized that when i execute a query like below

query

http://localhost:8983/solr/test/select?indent=on&q=text:Rich AND Active&defType=querqy&wt=json&debugQuery=true

debuq

"querystring": "text:Rich AND Active",
"parsedquery": "text:text:rich text:and text:active",

the keyword AND is treated as "and" and thus returning unexpected results while with edismax / default query parser it returns expected result by

debuqQuery

querystring": "text:Rich AND Active",
"parsedquery": "(+(+text:rich +DisjunctionMaxQuery((text:active))))/no_coord",

OR (with dismax)
"querystring": "text:Rich AND Active",
"parsedquery": "+text:rich +text:active",

Please suggest.

Solr Plugin: Current approach of applying reranking leads to performance degradation

We currently wrap the generated main query in a RankQuery

  • when boosts are applied and parameter qboost.method = rerank
  • when an external rerank query is provided using parameter querqy.rq

Wrapping the main query incurs performance penalties when Solr needs to evaluate the main query multiple times. Facetting with facet exclusions seems to be one of the cases.

We should not replace the main query but set main and rank query individually on the org.apache.solr.handler.component.ResponseBuilder (using #setQuery and #setRankQuery, respectively)

querqy-solr tests sometimes fail due to random locale

Using the Solr testing framework, tests sometimes fail. It seems that this is triggered by some random locale chosen by the framework. One such locale is tr-CY, so

mvn test -Dtests.locale=tr-CY

always fails. Example error:

[ERROR] testThatOptCanBePassedAsBoostMethodParam(querqy.solr.BoostMethodTest)  Time elapsed: 0.002 s  <<< ERROR!
java.lang.RuntimeException: org.apache.solr.core.SolrCoreInitializationException: SolrCore 'collection1' is not available due to init failure: querqy.rewrite.commonrules.RuleParseException: Line 2: @ expected at beginning of line: FILTER: * f2:c
	at querqy.solr.BoostMethodTest.setUp(BoostMethodTest.java:37)
Caused by: org.apache.solr.core.SolrCoreInitializationException: SolrCore 'collection1' is not available due to init failure: querqy.rewrite.commonrules.RuleParseException: Line 2: @ expected at beginning of line: FILTER: * f2:c
	at querqy.solr.BoostMethodTest.setUp(BoostMethodTest.java:37)

Remove LuceneSynonymsRewriter

Remove LuceneSynonymsRewriter and its dependencies.

The original intent of this rewriter was to support the Lucene synonyms syntax to configure query time synonyms. As we want to focus on synonyms via Common Rules and as the synonyms.txt file can easily be converted in a rules.txt file, we should remove this rewriter (and its dependency on 'Sequence' and lucene.FST). The corresponding Solr rewriter factory has been deprecated for a while.

Allow Querqy to take priority over rank queries in Solr

In Solr, a query parameter rq can be used to apply a rank query to re-rank search results (https://lucene.apache.org/solr/guide/8_6/query-re-ranking.html), for example, using a rerank or ltr query. This would override the ranking produced by Querqy query rewriting, which could be problematic, for example, if a search manager creates a boosting rule using the UP/DOWN rules in the common rules rewriter and then the desired effect gets overridden by LTR.

The suggested solution is to give priority to Querqy in cases where it produces boost queries. Rank queries (ltr, rerank,...) should be specified in a parameter querqy.rq instead of rq so that the Querqy query parser can decide to either ignore this parameter - in the case that a Querqy boost query was produced - or apply the rank query - in the case that Querqy didn't produce a boost query.

NPE in TermQueryCacheValue

If the Lucene analysis chain removes the only token in a query, a NullPointerException is thrown in TermQueryCacheValue (branch solr5_1)

org.apache.solr.common.SolrException; null:java.lang.NullPointerException
at querqy.lucene.rewrite.cache.TermQueryCacheValue.(TermQueryCacheValue.java:18)
at querqy.lucene.rewrite.TermSubQueryBuilder.termToFactory(TermSubQueryBuilder.java:94)
at querqy.lucene.rewrite.LuceneQueryBuilder.addTerm(LuceneQueryBuilder.java:293)
at querqy.lucene.rewrite.LuceneQueryBuilder.visit(LuceneQueryBuilder.java:253)
at querqy.lucene.rewrite.LuceneQueryBuilder.visit(LuceneQueryBuilder.java:25)
at querqy.model.Term.accept(Term.java:57)
at querqy.model.AbstractNodeVisitor.visit(AbstractNodeVisitor.java:23)
at querqy.lucene.rewrite.LuceneQueryBuilder.visit(LuceneQueryBuilder.java:190)
at querqy.lucene.rewrite.LuceneQueryBuilder.visit(LuceneQueryBuilder.java:25)
at querqy.model.DisjunctionMaxQuery.accept(DisjunctionMaxQuery.java:25)
at querqy.model.AbstractNodeVisitor.visit(AbstractNodeVisitor.java:31)
at querqy.lucene.rewrite.LuceneQueryBuilder.visit(LuceneQueryBuilder.java:119)
at querqy.lucene.rewrite.LuceneQueryBuilder.visit(LuceneQueryBuilder.java:105)
at querqy.lucene.rewrite.LuceneQueryBuilder.createQuery(LuceneQueryBuilder.java:99)
at querqy.solr.QuerqyDismaxQParser.makeMainQuery(QuerqyDismaxQParser.java:664)
at querqy.solr.QuerqyDismaxQParser.parse(QuerqyDismaxQParser.java:340)
at org.apache.solr.search.QParser.getQuery(QParser.java:141)
at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:157)
at querqy.solr.QuerqyQueryComponent.prepare(QuerqyQueryComponent.java:30)
at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:196)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1984)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:829)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:446)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:220)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)

Allow rewriters to produce phrase queries

Currently Querqy produces only term-centric queries, for example:


<some input> =>
SYNONYM: iphone x

produces:

(field1:iphone | field2:iphone) AND (field1:x | field2:x)

In some cases we want the option to produce:

field1:"iphone x" | field2:"iphone x"

To implement this, we need to add a phrase query object to Querqy's object model. For now, we focus only on producing such phrase queries in the rewriting process and deal with parsing them from the user query later.

The common rules rewriter could use double quotes around the rhs of the rules to mark phrases, like in:

<some input> =>
SYNONYM: "iphone x"

Create Replace Rewriter

There should be a possibility to replace query terms. This will facilitate maintaining rules for different variants of terms, e. g.

  • i phone => iphone
  • iphones => iphone

The replaced terms should not be marked as generated in order to allow applying subsequent common rules on it.

Furthermore, users should be allowed to define multiple tab-delimited inputs for an output, e. g.
i phone iphones ihphone => iphone

deprecated warning for querqy.parser.WhiteSpaceQuerqyParserFactory

Hello Team,

Is there any way to remove deprecated warning. I am seeing this with Solr 6.6.2

SolrResourceLoader | Solr loaded a deprecated plugin/analysis class [querqy.parser.WhiteSpaceQuerqyParserFactory]. Please consult documentation how to replace it accordingly.

Thanks,
Susheel

Update to lucene/solr 4.10.2

Just bumping the version doesn't do the trick...

We're having issues with DOWN instructions, not sure if this is related, we need to investigate here. Hopefully the update to 4.10.2 provides the fix sooner than we can investigate :-)

Support suffixes for ReplaceRewriter

The ReplaceRewriter should support suffixes for replacements. Suffixes can be indicated by a wildcard at the beginning of a term, e. g.
*phones => phone

Use cases:

  • Compensate limitations of stemmers due to language dependencies, e. g. *kameras => kamera
  • Facilitate the definition of subsequently applied synonyms, e. g *phones => phone

The ReplaceRewriter will replace the matched suffix (sub)string by the sequence on the right side. In order to avoid abusal usage leading to ambiguous behavior, we could consider some limitations for the usage of the suffixes.

As I currently see no use case for using those suffix definitions within sequences of terms (e. g. apple *phones => apple $1phone or the like), using wildcards only will be allowed for single term input definitions.

Addition of Facet Support

There are cases, where on the basis of the search terms we can set rules on what facets should come, also we can include multiple features in it like:

  1. Facet Sort Type
  2. Facet Size
  3. Facet Name

Enable handling of the non matching part for prefix and suffix matches

A prefix/suffix rule like spühl* => spül should not only allow to handle the matching part, but also to handle the non-matching part. The non-matching part will be expressed with $1 (analogue to common rules). This will help to

  • remove non-matching parts günstig* => günstig or
  • to split terms abc* => abc $1

Existing configurations will be needed to be slightly modified. An existing rule like spühl* => spül will be needed to be refactored to spühl* => spül$1. However, as this feature has not been publicly presented or documented so far, this should not be a big issue.

Add tie property to DismaxQuery in Querqy query object model

querqy.model.DisjunctionMaxQuery should have an optional property 'tie' (in the sense of Lucene's dismax query) so that query rewriters can set specific tie values. In the current implementation a global value is applied (taken from request params). The global tie value shall still be used if the DisjunctionMaxQuery.tie is not set by a query rewriter.

[BUG] Querying term in field that was preloaded for different field(s) can cause Exception

Rules:

a =>
    SYNONYM: b
    UP(100): c

The index has:

doc1: {field1:a}
doc2: {field1:c, field2:x}

Cache preload is configured for field1only. Running a query with q=a&gqf=field1 field2 throws an Exception:

Caused by: java.lang.IllegalArgumentException: term=[63] does not exist
    at org.apache.lucene.index.TermsEnum.seekExact(TermsEnum.java:114)
    at querqy.lucene.rewrite.DependentTermQuery$TermWeight.getTermsEnum(DependentTermQuery.java:176)

The cache seems to provide some TermContext with documentFrequency>0 when searching for the non-existent field2:c.

Workaround: preload all fields.

querqy-solr-3.0.0-jar-with-dependencies.jar throws java.lang.ClassNotFoundException: org.apache.lucene.analysis.util.ResourceLoaderAware

Hello Rene,

I tested 3.0.0 and 3.2.0 versions and 3.0.0 with Solr-6.0.0 fails with below stack trace.

Can you please look into as I was planing to put this into out Solr-6 cluster.

Thanks,
Susheel

1729 INFO (coreLoadExecutor-6-thread-3) [ x:techproducts] o.a.s.c.CachingDirectoryFactory Closing directory: /Users/kumars5/opt/solr-6.0.0/server/solr/techproducts/data/index
1729 ERROR (coreLoadExecutor-6-thread-3) [ x:techproducts] o.a.s.c.CoreContainer Error creating core [techproducts]: org/apache/lucene/analysis/util/ResourceLoaderAware
org.apache.solr.common.SolrException: org/apache/lucene/analysis/util/ResourceLoaderAware
at org.apache.solr.core.SolrCore.(SolrCore.java:771)
at org.apache.solr.core.SolrCore.(SolrCore.java:642)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:817)
at org.apache.solr.core.CoreContainer.access$000(CoreContainer.java:88)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:468)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:459)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/util/ResourceLoaderAware
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:487)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:814)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:520)
at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:467)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:565)
at org.apache.solr.core.PluginBag.createPlugin(PluginBag.java:121)
at org.apache.solr.core.PluginBag.init(PluginBag.java:221)
at org.apache.solr.core.PluginBag.init(PluginBag.java:210)
at org.apache.solr.core.SolrCore.(SolrCore.java:718)
... 10 more
Caused by: java.lang.ClassNotFoundException: org.apache.lucene.analysis.util.ResourceLoaderAware
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 46 more

Common Rules ordering/limiting should support 'levels'

In the Common Rules rewriter, rules can be sorted by a property and then the top n rules can be selected using the limit parameter. Though this is already useful when we want to promote a specific rule over a generic rule (such as 'notebook bag' vs 'notebook' - prio 10 vs prio 1 and sort=prio desc & limit=1), there might be a varying number of high priority rules that we want to apply before filtering out lower priority rules.

The suggested solution uses a new parameter 'limit.level=true|false'. If it is 'true' rules will be sorted by the sort parameter as before (sort=prop desc) but if two rules have the same value for the sort criterion, they will only count as one unit towards the limit.

Let there be a third rule for 'bag', also having prio = 10, then sort=prio desc&limit=1&limit.level=true, will allow rules for 'notebook bag' and 'bag' - both at the same prio 'level' (10) - to be included, but 'notebook' (prio=1, next level) to be excluded.

Enable deleting the last non-generated term in common rules rewriter

Currently, the common rules rewriter will not delete a term if the query will not contain at least one non-generated term in the query after the deletion.

This behavior leads to the following issues:

  • The behavior is quite counterintuitive.
  • It is not possible to redefine a query by replacing its terms by a combination of boost and filter queries. This would be very powerful as filter queries allow the definition of raw queries, which allows access to the big universe of Solr/Elasticsearch search capabilities.

By deleting the last non-generated term, the rewriter should show the following behavior:

  • If the query still contains generated terms, the query will be processed in the existing way using these terms.
  • If the query does not contain any term anymore, but contains boost and/or filter queries, the query will be defined as a MatchAllQuery.
  • If the query neither contains terms nor boost or filter queries anymore, the query will be defined as NoMatchQuery returning 0 results.

Support compressed rewrite rule files

When using Querqy's rule-based rewriter one can quickly scratch the 1 MB size limit that Solrcloud (or Zookeeper) currently imposes on config files. While there are workarounds (such as increasing ZK's jute.maxbuffer) and more involved solution approaches (such as #14) supporting GZ compression for rule files might already be a simple first step that eases the pain for most users.

Create simple QuerqyParser that handles field names

This querqy.parser.FieldQuerqyParser implementation must be able to handle field names in the query input string so that querqy.antlr.ANTLRQueryParser will no longer be needed and the querqy-antlr module can be dropped. This will be needed for #30

Create version for Solr 8.1.0

Solr 8.1.0 now comes with com.jayway.jsonpath:json-path bundled. Querqy must not include this library in its 'jar-with-dependencies'

Allow to split rules file for Common Rules

If Querqy runs in SolrCloud, the rules file for the Common Rules rewriter might hit the default Zookeeper max. file size of 1 MB. Allow to configure a (comma-separated) list of rules files.

Implement a number-unit-rewriter

I am currently thinking about how a number-unit-rewriter could be implemented in Querqy.

The goal: A query like "tv 50 inch" is rewritten to a query like this

eq(
    boolq(
        dmq(
            term(
                *:"tv"
            ),
            term(
                *:"television"
            ),  
        ),  
        dmq(
            term(
                screen_size:50^20
            ),
            range(
                screen_size:[45 TO 50)^10
            ),
            range(
                screen_size:(50 TO 55]^10
            ),
            range(
                screen_size:[40 TO 45)^5
            ),
            range(
                screen_size:(55 TO 60]^5
            )
        )
    )
)

The rewriter configuration could roughly look like this:

<lst name="rewriter">
    <str name="class">querqy.solr.contrib.NumberUnitRangeRewriterFactory</str>
    <lst name="numberUnitDefinition">
        <!-- units can be defined with a factor the number is multiplied with, 
        e. g. for something like mm, cm, m -->
        <lst name="units">
            <str name="inch">1.0</str> 
        </lst>
        <!-- multiple fields should be configurable -->
        <str name="fields">screen_size</str>
        <lst name="boundaries">
            <lst name="upperBound">
                <str name="weight">10</str>
                <str name="bound">10%</str>
            </lst>
            <lst name="upperBound">
                <str name="weight">5</str>
                <str name="bound">20%</str>
            </lst>
            <lst name="lowerBound">
                <str name="weight">10</str>
                <str name="bound">10%</str>
            </lst>
            <lst name="lowerBound">
                <str name="weight">5</str>
                <str name="bound">20%</str>
            </lst>
        </lst>
    </lst>
</lst>

Probably, an option to keep the original token or to do something with it like concat / split can be added.

The more complex part is how to implement this in Querqy. So far I identified three options:

  1. Not touching the query model and the LuceneQueryBuilder by creating a combination of filter and boost clauses (should be the worst option)

    • Enforces some kind of MUST for the range part
    • Requires a lot of clauses to be created
    • Feels like a hack
  2. Using RawQuery for the ranges

    • Requires RawQuery to be a DisjunctionMaxClause
    • Requires the LuceneQueryBuilder to visit RawQuery and to create a RawQueryFactory that creates a query simliar like the method "Query parseRawQuery(final RawQuery rawQuery)" in class DismaxSearchEngineRequestAdapter
  3. Enhancing the model by something like a range query

    • Requires the most comprehensive changes as the entire visiting has to be enhanced
    • Should be the cleanest solution

Personally, I am in favor of solution 2 even though it appears not to be the cleanest one. The solution does not require too deep changes on the plugin. Furthermore, making RawQuery compatible to be nested in DMQ could also make the plugin more flexible for future rewriters.

What do you think about this?

Add logging

Add logging to Querqy, using a common logging framework

Support Solr Config API

As discussed today in Relevance Slack: I've found that attempting to add the Querqy query parser using the Solr Config API (in Solr 8.2) fails with the exception java.lang.ClassCastException: class java.util.LinkedHashMap cannot be cast to class org.apache.solr.common.util.NamedList. The command below is accepted (but is ineffective) only if the rewriter property is renamed:

{
  "create-queryparser": {
    "class": "querqy.solr.DefaultQuerqyDismaxQParserPlugin",
    "name": "querqy",
    "parser": {
      "class": "querqy.parser.WhiteSpaceQuerqyParser",
      "factory": "querqy.solr.SimpleQuerqyQParserFactory"
    },
    "rewriteChain": {
      "rewriter": {
        "class": "querqy.solr.SimpleCommonRulesRewriterFactory",
        "ignoreCase": true,
        "querqyParser": "querqy.rewrite.commonrules.WhiteSpaceQuerqyParserFactory",
        "rules": "rules.txt"
      }
    }
  }
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.