bripkens / lucene Goto Github PK

View Code? Open in Web Editor NEW

72.0 5.0 33.0 396 KB

Node.js lib to transform: lucene query → syntax tree → lucene query

License: MIT License

JavaScript 100.00%

lucene querystring parser stringify formatter grammar peg

lucene's Introduction

lucene

Parse, modify and stringify lucene queries.

Installation | Try It | Usage | Grammar | History

Installation

npm install --save lucene
-or-
yarn add lucene

Usage

const lucene = require('lucene');

const ast = lucene.parse('name:frank OR job:engineer');
console.log(ast);
// {
//   left: {
//     field: 'name',
//     term: 'frank'
//   },
//   operator: 'OR',
//   right: {
//     field: 'job',
//     term: 'engineer'
//   }
// }

console.log(lucene.toString(ast));
// name:frank OR job:engineer

Grammar

The parser is auto-generated from a PEG implementation in JavaScript called PEG.js.

To test the grammar without using the generated parser, or if you want to modify it, try out PEG.js online. This is a handy way to test arbitrary queries and see what the results will be like or debug a problem with the parser for a given piece of data.

History

This project is based on thoward/lucene-query-parser.js and its forks (most notably xomyaq/lucene-queryparser). The project is forked to allow some broader changes to the API surface area, project structure and additional capabilities.

lucene's People

Contributors

Stargazers

Watchers

lucene's Issues

Support for lower case operators

I'm using the library to allow for free text search and if I do:
a OR b

the tree builds up with "operator": "OR", like this one

{
   "left": {
      "field": "<implicit>",
      "fieldLocation": null,
      "term": "a",
      "quoted": false,
      "regex": false,
      "termLocation": {
         "start": {
            "offset": 0,
            "line": 1,
            "column": 1
         },
         "end": {
            "offset": 2,
            "line": 1,
            "column": 3
         }
      },
      "similarity": null,
      "boost": null,
      "prefix": null
   },
   "operator": "OR",
   "right": {
      "field": "<implicit>",
      "fieldLocation": null,
      "term": "b",
      "quoted": false,
      "regex": false,
      "termLocation": {
         "start": {
            "offset": 5,
            "line": 1,
            "column": 6
         },
         "end": {
            "offset": 6,
            "line": 1,
            "column": 7
         }
      },
      "similarity": null,
      "boost": null,
      "prefix": null
   }
}

However if I have a or b, the operator is implicit ("operator": "<implicit>",) as can be seen in this tree:

{
   "left": {
      "field": "<implicit>",
      "fieldLocation": null,
      "term": "a",
      "quoted": false,
      "regex": false,
      "termLocation": {
         "start": {
            "offset": 0,
            "line": 1,
            "column": 1
         },
         "end": {
            "offset": 2,
            "line": 1,
            "column": 3
         }
      },
      "similarity": null,
      "boost": null,
      "prefix": null
   },
   "operator": "<implicit>",
   "right": {
      "left": {
         "field": "<implicit>",
         "fieldLocation": null,
         "term": "or",
         "quoted": false,
         "regex": false,
         "termLocation": {
            "start": {
               "offset": 2,
               "line": 1,
               "column": 3
            },
            "end": {
               "offset": 5,
               "line": 1,
               "column": 6
            }
         },
         "similarity": null,
         "boost": null,
         "prefix": null
      },
      "operator": "<implicit>",
      "right": {
         "field": "<implicit>",
         "fieldLocation": null,
         "term": "b",
         "quoted": false,
         "regex": false,
         "termLocation": {
            "start": {

Would be nice to have a way of expanding the grammar and therefore the parser to contain the lower case equivalents of the operators.

Greater than, less than etc?

Does this parser support queries like age:>21, age:(>=18 OR <=21) or dob:>=1970-01-01?

Can't parse range query string with colon symbol

try to parse this lucene string:

"creation_date:[2017-06-09T10:18:33Z TO 2017-06-09T10:18:33Z]"

and you'll get this:

{
  "message": "Expected \".\", \"TO\", [^: \\t\\r\\n\\f{}()\"\\/\\^~[\\]] or whitespace but \":\" found.",
  "expected": [
    {
      "type": "literal",
      "value": ".",
      "description": "\".\""
    },
    {
      "type": "literal",
      "value": "TO",
      "description": "\"TO\""
    },
    {
      "type": "class",
      "value": "[^: \\t\\r\\n\\f{}()\"\\/\\^~[\\]]",
      "description": "[^: \\t\\r\\n\\f{}()\"\\/\\^~[\\]]"
    },
    {
      "type": "other",
      "description": "whitespace"
    }
  ],
  "found": ":",
  "offset": 28,
  "line": 1,
  "column": 29,
  "name": "SyntaxError"
}

Regular expressions in queries cannot be parsed

support regex options

e.g. /r/i, there is currently no way to get that i option.

escaping tilde

https://runkit.com/embed/kx7k2fbprecw

> lucene.parse('foo~bar:"hello"')
> {
  "left": {
    "boost": null,
    "field": "<implicit>",
    "prefix": null,
    "quoted": false,
    "similarity": 0.5,
    "term": "foo"
  },
  "operator": "<implicit>",
  "right": {
    "boost": null,
    "field": "bar",
    "prefix": null,
    "proximity": null,
    "quoted": true,
    "term": "hello"
  }
}

I'm having issues escaping the tilde on the field. It seems to work for some other special chars. Any suggestions here?

Possible to parse escaped quotes?

The original lucene lib had an issue and a known limitation related to parse the escaped quotes.
thoward/lucene-query-parser.js#1

When i try to do the same in this library, i get some unexpected result(not an error though). Is it also a limitation with this parser too?

Codesandbox example for the same
https://codesandbox.io/s/busy-browser-kzykt?file=/src/index.js

Named field fuzzy search?

I'm getting a syntax error when trying to do age:~30

Is there anyway to do a fuzzy search on a named field? If not, could it be added?

"AND NOT" is mishandled

An extra space between "AND NOT" will result in an incorrect AST:

Correct:

'datacenter:"dca1" AND NOT @reserved.collector.filename:"executor"'

Mangled:

'datacenter:"dca1" AND  NOT @reserved.collector.filename:"executor"'

Anti-slashes are not properly handled

Hello.

I'm trying to use a field name with a space in it. It seems that I can make this work with the java lucene SyntaxParser but not with your library (please omit the fact that using a space in a field name is probably a very bad idea ;) ).

Here is what I do in Java:

And here a re a couple tests that give inconsistent results afaik:

Here is the code to reproduce:

var lucene = require("lucene")
var ast = lucene.parse('name:"hello there" AND (tags.tag one:(a OR c) AND tags.tag2:b)');
console.log(lucene.toString(ast), "as expected 1");

ast = lucene.parse("name:\"hello there\" AND (tags.tag one:(a OR c) AND tags.tag2:b)");
console.log(lucene.toString(ast), "as expected 2");

ast = lucene.parse("name:\"hello there\" AND (tags.tag\ one:(a OR c) AND tags.tag2:b)");
console.log(lucene.toString(ast), "not sure what was to expect here but feels weird to have lost the antislash");

ast = lucene.parse("name:\"hello there\" AND (tags.tag\\ one:(a OR c) AND tags.tag2:b)");
console.log(lucene.toString(ast), "if we lost the antislash previously we should have had one antislach here i suppose ?");

ast = lucene.parse('name:"hello there" AND (tags.tag\ one:(a OR c) AND tags.tag2:b)');
console.log(lucene.toString(ast), "I was definitly expecting the anti slash to remain here");

ast = lucene.parse('name:"hello there" AND (tags.tag\\ one:(a OR c) AND tags.tag2:b)');
console.log(lucene.toString(ast), "And here suddenly I have two anti slashes");

Grouping order for query like "𝑎 AND 𝑏 AND 𝑐"

It seems the query: "𝑎 AND 𝑏 AND 𝑐" is by default grouped as "𝑎 AND (𝑏 AND 𝑐)".

Would it be unreasonable to expect it be grouped instead as "(𝑎 AND 𝑏) AND 𝑐"?

I'm creating a filter based on this library, and I stumbled upon a particular query that makes me thing that the latter might be more natural.

E.g.: For this data:

const data = [
  { /* 0 */ name: 'C-3PO', species: 'Droid', height: 1.7526, misc: {} },
  { /* 1 */ name: 'R2-D2', species: 'Droid', height: 1.1, misc: {} },
  { /* 2 */ name: 'Anakin Skywalker', species: 'Human', height: 1.9 },
  { /* 3 */ name: 'Obi-Wan Kenobi', species: 'Human', height: 1.8, misc: {} },
  { /* 4 */ name: 'Han Solo', species: 'Human', height: 1.8, misc: {} },
  { /* 5 */ name: 'Princess Leia', species: 'Human', height: 1.5, misc: {} },
];

If I query:

an AND NOT wan AND NOT han

I expect the result to be

{ /* 2 */ name: 'Anakin Skywalker', ... }

right?

But that happens only when the query is specifically formatted as:

(an AND NOT wan) AND NOT han

To elaborate step-by-step:

Case 1: 'an AND NOT wan AND NOT han'

Query split as

{
left: 'an', 
operator: 'AND NOT', 
right: 'wan AND NOT han'
}

Parse left side 'an' = 3 results: [Anakin, Obi-Wan, Han Solo]
Parse right side: 'wan AND NOT han'

Query split as:
```
{
  left: 'wan', 
  operator: 'AND NOT', 
  right: 'han'
}
```
1. Parse left side 'wan' = 1 result: [Obi-Wan]
2. Parse right side 'han' = 1 result: [Han Solo]
3. Apply operator AND NOT
```
[Obi-Wan] AND NOT [Han Solo] 
```
  = 1 results: [Obi-Wan]
Apply operator AND NOT
```
[Anakin, Obi-Wan, HanSolo] AND NOT [Obi-Wan]
```
= 2 results: [Anakin, Han Solo]

End Result: [Anakin, Han Solo]

Case 2: '(an AND NOT wan) AND NOT (han)'

Query split as:

{
  left: 'an AND NOT wan', 
  operator: 'AND NOT', 
  right: 'han'
}

Parse left side 'an AND NOT wan'

Query split as:
```
{
  left: 'an', 
  operator: 'AND NOT', 
  right: 'wan'
}
```
1. Parse left side 'an' => 3 results: [Anakin, Obi-Wan, Han Solo]
2. Parse right side 'wan' => 1 results: [Obi-Wan]
3. Apply operator AND NOT
```
[Anakin, Obi-Wan, Han Solo] AND NOT [Obi-Wan] 
```
  = 2 results: [Anakin, Han Solo]
Parse right side 'han' = 1 results: [Han Solo]
Apply operator AND NOT
```
[Anakin, Han Solo] AND NOT [Han Solo]
```
= 1 results: [Anakin]

End Result: [Anakin]

So, as you can see only Case 2 gives the expected result.

Unless my expectations or algorithm is flawed in which case I'd appreciate the correction.

`/` in the query results in parse error

https://runkit.com/embed/j4erp5p1jqly

var lucene = require("lucene")
lucene.parse('field:test/')

results in

peg$SyntaxError: Expected "!", "&&", "(", "+", "-", ".", "AND NOT", "AND", "NOT", "OR NOT", "OR", "[", "\"", "\\", "^", "{", "||", "~", [^: \t\r\n\x0C{}()"/\^~[\]], end of input, or whitespace but "/" found.

but according to https://lucene.apache.org/core/2_9_4/queryparsersyntax.html#Escaping%20Special%20Characters / should not be escaped.

Not sure if this is a bug in the parser or is this invalid lucene syntax?

Whitespace after opening parenthesis is breaking parser

Considering this query :

lucene.parse('foo AND ( bar OR baz)');

// Output: SyntaxError 
Line 1, column 11: Expected "!", "&&", "+", "-", "AND NOT", "AND", "NOT", "OR NOT", "OR", "||", or whitespace but "b" found.

Syntax Error is thrown because of the whitespace just after the opening parenthesis

Grammar potentially incorrectly parses fields with whitespaces before terms

Firstly, many thanks for the great library!

I have been testing it a lot and found what I believe to be a small issue. Parsing a field with one or more spaces after the colon followed by another term incorrectly groups the term with the field.

Example: color: red parses as

{
   "left": {
      "field": "color",
      "term": "red",
      ...
   }
}

I would expect it to parse as a syntax error or two separate terms. I am by no means an expert so I could be mistaken.

ES6 modules

Would you consider using ES6 modules?

Thanks :)

Problem stringifying the AST with a parenthesized negated expression

If we have a parenthesized expression that has a start (no left-hand expression), parenthesis is not placed correctly when stringifying the AST.

Example:

const { parse, toString } = require('lucene')

toString(parse('my.prop:value1 AND (NOT _exists_:other.prop OR other.prop:value2)'))
// Result is -> "my.prop:value1 AND NOT (_exists_:other.prop OR other.prop:value2)"

At a glance, the fix should be simple. Check if parenthesized is set when concatenating start and make sure start is not set when adding an opening parenthesis for a parenthesized left-hand.

AST explorer

Would be awesome if this would be added as a parser to AST explorer

https://github.com/fkling/astexplorer#how-to-add-a-new-parser

Then it could be used here: https://astexplorer.net

Date rounding is reported as an error

Hello,

First of all, thanks for providing this great library (very handy in many cases) !
I have noticed that date rounding is reported as an error.

Consider the following valid Lucene query :

dateModified_date:[NOW/YEAR TO NOW]

The slash after "NOW" is reported as unexpected :

Line 1, column 23: Expected ".", "TO", "\\", [^ \t\r\n\x0C{}()"/\^~[\]], or whitespace but "/" found.

We get the very same result when using the PEG grammar defined in this repository with PEG.js online.

Thanks for your attention

Improve escaping support

Malformed return queries from the toString operation

when submitting a query like: name:(-frank)
from the "toString" operation returns: name:(-frank
example:

const lucene = require('lucene');

const ast = lucene.parse('name:(-frank)');
console.log(ast);

// {
// left:
// { left:
// { field: '',
// term: 'frank',
// quoted: false,
// similarity: null,
// boost: null,
// prefix: '-' },
// parenthesized: true,
// field: 'name'
// }
// }

console.log(lucene.toString(ast));
// name:(-frank

I saw that I modify the code in the toString.js file, as below, it works, what do you say?
.....
if (ast.left) {
if (ast.parenthesized) {
result += '(';
}
result += toString(ast.left);

    if (ast.parenthesized && !ast.right) {
        result += ')';
    }
}

......

foo:-bar is not possible according to lucene QueryParser Java impl

Falsy numbers as term values lead to invalid queries in toString()

When a Node has term: 0, the query returned by toString() will lack a value:

> const lucene = require("lucene");
> lucene.toString({
... "left": {
..... "field": "field",
..... "fieldLocation": {
....... "start": {
......... "offset": 0,
......... "line": 1,
......... "column": 1
......... },
....... "end": {
......... "offset": 5,
......... "line": 1,
......... "column": 6
......... }
....... },
..... "term": 0,  // <-----
..... "quoted": false,
..... "regex": false,
..... "termLocation": {
....... "start": {
......... "offset": 6,
......... "line": 1,
......... "column": 7
......... },
....... "end": {
......... "offset": 7,
......... "line": 1,
......... "column": 8
......... }
....... },
..... "similarity": null,
..... "boost": null,
..... "prefix": null
..... }
... });
'field:'

Changing "term": 0 to "term": "0" fixes the problem, returning 'field:0'.

I think this is due to checking falsy values in these places:

lucene/lib/toString.js

Line 52 in 961ecf2

if (ast.term || (ast.term === '' && ast.quoted)) {

lucene/lib/toString.js

Line 77 in 961ecf2

if (ast.term_min) {

Of course, a workaround is to always use strings as term values, but allowing numbers and returning an invalid query like this is very confusing.