pipedown / noise Goto Github PK

View Code? Open in Web Editor NEW

323.0 323.0 11.0 502 KB

Nested Object Inverted Search Engine

Home Page: https://noisesearch.org/

License: Apache License 2.0

Rust 99.72% Shell 0.28%

noise's People

Contributors

Stargazers

Watchers

Forkers

daschl vmx mcbguru jamesnocentini yarwelp vane atd-schubert oshistory mareksybilak baitcenter davidalphafox

noise's Issues

`default` return with single digit doesn't work

If the default value for a return is a single digit (also negative) value, there's a failure. It works with other numeric values. Example test case:

exports['test default=0 bug'] = function(assert, done) {
    var index = noise.open("default0bug", true);
    var doc = {_id: "a", foo: "bar"};
    var find = ' ';
    index.add([doc]).then(resp => {
        assert.deepEqual(resp, ["a"], "doc created");
        return index.query('find {foo: == "bar"} return .hammer default=0');
    }).then(resp => {
        assert.deepEqual(resp, [0], "return .hammer default=0 is correct");
        done();
    }).catch(error => {
        console.log(error);
    });
};

Fails with:

  test default=0 bug
    ✓ doc created
Error: Expected json value for default
    at Error (native)
    at Socket.localCb (/home/vmx/src/rust/noise/node-noise/lib/noise.js:38:34)
    at emitOne (events.js:96:13)
    at Socket.emit (events.js:188:7)
    at readableAddChunk (_stream_readable.js:176:18)
    at Socket.Readable.push (_stream_readable.js:134:10)
    at Pipe.onread (net.js:551:20)

Remove "only logical nots" limitation

When you negate all your operators like:

find {webChannel: != null}

you get this error:

{"error": "Error: query cannot be made up of only logical not. Must have at least one match clause not negated."}

But as we now have find {} it makes sense to also support those queries.

Add continuous integration

This repository really needs proper continuous integration. The tests should be run before every commit (possible also the node binding test suite).

Completely empty document doesn't return correctly

If you insert a completely empty document then return . doesn't return its _id field.

This really happens only for a completely empty document, I've also tested documents without _id and a single field and documents with an _id only.

Test cases are available at pipedown/node-noise#5.

add Python support

cc @ingenieroariel

Add support for integration downstream Python applications.

Changing the not equal operator

I've had already two people telling me that that would prefer having not equal != rather than the JavaScript like !==.

To explain where it comes from: equal is == and then you just put a ! in front to have it negated. Hence !==. Though I agree that != is way more common.

One proposal went even as far as proposing = for equal. I don't agree with that. I always found that confusing when I saw that in SQL.

Although I'm a huge fan of "there's only one way" I propose having != and !== being the same. So you have the symmetry between == -> !== and ~- -> !~= (do we support that already?), but also the more common !=.

What do others think?

not equal operator isn't parsed correctly

This query

find {webChannel: != null} return .webChannel limit 100

leads to that error:

{"error": "Error: Expected comparison operator"}

`limit` doesn't work as expected

The limit clause often returns the full result and you can even produce a false positive syntax error. Instead of explaining the cases I've found, I've created test cases (note that not all of them fail, but quite a few).

# limit clause tests

drop target/tests/querytestlimit;
create target/tests/querytestlimit;


add {"_id":"1", "A": 6};
"1"
add {"_id":"2", "A": 6};
"2"
add {"_id":"3", "A": 4};
"3"
add {"_id":"4", "A": 4};
"4"
add {"_id":"5", "A": 1};
"5"

# "limit" tests with find clause only

find {A: >= 1};
[
"1",
"2",
"3",
"4",
"5"
]

find {A: >= 1}
limit 1;
[
"1",
]

find {A: >= 1}
limit 3;
[
"1",
"2",
"3"
]

find {A: < 5};
[
"3",
"4",
"5"
]

find {A: < 5}
limit 2;
[
"3",
"4",
]

# "limit" tests with ordering

find {A: > 3}
order .A;
[
"4",
"3",
"2",
"1"
]

find {A: > 3}
order .A
limit 1;
[
"4"
]

# "limit" tests with return

find {A: >= 1}
return .;
[
{"A":6,"_id":"1"},
{"A":6,"_id":"2"},
{"A":4,"_id":"3"},
{"A":4,"_id":"4"},
{"A":1,"_id":"5"}
]

find {A: >= 1}
return .
limit 1;
[
{"A":6,"_id":"1"}
]

find {A: >= 1}
return .A;
[
6,
6,
4,
4,
1
]

find {A: >= 1}
return .A
limit 1;
[
6
]

# "limit" tests with return and ordering

find {A: >= 1}
order .A
return .A;
[
1,
4,
4,
6,
6
]

find {A: >= 1}
order .A
return .A
limit 3;
[
1,
4,
4
]

Losslessly store shredded documents by seq and keypath

Add a new keyspace that can be used to store complete and retrieve portions of a documents when gathering results for projections and ordering.

When shredding documents, each value (a terminal value in a object field or array slot) should be inserted with it's own unique key path, like this:

key:
d!
value:
normalizedvlaue

where is s for string, n for number, t for true, f for false and n for null.

so this document after shredding:
{"_id": "1", "foo": [{"bar": "baz"}, {"bar": "baz"}}

should be inserted like this:

key:
1.foo$.bar!0
value:
sbaz

key:
1.foo$.bar!1
value:
sbaz

This will allow us to return just the exact fields we are interested in during queries without loading necessary data, and to update/delete documents (and inverted index values) quickly without scanning the whole index.

Wildcard search didn't work as expected

Disclaimer: I didn't read the documentation :-)

I searched for:

find
    {name: ~= "geo*"}
return
    .

and got results:

{
  "_id": "14153",
  "cast": [],
  "episodes": [
    {
      "airdate": "2016-03-05",
      "airtime": "07:00",
      "name": "Funky Feathers",
      "number": 1,
      "runtime": 120,
      "season": 1,
      "summary": "<p>Brainteasers, wacky animal facts, hip-hopping birds, animated adventures, and a cup of Joe with your favorite vet.</p>",
      "url": "http://www.tvmaze.com/episodes/650336/nat-geo-wild-kids-1x01-funky-feathers"
    },
    {
      "airdate": "2016-03-12",
      "airtime": "07:00",
      "name": "Jungle Jamboree",
      "number": 2,
      "runtime": 120,
      "season": 1,
      "summary": "<p>Explore bizarre creatures in Wonderfully Weird; get inspired by the wildlife rescue team on Bandit Patrol; special guests in Dr. Pol Coffee Breaks; cute and cuddly animal buddies in Unlikely Animal Friends.</p>",
      "url": "http://www.tvmaze.com/episodes/650337/nat-geo-wild-kids-1x02-jungle-jamboree"
    },

Was expecting only results with "Geo*" in the name, like "George".

Refactor out the Query language from RocksDB

I have a use case to match JSON responses from a variety of RESTful APIs and other sources. I'd like to be able to use the query language to parse out a single JSON document or even a JSON array.

Is there a way to do this currently? From the source it looks like it has to be in RocksDB to query it.

Noise panics if return field is a prefix of another one

If you try to return a field whose name is a prefix of an existing one, it will panic with an error like:

thread 'test_repl' panicked at 'somehow couldn't parse key segment V1#.pre V1#.prefix',

Here's a repl test case:

# prefix bug

drop target/tests/querytestprefix;
create target/tests/querytestprefix;


add {"_id":"1", "prefix": true};
"1"

# Works as expected
find {prefix: == true}
return .prefix;
[
true
]

# Works as expected
find {prefix: == true}
return .foo;
[
null
]

# Panics with:
# thread 'test_repl' panicked at 'somehow couldn't parse key segment V1#.pre V1#.prefix', src/snapshot.rs:389
find {prefix: == true}
return .pre;
[
null
]

Split up word index keys to be one key per nested repeated field

Currently word index group all the same words in same repeating nested array objects in one key.

For example, this document:
{"_id": "1", "foo": [{"bar": "baz"}, {"bar": "baz"}}

When shredded, will be indexed like this (semantically, actual structure in value is capnp)):
Key:
w.foo$.bar!baz#1
Value:
[{array path: [0], [{offset: 0, suffix: "", suffixoffset: 3}]}}, array path: [1], [{offset: 0, suffix: "", suffixoffset: 3}]}}]

Instead split into 2 key/values and put the array path in the key, like this:

Key:
w.foo$.bar!baz#1@0
Value:
[{offset: 0, suffix: "", suffixoffset: 3}]

Key:
w.foo$.bar!baz#1@1
Value:
[{offset: 0, suffix: "", suffixoffset: 3}]

This will allow us to stream and index to disk in very large json documents without holding key/values in memory. Currently we must hold the shaded documents in memory in case we have duplicate keys. Also when querying we won't have to load very large values (because they same words duplicate over and over in may nested objects, the value can be huge). Additionally when we allow partial updates we only have to remove update only the exact fields that have changed, we won't have to touch unaffected nested objects.

Drop rustc-serialize dependency

It has been deprecated: announcement.

Support stemmer-languages other than english

Following discussion from #30

The wrapper around the snowball stemmer is hard-coded to english. Quite some
flexibility could be gained by offering the ability to specify a different language
supported natively by the snowball-project. Obviously english should be the default.

From the top of my head:
Something along the lines of:

let index = noise.open("myindex", true, { "lang": "german" });

Most use cases should operate on a single language, so multi-language support
shouldn't be an issue.

rustc-serialize is deprecated, upgrade to serde

serde is the future, but unfortunately it requires a bit more work than just dropping in the replacement library.

Soundness issue in MvccRwLock

Hi there, we (Rust group @sslab-gatech) are scanning crates on crates.io for potential soundness bugs. We noticed that MvccRwLock implements Send and Sync for all types:

noise/src/index.rs

Lines 418 to 419 in 5a582a7

    
           unsafe impl<T> Send for MvccRwLock<T> {} 
        
           unsafe impl<T> Sync for MvccRwLock<T> {}

However, this should probably have tighter bounds on its Send and Sync traits, otherwise its possible to create data-races from safe rust code by using non-Sync types like Cell across threads or sending non-Send types across like Rc. Here's a little proof-of-concept using Rc.

#![forbid(unsafe_code)]

use noise_search::index::MvccRwLock;

use std::rc::Rc;

fn main() {
    let rc = Rc::new(42);

    let lock = MvccRwLock::new(rc.clone());
    std::thread::spawn(move || {
        let smuggled_rc = lock.read();

        println!("Thread: {:p}", *smuggled_rc);
        for _ in 0..100_000_000 {
            (*smuggled_rc).clone();
        }
    });

    println!("Main:   {:p}", rc);
    for _ in 0..100_000_000 {
        rc.clone();
    }
}

This outputs:

Main:   0x561539bdcd40
Thread: 0x561539bdcd40

Terminated with signal 4 (SIGILL)

It seems like this class also potentially allows for aliasing violations, in this case maybe it would be better to mark the methods as unsafe and maybe not expose the class outside the crate?

Allow using Bind Variables for or'ed conditions

I'm not sure if it's a good idea or not, or if there are perhaps better ways of solving this.

When I prepared a blog post, I came across data with this shape:

{
    "fy2012": {
        "amount": 0.5,
        "netOrGross": "Gross"
    },
    "fy2013": {
        "amount": 0.2,
        "netOrGross": "Gross"
    },
    "fy2014": {
        "amount": 0.3,
        "netOrGross": "Net"
    }
}

I'd now like to sum up the amount of all years where netOrGross is Gross.

I don't see a way to do this is a single query currently.

What if I could use a Bind Variable:

find {isGross::(fy2012: {netOrGross: == "Gross"} ||
                fy2013: {netOrGross: == "Gross"} ||
                fy2014: {netOrGross: == "Gross"})}
return sum(isGross.amount)

Using field names as key in return value

Currently it's not possible to use a field name as key in the output. Example:

find {name: ~= "seinfeld"}
return {.name: score()}

Expected output:

[{"Seinfeld": 1}]

Currently you get:

{"error": "Error: Expected '}' at character 43, found .."}

Current version doesn't build

The current version of Noise doesn't build on my machine It fails with:

--- stderr
thread 'main' panicked at '

Internal error occurred: Command "c++" "-O0" "-ffunction-sections" "-fdata-sections" "-fPIC" "-g" "-m64" "-I" "rocksdb/include/" "-I" "rocksdb/" "-I" "rocksdb/third-party/gtest-1.7.0/fused-src/" "-I" "snappy/" "-I" "." "-std=c++11" "-DNDEBUG=1" "-DSNAPPY=1" "-DOS_LINUX=1" "-DROCKSDB_PLATFORM_POSIX=1" "-DROCKSDB_LIB_IO_POSIX=1" "-Wall" "-Wextra" "-o" "/home/vmx/src/rust/noise/noise/target/debug/build/librocksdb-sys-ccd420355f6e49a0/out/rocksdb/db/auto_roll_logger.o" "-c" "rocksdb/db/auto_roll_logger.cc" with args "c++" did not execute successfully (status code exit code: 1).

', /home/vmx/.cargo/registry/src/github.com-1ecc6299db9ec823/gcc-0.3.54/src/lib.rs:1670
note: Run with `RUST_BACKTRACE=1` for a backtrace.

I suspect that my upgrade to GCC7 is the problem. But I didn't investigate any further as an upgrade of RocksDB fixes the problem.

Ordering with default value doesn't work

According to the query language reference, it's possible to have a default value for the ordering, if the field it is ordered on isn't defined.

Here's an example:

drop target/tests/querytestorder;
create target/tests/querytestorder;

add {"_id":"9", "bar": true};
"9"
add {"_id":"10", "bar": false};
"10"
add {"_id":"15", "foo":"coll", "bar": "string"};
"15"
add {"_id":"18"};
"18"

# This one is OK
find {}
order .bar asc
return .bar;
[
null,
false,
true,
"string"
]

# This one is also OK
find {}
order .bar default=1 asc
return .bar default=1;
[
false,
true,
1,
"string"
]

# This one fails. It returns `null` as first element.
# If it would take the `default=1` into account,
# it would have the order shown below.
find {}
order .bar default=1 asc
return .bar;
[
false,
true,
null,
"string"
]

Add documention for KeyBuilder

Add some documentation to the KeyBuilder which explains what a keypath is, how it is structured and what possible values it can have. This will make the code way easier to read and follow.

Would you like to take part in the the search benchmark game?

Hi,

We built a search benchmark for tantivy. It proved to be very helpful to get a breakdown
of our performance compared to that of Lucene. It helped me identify the need for a skip info
for instance.
https://github.com/tantivy-search/search-benchmark-game

Would you guys want to add your engine to the bench?

(I tried to integrate noise with it. The noise API is hands off the easiest I have seen so far. I might want to copy it at one point :) . Kudos for that. However I failed indexing as little as 10K docs...)

Document ordering of the keywords

find, return, order and limit have to be in a specific order. Document that in the language reference. Perhaps as a general overview section which shows the structure of a query. Railroad diagrams like the ones for JSON or SQLite come to my mind.

Support matching on JSON keys only

I came across the following use case. I have a JSON structure where the keys of an object are identifiers and the values are some additional information. The object might contain any of the identifiers. I now want to query for all documents that have a specific identifier. That's currently not possible.

Here's an example to make it more concrete:

{
  "outage": {
    "030": "Berlin",
    "089": "Munich"
  }
}

I would now want to query for all outages that contain "089", but I don't care about the actual value.

Review all transmutes

Thanks to the Clippy warnings, lots of transmutes were replaces with safe code. Look at all outstanding transmutes and see if they can be replaced with safe code.

Stemmed words that aren't prefixes of the original word won't work with comparison operators

The search operators >= <= < > can miss words where stemming has changed the some letters. We need to instead to only use the unchanged portion of the stemmed word as the index key. For full text lookups, we'll restem each potential match at query time (the original word is still in the value of the index), perhaps examining more keys than necessary but not missing anything.

Changing order and group

I'd like to propose changing the language a bit to make it a bit more regular/consistent.

Order clause

Currently it is is:

order .baz asc default=1

I'd like to change it to:

order .baz default=1 asc

Reason: In return values it's:

return {baz: .baz default=0, hammer: .hammer default=1}

So it's (kind of like a tuple). Use this pattern for the order as well.

Grouping

Currently group() is:

group(.baz order=asc) default="a"

I'd like to change it to:

group(.baz default="a" asc)

Reason: It now has the same shape as the order clause.

What do others think about making this change?

	unsafe impl<T> Send for MvccRwLock<T> {}
	unsafe impl<T> Sync for MvccRwLock<T> {}