toshi-search / toshi Goto Github PK

View Code? Open in Web Editor NEW

4.1K 55.0 129.0 1.65 MB

A full-text search engine in rust

License: MIT License

Rust 99.32% Shell 0.68%

search search-engine indexing rust rust-lang elasticsearch

toshi's Introduction

Toshi

A Full-Text Search Engine in Rust

Please note that this is far from production ready, also Toshi is still under active development, I'm just slow.

Description

Toshi is meant to be a full-text search engine similar to Elasticsearch. Toshi strives to be to Elasticsearch what Tantivy is to Lucene.

Motivations

Toshi will always target stable Rust and will try our best to never make any use of unsafe Rust. While underlying libraries may make some use of unsafe, Toshi will make a concerted effort to vet these libraries in an effort to be completely free of unsafe Rust usage. The reason I chose this was because I felt that for this to actually become an attractive option for people to consider it would have to have be safe, stable and consistent. This was why stable Rust was chosen because of the guarantees and safety it provides. I did not want to go down the rabbit hole of using nightly features to then have issues with their stability later on. Since Toshi is not meant to be a library, I'm perfectly fine with having this requirement because people who would want to use this more than likely will take it off the shelf and not modify it. My motivation was to cater to that use case when building Toshi.

Build Requirements

At this current time Toshi should build and work fine on Windows, Mac OS X, and Linux. From dependency requirements you are going to need 1.39.0 and Cargo installed in order to build. You can get rust easily from rustup.

Configuration

There is a default configuration file in config/config.toml:

host = "127.0.0.1"
port = 8080
path = "data2/"
writer_memory = 200000000
log_level = "info"
json_parsing_threads = 4
bulk_buffer_size = 10000
auto_commit_duration = 10
experimental = false

[experimental_features]
master = true
nodes = [
    "127.0.0.1:8081"
]

[merge_policy]
kind = "log"
min_merge_size = 8
min_layer_size = 10_000
level_log_size = 0.75

Host

host = "localhost"

The hostname Toshi will bind to upon start.

Port

port = 8080

The port Toshi will bind to upon start.

Path

path = "data/"

The data path where Toshi will store its data and indices.

Writer Memory

writer_memory = 200000000

The amount of memory (in bytes) Toshi should allocate to commits for new documents.

Log Level

log_level = "info"

The detail level to use for Toshi's logging.

Json Parsing

json_parsing_threads = 4

When Toshi does a bulk ingest of documents it will spin up a number of threads to parse the document's json as it's received. This controls the number of threads spawned to handle this job.

Bulk Buffer

bulk_buffer_size = 10000

This will control the buffer size for parsing documents into an index. It will control the amount of memory a bulk ingest will take up by blocking when the message buffer is filled. If you want to go totally off the rails you can set this to 0 in order to make the buffer unbounded.

Auto Commit Duration

auto_commit_duration = 10

This controls how often an index will automatically commit documents if there are docs to be committed. Set this to 0 to disable this feature, but you will have to do commits yourself when you submit documents.

Merge Policy

[merge_policy]
kind = "log"

Tantivy will merge index segments according to the configuration outlined here. There are 2 options for this. "log" which is the default segment merge behavior. Log has 3 additional values to it as well. Any of these 3 values can be omitted to use Tantivy's default value. The default values are listed below.

min_merge_size = 8
min_layer_size = 10_000
level_log_size = 0.75

In addition there is the "nomerge" option, in which Tantivy will do no merging of segments.

Experimental Settings

experimental = false

[experimental_features]
master = true
nodes = [
    "127.0.0.1:8081"
]

In general these settings aren't ready for usage yet as they are very unstable or flat out broken. Right now the distribution of Toshi is behind this flag, so if experimental is set to false then all these settings are ignored.

Building and Running

Toshi can be built using cargo build --release. Once Toshi is built you can run ./target/release/toshi from the top level directory to start Toshi according to the configuration in config/config.toml

You should get a startup message like this.

  ______         __   _   ____                 __
 /_  __/__  ___ / /  (_) / __/__ ___ _________/ /
  / / / _ \(_-</ _ \/ / _\ \/ -_) _ `/ __/ __/ _ \
 /_/  \___/___/_//_/_/ /___/\__/\_,_/_/  \__/_//_/
 Such Relevance, Much Index, Many Search, Wow
 
 INFO  toshi::index > Indexes: []

You can verify Toshi is running with:

curl -X GET http://localhost:8080/

which should return:

{
  "name": "Toshi Search",
  "version": "0.1.1"
}

Once toshi is running it's best to check the requests.http file in the root of this project to see some more examples of usage.

Example Queries

Term Query

{ "query": {"term": {"test_text": "document" } }, "limit": 10 }

Fuzzy Term Query

{ "query": {"fuzzy": {"test_text": {"value": "document", "distance": 0, "transposition": false } } }, "limit": 10 }

Phrase Query

{ "query": {"phrase": {"test_text": {"terms": ["test","document"] } } }, "limit": 10 }

Range Query

{ "query": {"range": { "test_i64": { "gte": 2012, "lte": 2015 } } }, "limit": 10 }

Regex Query

{ "query": {"regex": { "test_text": "d[ou]{1}c[k]?ument" } }, "limit": 10 }

Boolean Query

{ "query": {"bool": {"must": [ { "term": { "test_text": "document" } } ], "must_not": [ {"range": {"test_i64": { "gt": 2017 } } } ] } }, "limit": 10 }

Usage

To try any of the above queries you can use the above example

curl -X POST http://localhost:8080/test_index -H 'Content-Type: application/json' -d '{ "query": {"term": {"test_text": "document" } }, "limit": 10 }'

Also, to note, limit is optional, 10 is the default value. It's only included here for completeness.

Running Tests

cargo test

What is a Toshi?

Toshi is a three year old Shiba Inu. He is a very good boy and is the official mascot of this project. Toshi personally reviews all code before it is committed to this repository and is dedicated to only accepting the highest quality contributions from his human. He will, though, accept treats for easier code reviews.

toshi's People

Contributors

Stargazers

Watchers

Forkers

danbruder gitter-badger luciofranco jacobmischka igxactly tempbottle roki1988 taodi hunglethanh9 laeeth rleungx toshisam jithinraj sadiqmmm cyuyanzuoye linecode hhy5277 lyrl 0xflotus tchen0123 langyabangnunu suhuaguo jankeromnes teach-me-how rahulsoibam awesome-archive shaunstanislauslau wolf4ood emuhedo leeleekaen fengweijp mauri870 tinycedar jlerche hoangpq weiboyiyou krizz thomasrockhu fhalim marks-yag gyb997 chenwdong zorrock bugsbunny1101 wthdt freshy969 automancursor threatintel-c ye-man tigerneil ghanima s-you gloscombe-amp baitcenter cvnb prog8 arso-project jonarod digitalsanity jason-cooke chandanpasunoori tehroot hagi013 rowhit xiaming9880 mikalv flipsimon tarsbase hoang17 mbaneshi shuo93 dalevross rustyforks ra2003 gopherj sigmaris onlyone0001 kod-kristoff jacobjohansen julianwgs yanakad liuwqiang holg hubitor-forks ajrpayne zero-github dirpyth chenyang8094 prabhatsharma fridgeseal c0ns0le xiaoyaoyouzizai emg110 search-tech zhaopufeng urtho doytsujin ajunlonglive galactus009 121601571

toshi's Issues

Implement Fuzzy Query

Related to #39

Call to / produces 404 not found rather than version number

Steps taken:

Installed capnproto
Built toshi cargo build --release
Ran toshi target/release/toshi

 INFO  toshi > Base data path data/ does not exist, creating it...
 INFO  toshi > Clustering disabled...
 INFO  toshi::index > Indexes: []

  ______         __   _   ____                 __
 /_  __/__  ___ / /  (_) / __/__ ___ _________/ /
  / / / _ \(_-</ _ \/ / _\ \/ -_) _ `/ __/ __/ _ \
 /_/  \___/___/_//_/_/ /___/\__/\_,_/_/  \__/_//_/
 Such Relevance, Much Index, Many Search, Wow
 
 INFO  gotham::start >  Gotham listening on http://[::1]:8080

Ran curl verbosly curl -X GET http://localhost:8080/ -v

Note: Unnecessary use of -X or --request, GET is already inferred.
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 8080 (#0)
> GET / HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.62.0
> Accept: */*
> 
< HTTP/1.1 404 Not Found
< x-request-id: be615a31-a504-430a-a06a-d2017a1a3b81
< content-length: 0
< date: Fri, 23 Nov 2018 19:56:47 GMT
< 
* Connection #0 to host localhost left intact

I am running toshi with the default config file

Data directory is not created if it does not exist

This causes Toshi to fail to start. @hntd187 should we create it if it doesn't exist?

RangeResult is unused

pub struct RangeResult {
    key: String,
    #[serde(skip_serializing_if = "Option::is_none")]
    to: Option<String>,
    #[serde(skip_serializing_if = "Option::is_none")]
    from: Option<String>,
    num_docs: u64,
}

in RangeResult is throwing an un-used warning during cargo test.

Split out consul client

The consul client should be split into a consul::Builder and consul::Consul. Currently, the building of a consul client and the client itself are in the same impl block on the same struct. Ideally, there should be some builder phase that collects the items needed to create the client. This also would allow the spawning phase that is required to build a tower_buffer since it needs to spawn the actual service to drive it in the background. This is why we need to wrap Consul::default() in a future::lazy because it needs to get the DefaultExecutor::current() that is stored in a thead local variable.

RPC library for Toshi

RPC Library Options

tower-grpc: A gRPC library implemented on top of the tower-service trait and built from the same people who built tokio and linkerd2.
capnp-rpc: A RPC library written by the same people behind capnproto-rust. It has been around for a very long time.
tarpc: A very rusty RPC library and probably the most futuristic since it already supports async/await out of the box.
grpc-rs: A gRPC library that wraps around the grpc c bindings generated from the grpc-go library. This is probably the most feature complete rpc library.
grpc-rust: A rust implementation of gRPC, this was the original library in rust for gRPC.

Comparison

Generally speaking, capnp and gRPC based libraries accomplish the same thing in the sense that they provide a language-agnostic way of expressing the API and Service. Unlike those, tarpc uses Rust structs and macros to define the components of the RPC. Because of this, I do not see its advantage over something that is language agnostic.

Capnproto seems like a decent choice because it is all pointer based and doesn't actually do any parsing but because of this, the actual rust implementation has a lot of unsafe code that in my opinion removes the reason for why we use rust.

This leaves the three implementations of gRPC. I find gRPC to be a pretty good fit due to its http2 backing being very fast and low overhead. It has support for many languages and is a specification supported by Google. Currently, pingcap/tikv and linkerd2-proxy use a gRPC implementation.

grpc-rust was the first gRPC implementation using the rust protobuf wrapper around the protoc rust plugin. This means that any project using this must depend on the protoc binary being in the path. While this is not a horrible route it's not perfect. The biggest problem with grpc-rust is mostly that in the readme there is a TODO item that says Fix performance which is not very promising.

This leaves us with grpc-rs and tower-grpc. Lets start with grpc-rs, it is the library crated by pingcap for use within tikv. The basis for this library comes from the c bindings generated by compiling the grpc-go implementation. That being said this library is by far the most feature complete since it gets all of its features from a previous implementation. Though, since the library is pretty much ffi calls its api is not very ergonomic and is quite rough around the edges.

This brings us to tower-grpc, this is the library created by the people behind tokio. This library builds off of the tower-service trait. This library is by far the newest out of all five. That being said it actually has more repositories on GitHub using it than capnp-rpc. tower-grpc uses prost under the hood to build the rust code. This means, that there are zero external dependencies that are not in rust. This to me is quite powerful since we are already a pure rust project. The ergonomics of the API are also very nice since there are a lot of similarities with tokio, meaning people who are used to working with tokio should have an easy time with tower-grpc. The other powerful benefit to using tower and its accompanying crates is that there is an ecosystem around different middleware that can be used. Everything, from load balancing to timeouts and more. The drawback of this library is that it is currently not released on crates.io and still has a somewhat unstable API. That being said, I got in contact with the tower people and they said that the actual public interface for tower-grpc is not going to change and that a 0.1 of it will be released soon on crates.io.

All this said, I personally find thattower-grpc is the right choice, mostly as it seems that this method is the way the community is heading towards. It is by far the most flexible library and already is pretty stable (from my usage).

I would like to hear more thoughts on this, cc @fhaynes @hntd187

roadmap to production

Toshi looks awesome. We'd love to evaluate using Toshi in our product, so I'm wondering if there is a roadmap detailing major milestones toward changing the project tagline from "Note: This is far from production ready" to something like "Here are the benchmarks comparing Toshi, Elasticsearch, and Solr."

Are there any existing benchmarks? Or more importantly, a guide to how we might be able to contribute?

Thanks for guidance, and sorry if I missed some obvious documentation about roadmap somewhere?

Also, is there a collaboration channel like IRC where Toshi developers hangout?

Handle Delete operation

Issue received in tantivy-cli.

Use the new tokio-sync crate

Replace the crossbean channel used by the new tokio-sync crate

Implement node metadata updates

Nodes should update their metadata in consul on a regular basis

Remove star imports

We should no longer be using use some_crate::module::* and use explicit imports.

Dependabot can't resolve your Rust dependency files

Dependabot can't resolve your Rust dependency files.

As a result, Dependabot couldn't update your dependencies.

The error Dependabot encountered was:

    Updating git repository `https://github.com/tower-rs/tower-grpc`
    Updating crates.io index
    Updating git repository `https://github.com/carllerche/tokio-connect`
    Updating git repository `https://github.com/tower-rs/tower`
    Updating git repository `https://github.com/tower-rs/tower-h2`
    Updating git repository `https://github.com/tower-rs/tower-http`
error: no matching package named `tower-direct-service` found
location searched: https://github.com/tower-rs/tower
required by package `toshi v0.1.1 (/home/dependabot/dependabot-updater/dependabot_tmp_dir

If you think the above is an error on Dependabot's side please don't hesitate to get in touch - we'll do whatever we can to fix it.

You can mention @dependabot in the comments below to contact the Dependabot team.

Failed registering node: connection refused

After successfull build of release target (Ubuntu), while running

$ RUST_BACKTRACE=1 ./target/release/toshi

I keep on getting:

 INFO  toshi > Settings { host: "127.0.0.1", port: 8080, path: "data/", place_addr: "0.0.0.0:8082", log_level: "info", writer_memory: 200000000, json_parsing_threads: 4, auto_commit_duration: 10, bulk_buffer_size: 10000, merge_policy: ConfigMergePolicy { kind: "log", min_merge_size: Some(8), min_layer_size: Some(10000), level_log_size: Some(0.75) }, consul_addr: "127.0.0.1:8500", cluster_name: "kitsune", enable_clustering: true, master: true, nodes: ["127.0.0.1:8081", "127.0.0.1:8082"] }

  ______         __   _   ____                 __
 /_  __/__  ___ / /  (_) / __/__ ___ _________/ /
  / / / _ \(_-</ _ \/ / _\ \/ -_) _ `/ __/ __/ _ \
 /_/  \___/___/_//_/_/ /___/\__/\_,_/_/  \__/_//_/
 Such Relevance, Much Index, Many Search, Wow
 
 ERROR toshi > Error: Failed registering Node: Inner(Inner(Error { kind: Connect, cause: Os { code: 111, kind: ConnectionRefused, message: "Connection refused" } }))
thread 'main' panicked at 'internal error: entered unreachable code: Shutdown signal channel should not error, This is a bug.', src/bin/toshi.rs:68:22
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::panicking::default_hook::{{closure}}
             at src/libstd/sys_common/backtrace.rs:71
             at src/libstd/sys_common/backtrace.rs:59
             at src/libstd/panicking.rs:211
   2: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:227
             at src/libstd/panicking.rs:491
   3: std::panicking::continue_panic_fmt
             at src/libstd/panicking.rs:398
   4: std::panicking::begin_panic_fmt
             at src/libstd/panicking.rs:353
   5: toshi::main::{{closure}}
   6: <futures::task_impl::Spawn<T>>::enter::{{closure}}
   7: toshi::main
   8: std::rt::lang_start::{{closure}}
   9: main
  10: __libc_start_main
  11: _start

netstat shows me that 8080 isn't in use by another process, and running command with sudo doesn't change anything. Message clearly state that this is a bug. So... is there a solution or not?

Add a modified tantivy-cli

I wanted to add the tantivy-cli into the project binaries as a tool for admins to work with indexes on disk in an easier and more direct way than just through the rest interface. Since tantivy indexes are units by themselves this should be fine.

https://github.com/tantivy-search/tantivy-cli

The CLI doesn't have a license in it @fulmicoton safe to assume it's MIT like tantivy?

Query Implementation Tracking Issue

This issue tracks the implementation of the various query types. The ones Tantivvy natively supports are highest priority for implementation, followed by the ones it does not directly support.

Primary Query Types

Cluster(ing) ?

Hey,

Assuming you want to build a better/faster elasticsearch alternative. What do you think about building this on top of tikv?

This way you get replication, sharding, commitlog, transactions,raft,grpc,backups,multi-datacenter,native-local-functions etc for ~free.

At a minimum, you can replace the sstable of rocksdb to store a tantivy segment.
Doing crud you check on commitlog then on the segment, while doing search you just check on sstables. You can replace the memtable with another tantity segment to enable real-time querying without refresh. (refreshing would translate+persist the memtable memory-segment to sstable disk-segment.

Using grpc maybe even be better than http/json.

Makes sense ? Compared to do-it-yourself ?

ps: best/extreme scenario would be to build it on top of seastar-framework but it's probably too much work compared to above.

Implement Regex Query

Related to #39

Investigate aggregates

Request: configurable support for Kubernetes vs Consul

I totally get that refactoring to be agnostic to discovery mechanisms would be a significant time investment. On that front, I'd be happy to contribute the kubernetes part if you decide to go that route.

With that said, it's fairly straightforward to use the kubernetes API. An HTTP request is made to https://kubernetes.default.svc.cluster.local/api/v1/namespaces/<namespace>/endpoints?labelSelector=<name-defined-in-k8s-config>. The response is something like this, assuming serde for serialization

#[derive(Serialize, Deserialize, Debug)]
struct Addresses {
    ip: String,
    #[serde(rename = "nodeName")]
    node_name: String,
    #[serde(rename = "targetRef")]
    target_ref: TargetRef,
}

#[derive(Serialize, Deserialize, Debug)]
struct Items {
    metadata: Metadata1,
    subsets: Vec<Subsets>,
}

#[derive(Serialize, Deserialize, Debug)]
struct Labels {
    app: String,
}

#[derive(Serialize, Deserialize, Debug)]
struct Metadata {
    #[serde(rename = "selfLink")]
    self_link: String,
    #[serde(rename = "resourceVersion")]
    resource_version: String,
}

#[derive(Serialize, Deserialize, Debug)]
struct Metadata1 {
    name: String,
    namespace: String,
    #[serde(rename = "selfLink")]
    self_link: String,
    uid: String,
    #[serde(rename = "resourceVersion")]
    resource_version: String,
    #[serde(rename = "creationTimestamp")]
    creation_timestamp: String,
    labels: Labels,
}

#[derive(Serialize, Deserialize, Debug)]
struct Ports {
    name: String,
    port: i64,
    protocol: String,
}

#[derive(Serialize, Deserialize, Debug)]
struct K8sEndpoint {
    kind: String,
    #[serde(rename = "apiVersion")]
    api_version: String,
    metadata: Metadata,
    items: Vec<Items>,
}

#[derive(Serialize, Deserialize, Debug)]
struct Subsets {
    addresses: Vec<Addresses>,
    ports: Vec<Ports>,
}

#[derive(Serialize, Deserialize, Debug)]
struct TargetRef {
    kind: String,
    namespace: String,
    name: String,
    uid: String,
    #[serde(rename = "resourceVersion")]
    resource_version: String,
}

Retrieving the ip addresses is as simple as

let mut list_of_nodes = Vec::new();
for item in endpoints.items {
    for subset in item.subsets {
        for address in subset.addresses {
            list_of_nodes.push(address.ip);
        }
    }
}

Per #19 if leader election wanted to be done, kubernetes has a unique number tied to each API object called resourceVersion. Here, each Address has a TargetRef field which will have resource_version field. The leader can be chosen via min/max of the resource version associated with it. Kubernetes can also expose the pod name to the container via environment variable so any toshi node can know its kubernetes identifier.

Querl DSL Coverage

The current queries "work" but to get some better parity as @LucioFranco rightly pointed out it's probably good to break them out and start thinking about how to make something more extendable.

Some annoying examples of ES's Query language are the bool query (and this comes up in lots of places) where it can accept one or more T types of queries, and rather than it making an array with 1 element it flattens the array out.

I think we can create a lot of partity it's just this is one kind of query, they'll be a lot of boilerplate to cover it all. Unless there are some better ideas?

{  
   "query":{  
      "bool":{  
         "must":[  
            {  
               "term":{  
                  "user":"kimchy"
               }
            },
            {  
               "range":{  
                  "age":{  
                     "gte":-10,
                     "lte":99999999999999
                  }
               }
            }
         ],
         "filter":[  
            {  
               "term":{  
                  "user":"kimchy"
               }
            },
            {  
               "range":{  
                  "age":{  
                     "gte":10.5,
                     "lte":-20.3333333
                  }
               }
            }
         ],
         "must_not":[  
            {  
               "term":{  
                  "user":"kimchy"
               }
            },
            {  
               "range":{  
                  "age":{  
                     "gte":10,
                     "lte":20
                  }
               }
            }
         ],
         "should":[  
            {  
               "term":{  
                  "user":"kimchy"
               }
            },
            {  
               "range":{  
                  "age":{  
                     "gte":10,
                     "lte":20
                  }
               }
            }
         ],
         "minimum_should_match":1,
         "boost":1.0
      }
   }
}

Request {
    aggs: None,
    query: Some(
        Boolean {
            bool: Bool {
                must: [
                    Exact(
                        ExactTerm {
                            term: {
                                "user": "kimchy"
                            }
                        }
                    ),
                    Range {
                        range: {
                            "age": I64Range {
                                gte: Some(
                                    -10
                                ),
                                lte: Some(
                                    99999999999999
                                ),
                                lt: None,
                                gt: None
                            }
                        }
                    }
                ],
                filter: [
                    Exact(
                        ExactTerm {
                            term: {
                                "user": "kimchy"
                            }
                        }
                    ),
                    Range {
                        range: {
                            "age": F32Range {
                                gte: Some(
                                    10.5
                                ),
                                lte: Some(
                                    -20.333334
                                ),
                                lt: None,
                                gt: None
                            }
                        }
                    }
                ],
                must_not: [
                    Exact(
                        ExactTerm {
                            term: {
                                "user": "kimchy"
                            }
                        }
                    ),
                    Range {
                        range: {
                            "age": U64Range {
                                gte: Some(
                                    10
                                ),
                                lte: Some(
                                    20
                                ),
                                lt: None,
                                gt: None
                            }
                        }
                    }
                ],
                should: [
                    Exact(
                        ExactTerm {
                            term: {
                                "user": "kimchy"
                            }
                        }
                    ),
                    Range {
                        range: {
                            "age": U64Range {
                                gte: Some(
                                    10
                                ),
                                lte: Some(
                                    20
                                ),
                                lt: None,
                                gt: None
                            }
                        }
                    }
                ],
                minimum_should_match: 1,
                boost: 1.0
            }
        }
    )
}

#[test]
fn test_enum() {
    use std::collections::HashMap;

    macro_rules! type_range {
        ($($n:ident $t:ty),*) => {
            #[derive(Deserialize, Debug, PartialEq)]
            #[serde(untagged)]
            pub enum Ranges {
                $($n {
                    gte: Option<$t>,
                    lte: Option<$t>,
                    lt: Option<$t>,
                    gt: Option<$t>
                },)*
            }
        };
    }

    type_range!(U64Range u64, I64Range i64, U8Range u8, F32Range f32);

    #[derive(Deserialize, Debug, PartialEq)]
    struct Bool {
        #[serde(default = "Vec::new")]
        must: Vec<TermQueries>,
        #[serde(default = "Vec::new")]
        filter: Vec<TermQueries>,
        #[serde(default = "Vec::new")]
        must_not: Vec<TermQueries>,
        #[serde(default = "Vec::new")]
        should: Vec<TermQueries>,
        minimum_should_match: u64,
        boost: f64,
    }

    #[derive(Deserialize, Debug, PartialEq)]
    struct ExactTerm {
        term: HashMap<String, String>,
    }

    #[derive(Deserialize, Debug, PartialEq)]
    struct FuzzyTerm {
        value: String,
        #[serde(default)]
        distance: u8,
        #[serde(default)]
        transposition: bool,
    }

    #[derive(Deserialize, Debug, PartialEq)]
    #[serde(untagged)]
    enum TermQueries {
        Fuzzy { fuzzy: HashMap<String, FuzzyTerm> },
        Exact(ExactTerm),
        Range { range: HashMap<String, Ranges> },
    }

    #[derive(Deserialize, Debug, PartialEq)]
    #[serde(untagged)]
    enum Query {
        Boolean { bool: Bool },
    }

    #[derive(Deserialize, Debug)]
    pub struct Request {
        query: Option<Query>,
    }

    let j3 = r#"{"query":{"bool":{"must":[{"term":{"user":"kimchy"}}],"filter":[{"fuzzy":{"user":{"value":"kimchy"}}},{"range":{"age":{"gte":10.5,"lte":-20.3333333}}}],"must_not":[{"term":{"user":"kimchy"}},{"range":{"age":{"gte":10,"lte":20}}}],"should":[{"term":{"user":"kimchy"}},{"range":{"age":{"gte":10,"lte":20}}}],"minimum_should_match":1,"boost":1.0}}}"#;
    let result: Request = serde_json::from_str(j3).unwrap();
    println!("{:#?}", result);
}

Add option for consul server address

Toshi should accept a Consul address as a CLI flag. Related to #15

Index creation failed

Index creation (schema from README) fails, the cli output is:

INFO toshi::index > Indexes: []

/_ / ___ / / () / / ___ / /
/ / / _ (-</ _ / / \ / -) _ `/ __/ __/ _
// _////// //_/_,// _///_/
Such Relevance, Much Index, Many Search, Wow

INFO toshi > "GET / HTTP/1.1" 200 15.365µs
INFO toshi::router > Error { kind: "ErrorKind::Internal" }
INFO toshi > "PUT /test_index HTTP/1.1" 500 56.23µs

No files are created in the data directory.

Cluster Tracking Issue

This issue tracks implementation of clustering into Toshi.

Decide on clustering method (master/client, masterless). Relates to #15
Add CLI flag for Consul cluster address. Relates to #18
Add CLI flag for cluster name. Relates to #19
Decide on shard replication scheme. Relates to #16.
Enumerate failure modes
Write tests for failure modes (network splits, etc)
Write module to register with Consul. Relates to #24.

* High jacking Fletcher's issue for my own tracking * - Steve

I've laid out some of the groundwork for this with tower-grpc, so I should be able to get rolling with this today.

Things I'd like to do by end of the year

Unexpected character in root json response

After issuing the command curl -X GET http://localhost:8080 -output - the following response is returned by toshi:

){"name":"Toshi Search","version":"0.1.1"}

As you can see, this is invalid json, there's an aditional character at the start of the response and it vary between: -, +, ,, ), etc. I don't know why it's returning this, seems like garbage to me. Removing the Deflate middleware somehow solves the problem. If you try to change the name in ToshiInfo the initial byte changes too, pretty strange.

Decide on basic clustering design

This is related to #14

ElasticSearch Clustering (Simplified)

ES uses a protocol called zen to discover cluster members. Further, ES has multiple node types, one of which is master. Clusters and masters have the following characteristics:a

The master nodes elect a leader.
A minimum number of master nodes must be present for the cluster to form.
If the minimum number of master nodes is not present, the cluster can be set to respond only to read requests, or to no requests. Write requests will not be accepted.

Consul

As a quicker path to clustering, we can use an external tool such as Consul to handle node registration and leader election. This is similar to how many Apache projects use ZooKeeper. The flow would look something like this:

A node starts up, and is given the address of a Consul server
It registers itself with Consul as a member of the cluster and it role(s)
The leader consults Consul when it needs to make decisions about shard placement or rebalancing
If the leader node goes down, a new one is chosen. There's several ways to do this, detailed here: https://www.consul.io/docs/guides/leader-election.html

Leaderless

A different approach is to not have leaders or central coordination. We could use consistent hashing to decide on which node data is placed. This would require virtual rings to deal with replicas, as otherwise the algorithm would place the replicas on the same node as the primary.

This is the approach Cassandra and Riak use. http://docs.basho.com/riak/kv/2.2.3/learn/concepts/vnodes/ is a good introduction to the concept.

Register node metadata with Consul

To start with:

IP
Port
CPU Load
Data directories
- Available space per data directory
System iowait
RAM available

These should be reported every N minutes, and could also serve as a heartbeat

Consul organization and keys

So using consul's kv store we can do more in terms of orchestration than just leader election (if we even want to go that route)

but basically
/service/toshi/leader true

whichever node gets the session lock on this is the leader then nodes can join the cluster and announce their address perhaps via something like

/service/toshi/cluster/tn1 hostname:port
/service/toshi/cluster/tn2 hostname:port

So on so forth, I don't have any particular attachment to this method, but just something to get started with.

Integrate bors

This is more to open a dicussion on the possiblity to integrate bors.

https://bors.tech/

I use this on my other projects mainly, amethyst/amethyst and amethyst/laminar but it may be useful for this project as well.

Dependabot can't resolve your Rust dependency files

Dependabot can't resolve your Rust dependency files.

As a result, Dependabot couldn't update your dependencies.

The error Dependabot encountered was:

    Updating git repository `https://github.com/tower-rs/tower`
    Updating crates.io index
    Updating git repository `https://github.com/carllerche/better-future`
    Updating git repository `https://github.com/toshi-search/systemstat`
    Updating git repository `https://github.com/carllerche/tokio-connect`
    Updating git repository `https://github.com/LucioFranco/tower-consul`
    Updating git repository `https://github.com/tower-rs/tower-grpc`
    Updating git repository `https://github.com/tower-rs/tower-h2`
    Updating git repository `https://github.com/tower-rs/tower-http`
error: no matching package named `tower-direct-service` found
location searched: https://github.com/tower-rs/tower
required by package `toshi v0.1.1 (/home/dependabot/dependabot-updater/dependabot_tmp_dir

If you think the above is an error on Dependabot's side please don't hesitate to get in touch - we'll do whatever we can to fix it.

You can mention @dependabot in the comments below to contact the Dependabot team.

Add a CLI flag for cluster name

This is the name all nodes will register themselves under Consul in. Related to #14

Additions to .gitignore

A cargo build seems to leave these directories lying around:

logs/
new_index/

Should we clean them up? Or add them to .gitignore?

Switch from Gotham to Actix

Since actix-web has removed all unsound use of unsafe in its codebase, when do you plan to switch to acitx-web as mentioned on reddit?

Re-factor node_id generation

@LucioFranco is there a more rustique way to write this?

    let node_id: String;
    if let Ok(nid) = cluster::read_node_id(&settings.path) {
        info!("Node ID is: {}", nid);
        node_id = nid;
    } else {
        // If no file exists containing the node ID, generate a new one and write it
        let random_id = uuid::Uuid::new_v4().to_hyphenated().to_string();
        info!("No Node ID found. Creating new one: {}", random_id);
        node_id = random_id.clone();
        if let Err(err) = cluster::write_node_id(random_id, &settings.path) {
            error!("{:?}", err);
            std::process::exit(1);
        }
    }

Dependabot can't resolve your Rust dependency files

Dependabot can't resolve your Rust dependency files.

As a result, Dependabot couldn't update your dependencies.

The error Dependabot encountered was:

    Updating git repository `https://github.com/tower-rs/tower`
    Updating crates.io index
error: no matching package named `tower-direct-service` found
location searched: https://github.com/tower-rs/tower
required by package `toshi v0.1.1 (/home/dependabot/dependabot-updater/dependabot_tmp_dir

If you think the above is an error on Dependabot's side please don't hesitate to get in touch - we'll do whatever we can to fix it.

You can mention @dependabot in the comments below to contact the Dependabot team.

Publish to crates.io

It would be great to be able to

cargo install toshi

Re-factor public node_id

Right now, the node_id in ConsulInterface is public so that it can be assigned in main.rs. This whole section could be refactored to load or generate both consul and node id, then create the ConsulInterface once we have all the needed info.

empty body causes ErrorKind::Internal

What happened

Accidentally omitting document content returns 500 Internal Server Error with a body of {"message":"Internal error","uri":"/new_index"}

What was expected

Emitting any kind of helpful message would be helpful. Also, in my experience, when the client receives a 500 response, there is usually something informative on the server-side. But in this case, the server emits the same message that the client receives, which isn't helpful.

This bug is actually just the worst offender of a whole class of bugs where if something doesn't go Toshi's way, it just gives back a raspberry, but I'd say getting a 500 for an empty document is pretty far up the list for me

How to reproduce

Assuming you create an index based on the cargo test schema, then send in an indexing request of the form

$ echo '{}' | curl ... -X PUT -d @- 127.0.0.1:9200/new_index

Implement Phrase Query

Related to #39.

Implement Range Query

Related to #39

Additional config items

This is just a tracking issue for additional config items that need to be added:

Consul client buffer size

Decide on Shard Replication Scheme

This relates to issue #14.

ElasticSearch Shards

This describes the basics of ES sharding.

Primary Shards

In ES, an index is created with a certain number of Primary shards (5 is common). These Primary shards are writable, and when data is written to an index, a Primary shard is chosen to receive it.

The number of primary shards for an index cannot be changed after creation.

Replica Shards

Each Primary shard has 0 or more Replica shards. After a Primary shard writes data, it is replicated to the Replica shards belonging to that Primary shard. Replica shards can be used to handle read requests, but not write. Ideally, all Replica shards are located on a different node/server.

Rebalancing

This means redistributing data amongst the cluster members. It is usually done in two situations.

Node Addition or Removal

If more nodes are added to a cluster, shards should be rebalanced to evenly distribute the load. This can be done in many ways. With consistent hashing, the algorithm itself will tell you what needs to be moved where.

With a leader architecture, a process will need to choose which shards to move where. This can be based on CPU, user-defined tags, memory usage or any other metadata we have about the state of the system.

Hot Shard Problem

It is common for a situation to arise where one node is overloaded because it holds shards for popular indices. This may be a temporary situation, or a longer-term issue. In this scenario, it is desireable to redistribute the hot shards to machines with less load.

Toshi

I suggest using Consul to track shard assignments along with leader election.

Dependabot can't resolve your Rust dependency files

Dependabot can't resolve your Rust dependency files.

As a result, Dependabot couldn't update your dependencies.

The error Dependabot encountered was:

    Updating git repository `https://github.com/tower-rs/tower`
    Updating crates.io index
    Updating git repository `https://github.com/carllerche/better-future`
    Updating git repository `https://github.com/toshi-search/systemstat`
    Updating git repository `https://github.com/carllerche/tokio-connect`
    Updating git repository `https://github.com/LucioFranco/tower-consul`
    Updating git repository `https://github.com/tower-rs/tower-grpc`
    Updating git repository `https://github.com/tower-rs/tower-h2`
    Updating git repository `https://github.com/tower-rs/tower-http`
error: no matching package named `tower-direct-service` found
location searched: https://github.com/tower-rs/tower
required by package `toshi v0.1.1 (/home/dependabot/dependabot-updater/dependabot_tmp_dir

If you think the above is an error on Dependabot's side please don't hesitate to get in touch - we'll do whatever we can to fix it.

You can mention @dependabot in the comments below to contact the Dependabot team.

Create default data directory when none are supplied

router does not permit charset declaration on content-type

What happened

A mime type containing the ;charset= qualifier caused Toshi to respond with 400 Bad Request to an index creation request

What was expected

The index would have been created. Many, many http libraries will send along the charset if they know it, because it's polite

How to reproduce

Assuming you have run cargo test, there will be new_index containing the tantivy index, and its schema, in the current directory; thus:

$ jq .schema new_index/meta.json | \
    curl --compress -vH 'content-type: application/json;charset=utf-8' -X PUT -d @-  127.0.0.1:9200/foo/_create

produces:

> PUT /foo/_create HTTP/1.1
> Host: 127.0.0.1:9200
> User-Agent: curl/7.54.0
> Accept: */*
> Accept-Encoding: deflate, gzip
> content-type: application/json;charset=utf-8
>
< HTTP/1.1 400 Bad Request
< content-type: application/json
< content-encoding: deflate
< transfer-encoding: chunked
< date: Sun, 20 Jan 2019 20:58:04 GMT
<
{"message":"Bad request","uri":"/foo/_create"}

but the unqualified content-type header creates the index as expected

$ jq .schema new_index/meta.json | \
    curl --compress -vH 'content-type: application/json' -X PUT -d @-  127.0.0.1:9200/foo/_create

toshi-search / toshi Goto Github PK

toshi's Introduction

Toshi

A Full-Text Search Engine in Rust

Description

Motivations

Build Requirements

Configuration

Host

Port

Path

Writer Memory

Log Level

Json Parsing

Bulk Buffer

Auto Commit Duration

Merge Policy

Experimental Settings

Building and Running

Example Queries

Term Query

Fuzzy Term Query

Phrase Query

Range Query

Regex Query

Boolean Query

Usage

Running Tests

What is a Toshi?

toshi's People

Contributors

Stargazers

Watchers

Forkers

toshi's Issues

RPC Library Options

Comparison

Primary Query Types

* High jacking Fletcher's issue for my own tracking * - Steve

ElasticSearch Clustering (Simplified)

Consul

Leaderless

What happened

What was expected

How to reproduce

ElasticSearch Shards

Primary Shards

Replica Shards

Rebalancing

Node Addition or Removal

Hot Shard Problem

Toshi

What happened

What was expected

How to reproduce

Recommend Projects

Recommend Topics

Recommend Org