reductstore / reductstore Goto Github PK

View Code? Open in Web Editor NEW

144.0 5.0 8.0 1.38 MB

A time series database for storing and managing large amounts of blob data

Home Page: https://www.reduct.store

License: Other

Dockerfile 0.40% Python 8.85% Shell 0.52% Rust 90.23%

blob-storage storage-engine http-server storage historian timeseries time-series big-data bigdata reductstore

reductstore's Introduction

ReductStore

A high-performance time series database for blob data

ReductStore is a time series database that is specifically designed for storing and managing large amounts of blob data. It boasts high performance for both writing and real-time querying, with the added benefit of batching data. This makes it an ideal solution for edge computing, computer vision, and IoT applications where network latency is a concern. For more information, please visit https://www.reduct.store/.

Why Does It Exist?

There are numerous time-series databases available in the market that provide remarkable functionality and scalability. However, all of them concentrate on numeric data and have limited support for unstructured data, which may be represented as strings.

On the other hand, S3-like object storage solutions could be the best place to keep blob objects, but they don't provide an API to work with data in the time domain.

There are many kinds of applications where we need to collect unstructured data such as images, high-frequency sensor data, binary packages, or huge text documents and provide access to their history. Many companies build a storage solution for these applications based on a combination of TSDB and Blob storage in-house. It might be a working solution; however, it is a challenging development task to keep data integrity in both databases, implement retention policies, and provide data access with good performance.

The ReductStore project aims to solve the problem of providing a complete solution for applications that require unstructured data to be stored and accessed at specific time intervals. It guarantees that your data will not overflow your hard disk and batches records to reduce the number of critical HTTP requests for networks with high latency.

All of these features make the database the right choice for edge computing and IoT applications if you want to avoid development costs for your in-house solution.

Features

Storing and accessing unstructured data as time series
No limit for maximum size of blob
Real-time FIFO bucket quota based on size to avoid disk space shortage
HTTP(S) API
Append-only replication
Optimized for small objects (less than 1 MB)
Labeling data for annotation and filtering
Iterative data querying
Batching records in an HTTP response for write and read operations
Embedded Web Console
Token authorization for managing data access

Get Started

The quickest way to get up and running is with our Docker image:

docker run -p 8383:8383 -v ${PWD}/data:/data reduct/store:latest

Alternatively, you can opt for Cargo:

apt install protobuf-compiler
cargo install reductstore
RS_DATA_PATH=./data reductstore

For a more in-depth guide, visit the Getting Started and Download sections.

After initializing the instance, dive in with one of our Client SDKs to write or retrieve data. To illustrate, here's a Python sample:

import time
import asyncio
from reduct import Client, Bucket

async def main():
    # Create a client for interacting with a ReductStore service
    async with Client("http://localhost:8383") as client:
        # Create a bucket and store a reference to it in the `bucket` variable
        bucket: Bucket = await client.create_bucket("my-bucket", exist_ok=True)

        # Write data to the bucket
        ts = time.time_ns() / 1000
        await bucket.write("entry-1", b"Hey!!", ts)

        # Read data from the bucket
        async with bucket.read("entry-1", ts) as record:
            data = await record.read_all()
            print(data)

# Run the main function
loop = asyncio.get_event_loop()
loop.run_until_complete(main())

Client SDKs

ReductStore is built with adaptability in mind. While it comes with a straightforward HTTP API that can be integrated into virtually any environment, we understand that not everyone wants to interact with the API directly. To streamline your development process and make integrations smoother, we've developed a series of client SDKs tailored for different programming languages and environments. These SDKs wrap around the core API, offering a more intuitive and language-native way to interact with ReductStore, thus accelerating your development cycle. Here are the client SDKs available:

Tools

ReductStore is not just about data storage; it's about simplifying and enhancing your data management experience. Along with its robust core features, ReductStore offers a suite of tools to streamline administration, monitoring, and optimization. Here are the key tools you can leverage:

CLI Client - a command-line interface for direct interactions with ReductStore
Web Console - a web interface to administrate a ReductStore instance

Feedback & Contribution

Your input is invaluable to us! 🌟 If you've found a bug, have suggestions for improvements, or want to contribute directly to the codebase, here's how you can help:

Discord: Join our Discord community to discuss, share ideas, and collaborate with fellow ReductStore users.
Feedback & Bug Reports: Open an issue on our GitHub repository. Please provide as much detail as possible so we can address it effectively.
Contribute: ReductStore is an open-source project. We encourage and welcome contributions.

Get Involved

We believe in the power of community and collaboration. If you've built something amazing with ReductStore, we'd love to hear about it! Share your projects, experiences, and insights on our Discord community.

If you find ReductStore beneficial, give us a ⭐ on our GitHub repository.

Your support fuels our passion and drives us to keep improving.

Together, let's redefine the future of blob data storage! 🚀

Frequently Asked Questions (FAQ)

Q1: What sets ReductStore apart from other time-series databases?

A1: ReductStore is specially designed for storing and managing large amounts of blob data, optimized for both high performance and real-time querying. Unlike other databases that focus primarily on numeric data, ReductStore excels in handling unstructured data, making it ideal for various applications like edge computing and IoT.

Q2: How do I get started with ReductStore?

A2: You can easily set up ReductStore using our Docker image or by using cargo. Detailed instructions are provided in the Getting Started section.

Q3: Is there any size limitation for the blob data?

A3: While ReductStore is optimized for small objects (less than 1 MB), there's no hard limit for the maximum size of a blob.

Q4: Can I integrate ReductStore with my current infrastructure?

A4: Absolutely! With our variety of client SDKs and its adaptable HTTP API, ReductStore can be integrated into almost any environment.

Q5: I'm facing issues with the installation. Where can I get help?

A5: We recommend checking out our documentation. If you still face issues, feel free to join our Discord community or raise an issue on our GitHub repository.

reductstore's People

Contributors

Stargazers

Watchers

Forkers

steviboy04 mambaz rtadepalli mattiwilde icodein renghen anthonycvn luis-sousa-pinto

reductstore's Issues

Performance of write operation degrades when the entry data increases

The assumption is that the meta information about blocks increasing and the marshaling for each operation becomes expensive. To do:

Write benchmarks
Investigate the problem and check the assumption.
Found solution.

Refactor block structure in entry

The current implementation has a big descriptor which has information about all the stored blocks and records. It is pretty fast but has downsides:

If it gets broken, you lose all the data
it limits the number of records because its indexed are stored in RAM

The better approach, to store everything in a block and make it autonomous.

List buckets in storage

The Server API should provide the list of buckets with the following information:

name of the bucket
size of the bucket
number of entries in the bucket
timestamp of the newest record in the bucket
timestamp of the oldest record in the bucket

PUT /b/:bucket doesn't work correctly

If I send partitianal settings:

{
  "quota_size": 100
}

and read them by GET /b/:bucket method. I will have not full settings, but the last written

{
  "quota_size": 100
}

P.S. Documentation also wrong. It says all the settings should be in request but they are optional.

Timestamp in microseconds doesn't work

If you try to use a timestamp in microseconds you get an error:

 curl -d "some_data" -X POST -a http://127.0.0.1:8383/b/my_data/entry_1?ts=1610387457862000

2022-01-19 18:42:46.569 (47040) [ERROR] -- common.h:71 POST /b/my_data/entry_1: [422] Failed to parse 'ts' parameter: 1610387457862000 should unix times in microseconds

Integrate Web Console to the storage

The web console should be embedded to the storage and run on the same port and host.

Server URL in logs doesn't contain base path

storage-1_1  | 2022-05-21 05:25:36.462 (51136)  [INFO] -- main.cc:41 Reduct Storage 0.6.0 
storage-1_1  | 2022-05-21 05:25:36.463 (51136)  [INFO] -- main.cc:55 Configuration: 
storage-1_1  |          RS_LOG_LEVEL = INFO (default)
storage-1_1  |  RS_HOST = 0.0.0.0 (default)
storage-1_1  |  RS_PORT = 8383 (default)
storage-1_1  |  RS_API_BASE_PATH = storage-1/ 
storage-1_1  |  RS_DATA_PATH = /data (default)
storage-1_1  |  
storage-1_1  | 2022-05-21 05:25:36.483 (51136)  [INFO] -- storage.cc:38 Load 0 buckets 
storage-1_1  | 2022-05-21 05:25:36.483 (51136) [WARNING] -- token_auth.cc:125 API token is empty. No authentication. 
storage-1_1  | 2022-05-21 05:25:36.483 (51136)  [INFO] -- api_server.cc:135 Run HTTP server on http://0.0.0.0:8383

RS_API_BASE_PATH is storage-1 but the URL is http://0.0.0.0:8383 . It should be http://0.0.0.0:8383/storage-1

Get latest record in an entry

Very often we need only the latest record, so it can be accessed by method:

GET /b/:bucket/:entry

HTTP API to store and get a blob

As a user, I can store blobs by name with timestamp and get them from storage:

POST /<bucket_name>/<entry_name>?timestamp=
GET /<bucket_name>/<entry_name>?timestamp=

Docker container crushes without volume

The container runs with an error if you don't give it volumes:

docker run -p 8383:8383 --rm ghcr.io/reduct-storage/reduct-storage:latest

2022-01-31 21:32:05.783 (42944)  [INFO] -- main.cc:32 Reduct Storage 0.1.0 
2022-01-31 21:32:05.788 (42944)  [INFO] -- main.cc:43 Configuration: 
 	RS_LOG_LEVEL = INFO (default)
	RS_HOST = 0.0.0.0 (default)
	RS_PORT = 8383 (default)
	RS_API_BASE_PATH = / (default)
	RS_DATA_PATH = /data (default)
 
terminate called after throwing an instance of 'std::filesystem::__cxx11::filesystem_error'
  what():  filesystem error: directory iterator cannot open directory: No such file or directory [/data]

Add link to JavaScript SDK to docs

The documentation site has links to SDKs. Add new JavaScript SDK there.

Browse a bucket

We should extend GET /b/:bucket_name method:

List of entries
Statistical information:
- Size in bytes
- Oldest record
- Latest record

Add curl into release docker image

We need to make health checks

GET /b/:bucket/:entry/list should compress response and send by chunks

The JSON response may be few megabytes and it is too big to send it in one res->End also it should be compressed on the server side

PUT method of bucket should have optional parameters

Method POST uses optional parameters, but PUT does not. To ovoid reading configuration before its update, the server should provide optional parameters.

Usage in GET /info always 0

This test in javascript SDL shows that after writing data the storage shows 0 bytes of usage.

 it("should get information about the server", async () => {
        await client.createBucket("bucket_1");
        const bucket = await client.createBucket("bucket_2");
        await bucket.write("entry", "somedata", new Date(1000));
        await bucket.write("entry", "somedata", new Date(2000));

        const info: ServerInfo = await client.getInfo();
        expect(info.version).toMatch(/0\.[0-9]+\.[0-9]+/);

        expect(info.bucketCount).toEqual(2n);
        expect(info.usage).toEqual(0n); //TODO: a bug in storage
        expect(info.uptime).toBeGreaterThanOrEqual(0);
        expect(info.oldestRecord).toEqual(1000_000n);
        expect(info.latestRecord).toEqual(2000_000n);
    });

This is wrong and it comes from server side. Looks like the bug happens after old bucket was removed.

System degrades if an empty entry is created

If a user tries to get data from not existing entry, the storage created it. After that, the quota stops working, and it creates many logs and
stop working at all.

 [ERROR] -- storage.cc:199 Didn't mange to keep quota: [500] Tries to remove a block in empty entry

The solution:

Fix quota not to clean blocks from already empty entries.
Not create an entry for GET request

Wrong size calculation

In the file system the data takes almost 356G

du -h /mnt/data/reduct/data/
21G	/mnt/data/reduct/data/acc-3
195G	/mnt/data/reduct/data/camera
21G	/mnt/data/reduct/data/acc-4
21G	/mnt/data/reduct/data/acc-0
21G	/mnt/data/reduct/data/acc-5
21G	/mnt/data/reduct/data/acc-1
21G	/mnt/data/reduct/data/acc-6
21G	/mnt/data/reduct/data/acc-7
21G	/mnt/data/reduct/data/acc-2
356G	/mnt/data/reduct/data/

But the storage provides way less numbers:

{"info":{"name":"data","size":"136706208280","entry_count":"9","oldest_record":"1653655367073000","latest_record":"1653700867366000"},"settings":{"max_block_size":"67108864","quota_type":"FIFO","quota_size":"322122547200"},"entries":[{"name":"acc-0","size":"7682225325","record_count":"45485","block_count":"115","oldest_record":"1653655377075000","latest_record":"1653700867366000"},{"name":"acc-1","size":"7696745730","record_count":"45485","block_count":"115","oldest_record":"1653655377075000","latest_record":"1653700867366000"},{"name":"acc-2","size":"7704795276","record_count":"45485","block_count":"115","oldest_record":"1653655377075000","latest_record":"1653700867366000"},{"name":"acc-3","size":"7686308722","record_count":"45486","block_count":"115","oldest_record":"1653655376075000","latest_record":"1653700867366000"},{"name":"acc-4","size":"7706995989","record_count":"45485","block_count":"115","oldest_record":"1653655367073000","latest_record":"1653700857365000"},{"name":"acc-5","size":"7704742470","record_count":"45485","block_count":"115","oldest_record":"1653655367073000","latest_record":"1653700857365000"},{"name":"acc-6","size":"7698664263","record_count":"45485","block_count":"115","oldest_record":"1653655367073000","latest_record":"1653700857365000"},{"name":"acc-7","size":"7711576776","record_count":"45485","block_count":"115","oldest_record":"1653655367073000","latest_record":"1653700857365000"},{"name":"camera","size":"75114153729","record_count":"45486","block_count":"1107","oldest_record":"1653655376075000","latest_record":"1653700867366000"}]}

Authentication with JWT token

We should introduce some authentication and authorization model to the storage. The simplest way to start is to use a JWT token and embedded admin user with the full rights.

The storage runs with RS_API_TOKEN environment variable
Client should use RS_API_TOKEN as a refresh token to get a temporal access token
Token must be a part of HTTP header

The storage crushes if no bearer token in header

2022-03-06 21:10:05.262 ( 9152) [ERROR] -- common.h:96 POST /b/bucket_1160546: [401] No bearer token in response header

HTTP API to list of records for time interval

The storage should provide a list of stored objects for time interval, Example of API:

GET http://hostname/bucket/entry?start=X&stop=Y 
Output:
[ 
{ ts: 11002021, size: 32102},
.....
]

Storage aborts HTTP response for big blobs (~1Mb)

If I put some record with the size about ~1Mb and try to read with GET:

CURL hangs on the requesti
reudct-js library throws error: APIError { message: 'error request aborted' }

Version: 0.4.1

Document HTTP API

The authentication happens after processing

The storage always run the request (create a bucket or write data), then check the API TOKEN and return 401. It is completely wrong, the storage should do nothing.

Bucket API

Bucket is a common space for all records with storage settings. Currently it has only name.

Request information about bucket: GET /<bucket_name>
Create a new bucket: POST /<bucket_name>
Remove a bucket and all its data: DELETE /<bucket_name>

Method GET /info should provide defaults for a new bucket

Because the bucket can be created with default parameters, a user should be able to know them:

{
    "version":"0.5.0",
    "bucket_count":"1",
    "usage":"32167820595",
    "uptime":"16029",
    "oldest_record":"1652033077885000",
    "latest_record":"1652041795230000",
    "defaults": {
       "bucket": {
          "block_size": "xxxxx",
          "quota_type": "NONE",
          "quota_size": "0",
       }
    }
}

GET /b/:bucket/:entry/list returns error

Request to the test server:

GET "http://test.reduct-storage.dev:8383/b/data/entry/list?start=0&stop=1647765845000000"

Returns:

{"detail":"Failed to load a block descriptor: /data/data/entry/67904000000.meta"}

But there is no such file and it doesn't look like timestamp.

HEAD method for bucket to check if it exists

Currently, a user can check bucket only with GET method, but it has some overhead.

Check mime types of HTTP responses

Looks like we send no information about types.

Write\read blobs asynchronously

Get /b/:bucket/:entry request fails sometime

If the storage has intensive reading, it aborts the requests and crushes:

reduct-storage_1  | 2022-03-27 21:50:07.550 ( 5056) [ERROR] -- api_server.cc:361 Failed to send data 
reduct-storage_1  | 2022-03-27 21:50:07.695 ( 5056) [ERROR] -- api_server.cc:361 Failed to send data 
reduct-storage_1  | 2022-03-27 21:50:12.900 ( 5056) [ERROR] -- common.h:95 GET /b/data/entry: aborted 
reduct-storage_1  | 2022-03-27 21:50:12.931 ( 5056) [ERROR] -- common.h:95 GET /b/data/entry: aborted 
reduct-storage_1  | 2022-03-27 21:50:16.901 ( 5056) [ERROR] -- common.h:95 GET /b/data/entry: aborted 
reduct-storage_1  | 2022-03-27 21:50:16.912 ( 5056) [ERROR] -- common.h:95 GET /b/data/entry: aborted

Looks like I have to check if http engine is ready for writing.

Long loading

After the block refactoring, we have to calculate size and record count by parsing all meta files for each block. It takes too long. 1 second per 50 Gb. The alternative solution:

Don't calculate record count, it is not a part of HTT API
To calculate file size, use size of the files.

Provisioning bucket from environment variables

For edge devices, the most common use case is to have only one bucket with fixed settings. It is more useful to set it by environment variables:

RS_BUCKET_NAME
RS_BUCKET_MAX_BLOCK_SIZE (humanized size 10M, 2G etc)
RS_BUCKET_QUOTA_TYPE (values—NONE (default), FIFO)
RS_BUCKET_QUOTA_SIZE (humanized size 10M, 2G etc)

Provisioned bucket cannot be deleted or changed.

Typo in log

There is a log message for the bad API token:

 No bearer token in response header

It should be request header.

The engine removes the wrong block in a bucket with quota

I have 2 entries in a bucket:

{
  "info":{
    "name":"test-bucket",
    "size":"32144630132",
    "entry_count":"2",
    "oldest_record":"1651412479999000",
    "latest_record":"1651514933403000"
  },
  "settings":{
    "max_block_size":"67108864",
    "quota_type":"FIFO",
    "quota_size":"32212254720"
  },
  "entries":[
    {
      "name":"blobs",
      "size":"32144630100",
      "record_count":"61213",
      "block_count":"477",
      "oldest_record":"1651412479999000",
      "latest_record":"1651514933403000"
    },
    {
      "name":"md-sums",
      "size":"32",
      "record_count":"1",
      "block_count":"1",
      "oldest_record":"1651514933403000",
      "latest_record":"1651514933403000"
    }
  ]
}

The engine removes blocks from md-sums however the entry blobs has older records.

Graceful stop

We should handle OS signals to finish all tasks and stop the storage safely.

Extend GET /b/:bucket/ method with stats of each entry

The current bucket GET /b/:bucket return list of entry names:

{
  "info": {
    "name": "my_data",
    "size": "27",
    "entry_count": "3",
    "oldest_record": "0",
    "latest_record": "0"
  },
  "settings": {
    "max_block_size": "67108864",
    "quota_type": "FIFO",
    "quota_size": "10000"
  },
  "entries": [
    "entry_1",
    "entry_2",
    "entry_3"
  ]
}

but it'd be more useful to provide the same info as for bucket 👍

   {
    "name": "entry_1",
    "size": "27",
    "record_count": "3",
    "oldest_record": "0",
    "latest_record": "0"
  }

Wrong documentation of Entry API

GET List endpoint has bucket_name and stop parameters as optional, but they are acctually required.

Web Console doesn't work with RS_API_BASE_PATH

If a use RS_API_BASE_PATH is not default, ("/"), the console stop working.

CORS Access-Contorl-Allow isn't included to response

The HTTP server should add the client's host to the header Access-Contorl-Allow, if it passes it with Origin header.

Bad timestamp in GET /info

Looks like oldest_record field is in seconds:

{
"version":"0.5.0",
"bucket_count":"1",
"usage": "42928697586",
"uptime": "442",
"oldest_record": "1648890020", // !!!!
"latest_record":"1648890020590279"
}

Bucket Quota

We have to limit the size of bucket not to run out the disk space.

We should add FIFO quota in bytes into the settings of bucket.

Env variable API_BASE_PATH doesn't work

If I specify API_BASE_PATH=/storage, the HTTP API stops working:

curl http://127.0.0.1:8383/storage/info
curl: (52) Empty reply from server

Publsih a Docker image

We should have a ready-to-use Docker image. We can publish it to the public Github's registry from main brunch

[500] Failed to find the needed block in descriptor

Periodically, EntryList request has an error 500:

reduct-storage_1  | 2022-01-21 22:06:22.201 (22464) [ERROR] -- entry.cc:260 No block in entry 'entry' for ts=2022-01-21T22:03:43.996662Z 
reduct-storage_1  | 2022-01-21 22:06:22.201 (22464) [ERROR] -- common.h:71 GET /b/data/entry/list: [500] Failed to find the needed block in descriptor

Get rid of nholmann/json

Currently we use Protobuf and nholmann/json for JSON serialization. We need only protobuf because it is used for binary serialization anyway.

Support HTTPS

A user can provide paths to the certificate and private key by using environment variables:

RS_CERT_PATH
RS_CERT_KEY_PATH

Protobuf warning

reduct-storage_1  | [libprotobuf ERROR /root/.conan/data/protobuf/3.19.1/_/_/build/64504d4b5743a18b5bb012ba0145fd09ce3bd5f2/source_subfolder/src/google/protobuf/wire_format_lite.cc:581] String field 'reduct.proto.EntryRecord.blob' contains invalid UTF-8 data when serializing a protocol buffer. Use the 'bytes' type if you intend to send raw bytes.

Extend information about server in GET method

Currently. GET /info returns only version and count of buckets. We should add:

size of storage in bytes
uptime in seconds
timestamp of the oldest record
timestamp of the newest record

Don't print HTTP errors 40x to logs

HTTP errors with the codes 40x aren't problems on the server size, so we should not print them as errors into logs.