Code Monkey home page Code Monkey logo

reductstore's Introduction

ReductStore

A high-performance time series database for blob data

GitHub release (latest SemVer) GitHub Workflow Status Docker Pulls GitHub all releases Discord

ReductStore is a time series database that is specifically designed for storing and managing large amounts of blob data. It boasts high performance for both writing and real-time querying, with the added benefit of batching data. This makes it an ideal solution for edge computing, computer vision, and IoT applications where network latency is a concern. For more information, please visit https://www.reduct.store/.

Why Does It Exist?

There are numerous time-series databases available in the market that provide remarkable functionality and scalability. However, all of them concentrate on numeric data and have limited support for unstructured data, which may be represented as strings.

On the other hand, S3-like object storage solutions could be the best place to keep blob objects, but they don't provide an API to work with data in the time domain.

There are many kinds of applications where we need to collect unstructured data such as images, high-frequency sensor data, binary packages, or huge text documents and provide access to their history. Many companies build a storage solution for these applications based on a combination of TSDB and Blob storage in-house. It might be a working solution; however, it is a challenging development task to keep data integrity in both databases, implement retention policies, and provide data access with good performance.

The ReductStore project aims to solve the problem of providing a complete solution for applications that require unstructured data to be stored and accessed at specific time intervals. It guarantees that your data will not overflow your hard disk and batches records to reduce the number of critical HTTP requests for networks with high latency.

All of these features make the database the right choice for edge computing and IoT applications if you want to avoid development costs for your in-house solution.

Features

  • Storing and accessing unstructured data as time series
  • No limit for maximum size of blob
  • Real-time FIFO bucket quota based on size to avoid disk space shortage
  • HTTP(S) API
  • Append-only replication
  • Optimized for small objects (less than 1 MB)
  • Labeling data for annotation and filtering
  • Iterative data querying
  • Batching records in an HTTP response for write and read operations
  • Embedded Web Console
  • Token authorization for managing data access

Get Started

The quickest way to get up and running is with our Docker image:

docker run -p 8383:8383 -v ${PWD}/data:/data reduct/store:latest

Alternatively, you can opt for Cargo:

apt install protobuf-compiler
cargo install reductstore
RS_DATA_PATH=./data reductstore

For a more in-depth guide, visit the Getting Started and Download sections.

After initializing the instance, dive in with one of our Client SDKs to write or retrieve data. To illustrate, here's a Python sample:

import time
import asyncio
from reduct import Client, Bucket

async def main():
    # Create a client for interacting with a ReductStore service
    async with Client("http://localhost:8383") as client:
        # Create a bucket and store a reference to it in the `bucket` variable
        bucket: Bucket = await client.create_bucket("my-bucket", exist_ok=True)

        # Write data to the bucket
        ts = time.time_ns() / 1000
        await bucket.write("entry-1", b"Hey!!", ts)

        # Read data from the bucket
        async with bucket.read("entry-1", ts) as record:
            data = await record.read_all()
            print(data)

# Run the main function
loop = asyncio.get_event_loop()
loop.run_until_complete(main())

Client SDKs

ReductStore is built with adaptability in mind. While it comes with a straightforward HTTP API that can be integrated into virtually any environment, we understand that not everyone wants to interact with the API directly. To streamline your development process and make integrations smoother, we've developed a series of client SDKs tailored for different programming languages and environments. These SDKs wrap around the core API, offering a more intuitive and language-native way to interact with ReductStore, thus accelerating your development cycle. Here are the client SDKs available:

Tools

ReductStore is not just about data storage; it's about simplifying and enhancing your data management experience. Along with its robust core features, ReductStore offers a suite of tools to streamline administration, monitoring, and optimization. Here are the key tools you can leverage:

  • CLI Client - a command-line interface for direct interactions with ReductStore
  • Web Console - a web interface to administrate a ReductStore instance

Feedback & Contribution

Your input is invaluable to us! ๐ŸŒŸ If you've found a bug, have suggestions for improvements, or want to contribute directly to the codebase, here's how you can help:

  • Discord: Join our Discord community to discuss, share ideas, and collaborate with fellow ReductStore users.
  • Feedback & Bug Reports: Open an issue on our GitHub repository. Please provide as much detail as possible so we can address it effectively.
  • Contribute: ReductStore is an open-source project. We encourage and welcome contributions.

Get Involved

We believe in the power of community and collaboration. If you've built something amazing with ReductStore, we'd love to hear about it! Share your projects, experiences, and insights on our Discord community.

If you find ReductStore beneficial, give us a โญ on our GitHub repository.

Your support fuels our passion and drives us to keep improving.

Together, let's redefine the future of blob data storage! ๐Ÿš€

Frequently Asked Questions (FAQ)

Q1: What sets ReductStore apart from other time-series databases?

A1: ReductStore is specially designed for storing and managing large amounts of blob data, optimized for both high performance and real-time querying. Unlike other databases that focus primarily on numeric data, ReductStore excels in handling unstructured data, making it ideal for various applications like edge computing and IoT.

Q2: How do I get started with ReductStore?

A2: You can easily set up ReductStore using our Docker image or by using cargo. Detailed instructions are provided in the Getting Started section.

Q3: Is there any size limitation for the blob data?

A3: While ReductStore is optimized for small objects (less than 1 MB), there's no hard limit for the maximum size of a blob.

Q4: Can I integrate ReductStore with my current infrastructure?

A4: Absolutely! With our variety of client SDKs and its adaptable HTTP API, ReductStore can be integrated into almost any environment.

Q5: I'm facing issues with the installation. Where can I get help?

A5: We recommend checking out our documentation. If you still face issues, feel free to join our Discord community or raise an issue on our GitHub repository.

reductstore's People

Contributors

anthonycvn avatar aschenbecherwespe avatar atimin avatar dependabot[bot] avatar mambaz avatar renghen avatar rtadepalli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

reductstore's Issues

Refactor block structure in entry

The current implementation has a big descriptor which has information about all the stored blocks and records. It is pretty fast but has downsides:

  1. If it gets broken, you lose all the data
  2. it limits the number of records because its indexed are stored in RAM

The better approach, to store everything in a block and make it autonomous.

List buckets in storage

The Server API should provide the list of buckets with the following information:

  • name of the bucket
  • size of the bucket
  • number of entries in the bucket
  • timestamp of the newest record in the bucket
  • timestamp of the oldest record in the bucket

PUT /b/:bucket doesn't work correctly

If I send partitianal settings:

{
  "quota_size": 100
}

and read them by GET /b/:bucket method. I will have not full settings, but the last written

{
  "quota_size": 100
}

P.S. Documentation also wrong. It says all the settings should be in request but they are optional.

Timestamp in microseconds doesn't work

If you try to use a timestamp in microseconds you get an error:

 curl -d "some_data" -X POST -a http://127.0.0.1:8383/b/my_data/entry_1?ts=1610387457862000

2022-01-19 18:42:46.569 (47040) [ERROR] -- common.h:71 POST /b/my_data/entry_1: [422] Failed to parse 'ts' parameter: 1610387457862000 should unix times in microseconds 

Server URL in logs doesn't contain base path

storage-1_1  | 2022-05-21 05:25:36.462 (51136)  [INFO] -- main.cc:41 Reduct Storage 0.6.0 
storage-1_1  | 2022-05-21 05:25:36.463 (51136)  [INFO] -- main.cc:55 Configuration: 
storage-1_1  |          RS_LOG_LEVEL = INFO (default)
storage-1_1  |  RS_HOST = 0.0.0.0 (default)
storage-1_1  |  RS_PORT = 8383 (default)
storage-1_1  |  RS_API_BASE_PATH = storage-1/ 
storage-1_1  |  RS_DATA_PATH = /data (default)
storage-1_1  |  
storage-1_1  | 2022-05-21 05:25:36.483 (51136)  [INFO] -- storage.cc:38 Load 0 buckets 
storage-1_1  | 2022-05-21 05:25:36.483 (51136) [WARNING] -- token_auth.cc:125 API token is empty. No authentication. 
storage-1_1  | 2022-05-21 05:25:36.483 (51136)  [INFO] -- api_server.cc:135 Run HTTP server on http://0.0.0.0:8383 

RS_API_BASE_PATH is storage-1 but the URL is http://0.0.0.0:8383 . It should be http://0.0.0.0:8383/storage-1

HTTP API to store and get a blob

As a user, I can store blobs by name with timestamp and get them from storage:

  • POST /<bucket_name>/<entry_name>?timestamp=
  • GET /<bucket_name>/<entry_name>?timestamp=

Docker container crushes without volume

The container runs with an error if you don't give it volumes:

docker run -p 8383:8383 --rm ghcr.io/reduct-storage/reduct-storage:latest

2022-01-31 21:32:05.783 (42944)  [INFO] -- main.cc:32 Reduct Storage 0.1.0 
2022-01-31 21:32:05.788 (42944)  [INFO] -- main.cc:43 Configuration: 
 	RS_LOG_LEVEL = INFO (default)
	RS_HOST = 0.0.0.0 (default)
	RS_PORT = 8383 (default)
	RS_API_BASE_PATH = / (default)
	RS_DATA_PATH = /data (default)
 
terminate called after throwing an instance of 'std::filesystem::__cxx11::filesystem_error'
  what():  filesystem error: directory iterator cannot open directory: No such file or directory [/data]

Browse a bucket

We should extend GET /b/:bucket_name method:

  • List of entries
  • Statistical information:
    • Size in bytes
    • Oldest record
    • Latest record

Usage in GET /info always 0

This test in javascript SDL shows that after writing data the storage shows 0 bytes of usage.

 it("should get information about the server", async () => {
        await client.createBucket("bucket_1");
        const bucket = await client.createBucket("bucket_2");
        await bucket.write("entry", "somedata", new Date(1000));
        await bucket.write("entry", "somedata", new Date(2000));

        const info: ServerInfo = await client.getInfo();
        expect(info.version).toMatch(/0\.[0-9]+\.[0-9]+/);

        expect(info.bucketCount).toEqual(2n);
        expect(info.usage).toEqual(0n); //TODO: a bug in storage
        expect(info.uptime).toBeGreaterThanOrEqual(0);
        expect(info.oldestRecord).toEqual(1000_000n);
        expect(info.latestRecord).toEqual(2000_000n);
    });

This is wrong and it comes from server side. Looks like the bug happens after old bucket was removed.

System degrades if an empty entry is created

If a user tries to get data from not existing entry, the storage created it. After that, the quota stops working, and it creates many logs and
stop working at all.

 [ERROR] -- storage.cc:199 Didn't mange to keep quota: [500] Tries to remove a block in empty entry 

The solution:

  1. Fix quota not to clean blocks from already empty entries.
  2. Not create an entry for GET request

Wrong size calculation

In the file system the data takes almost 356G

du -h /mnt/data/reduct/data/
21G	/mnt/data/reduct/data/acc-3
195G	/mnt/data/reduct/data/camera
21G	/mnt/data/reduct/data/acc-4
21G	/mnt/data/reduct/data/acc-0
21G	/mnt/data/reduct/data/acc-5
21G	/mnt/data/reduct/data/acc-1
21G	/mnt/data/reduct/data/acc-6
21G	/mnt/data/reduct/data/acc-7
21G	/mnt/data/reduct/data/acc-2
356G	/mnt/data/reduct/data/

But the storage provides way less numbers:

{"info":{"name":"data","size":"136706208280","entry_count":"9","oldest_record":"1653655367073000","latest_record":"1653700867366000"},"settings":{"max_block_size":"67108864","quota_type":"FIFO","quota_size":"322122547200"},"entries":[{"name":"acc-0","size":"7682225325","record_count":"45485","block_count":"115","oldest_record":"1653655377075000","latest_record":"1653700867366000"},{"name":"acc-1","size":"7696745730","record_count":"45485","block_count":"115","oldest_record":"1653655377075000","latest_record":"1653700867366000"},{"name":"acc-2","size":"7704795276","record_count":"45485","block_count":"115","oldest_record":"1653655377075000","latest_record":"1653700867366000"},{"name":"acc-3","size":"7686308722","record_count":"45486","block_count":"115","oldest_record":"1653655376075000","latest_record":"1653700867366000"},{"name":"acc-4","size":"7706995989","record_count":"45485","block_count":"115","oldest_record":"1653655367073000","latest_record":"1653700857365000"},{"name":"acc-5","size":"7704742470","record_count":"45485","block_count":"115","oldest_record":"1653655367073000","latest_record":"1653700857365000"},{"name":"acc-6","size":"7698664263","record_count":"45485","block_count":"115","oldest_record":"1653655367073000","latest_record":"1653700857365000"},{"name":"acc-7","size":"7711576776","record_count":"45485","block_count":"115","oldest_record":"1653655367073000","latest_record":"1653700857365000"},{"name":"camera","size":"75114153729","record_count":"45486","block_count":"1107","oldest_record":"1653655376075000","latest_record":"1653700867366000"}]}

Authentication with JWT token

We should introduce some authentication and authorization model to the storage. The simplest way to start is to use a JWT token and embedded admin user with the full rights.

  • The storage runs with RS_API_TOKEN environment variable
  • Client should use RS_API_TOKEN as a refresh token to get a temporal access token
  • Token must be a part of HTTP header

HTTP API to list of records for time interval

The storage should provide a list of stored objects for time interval, Example of API:

GET http://hostname/bucket/entry?start=X&stop=Y 
Output:
[ 
{ ts: 11002021, size: 32102},
.....
]

Bucket API

Bucket is a common space for all records with storage settings. Currently it has only name.

  • Request information about bucket: GET /<bucket_name>
  • Create a new bucket: POST /<bucket_name>
  • Remove a bucket and all its data: DELETE /<bucket_name>

Method GET /info should provide defaults for a new bucket

Because the bucket can be created with default parameters, a user should be able to know them:

{
    "version":"0.5.0",
    "bucket_count":"1",
    "usage":"32167820595",
    "uptime":"16029",
    "oldest_record":"1652033077885000",
    "latest_record":"1652041795230000",
    "defaults": {
       "bucket": {
          "block_size": "xxxxx",
          "quota_type": "NONE",
          "quota_size": "0",
       }
    }
}

GET /b/:bucket/:entry/list returns error

Request to the test server:

GET "http://test.reduct-storage.dev:8383/b/data/entry/list?start=0&stop=1647765845000000"

Returns:

{"detail":"Failed to load a block descriptor: /data/data/entry/67904000000.meta"}

But there is no such file and it doesn't look like timestamp.

Get /b/:bucket/:entry request fails sometime

If the storage has intensive reading, it aborts the requests and crushes:

reduct-storage_1  | 2022-03-27 21:50:07.550 ( 5056) [ERROR] -- api_server.cc:361 Failed to send data 
reduct-storage_1  | 2022-03-27 21:50:07.695 ( 5056) [ERROR] -- api_server.cc:361 Failed to send data 
reduct-storage_1  | 2022-03-27 21:50:12.900 ( 5056) [ERROR] -- common.h:95 GET /b/data/entry: aborted 
reduct-storage_1  | 2022-03-27 21:50:12.931 ( 5056) [ERROR] -- common.h:95 GET /b/data/entry: aborted 
reduct-storage_1  | 2022-03-27 21:50:16.901 ( 5056) [ERROR] -- common.h:95 GET /b/data/entry: aborted 
reduct-storage_1  | 2022-03-27 21:50:16.912 ( 5056) [ERROR] -- common.h:95 GET /b/data/entry: aborted 

Looks like I have to check if http engine is ready for writing.

Long loading

After the block refactoring, we have to calculate size and record count by parsing all meta files for each block. It takes too long. 1 second per 50 Gb. The alternative solution:

  1. Don't calculate record count, it is not a part of HTT API
  2. To calculate file size, use size of the files.

Provisioning bucket from environment variables

For edge devices, the most common use case is to have only one bucket with fixed settings. It is more useful to set it by environment variables:

  • RS_BUCKET_NAME
  • RS_BUCKET_MAX_BLOCK_SIZE (humanized size 10M, 2G etc)
  • RS_BUCKET_QUOTA_TYPE (valuesโ€”NONE (default), FIFO)
  • RS_BUCKET_QUOTA_SIZE (humanized size 10M, 2G etc)

Provisioned bucket cannot be deleted or changed.

Typo in log

There is a log message for the bad API token:

 No bearer token in response header 

It should be request header.

The engine removes the wrong block in a bucket with quota

I have 2 entries in a bucket:

{
  "info":{
    "name":"test-bucket",
    "size":"32144630132",
    "entry_count":"2",
    "oldest_record":"1651412479999000",
    "latest_record":"1651514933403000"
  },
  "settings":{
    "max_block_size":"67108864",
    "quota_type":"FIFO",
    "quota_size":"32212254720"
  },
  "entries":[
    {
      "name":"blobs",
      "size":"32144630100",
      "record_count":"61213",
      "block_count":"477",
      "oldest_record":"1651412479999000",
      "latest_record":"1651514933403000"
    },
    {
      "name":"md-sums",
      "size":"32",
      "record_count":"1",
      "block_count":"1",
      "oldest_record":"1651514933403000",
      "latest_record":"1651514933403000"
    }
  ]
}

The engine removes blocks from md-sums however the entry blobs has older records.

Graceful stop

We should handle OS signals to finish all tasks and stop the storage safely.

Extend GET /b/:bucket/ method with stats of each entry

The current bucket GET /b/:bucket return list of entry names:

{
  "info": {
    "name": "my_data",
    "size": "27",
    "entry_count": "3",
    "oldest_record": "0",
    "latest_record": "0"
  },
  "settings": {
    "max_block_size": "67108864",
    "quota_type": "FIFO",
    "quota_size": "10000"
  },
  "entries": [
    "entry_1",
    "entry_2",
    "entry_3"
  ]
}

but it'd be more useful to provide the same info as for bucket ๐Ÿ‘

   {
    "name": "entry_1",
    "size": "27",
    "record_count": "3",
    "oldest_record": "0",
    "latest_record": "0"
  }

Bad timestamp in GET /info

Looks like oldest_record field is in seconds:

{
"version":"0.5.0",
"bucket_count":"1",
"usage": "42928697586",
"uptime": "442",
"oldest_record": "1648890020", // !!!!
"latest_record":"1648890020590279"
}

Bucket Quota

We have to limit the size of bucket not to run out the disk space.

We should add FIFO quota in bytes into the settings of bucket.

Publsih a Docker image

We should have a ready-to-use Docker image. We can publish it to the public Github's registry from main brunch

[500] Failed to find the needed block in descriptor

Periodically, EntryList request has an error 500:

reduct-storage_1  | 2022-01-21 22:06:22.201 (22464) [ERROR] -- entry.cc:260 No block in entry 'entry' for ts=2022-01-21T22:03:43.996662Z 
reduct-storage_1  | 2022-01-21 22:06:22.201 (22464) [ERROR] -- common.h:71 GET /b/data/entry/list: [500] Failed to find the needed block in descriptor 

Get rid of nholmann/json

Currently we use Protobuf and nholmann/json for JSON serialization. We need only protobuf because it is used for binary serialization anyway.

Support HTTPS

A user can provide paths to the certificate and private key by using environment variables:

  • RS_CERT_PATH
  • RS_CERT_KEY_PATH

Protobuf warning

reduct-storage_1  | [libprotobuf ERROR /root/.conan/data/protobuf/3.19.1/_/_/build/64504d4b5743a18b5bb012ba0145fd09ce3bd5f2/source_subfolder/src/google/protobuf/wire_format_lite.cc:581] String field 'reduct.proto.EntryRecord.blob' contains invalid UTF-8 data when serializing a protocol buffer. Use the 'bytes' type if you intend to send raw bytes. 

Extend information about server in GET method

Currently. GET /info returns only version and count of buckets. We should add:

  • size of storage in bytes
  • uptime in seconds
  • timestamp of the oldest record
  • timestamp of the newest record

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.