jajir / jbindex Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 1019 KB

index

Java 100.00%

jbindex's Introduction

jbindex

Goal is to provide easy to use key value map for billions of records using just one directory and some space.

It's simple fast index. Work with index should be split into phases of:

Writing data to index. All data that should be stored in index should be send to index.
Building index. In this phase data are organized for fast access.
Search through index. In this phase it's not possible to alter data in index.

Index is not thread safe.

Useful links

Basic work with index

Index could be in following states:

Index should be created with builder, which make index instance. For example:

final Index<Integer, String> index = Index.<Integer, String>builder()
        .withDirectory(directory)
        .withKeyClass(Integer.class)
        .withValueClass(String.class)
        .build();

Interruption of process of writing data to index could lead to corruption of entire index.

Development

Mockito requires reflective access to non-public parts in a Java module. It could be manually open by passing following parameter as jvm parameter:

--add-opens=java.base/java.lang=ALL-UNNAMED

How to get segment disk size

On apple try:

diskutil  info /Volumes/LaCie

jbindex's People

Contributors

Stargazers

Watchers

jbindex's Issues

SegmentSearcher is opening bloom cache too often

SegmentSearcher doesn't need to opened bloom object cache in each call .get(). It just should search with assigned data objects.

Make builder more easy to use

Add some default values, allows to insert just data type, without specifying descriptor. Make it work even without configuring bloom filter. Add option to define type withou desceiptor, just with class. Add default values per data type.

Add how to use

Add some documentation how to use index

Stream is not closed

In class SstIndexImpl method getStream() return stream that doesn't close underlying resources.

Clearing of segment cache requires to read it into memory

Segment cache could be cleared in a following was:

final SegmentCache<K, V> sc = new SegmentCache<>(
        segmentFiles.getKeyTypeDescriptor(), segmentFiles,
        segmentPropertiesManager);
sc.clear();

It's not effective. Cache should not be loaded into memory. It's in SegmentFullWriter.close().

Configuration should have some value checks

verify parameters, eg delta cache size can't be smaller that 2.

Add recovery

Use wal for cache with puts and deletes. After successful flushndelete wal and start new one. When index starts after ctrl+c than recovery from wal.

provide false positive probability in bloom filter

provide false positive probability in bloom filter. It's most important parameter in bloom filters.

Improve segment search object caching

Segment contains few objects that helps with search process. It's cache, bloom filter and scarce index. All of them should be cached on one place and still be accessible from segment.

File reading should have configurable buffer size

It large data files it's different to have buffer 4KB and have 1MB for reading whole data it's two times faster to use 1MB buffer.

Rewrite segment iterating tryAdvance

hasNext and next methods are not suitable, they require preloading element it could faul, when next element is unloaded from delta cache. Use tryAdvance method

Add flush() to index

Add flush() to index. It allows "commit" changes to work with them further. Document it.

if it's possible add some info about estimated index size in memory

Some object should provide information about expected consumed index memory. If I know key and value size it could be computed.

Cached object are not correctly unloaded

During index closing SegmentDataProvider.invaliidate() is not called.

When cached data are unloaded from memory, call close() on them

When cached data are unloaded from memory, call close() on them. It allows implements search as random access file. It's question if such feature is necessary.

Index iterator is inconsistent

Segment iterator should work after flushing. Because of that it can't iterate through data in cache. Iterator should just use data from segment. Event read part should check it's value in cache. There will be problem in case of tombstone in cache..

Fix security problems with log4j

Fix security problems with log4j. There are 6 of them.

Loaded segment after compacting should be unloaded from searched cache

After writing data to segment segment could be compacted and splitter. Because of that after writing index searcher should always be unloaded from memory to enforce it's refresh.

Segment splitting

When there are multiple duplicated write operations than properties hold invalid value of number of keys in segment. Because of that splitter segments are not equally sized.

Segment iterator is not consistent

When iterator starts read values, one value is changed that old value is returned. For example there are pairs in segment [a,1],[b,1],[c,1] following commands are executed:

iterator.next() --> [a,1]
put(b,2)
iterator.next() --> [b,1]

Last call should return [b,2]

Flush to segment, should compact just once

When flushing pairs to segment. There should be introduces, two parameters fro delta cache size:

max size for searching and reading
max size for flushing, it temporally allows to exceed previous value to prevent from repeating compacting during flushing

Improve logging format

In logs are number formatted simply 3740512. To improve readability it should be 3 740 512.

Splitting with delta cache full of tombstones fail

When delta cache has similar size to main index and delta cache is full of delete command of already stored keys. In that case splitting function can't correctly estimate where is half of keys and could fail.