vladrodionov / bigbase Goto Github PK
View Code? Open in Web Editor NEWBigBase - read optimized, fully HBase-compatible, NoSQL Data Store
License: GNU Affero General Public License v3.0
BigBase - read optimized, fully HBase-compatible, NoSQL Data Store
License: GNU Affero General Public License v3.0
When we have 50% of cache population IMMORTAL (pinned) object default eviction processing OffHeapCache.processEvictionFast in LRU does not work 100%. It seems that 25% of IMMORTAL is OK. Consider switching to more complex eviction processing.
How to pre-cache data efficiently. Current implementation of a BlockCache does not allow to specify which generation space block should be put into (young or tenured).
We need to investigate new, more advanced compression scheme. Ideally, with decompression speed comparable to LZ4. Compression may be two stage:
first, when block is inserted - use fast/regular compression bq thread will be running and re-compressing blocks
Large block cache (10s-100sGB )can have hot-warm-cold zones. Each zone can have different compression schemes.
Another additional option is to try columnar representation of blocks in memory.
This feature is becoming more and more important: See this JIRA: https://issues.apache.org/jira/browse/HBASE-10191
StoreConfiguration has a store name, which can be used.
TODO: Write test where 2 different caches saved/loaded from the same directory
Make Storage Recycler pluggable. Check code - its probably done.
All cache levels (L1/L2/L3)
It does not take into account the current data file size.
The interim solution: fall back to non-compacting allocator.
If we still have any. Need to verify.
It decreases over time. Suspect is in memory L3-RAM reference cache.
We need additional check-sum to verify data integrity.
When Linux has hardware failure, some garbage can be written to data file, that is why we need check-sums to verify data integrity on reading. When check-sum fails, file needs to be truncated to the position of an end of last valid block.
For check-sum generation, we have com.koda.util.Utils class (hash functions). We will probably need some additional code.
BLOCKCACHE-L3 = true means that blocks will be cached only in L3, bypassing L2. This to avoid L2 cache trashing when L2 size is not enough to keep majority of hot blocks but L3 cache is large enough for the purpose.
Need carefully to calculate required amount for non-DATA blocks. It seems that the current recommended - 4% from off heap size is too large.
Google "accrual failure detection". Check Cassandra failure detection.
Under HBase JIRA (to be opened)
Or increase default min to 98%. We need every % of RAM
They run too long.
Idea is to use the fastest compression when object is being placed into cache and later re-compress it using more advanced version (LZ4 =>LZ4-HC).
Some optimizations:
https://groups.google.com/forum/#!topic/lz4c/fdlAfZzSAr4
Major feature. During compaction (minor and major) block cache invalidates old blocks but new ones are not added to the cache. We need compaction to be block hotness - aware to put new block into the cache if block 'hotness' index is high. The placement may be into L2/L3 or into L3 only.
The work will done under Apache HBase JIRA:
https://issues.apache.org/jira/browse/HBASE-5263
We need special treatment for this error:
DISK STORAGE =137686247146
ITEMS IN STORE=15840571
14/05/25 12:31:08 FATAL storage.FileExtStorage: java.io.IOException: No space left on device
java.io.IOException: No space left on device
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:51)
at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:205)
at com.koda.integ.hbase.storage.FileExtStorage$FileFlusher.run(FileExtStorage.java:303)
14/05/25 12:31:08 FATAL storage.FileExtStorage: file-flusher thread died.
14/05/25 12:31:08 FATAL storage.FileExtStorage: java.io.IOException: No space left on device
java.io.IOException: No space left on device
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:51)
at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:205)
at com.koda.integ.hbase.storage.FileExtStorage$FileFlusher.run(FileExtStorage.java:303)
14/05/25 12:31:08 FATAL storage.FileExtStorage: file-flusher thread died.
Couple notes:
We do not need page cache for SSD at all. SOme other perf tips:
SSD optimization tips
We need for testing L3 cache, relatively fast storage. IO - provisioned EBS can deliever several thousands IOPS on a budget and SSD can be a local cache L3.
See instruction here: http://hortonworks.com/blog/deploying-hadoop-cluster-amazon-ec2-hortonworks/
Yep, as an option.
Only single memory allocator is allowed and no global memory limit, expansion factor etc can be specified. We need Memory allocator per OffHeap instance, way to set global limit (to enable compaction), specify expansion factor, minimum slab size, total slabs.
For example, for block cache min slab size can be 4K, for row cache - 64 bytes (not 16). We can set expansion factor to 1.1 instead of default 1.2, thus saving ~ 10% of space.
The code is there but no tests yet have been done.
When cache size is small (less say < 20-40MB) nothing works actually. See CacheDirectCompressionTest. I think we need to impose restriction minimum size on cache
Row cache caches row:col entirely in memory and serves as L1 cache in a multilevel caching architecture. The current limitation is cached object gets purged from cache every time 'row:col' is updated. We could merge new values with existing cached one on update.
http://rocksdb.org/ Flash SSD optimized K-V store
I think something similar has been implemented by FB.
NN/DN/HDFS enhancement need to support file 'registration' (now we create, open, delete) - we need 'register' as well . Agent creates all replicas locally on K data nodes (K - is replication factor), agent register new file with DNs and NN. How? This new API will allow to reduce or even eliminate network traffic during HBase compaction.
Check HBase JIRA.
When storage recycle deletes file all references to this file must be removed from L3-RAM reference cache.
We need efficient way to retrieve data from row cache from inside coprocessors.
FileChannel.read(ByteBuffer, long pos) can be used for parallel reading.
See FileExtStorage implementation. We keep list of open RandomAccessFile handles per actual file. Default size is 5. We need this to avoid thread issues. The assumption was accessing RandomAccessFile from multiple threads is not thread safe. It seems that having just one FileChannel per file suffice.
The standard ByteArraySerializer takes 4 bytes for array length. We can save 3 bytes for small arrays.
This is complex feature: L1 and L2 RAM caches share the same RAM and size of both caches are dynamically adjusted to workload.
Current cacheOnWrite put data into young gen space. Which is 50% of overall cache. We can not bypass this yet.
If codec == null - do not put 4 bytes length of a compressed stream (0) - save 4 bytes.
When optimizer works on non-compressed data. Check PerfTest with compression disabled and optimizer enabled
File Storage (L3) is in 1.0 (1.1) release already. This is experimental feature (3.0)
Read here: RocksDB Wiki. Perfix API is similar to Amazon Dynamo DB NoSQL API, where row key is split into two parts: hash-prefix and scan - suffix. This will require custom split support as well.
Currently, its ~ 140 bytes per one block because we do not have efficient serializer for StorageHandle.
We need for testing L3 cache, relatively fast storage. IO - provisioned EBS can deliever several thousands IOPS on a budget.
See instruction here: http://hortonworks.com/blog/deploying-hadoop-cluster-amazon-ec2-hortonworks/
Actually, OffHeapCache.clear() has some serious issues and either MUST be rewritten or fixed. This can be the sign of a MORE serious problem in OffHeapCache: memory alloc/free/alloc sequence can fail.
Rack awareness is good to have to avoid inter - rack up-link bottleneck.
If it would be configurable per table/cf that would be another story. One could then have smaller, hot table still use the block cache and have larger not-so-tables use the off heap cache; and thus we'd be able to make use of RAM sizes of 128gb or more.
Added test. Observation:
L3 (SSD) needs to be tested standalone and in cluster.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.