Code Monkey home page Code Monkey logo

Comments (27)

jackhftang avatar jackhftang commented on July 26, 2024 1

For method 1, the key point is for each user in each group, keep an array-like data structure (using in-memory array or key-value store or table with rank key) that store timestamp in sorted manner.

For example, user X in group Y has the following array of timestamp [ 1462032000, 1462032011, 1462032012, 1462032015, 1462032020, 1462032030].

When new message come, just append to array would remain sorted. When query, let say we want to know how many message that user has between 1462032010 and 1462032025, then we do binary search twice:

  1. search for the smallest index such that its value is larger than 1462032010, in the above case, it is 1 which corresponding to 1462032011.
  2. search for the larger index such that its value is smaller than 1462032025, in the above case, it is 4 which corresponding to 1462032020.
    Then the number of message the user has between 1462032010 and 1462032025 would be the result in 2 subtract the result in 1 and then plus 1 = 4 - 1 + 1 = 4

If we want the whole history for each user in each group, we can binary search ANY time period in O(ln(N)).

The cache I submitted in #21 is exactly implemented in this approach using in-memory array. It is NOT using slot approach or insertion sort.

from telegram-jung2-bot.

siutsin avatar siutsin commented on July 26, 2024

#18

from telegram-jung2-bot.

siutsin avatar siutsin commented on July 26, 2024

Agreed for switching DB, it took up way much more resources than what I expected.
Currently the app is hosting with small gear in OpenShift with Mongodb 2.4.

Can you elaborate more on how the caching works? Can it reflect the real-time message count i.e?
For example:

  1. Group have 100 messages at the moment
  2. /topten -> show 100 messages
  3. someone say something
  4. /topten -> show 101 messages

I will study more about this issue over the weekend. Also I can share the DB stats for your reference for now.

screen shot 2016-04-28 at 9 28 22 am
screen shot 2016-04-28 at 9 28 51 am
screen shot 2016-04-28 at 9 29 02 am
screen shot 2016-04-28 at 9 30 04 am

from telegram-jung2-bot.

jackhftang avatar jackhftang commented on July 26, 2024

There are number of approach

  1. use a key-name hack for key-value store e.g. lmdb/rocksdb. For each user in each group, keep a counter, whenever a new message, insert a new key message:group:username:counter and then count += 1 and the stored value is createdDate (or reverse message:group:username:createdDate as key, accumulated counter as value). Whenever a query, we can binary search on the database. (it is just like binary search createdDate on array, but not constant time access). Perform this query to all user in a group should be O( number of user in that group * ln(number of message in db) * ln( message of that user in that group )). This should be fast (but a bit less predictive) and flexible, you can query number of message in ANY time range.
  2. use slots, for each user in each group, keep and array of slot of some duration, Slot is just an integer counter. Let say 1-hour slot, for 7-day, there are at most 7 * 24 = 168 slots, and for user who has no message in that hour, there is not need to create slot. For query, scan all users in a group and sum all slot it took O( (number of user) * (number of slot) ) , since our number of user per group and number of slot is small 1500x168 = 252,000. And just scanning array and perform addition is pretty fast. And 1000 group 1500 users 1000x1500x168x4 byte ~= 1G is still not much memory. The problem is that it is just accurate up to the precision of slot (but should be okay in our use case).
  3. Directly use segment tree. for each user in each group keep a segment tree. It is similar to method 1, but it is independent of database size, however, it requires to keep all dateCreated in memory. For current size of message <1,000,000, it is not a problem, but when >1,000,000,000, it is not suitable.
  4. Buffer, it is optimistic approach. Instead of keep an array of whole createdDate, just keep the last N element and the length, add new element is just length += 1. query would be pop the buffer until it is within the range of query we want. If drain the buffer, then rely on database index and recalculate the length and take another N elements from database.

OTHER: storing information about group and user by group only once and update information every new message come. a Map can easily do it in O(ln(G) + ln(U)) where G is number of group and U is number of user. For each user in a group, store the last update time O(1).

May the jung be with you!

from telegram-jung2-bot.

jackhftang avatar jackhftang commented on July 26, 2024

I saw openshift also has a button for mysql. I think MySQL is also a good choice. Compact storage, low memory and fast enough for our purpose.

from telegram-jung2-bot.

jackhftang avatar jackhftang commented on July 26, 2024
  1. for each user in each group, maintain a queue that store the dataCreated. when new message, append; when query, check dateCreated should > now - 7-day. Similar to method2, but the memory performance depend on real usage.
  2. similar to method 1, but use table-based database. For each user in each group maintain a message count and in Message table add a field called sequence number and add index to dateCreated. When query, find the minimum dateCreated that larger than lower bound of query time. count - sequence number would be the result. (PS: relational db index should have supported count query. )

from telegram-jung2-bot.

siutsin avatar siutsin commented on July 26, 2024

I saw openshift also has a button for mysql. I think MySQL is also a good choice. Compact storage, low memory and fast enough for our purpose.

For out-of-the-box options, PostgreSQL can also be consider.

screen shot 2016-04-30 at 6 07 35 pm

from telegram-jung2-bot.

siutsin avatar siutsin commented on July 26, 2024

for each user in each group, maintain a queue that store the dataCreated. when new message, append; when query, check dateCreated should > now - 7-day. Similar to method2, but the memory performance depend on real usage.

I believe this should be similar to #21. This is obviously the fast and simple way for implement the cache. However, there should be a mechanism to rebuild the cache every time when the server restart (e.g. deploying new version, maintenance, etc.) Also we should find ways to handle the new incoming message when building the cache (MessageQueue...? sounds a bit overkill).

from telegram-jung2-bot.

siutsin avatar siutsin commented on July 26, 2024

use a key-name hack for key-value store e.g. lmdb/rocksdb. For each user in each group, keep a counter, whenever a new message, insert a new key message:group:username:counter and then count += 1 and the stored value is createdDate (or reverse message:group:username:createdDate as key, accumulated counter as value). Whenever a query, we can binary search on the database. (it is just like binary search createdDate on array, but not constant time access). Perform this query to all user in a group should be O( number of user in that group * ln(number of message in db) * ln( message of that user in that group )). This should be fast (but a bit less predictive) and flexible, you can query number of message in ANY time range.

For privacy issue and disk usage concern, I would like not to store the message to the database. Assuming we are going for the group:username:counter approach, the table should be something like this:

key timestamp
groupId1:userId1:120 1462178759
groupId1:userId2:3 1462178780
groupId1:userId1:121 1462178788
groupId1:userId1:122 1462178793

Not sure how are we going to get the latest counter for the given user? Is it allowed something like getKey("groupId1:userId:1:*")? Sorry, I am not familiar with lmdb/rocksdb...

use slots, for each user in each group, keep and array of slot of some duration, Slot is just an integer counter. Let say 1-hour slot, for 7-day, there are at most 7 * 24 = 168 slots, and for user who has no message in that hour, there is not need to create slot. For query, scan all users in a group and sum all slot it took O( (number of user) * (number of slot) ) , since our number of user per group and number of slot is small 1500x168 = 252,000. And just scanning array and perform addition is pretty fast. And 1000 group 1500 users 1000x1500x168x4 byte ~= 1G is still not much memory. The problem is that it is just accurate up to the precision of slot (but should be okay in our use case).

  • Assume user.slots.size = 168
  • Assume user.slots.lastSlotCreatedTimestamp = within 1 hour
  • user.slots = [0, 0, 90, 50, ..., 30, 20, 10];
// get total number of msg for a user
var getNumberOfMessage = function (user) {
  if (diff(user.lastSlotCreatedTimestamp, currentTimestamp) > 60) {
    // remove index 0 and insert 0 to index 167
  }
  var total = 0;
  for (var i = 0, l = user.slots.length; i < l; i++) {
    total += user.slots[i];
  }
  return total;
};

// add msg count
var addMsgCount = function (user) {
  if (diff(user.lastSlotCreatedTimestamp, currentTimestamp) > 60) {
    // remove index 0 and insert 0 to index 167
  }
  user.slots[167]++;
};

I think this is a good approach, and the tradeoff is acceptable.

Directly use segment tree. for each user in each group keep a segment tree. It is similar to method 1, but it is independent of database size, however, it requires to keep all dateCreated in memory. For current size of message <1,000,000, it is not a problem, but when >1,000,000,000, it is not suitable.

As this approach is not very future-proof.. it will cause the problem again eventually, I guess we should avoid it.

Buffer, it is optimistic approach. Instead of keep an array of whole createdDate, just keep the last N element and the length, add new element is just length += 1. query would be pop the buffer until it is within the range of query we want. If drain the buffer, then rely on database index and recalculate the length and take another N elements from database.

This is similar to the slot approach..?

Thanks so much for sharing these techniques and it indeed refreshes all my algorithm memories.
May the jung be with you!

from telegram-jung2-bot.

jackhftang avatar jackhftang commented on July 26, 2024

BTW, there are actually two ways to implement slot approach. One is faster, but using more memory (fix memory per user per group), another is slower, but less memory (dynamic allocate slot).

Also, if very low memory footprint is required, we can just store the data to JSON to harddisk periodically, saving the memory, cpu time and data exchange time for database. Using a save-unlink-rename approach should be low risk enough for this project.

from telegram-jung2-bot.

jackhftang avatar jackhftang commented on July 26, 2024

The question is whether there will be a function say an annual jung count, which method 1 (sorted array/database + binary search based) would be better (which support any timer period query). If we ensure just want to support up to 1 week that method 2 (slot-based) would be better, which could implemented in very low memory footprint and high speed.

from telegram-jung2-bot.

siutsin avatar siutsin commented on July 26, 2024

For method 1, the key point is for each user in each group, keep an array-like data structure (using in-memory array or key-value store or table with rank key) that store timestamp in sorted manner.

For example, user X in group Y has the following array of timestamp [ 1462032000, 1462032011, 1462032012, 1462032015, 1462032020, 1462032030].

When new message come, just append to array would remain sorted. When query, let say we want to know how many message that user has between 1462032010 and 1462032025, then we do binary search twice:

  1. search for the smallest index such that its value is larger than 1462032010, in the above case, it is 1 which corresponding to 1462032011.
  2. search for the larger index such that its value is smaller than 1462032025, in the above case, it is 4 which corresponding to 1462032020.
    Then the number of message the user has between 1462032010 and 1462032025 would be the result in 2 subtract the result in 1 and then plus 1 = 4 - 1 + 1 = 4

If we want the whole history for each user in each group, we can binary search ANY time period in O(ln(N)).

The cache I submitted in #21 is exactly implemented in this approach using in-memory array. It is NOT using slot approach or insertion sort.

Totally understand now, smart way for counting the number using the index.

from telegram-jung2-bot.

siutsin avatar siutsin commented on July 26, 2024

The question is whether there will be a function say an annual jung count, which method 1 (sorted array/database + binary search based) would be better (which support any timer period query). If we ensure just want to support up to 1 week that method 2 (slot-based) would be better, which could implemented in very low memory footprint and high speed.

It is a hard decision for me, as annual Jung count is part of the development roadmap.
If we don't need the annual Jung count, there is no point for storing any data > 7 days.

But it also doesn't make sense to me if the trade-off is huge for just for keeping data for that as I can see the growth rate is exponentially growing over the recent days.

screen shot 2016-05-02 at 11 54 49 pm

from telegram-jung2-bot.

siutsin avatar siutsin commented on July 26, 2024

318 active groups in the past 7 days. I guess the time-slot based is more reasonable for now. What do you think?

screen shot 2016-05-03 at 12 00 15 am

from telegram-jung2-bot.

jackhftang avatar jackhftang commented on July 26, 2024

I think the most critical question to ask is how many timestamp are we going to store? Whether it will excess memory or not?

If you just want to save messages in past 7 days, well, if not consider the worst case, in practice, I think even 512 MB should be enough for either slot-based or binary-search-based approach. And we can use JSON as store (not using database allowing more memory)

if you definitely want to store more messages than memory can handle, you would need to decide which database/store to use and how (another topic to discuss). And implementing binary-search approach would overlap the slot-based. I think there is no sound reason to use both at the same time.

Let's has a bit calculation, assume 1000 groups, each has 2000 messages a day, for 7 days and each message we just keep a 4 byte timestamp. 1000 * 2000 * 7 * 4 byte = 56 MB. There maybe some overhead, but assuming < 100MB. I think is reasonable in practice. With just running linux + node + data. I guess 512MB is able to handle (okay, there is swap, right?)

Slot-approach, on the other hand, at most each user 24x7x4 byte , i.e. for 1000 groups 1500 users
1000 * 1500 * 24 * 7 * 4 byte ~= 1GB. This is the worst case. In practice, most user do not even say a word in 7 days. Dynamic-allocate-slot-approach would make the memory usage far less, but less predictable. If we really want to prepare for the worst case, then we would need a larger slot (8-hour or even 1 day)

from telegram-jung2-bot.

siutsin avatar siutsin commented on July 26, 2024

Growth rate update

screen shot 2016-05-03 at 6 06 50 pm

from telegram-jung2-bot.

nothize avatar nothize commented on July 26, 2024

It is a hard decision for me, as annual Jung count is part of the development roadmap.
If we don't need the annual Jung count, there is no point for storing any data > 7 days.

How about:

  1. Focus on supporting data within 7 days.
  2. For the annual Jung count, store aggregated sum...... although the number can be less accurate due to any accident, as a yearly based number the possible error maybe acceptable?

from telegram-jung2-bot.

jackhftang avatar jackhftang commented on July 26, 2024

method 5: count-based-slot.

Instead of counting number of message in a fix time period, a slot count for a fixed number of messages and record the first and the last date created (two fields only startTimestamp and endTimestamp).

For example, when new message come, and currently the counter=0, record the startTimestamp and counter += 1, when the counter reach, say 5, record the endTimestamp. and push to the level structure (describe below). reset counter and startTimestamp.

Then create levels of slot, Each level can only store N slots. When one level is full, just record the startTimestamp in first slot and endTimestamp in last slot and then push to next level. A slot in level 0, represent 5 messages, in level 1, one slot represent 5_N message. In general, one slot in level L represent M_N^L (in our example M = 5) number of message and it has two field (start, end).

This work like number system. Whenever you count to 9 and then +1. push to next level, and when next level is full, push to next next level. In amortised sense, add a new message took O(ln(N))

Storage per user per group is O(ln(N)) where n is number of message of that user in that group. Query require linear time travel of all slots O(ln(N)). And the accuracy is a bit hard to say clearly, each slot can only tell there are X number of message between startTimestamp and endTimestamp (we need to estimate the query), and the difference between endTimestamp - startTimestamp dependent on the frequency of speaking of that user. Roughly, if the query time range is closer to now, it is more accurate; If the user speak frequently, it is more accurate.

In general, count-based-slot is all round data structure, O(ln(N)) storage, O(U * ln(N)) query, and trade for accuracy. In comparison, binary-search based is O(N) storage, O(U * ln(N)) query, and exact accuracy.
time-based-slot is O(U) storage where U is number of user, O(U) query with only finite time period.

from telegram-jung2-bot.

jackhftang avatar jackhftang commented on July 26, 2024

The current grow rate is 751915 - 687012 = 64903 msg / 18 hours < 4000 msg/hr = 96,000 msg/day ~= 35,040,000 msg/year.

Since more groups may be using jung bot and assume proportional grow of number of message and jung power will last FOREVER umm... for safety, I guess we need to multiply this number by 300, say we need to design a system that can provide quick query of ~10,000,000,000 messages.

May the jung with be you~

from telegram-jung2-bot.

nothize avatar nothize commented on July 26, 2024

@jackhftang have you implemented any benchmark or volume testing code?

Such that any of the proposed methods can be experimented and compared with the reference implementation when needed. (the current one used by jung2)

from telegram-jung2-bot.

jackhftang avatar jackhftang commented on July 26, 2024

@jackhftang have you implemented any benchmark or volume testing code?

see #21 is binary-search based synchronous in-memory array implementation. The 10^7 memory exception probably due to keeping object for testing, in practice, it just need to call addMessage() and then throw away the message. I expect this can go above 10^8 message.

If one have enough memory to store timestamps of all message, you can just wrap the web interface and just use it. To go well beyond memory can handle, it need to modify to be async and use rocksdb as backend.

For other methods, not implemented.

from telegram-jung2-bot.

jackhftang avatar jackhftang commented on July 26, 2024

Then create levels of slot, Each level can only store N slots. When one level is full, just record the startTimestamp in first slot and endTimestamp in last slot and then push to next level. A slot in level 0, represent 5 messages, in level 1, one slot represent 5N message. In general, one slot in level L represent MN^L (in our example M = 5) number of message and it has two field (start, end).

Just found that it is not necessary to merge all N slots at once. For example, each level can have at most 24 slots as head and 20 slots as tail. Whenever the 24-head is full, merge only 4 of them. This is guaranteed that if the number of message query count is start from now to some time point, the error is at most +/-5%.

However, I think this is still not the ultimate data structure for this problem. We need a way to combine all user in one data structure, not separately one per user.

from telegram-jung2-bot.

siutsin avatar siutsin commented on July 26, 2024

if you definitely want to store more messages than memory can handle, you would need to decide which database/store to use and how (another topic to discuss). And implementing binary-search approach would overlap the slot-based. I think there is no sound reason to use both at the same time.

#26 related to need to decide which database/store to use and how

from telegram-jung2-bot.

siutsin avatar siutsin commented on July 26, 2024

Usage update

screen shot 2016-05-07 at 11 11 11 pm
screen shot 2016-05-07 at 11 16 21 pm

from telegram-jung2-bot.

siutsin avatar siutsin commented on July 26, 2024

The current grow rate is 751915 - 687012 = 64903 msg / 18 hours < 4000 msg/hr = 96,000 msg/day ~= 35,040,000 msg/year.

Since more groups may be using jung bot and assume proportional grow of number of message and jung power will last FOREVER umm... for safety, I guess we need to multiply this number by 300, say we need to design a system that can provide quick query of ~10,000,000,000 messages.

As jungPremierLeague has been shut down, I believe the growth rate should be lower than that now

from telegram-jung2-bot.

jackhftang avatar jackhftang commented on July 26, 2024

However, I think this is still not the ultimate data structure for this problem. We need a way to combine all user in one data structure, not separately one per user.

Finally, I think the ultimate solution is simple. Just add a map( id-to-mesageCount ) to each count-based-slot. And add a map (again, id-to-messageCount) in each level as cache, which reflect the total number of that level (allowing that we can need to travel that last level when query). We can add message timestamp of all user in all group and group-based message to this data structure.

And count-based-slot approach is flexible in term of trade of between space, speed, accuracy. And generally, there are three kinds of implementation.

Some Notation:
nSlot(L) be the number of slot in level L
nMsg(L) be the number of message that a slot represent in level L.
query(t) be the two results of rank between [t1, now] and [t2, now] where t1 <= t <= t2

Accuracy of query(t) is measured by the difference of messages in the two results over the
total message of truth result.

Type 1: nSlot(L+1) = 2 * nSlot(L) and nMsg(L+1) = nMsg(L) * 2^L
For example, the first level at most 6 (4+2) slots, and merge last 2 slots; second level has 12 (8 + 4) slot and merge last 4 slots; third level 24 (16+8) merge last 8 slot; the fourth (32+16) merge last 16 slot; ...

The growth rate is pretty fast, a slot in level 8 represent 536,870,912 number of message, and then level 9 is 137,438,953,472 and then level 10 is 72,057,594,037,927,936 ... so it is pretty sure that it NEVER reach level 10. Surprisingly, although the large grow rate, query(t) is always bounded by a fixed percentage.

This is the most space-efficient. And the query is very fast (due to small number of levels) O(level + 2^level).

Type 2: nSlot(L+1) = 2 * nSlot(L) and nMsg(L+1) = nMsg(L) * 2
For example, the first level at most 6 (4+2) slots, and merge last 2 slots; second level has 10 (8 + 2) slot and merge last 2 slots; third level 18 (16+2) merge last 2 slot...

The growth rate is simple exponential. And the accuracy is exponentially increasing. i.e. the more the number of message, the more accurate it is !!

This is still fairly space-efficient, but the query may not be fast O(level + 2^level)

Type 3: nSlot(L+1) = constant number X and nMsg(L+1) = nMsg(L) * fixed number Y
The is the original thought. This is somehow space-efficient (it is still log(N) ) , And it is faster than 2 O(level + X). The accuracy is also increasing, but slower than 2.

from telegram-jung2-bot.

siutsin avatar siutsin commented on July 26, 2024

In practical, it will take more than half an hour to query all the top ten for all groups and build cache data in memory from the bundled Mongo DB of OpenShift. After moving the cache DB to an external mongo instance, the memory usage is still > 300 MB constantly out of 512 MB.

Solution:

  1. Query Cache DB. Clean up after seven days.
  2. Persistence DB

Observed performance at 6 pm:

  1. Query all groups (6:00 pm)
  2. Send off duty message (6:00 pm)
  3. Send top ten to each group (6:00 pm)

from telegram-jung2-bot.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.