Streaming Data Algorithm

This repo includes three algorithms: Bloom filtering, Flajolet-Martin algorithm, and reservoir sampling

1. Bloom Filter

Bloom Filtering algorithm is to estimate whether the city of a business in business_second.json has shown before in business_first.json. I used the following hash function to hash the city name:

where a, b is randomly picked large number, m is the length of bit array, which is just length of the bloom filter. Note: the representation of bloom filter is a not a list 1 or 0, but instead a set of indices of all 1's. This way we can save some spaces.

2. Flajolet-Martin algorithm

Uses generate_stream.jar to simulate the data stream to work with. The implementation listens to port number 9999 locally. FM algorithm estimate the number of unique cities within a window in the data stream by the following steps:

Multiple hash functions are used to improve the estimation accuracy. The standard way to do this is shown below

But for simplicity, instead of using kmeans to group multiple estimation result, I manually set a number of group to simulate the grouping process, turns out this solution works.

Results:

Time                  Ground Truth  Estimation
2021-04-23 14:45:40   24            32
2021-04-23 14:46:33   42            50
2021-04-23 14:48:46   62            59

3. Reservoir Sampling/Fixed Size Sampling

This implementation uses Twitter API of streaming to implement the fixed size sampling method (Reservoir Sampling Algorithm) and find popular tags with the top 3 frequencies on tweets based on the samples. Assuming that the memory can only save 100 tweets, we need to use the fixed size sampling method to only keep part of the tweets as a sample in the streaming.

The idea of Reservoir Sampling is to keep a fixed size sample to work with, and either keep/replace existing element in the sample list with newly arrived element or discard the new element:

I only take tweets with hashtags into account when doing Reservoir Sampling, that is, the hashtag will be counted. Newly arrived hashtag will be kept with the probability of 100/n, where n is the number of the newly arrived hashtag's index.

For example, the 101st element will be kept with a probability of 100/101, which is pretty high.

Sample data is obtained by Twitter's API. The program is set to listen to the tweets with following keywords:

TOPIC_LIST = ['COVID19', 'SocialDistancing', 'StayAtHome', 'Trump', 'Quarantine', 'CoronaVirus', 'China', 'Wuhan', 'Pandemic']

Results:

...

The number of tweets with tags from the beginning: 59
COVID19 : 16
Covid19 : 6
China : 4

The number of tweets with tags from the beginning: 60
COVID19 : 17
Covid19 : 6
China : 3
India : 3
SOSIYC : 3
Unite2FightCorona : 3

...

artisan1218 / streaming-data-algo Goto Github PK

streaming-data-algo's Introduction

Streaming Data Algorithm

1. Bloom Filter

2. Flajolet-Martin algorithm

3. Reservoir Sampling/Fixed Size Sampling

streaming-data-algo's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent