Code Monkey home page Code Monkey logo

streaming-data-algo's Introduction

Streaming Data Algorithm

This repo includes three algorithms: Bloom filtering, Flajolet-Martin algorithm, and reservoir sampling


1. Bloom Filter

Bloom Filtering algorithm is to estimate whether the city of a business in business_second.json has shown before in business_first.json. I used the following hash function to hash the city name:

where a, b is randomly picked large number, m is the length of bit array, which is just length of the bloom filter. Note: the representation of bloom filter is a not a list 1 or 0, but instead a set of indices of all 1's. This way we can save some spaces.


2. Flajolet-Martin algorithm

Uses generate_stream.jar to simulate the data stream to work with. The implementation listens to port number 9999 locally. FM algorithm estimate the number of unique cities within a window in the data stream by the following steps:

Multiple hash functions are used to improve the estimation accuracy. The standard way to do this is shown below

But for simplicity, instead of using kmeans to group multiple estimation result, I manually set a number of group to simulate the grouping process, turns out this solution works.

Results:

Time                  Ground Truth  Estimation
2021-04-23 14:45:40   24            32
2021-04-23 14:46:33   42            50
2021-04-23 14:48:46   62            59

3. Reservoir Sampling/Fixed Size Sampling

This implementation uses Twitter API of streaming to implement the fixed size sampling method (Reservoir Sampling Algorithm) and find popular tags with the top 3 frequencies on tweets based on the samples. Assuming that the memory can only save 100 tweets, we need to use the fixed size sampling method to only keep part of the tweets as a sample in the streaming.

The idea of Reservoir Sampling is to keep a fixed size sample to work with, and either keep/replace existing element in the sample list with newly arrived element or discard the new element:

I only take tweets with hashtags into account when doing Reservoir Sampling, that is, the hashtag will be counted. Newly arrived hashtag will be kept with the probability of 100/n, where n is the number of the newly arrived hashtag's index.

For example, the 101st element will be kept with a probability of 100/101, which is pretty high.

Sample data is obtained by Twitter's API. The program is set to listen to the tweets with following keywords:

TOPIC_LIST = ['COVID19', 'SocialDistancing', 'StayAtHome', 'Trump', 'Quarantine', 'CoronaVirus', 'China', 'Wuhan', 'Pandemic']

Results:

...

The number of tweets with tags from the beginning: 59
COVID19 : 16
Covid19 : 6
China : 4

The number of tweets with tags from the beginning: 60
COVID19 : 17
Covid19 : 6
China : 3
India : 3
SOSIYC : 3
Unite2FightCorona : 3

...

streaming-data-algo's People

Contributors

artisan1218 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.