Code Monkey home page Code Monkey logo

nserc_project's Introduction

NoSQL Systems Design for Real-Time Data Analytics

This research project is generously supported by the NSERC USRA, and supervised by Dr. Oana Balmau. It is built upon existing work on RocksDB developed by Facebook.

Motivation

In our modern era of technology, the goal of most (if not all) online platforms is to provide a customized experience to their users, by suggesting relevant items to adjust to their preferences. These recommendation systems ingest enormous amounts of information at a very high rate, and require a datastore that can simultaneously provide:

  1. High write throughput to ingest the incoming user events.
  2. High read throughput and low read latency to efficiently merge new information with the old data.
  3. Persistent storage, since the amount of incoming data is large and is growing continuously.

It is achieving all three objectives simultaneously that is a challenge.

Project Description

To address the challenge of designing an optimal recommendation system that satisfies the three requirements (high write throughput, high read throughput, persistent storage), this project will build upon Facebook's existing RocksDB, which uses log-structured merge trees. LSM trees are a popular data structure for write-optimized, persistent NoSQL datastores, but are not optimized for reads and provide little support for complex data structures like those in recommendation systems.

This project aims to augment the LSM tree that RocksDB uses, to support real-time recommendations on fresh data, and meet the three requirements of an effective modern recommendation system.

Phase 1 - Implementing and Running a Representative Benchmark

To simulate a recommendation system for a social media platform, there are four "maps" (distinguished using different column families in RocksDB), which represent the different key-value pairs that store the data generated during a benchmark:

Map Key Value*
1) Picture Annotations ImageID Set[Annotation]
2) Picture User Time Series UserID Map{ImageID → Map{ActionID → (List[TimeStamp], Counter)}}
3) User Annotation Time Series UserID Map{AnnotationID → Map{ActionID → (List[TimeStamp], Counter)}}
4) User Annotation Scores (UserID, AnnotationID) Map{ActionID → Score}

*In Phase 1, these maps are represented using RocksDB's Slice structure.

The data of interest collected from the benchmark is:

  • Average throughput of using all maps
  • Average read/write latency of all maps
  • 50th and 99th percentile read/write latency of all maps

The following is an example figure generated using the benchmark:

Map2 99th Percentile Figure

Phase 2 - Modifying the LSM Tree used by RocksDB

To augment the underlying LSM tree, RocksDB needs to be modified to support data structures other than just the Slice. The benchmark will then be re-run using these new data structures to compare the read/write performance against that of using a Slice.

Currently in progress!

Relevant Files

The following table highlights the files that have been modified/created for this project and their purpose.

File Status Description
tools/db_bench_tool.cc Modified Implemented custom benchmark to simulate social media workload.
util/json_serializer.cc Created* Created functions to stringify all four custom maps.
util/zipf.cc** Created* Created functions to generate random number between specified range using Zipfian distribution.
CMakeList.txt + src.mk Modified Added created files to enable Makefile to target.

* Corresponding header files have also been created for these files.
** Original author: Oana Balmau (link to code).

nserc_project's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.