Code Monkey home page Code Monkey logo

mids-mumbler's Introduction

mids-mumbler

Akka-based implementation of a Markov chain generator from Google n-gram data sets. This is an implementation of the Mumbler assignment in the MIDS Scaling Up! Data Science Curriculum.

System Building and Launch

Preconditions

  • Obtain a recent version of SBT (as of this writing, v.1.0.2 on Linux was used, it was fetched from http://www.scala-sbt.org/download.html)
  • Configure network connectivity between each box that will host the system and ensure that the configured hostnames are resolvable (see example invocation below for use of ports and hostnames given at invocation)

Steps

  • Build with SBT: sbt assembly

  • Upload ./agent/target/scala-2.12/mids_mumbler_agent-assembly-0.1.0.jar to remote nodes (perhaps gpfs1, gpfs2, and gpfs3). Upload ./mumbler/target/scala-2.12/mids_mumbler-assembly-0.1.0.jar to a system from which you can make websocket requests and that has network access to the remote nodes (this hosts the websocket API). Upload the ui directory to this system also.

  • Create a word censorship list at location DATADIR/badwords.txt (see next step for the use of this variable). If you prefer not to censor output, an empty badwords file is acceptable.

  • Start mumbler agent on each node:

    DATADIR=/vol/ngram/gpfs1 java -Dakka.remote.netty.tcp.hostname="gpfs1" -Dakka.remote.netty.tcp.port="5441" -jar ./agent/target/scala-2.12/mids_mumbler_agent-assembly-0.1.0.jar
    DATADIR=/vol/ngram/gpfs2 java -Dakka.remote.netty.tcp.hostname="gpfs2" -Dakka.remote.netty.tcp.port="5442" -jar ./agent/target/scala-2.12/mids_mumbler_agent-assembly-0.1.0.jar
    DATADIR=/vol/ngram/gpfs3 java -Dakka.remote.netty.tcp.hostname="gpfs3" -Dakka.remote.netty.tcp.port="5443" -jar ./agent/target/scala-2.12/mids_mumbler_agent-assembly-0.1.0.jar
    
  • Execute API launcher, providing configuration for the number of n-gram source files to process (100), an address and port to which to bind a websocket API (0.0.0.0:8080), and the hostname and address of each remote agent (gpfs1:5442...). The envvar points to the ui directory in the project root:

    MARKOV_UI=$PWD/ui; java -Dakka.remote.netty.tcp.hostname="api" -Dakka.remote.netty.tcp.port="2552" -jar ./mumbler/target/scala-2.12/mids_mumbler-assembly-0.1.0.jar 100 0.0.0.0:8080 gpfs1:5442 gpfs2:5442 gpfs3:5442
    

On first execution of the API launcher, the remote agents will download and preprocess input files as they are streamed (the full corpus is distributed among the agents). This means that the first invocation will take approximately 90 minutes to be ready to serve requests (if you'd like to test the system with fewer source files, replace the quantity "100" in the above invocation with a smaller value). On each subsequent invocation, the agents will report that the files have already been processed.

Only after the full data set is fetched and processed will the API be available to queries.

Note: The organization of data by this program can really eat inodes on an FS. If writing files to GPFS w/ 3x25GB clustered disks, you need to create the filesystem with a lot of inodes, e.g.: mmcrfs gpfsfpo -F /root/diskfile.fpo -A yes -Q no -r 1 -R 1 --inode-limit 5M.

System Use

This system serves a websocket API. Given a chain length limit integer and a chain seed word the system will build a Markov chain using remote agents. Each word added to the chain will be published to the websocket. When the chain is fully-formed the socket will be closed.

Example invocation where "20" is the chain limit and "fruit" is the seed word:

mdye@heidegger:~[100043]# wscat --connect "ws://localhost:8092/chain/20/seed/fruit"
connected (press CTRL+C to quit)
< fritters
< served
< much
< departed
< leaves
< disappear
< easily
< perceived
< unless
< storage
< receptacles
< could
< recreate
< only
< purveyors
< was
< transgressed
< my
< neck
disconnected

A query UI is available if you browse to http://localhost:8092/ui/. The UI accepts parameters to automatically perform searches and limit the chain length, for example: http://localhost:8092/ui/?searchinterval=6000&chainmax=60.

Logging output from the nodes and API runtimes varies. The API logs will print chain words as they are selected and the complete chain upon termination:

...
13:31:06.469 [main] INFO mumbler.Launch$ - API listening on 0.0.0.0:8092
13:31:06.482 [Mumbler-akka.actor.default-dispatcher-4] INFO mumbler.API - API server binding complete
[INFO] ... [akka.tcp://Mumbler@api:2552/user/StreamSupervisor-0/flow-2-1-actorPublisherSource] fritters
[INFO] ... [akka.tcp://Mumbler@api:2552/user/StreamSupervisor-0/flow-2-1-actorPublisherSource] served
[INFO] ... [akka.tcp://Mumbler@api:2552/user/StreamSupervisor-0/flow-2-1-actorPublisherSource] much
[INFO] ... [akka.tcp://Mumbler@api:2552/user/StreamSupervisor-0/flow-2-1-actorPublisherSource] departed
[INFO] ... [akka.tcp://Mumbler@api:2552/user/StreamSupervisor-0/flow-2-1-actorPublisherSource] leaves
[INFO] ... [akka.tcp://Mumbler@api:2552/user/StreamSupervisor-0/flow-2-1-actorPublisherSource] disappear
[INFO] ... [akka.tcp://Mumbler@api:2552/user/StreamSupervisor-0/flow-2-1-actorPublisherSource] easily
[INFO] ... [akka.tcp://Mumbler@api:2552/user/StreamSupervisor-0/flow-2-1-actorPublisherSource] perceived
[INFO] ... [akka.tcp://Mumbler@api:2552/user/StreamSupervisor-0/flow-2-1-actorPublisherSource] unless
[INFO] ... [akka.tcp://Mumbler@api:2552/user/StreamSupervisor-0/flow-2-1-actorPublisherSource] storage
[INFO] ... [akka.tcp://Mumbler@api:2552/user/StreamSupervisor-0/flow-2-1-actorPublisherSource] receptacles
[INFO] ... [akka.tcp://Mumbler@api:2552/user/StreamSupervisor-0/flow-2-1-actorPublisherSource] could
[INFO] ... [akka.tcp://Mumbler@api:2552/user/StreamSupervisor-0/flow-2-1-actorPublisherSource] recreate
[INFO] ... [akka.tcp://Mumbler@api:2552/user/StreamSupervisor-0/flow-2-1-actorPublisherSource] only
[INFO] ... [akka.tcp://Mumbler@api:2552/user/StreamSupervisor-0/flow-2-1-actorPublisherSource] purveyors
[INFO] ... [akka.tcp://Mumbler@api:2552/user/StreamSupervisor-0/flow-2-1-actorPublisherSource] was
[INFO] ... [akka.tcp://Mumbler@api:2552/user/StreamSupervisor-0/flow-2-1-actorPublisherSource] transgressed
[INFO] ... [akka.tcp://Mumbler@api:2552/user/StreamSupervisor-0/flow-2-1-actorPublisherSource] my
[INFO] ... [akka.tcp://Mumbler@api:2552/user/StreamSupervisor-0/flow-2-1-actorPublisherSource] neck
[INFO] ... [akka.tcp://Mumbler@api:2552/user/StreamSupervisor-0/flow-2-1-actorPublisherSource] Exiting b/c reached requested max chain length, 20
[INFO] ... [akka.tcp://Mumbler@api:2552/user/StreamSupervisor-0/flow-2-1-actorPublisherSource] Chain: fruit fritters served much departed leaves disappear easily perceived unless storage receptacles could recreate only purveyors was transgressed my neck

mids-mumbler's People

Contributors

michaeldye avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Forkers

kutsumigit dyejon

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.