Code Monkey home page Code Monkey logo

wikiracer's Introduction

Wikiracer

This is my attempt at a CLI wikiracer in python.

Overview

In order to quickly find an intersection between the source and destination, I'm using a bidirectional BFS, which is significantly more efficient than a standard BFS for this. The core component of this solution is the BFSBranch class, which represents a single direction of the BFS.

In the main file, I create an event loop to allow me to send off many concurrent API requests. I alternate between requesting pages from the source and destination branches. Instead of requesting links from all pages in a branch's queue, I instead request the first n pages, where n is the number of worker threads I have. This gives us a huge speedup by giving us a basic round robin scheduler between the two branches, instead of completing one branch's queue before the other. When I initially tried this, I was finishing each branch's queue before moving on to the other, but that caused issues where one branch could be much larger than the other, causing the algorithm to get stuck trying to finish that side of the search before it could continue searching from the other side. After each time I request link data, I compare the caches of the two branches to see if an intersection exists. If it does, I know that the path from source to destination can be completed. The cache data in each branch allows me to reconstruct the path from the initial page to the intersection page, so I can generate the entire path trivially.

Instructions

Python >= 3.4

pip install -r requirements.txt to install dependencies.

./wikiracer.py "Source Page" "Target Page" to race. You can modify the number of concurrent requests with the --workers flag.

./wikiracer.py --help to see usage.

Strategies I tried

I initially wrote a standard BFS to traverse the tree. While this was able to eventually find a path, it was not very efficient (Segment->The game in 4:30). I was able to significantly speed up my code by batching API requests to get links for 50 pages at a time and making the BFS bidirectional (Segment->The game in 0:03). However, since I was batching the API requests, I wasn't able to track parent/child relationships for encountered links, since the links could be from any of the provided parents. I had to revert to requesting a single page's information at a time, which came with a pretty decent slowdown (Segment->The game in 0:13). I had never really worked with concurrency in Python before, so I had to wrap my head around the new (since 3.4) asyncio library, as well as the requests_futures module. After a lot of effort, I was able to make HTTP requests in parallel, while doing some parsing as a callback provided to requests_futures.session.get(). Once the entire batch of requests had completed, I processed the new page data and added the necessary pages to the queue and started the whole process again. This brought me to my fastest time, and was in a state where I was fairly confident I couldn't speed it up much more (Segment->The game in 0:05).

How long I spent (roughly)

  • Basic BFS and API wrapper - 1 hour
  • Batching requests - 1 hour
  • Caching pages to calculate path - 2 hours
  • Initial hacky single-file async solution - 4 hours
  • Design of BFSBranch class with new async architecture - 1 hour
  • Refactoring to create BFSBranch class - 1 hour
  • Cleanup - 1 hour
  • Documentation/tests/README - 2 hours

wikiracer's People

Contributors

cbrenton avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.