Code Monkey home page Code Monkey logo

mida's Introduction

MIDA: A Tool for Measuring the Web

Go Go Report Card

MIDA is meant to be a general tool for web measurement projects. It is built in Go on top of Chrome/Chromium and the DevTools protocol, giving it a realistic vantage point to study the web and fine-grained access to information provided by Chrome Developer Tools.


Getting Started

Getting started with MIDA is easy! First, install:

$ wget files.mida.sprai.org/setup.py
$ sudo python3 setup.py 

Now we are ready to visit a site and collect some data:

$ mida go example.org

You can find the results of your crawl in the results/ directory.

Easy At-Scale Crawling

One major benefit of MIDA is in being able to run large scale, highly configurable crawls without needing to write your own crawler code. Here's an example of a single MIDA command which will crawl the Alexa Top 100K and gather a few specific types of data:

$ mida go -f https://files.mida.sprai.org/toplists/alexa.lst -n100000 -c8 --all-resources --screenshot --dom

Breaking this down by argument:

-f https://files.mida.sprai.org/toplists/alexa.lst: This is a list of the Alexa Top Websites. You can read from a local file or go get one hosted on the web somewhere

-n100000: Read the top 100,000 entries from the list

-c8: Run with 8 parallel crawlers (browser instances)

--all-resources: Gather all of the actual files/resources required to render the web page. Beware, this takes a lot of space!

--screenshot: Capture a screenshot after/if the load event for each website fires.

--dom: Capture a JSON representation of the DOM for each website visited.

mida's People

Contributors

nikitaborisov avatar pmurley avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mida's Issues

Auto-Dismiss JavaScript Popups

Need to add code that handles JavaScript dialog events and dismisses them automatically. Might want to make this configurable later on, but for now, seems logical to just dismiss everything.

Add mobile emulation

Need to be able to offer emulation of mobile devices for crawling. I believe DevTools has this functionality out-of-the-box, just need to apply it and add a task option for it.

Browser Closes on MIDA SIGTERM

MIDA gracefully handles SIGTERM, trying to complete running tasks and then exiting. However, chromedp passes the SIGTERM signal through to the browsers, causing browsers to exit instantly when the signal is sent.

Throttle crawl starts within a given period

Need to add throttling logic to prevent a bunch of crawls from starting all at once when many new tasks are available all at once. Many crawls starting at the same time puts undue/unnecessary load on the system and gives no real benefit.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.