Code Monkey home page Code Monkey logo

site-sonar's People

Contributors

francescostl avatar purukaushik avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

site-sonar's Issues

Fix xhr Memory Leaks

Currently our XMLHTTPRequest is defined globally. This should be descoped to prevent memory leaks across multiple requests.

Anonymize HTTP Header Info

We'd like to anonymize certain data sent in the HTTP headers which could be used to identify users in the event of a security issue on our server. Below is an example of the data sent between the client and our server. We should likely anonymize the user agent and accept language by setting one standard agent+accept to send.

screen shot 2016-08-13 at 5 12 41 pm

Develop as Firefox Add-on to Track All HTTP Requests

The current method for collecting potential HTTP requests and testing their speed seems flawed, as we are not loading the javascript on-page when we are making the requests, and it is likely that we are missing requests because of it. The original reasoning behind the current method was that browser automation is expensive, especially when crawling hundreds (if not thousands) of pages.

Implement Automated Navigation

We'd like to automatically navigate between pages so the user can add this add-on and let it do it's thing while benchmarking.

In order to complete this, I'll need to identify when a page is truly "loaded", aka, when all our potential ad-scripts have been requested and received. This is not as simple as calling onLoad.

My current methodology is when onLoad would be triggered (utilizing webRequest.onCompleted), noting the number of blocked requests we've sent out, and waiting for them to be received back. The issue is that even this is triggering far too soon. Websites like cnn.com which normally have hundreds of requests will only actually log two. Further, sites like ksdk.com which can often have upwards of 1000 reduce to a mere 10 blocked requests that we can track.

More to come

Log Results to DB

Now that we've got a functioning web extension, we need to log our results somewhere other than the console. Ideally, we will hook this up to a database so every user can send their results to one place.

Record Files of Size 0 as Null

Right now, we're getting 204 No Content responses which are resolving with a content-length (file size) of 0, which throws off our results and muddles the db. We want to be able to ignore all of these files in our mongo queries on the dashboard size. In order to do this, we'd like to set them explicitly to null before storage so we don't have to worry about adding extra server-side logic.

Data not sending after X minutes in browser

Seems as though the timers we've set up don't persist between sessions as, when I leave my computer and it locks, the add-on stops posting data to the server and I have to reinstall the add-on. This can still needs to be verified.

Add Privacy Policy

To comply with AMO + any other entity we'd like to release this on eventually (Chrome?), we need to develop some sort of privacy policy outlining, what we're collecting, how it is being transferred, and how it is being used.

Batch Ad Benchmarks per Page

It would be interesting to batch data per-page so that we can determine how bad each page-load is on average. This would give us the best shot at ranking sites by perceived performance due to ads on-page.

Grab only N% of Asset Benchmarks

To protect against too much PII being shared and to mitigate a potential storage space issue on our server, we'd like to grab ~10% of all asset benchmarks completed.

Remove Website Subdomains

To mitigate a PII leak, we'd like to remove subdomains which may contain personally identifiable info in some cases.

Add Site Profiler Tool

I think it would be nice to add a website profiler where users can record performance for a particular amount of time.

Get Tab Url

Currently we are treating originUrl as our tab's url (the website that the request came from at the lowest level) but an ad can actually be the origin of another ad. For this reason, we should get the top level host url of the tab which the request came from.

Implement Data Encryption

After determining what info we can collect while still remaining privacy-respecting, we need to encrypt that data. In this pursuit, we will salt+hash info before sending it to the db.

(lower priority) We may also need to send individual salt's from the db to the client each time the first "write" is requested in order to verify that no one is mucking up our db results. The likelyhood of someone caring enough to send us bad data is low, but this is something to consider at a minimum.

The "benchmarks" question

My friend asked me a question about load times -

if you are collecting data over different data connections( ranging from dial-up to high speed wi-fi connections) how can it be standardized? How can it be called a "benchmark"? Load times may vary from geographical region to region for the same site w/ ad network.

I was thinking maybe we should collect a page-load time vs ad-load time metric ?

Explore Building Local User Reports

We are considering adding a local page which users can access which will report on the data that they've contributed. In the interest of not simply rebuilding Lightbeam with group reporting capabilities, we should likely keep this dashboard fairly light and easy to understand. Although it would certainly be cool to have all the features Lightbeam has, that seems to be outside of the scope of this project.

Grab More Useful Data

We're aware that currently, asset load time, ad host URL, and origin URL don't give us a whole lot to report on. The issue is trying to gather information that is both useful and privacy respecting. We don't want to hit a situation where logs allow someone to be uniquely identified.

Some thoughts for useful metrics we can gather:

  1. Unique page-visit ID for each group of assets in one page visit
    This will allow us to determine page performance by host to some degree of accuracy. One variable which may throw off our data here is the amount of time spent on a page. If a user lets only 1 asset of a potential 300 load, while 2 other users had 300 assets loaded, that throws off our average by quite a lot. Which is why it would be useful to grab the next data point.
  2. Time spent on page
    If we collect time spent on a page when we are grouping requests by page visit, we will be able to parse out page visits which were too short to grab a majority of the requests on said page.

(DONE) 3. Ad Network
This determination can also happen server-side, so I'm unsure if we should be doing it in the extension. That said, we've got the list handy already in the extension.

More TBD

.gitignore js/web-crawler.bundle.js

ideally this file is built by npm install, so gotta be .gitignored so we dont have to worry about commiting it when throwing in pull requests

Export Data Tab Takes Long to Load

When you have the add-on installed for a while, export data takes a while to load, as there is a ton of JSON data on the page. We should consider removing the export data feature.

Redirect Dashboard Overview

We should give the overview tab a message when there is no ad data so people know that they might need to disable ad blocker, or that there isn't ad data yet. Also we should redirect to the profiler tool when profiling is in progress.

Determine whether Ad is loaded on Active Tab

Determine whether the Ad Asset is being loaded in the active tab or if it is being side-loaded in inactive tabs. This will allow us to aggregate assets based upon the above to see what ads are invasive and which ones are not so.

Explore URL Mapping/ Grouping

After #8 was merged, we had some new situation introduced where ad's are requesting other ads. See lax1-ib.adnxs.com below, where the content is Website Origin, Requested Ad Host, and Response Time (in that order, from left to right):
screen shot 2016-08-02 at 1 43 01 pm

Above, you can see that cnn.com is requesting an ad from lax1-ib.adnxs.com. That ad presumably proceeds to call js.moatads.com (note that the above picture is in reverse chronological order, so the last item was requested first and vice versa).

In order to mitigate this, we'd need to group all requests from a page to the original host. We can do this by grabbing the URL of the requested ad's tab (using details.tabId). This also allows us to map data in the same way lightbeam does, which could be interesting.

Mitigate when data is too large to send (>15mb)

The dashboard is now updated so that it accepts data up to 15mb. We should probably be checking client-side to make sure the string isn't larger than that, because if it is, users will have a growing Map which could increase in size indefinitely. While the likelyhood a user gets 15mb worth of asset logs in 2 min is astronomically low (that would require loading >36,000 ad-asset records), we should probably make sure it isn't happening.

Reference to finding: #43

Collect Ad Asset Type

We'd like to collect the type of ad asset for each request. Examples include analytics, advertising, content, and social.

Upgrade Local Dashboard

Looking to upgrade the local dashboard with metrics we can get for free (without expensive processing or space) like: total memory used on ads, total network time taken to load ads, total number of ads recorded, total number of batches sent, etc.

Remove "www." from all URLs

Right now www.facebook.com and facebook.com are being treated as two different URLs

Example with chris.com and www.chris.com
screen shot 2016-08-08 at 3 39 42 pm

Add Multi-tab Top-Level-Host tracking for Disconnect Checker

Currently it seems as though our add-on isn't checking to see whether the requests coming in are an allowed resource or property of the top level hosts in each tab. We are currently getting spammed (ironically) by facebook messenger requests which aren't truly ads. I'm thinking this is likely because the request is coming out of a tab which isn't active. Not certain.

Fix ParseURI() to Accept HTTP

Currently, our parseURI function (mindlessly copied from StackOverflow) only parses https urls. We probably don't need all of the extra filtering junk in there either.

Compounding Requests = Memory Errors & Unmanagble Load Times

Currently we are making the HTTP requests asynchronously which, across the tens of thousands of requests which are triggered in a short amount of time, causes the process to run out of memory/fail/stall forever.

This will be invalidated if we use Selenium as noted in #1

JSON within JSON in Export Data

When the JSONString builds in export.js, it adds the assets object twice (same data) nested within itself. Not sure why this is happening, but we've got a suspicion it has to do with async.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.