Add README.md content for Chrome/Opera when portable

Fix xhr Memory Leaks

Currently our XMLHTTPRequest is defined globally. This should be descoped to prevent memory leaks across multiple requests.

Batch Local Storage

Per #25 (comment), we should probably batch our storage so that we aren't having any issues with slowing the browser

Overview left-margin is wider than others

We need to check out margins for tomorrow

consistency: a single name – Site-sonar (as far as I can tell)

http://www.site-sonar.com/ mentions the creation of:

Site-sonar

– so I assume that's the true name. One uppercase S, hyphenated.

Elsewhere I see:

site-sonar (no uppercase)
Site-Sonar (uppercase twice)
Site Sonar (a space, non-hyphenated)
SiteSonar (neither a hyphen nor a space).

We'd like to anonymize certain data sent in the HTTP headers which could be used to identify users in the event of a security issue on our server. Below is an example of the data sent between the client and our server. We should likely anonymize the user agent and accept language by setting one standard agent+accept to send.

Develop as Firefox Add-on to Track All HTTP Requests

The current method for collecting potential HTTP requests and testing their speed seems flawed, as we are not loading the javascript on-page when we are making the requests, and it is likely that we are missing requests because of it. The original reasoning behind the current method was that browser automation is expensive, especially when crawling hundreds (if not thousands) of pages.

Implement Automated Navigation

We'd like to automatically navigate between pages so the user can add this add-on and let it do it's thing while benchmarking.

In order to complete this, I'll need to identify when a page is truly "loaded", aka, when all our potential ad-scripts have been requested and received. This is not as simple as calling onLoad.

My current methodology is when onLoad would be triggered (utilizing webRequest.onCompleted), noting the number of blocked requests we've sent out, and waiting for them to be received back. The issue is that even this is triggering far too soon. Websites like cnn.com which normally have hundreds of requests will only actually log two. Further, sites like ksdk.com which can often have upwards of 1000 reduce to a mere 10 blocked requests that we can track.

More to come

Log Results to DB

Now that we've got a functioning web extension, we need to log our results somewhere other than the console. Ideally, we will hook this up to a database so every user can send their results to one place.

Executive Summary - ads since last date- add a date

Add a date in the executive summary like:

Executive Summary

Cumulative ad asset data since 8/25/2016 11:31 AM

3252

Assets Benchmarked

24.26 mb

Memory Used on Ads

12.1 min

Network Time Spent on Ads

Add Big CTA To Dashboard

We want to direct users from the add-on to the user.

Record Files of Size 0 as Null

Right now, we're getting 204 No Content responses which are resolving with a content-length (file size) of 0, which throws off our results and muddles the db. We want to be able to ignore all of these files in our mongo queries on the dashboard size. In order to do this, we'd like to set them explicitly to null before storage so we don't have to worry about adding extra server-side logic.

Data Doesn't Persist Between Sessions

Need to handle session startup and loading data into memory rather than removing it.

Data not sending after X minutes in browser

Seems as though the timers we've set up don't persist between sessions as, when I leave my computer and it locks, the add-on stops posting data to the server and I have to reinstall the add-on. This can still needs to be verified.

Add Privacy Policy

To comply with AMO + any other entity we'd like to release this on eventually (Chrome?), we need to develop some sort of privacy policy outlining, what we're collecting, how it is being transferred, and how it is being used.

Batch Ad Benchmarks per Page

It would be interesting to batch data per-page so that we can determine how bad each page-load is on average. This would give us the best shot at ranking sites by perceived performance due to ads on-page.

Grab only N% of Asset Benchmarks

To protect against too much PII being shared and to mitigate a potential storage space issue on our server, we'd like to grab ~10% of all asset benchmarks completed.

Remove Website Subdomains

To mitigate a PII leak, we'd like to remove subdomains which may contain personally identifiable info in some cases.

Add Site Profiler Tool

I think it would be nice to add a website profiler where users can record performance for a particular amount of time.

Show distribution as standard deviation (not range)

From a statistical perspective, SD is easier to understand and apply to the average that also gets shown.

Port add-on to chrome

We'd like to build the add-on to support Chrome, Opera, Edge, etc. This thread should serve as tracking for the progress in that endeavor.

Here is a list of incompatibilities between Chrome and the current API's we are using:
https://developer.mozilla.org/en-US/Add-ons/WebExtensions/Chrome_incompatibilities

OriginUrl Isn't Supported in Firefox 47

We're blocked for release by OriginUrl in WebRequest details for onSendHeaders/ onHeadersReceived support in Firefox 48. Not a huge deal, but it should be documented that this add-on will not work in anything but fx >=v48.0

Round load time more reasonably

Right now it displays 3 digits (5083.666). One should enough and makes it easier to read.

Get Tab Url

Currently we are treating originUrl as our tab's url (the website that the request came from at the lowest level) but an ad can actually be the origin of another ad. For this reason, we should get the top level host url of the tab which the request came from.

Implement Data Encryption

After determining what info we can collect while still remaining privacy-respecting, we need to encrypt that data. In this pursuit, we will salt+hash info before sending it to the db.

(lower priority) We may also need to send individual salt's from the db to the client each time the first "write" is requested in order to verify that no one is mucking up our db results. The likelyhood of someone caring enough to send us bad data is low, but this is something to consider at a minimum.

Log fileSizes as Integers and not Strings

Change '-' to Nulls. Maybe we can get MongoDB to eat the null integer and ignore them during aggregation queries.

The "benchmarks" question

My friend asked me a question about load times -

if you are collecting data over different data connections( ranging from dial-up to high speed wi-fi connections) how can it be standardized? How can it be called a "benchmark"? Load times may vary from geographical region to region for the same site w/ ad network.

I was thinking maybe we should collect a page-load time vs ad-load time metric ?

Post Image to Readme

Explore Building Local User Reports

We are considering adding a local page which users can access which will report on the data that they've contributed. In the interest of not simply rebuilding Lightbeam with group reporting capabilities, we should likely keep this dashboard fairly light and easy to understand. Although it would certainly be cool to have all the features Lightbeam has, that seems to be outside of the scope of this project.

Grab More Useful Data

We're aware that currently, asset load time, ad host URL, and origin URL don't give us a whole lot to report on. The issue is trying to gather information that is both useful and privacy respecting. We don't want to hit a situation where logs allow someone to be uniquely identified.

Some thoughts for useful metrics we can gather:

Unique page-visit ID for each group of assets in one page visit
This will allow us to determine page performance by host to some degree of accuracy. One variable which may throw off our data here is the amount of time spent on a page. If a user lets only 1 asset of a potential 300 load, while 2 other users had 300 assets loaded, that throws off our average by quite a lot. Which is why it would be useful to grab the next data point.
Time spent on page
If we collect time spent on a page when we are grouping requests by page visit, we will be able to parse out page visits which were too short to grab a majority of the requests on said page.

(DONE) 3. Ad Network
This determination can also happen server-side, so I'm unsure if we should be doing it in the extension. That said, we've got the list handy already in the extension.

More TBD

.gitignore js/web-crawler.bundle.js

ideally this file is built by npm install, so gotta be .gitignored so we dont have to worry about commiting it when throwing in pull requests

Investigate Extension Performance

This also affects things like scrolling, changing tabs, and navigating pages.

Export Data Tab Takes Long to Load

When you have the add-on installed for a while, export data takes a while to load, as there is a ton of JSON data on the page. We should consider removing the export data feature.

Redirect Dashboard Overview

We should give the overview tab a message when there is no ad data so people know that they might need to disable ad blocker, or that there isn't ad data yet. Also we should redirect to the profiler tool when profiling is in progress.

Determine whether Ad is loaded on Active Tab

Determine whether the Ad Asset is being loaded in the active tab or if it is being side-loaded in inactive tabs. This will allow us to aggregate assets based upon the above to see what ads are invasive and which ones are not so.

Explore URL Mapping/ Grouping

After #8 was merged, we had some new situation introduced where ad's are requesting other ads. See lax1-ib.adnxs.com below, where the content is Website Origin, Requested Ad Host, and Response Time (in that order, from left to right):

Above, you can see that cnn.com is requesting an ad from lax1-ib.adnxs.com. That ad presumably proceeds to call js.moatads.com (note that the above picture is in reverse chronological order, so the last item was requested first and vice versa).

In order to mitigate this, we'd need to group all requests from a page to the original host. We can do this by grabbing the URL of the requested ad's tab (using details.tabId). This also allows us to map data in the same way lightbeam does, which could be interesting.

Mitigate when data is too large to send (>15mb)

The dashboard is now updated so that it accepts data up to 15mb. We should probably be checking client-side to make sure the string isn't larger than that, because if it is, users will have a growing Map which could increase in size indefinitely. While the likelyhood a user gets 15mb worth of asset logs in 2 min is astronomically low (that would require loading >36,000 ad-asset records), we should probably make sure it isn't happening.

Reference to finding: #43

Collect Ad Asset Type

We'd like to collect the type of ad asset for each request. Examples include analytics, advertising, content, and social.

Upgrade Local Dashboard

Looking to upgrade the local dashboard with metrics we can get for free (without expensive processing or space) like: total memory used on ads, total network time taken to load ads, total number of ads recorded, total number of batches sent, etc.

Add Export JSON View

Need to add the ability to copy all local json to the clipboard for export.

Remove "www." from all URLs

Right now www.facebook.com and facebook.com are being treated as two different URLs

Example with chris.com and www.chris.com

Add Multi-tab Top-Level-Host tracking for Disconnect Checker

Currently it seems as though our add-on isn't checking to see whether the requests coming in are an allowed resource or property of the top level hosts in each tab. We are currently getting spammed (ironically) by facebook messenger requests which aren't truly ads. I'm thinking this is likely because the request is coming out of a tab which isn't active. Not certain.

francescostl / site-sonar Goto Github PK

site-sonar's People

Contributors

Stargazers

Watchers

Forkers

site-sonar's Issues

Executive Summary

Recommend Projects

Recommend Topics

Recommend Org