Code Monkey home page Code Monkey logo

crux-top-lists's Introduction

Cached Chrome Top Million Websites

Recent research showed that the top million most popular websites published by Google Chrome via their UX Report (CrUX) is significantly more accurate than other top lists like the Alexa Top Million and Tranco Top Million.

This repository caches a CSV version of the Chrome top sites, queried from the CrUX data in Google BigQuery. You can browse all of the cached lists here. The most up-to-date top million global websites can be downloaded directly at: https://raw.githubusercontent.com/zakird/crux-top-lists/main/data/global/current.csv.gz.

Data Structure

The CrUX dataset has several important differences from other top lists:

  1. Websites are bucketed by rank magnitude order, not by specific rank. Rank will be 1000, 10K, 100K, or 1M in the provided files. The data is ordered by rank magnitude. Within each order of magnitude, websites are listed randomly.

  2. Websites are identified by origin (e.g., https://www.google.com) not by domain or FQDN.

  3. Data is released monthly, typically on the second Tuesday of the month.

This is an example of what the data looks like:

origin,rank
https://www.ptwxz.com,1000
https://ameblo.jp,1000
https://danbooru.donmai.us,1000
https://game8.jp,1000
https://www.google.com.au,1000
https://www.repubblica.it,1000
https://www.w3schools.com,1000
https://animekimi.com,1000

Websites are ranked by completed pageloads (measured by First Contentful Paint) and aggregated by web origin. The dataset adheres as closely as possible to user-initiated pageloads (e.g., it excludes traffic from iframes). More information about CrUX and its data collection methodology can be found on its official website: https://developer.chrome.com/docs/crux/about/.

Why 1 Million Sites?

This repository does not contain all of the website ranking data published by Chrome. Their global list of popular websites contains approximately 15M websites. The top million websites captures over 95% of user traffic in Chrome by both Page Loads and Time on Page (Ruth et al.) and is a reasonable approximation:

CDF of User Traffic

If you want to use more or fewer websites, this is the approximate breakdown of coverage:

Websites Page Loads
1000 50%
10K 70%
100K 87%
1M 95%
5M 99%

The following SQL can be used to generate a similar list of all globally popular websites:

SELECT distinct origin, experimental.popularity.rank
    FROM `chrome-ux-report.experimental.global`
    WHERE yyyymm = ? -- e.g., integer 202210
    GROUP BY origin, experimental.popularity.rank
    ORDER BY experimental.popularity.rank;

Country-Specific Websites

Ruth et al. also showed that browsing behavior is localized and a global top list skews towards global sites (e.g., technology and gaming) and away from local sites (e.g., education, government, and finance). As such, researchers may also want to investigate whether trends hold across individual countries.

Skew in Websites

Chrome publishes country-specific top lists in BigQuery and the following SQL can be used to dump out country-specific top websites:

SELECT distinct country_code, origin, experimental.popularity.rank
    FROM `chrome-ux-report.experimental.country`
    WHERE yyyymm = ? -- e.g., integer 202210
		AND experimental.popularity.rank <= 1000000
    GROUP BY country_code, origin, experimental.popularity.rank
    ORDER BY country_code, experimental.popularity.rank;

The CrUX dataset is based on data collected from Google Chrome and is thus biased away from countries with limited Chrome usage (e.g., China). If you're specifically interested in looking at domain popularity in China, consider Building an Open, Robust, and Stable Voting-Based Domain Top List, which is based on data collected from 114DNS, a large DNS provider in China.

Supporting Research

The data in this repo is all publicly posted by Google to their CrUX dataset in Google BigQuery. This is simply a cache of that public data. Many of the arguments in this README are based on two recent research papers. The first describes how we evaluated the accuracy of lists of top websites. The second is a study on web browsing more broadly.

Toppling Top Lists: Evaluating the Accuracy of Popular Website Lists
Kimberly Ruth, Deepak Kumar, Brandon Wang, Luke Valenta, and Zakir Durumeric
ACM Internet Measurement Conference (IMC), October 2022

A World Wide View of Browsing the World Wide Web
Kimberly Ruth, Aurore Fass, Jonathan Azose, Mark Pearson, Emma Thomas, Caitlin Sadowski, and Zakir Durumeric
ACM Internet Measurement Conference (IMC), October 2022

crux-top-lists's People

Contributors

dadrian avatar zakird avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crux-top-lists's Issues

`202205.csv` contains only 859188 records instead of 1M

Hello,

Thank you for maintaining this repository and cached versions of crux-top-list.

202205.csv contains only 859188 records instead of the usual 1M. Can the corresponding list be regenerated and updated here or is the data also missing from Google's BigQuery database?

>>> import pandas as pd
>>> df = pd.read_csv("202205.csv")
>>> df
                                       origin     rank
0                          http://iporntv.net     1000
1       https://eldenring.wiki.fextralife.com     1000
2                 https://m.lightinthebox.com     1000
3                          https://ssc.nic.in     1000
4                  https://ja.m.wikipedia.org     1000
...                                       ...      ...
859183    https://www.vulcaodaborracha.com.br  1000000
859184                     https://www.vub.be  1000000
859185     https://www.virginianaturalgas.com  1000000
859186         https://www.virtualregatta.com  1000000
859187                https://zamosc.lento.pl  1000000

[859188 rows x 2 columns]
>>> df.groupby("rank").nunique()
         origin
rank           
1000        904
10000      7806
100000    76566
1000000  773912

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.