Code Monkey home page Code Monkey logo

cc-crawl-statistics's Introduction

Common Crawl Support Library

Overview

This library provides support code for the consumption of the Common Crawl Corpus RAW crawl data (ARC Files) stored on S3. More information about how to access the corpus can be found at https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set .

You can take two primary routes to consuming the ARC File content:

(1) You can run a Hadoop cluster on EC2 or use EMR to run a Hadoop job. In this case, you can use the ARCFileInputFormat to drive data to your mappers/reducers. There are two versions of the InputFormat: One written to conform to the deprecated mapred package, located at org.commoncrawl.hadoop.io.mapred and one written for the mapreduce package, correspondingly located at org.commoncrawl.hadoop.io.mapreduce.

(2) You can decode data directly by feeding an InputStream to the ARCFileReader class located in the org.commoncrawl.util.shared package.

Both routes (InputFormat or ARCFileReader direct route) produce a tuple consisting of a UTF-8 encoded URL (Text), and the raw content (BytesWritable), including HTTP headers, that were downloaded by the crawler. The HTTP headers are UTF-8 encoded, and the headers and content are delimited by a consecutive set of CRLF tokens. The content itself, when it is of a text mime type, is encoded using the source text encoding.

Build Notes:

  1. You need to define JAVA_HOME, and make sure you have Ant & Maven installed.
  2. Set hadoop.path (in build.properties) to point to your Hadoop distribution.

Sample Usage:

Once the commoncrawl.jar has been built, you can validate that the ARCFileReader works for you by executing the sample command line from root for the commoncrawl source directory:

./bin/launcher.sh org.commoncrawl.util.shared.ARCFileReader --awsAccessKey <ACCESS KEY> --awsSecret <SECRET> --file s3n://aws-publicdatasets/common-crawl/parse-output/segment/1341690164240/1341819847375_4319.arc.gz

cc-crawl-statistics's People

Contributors

jnioche avatar pjox avatar sebastian-nagel avatar thunderpoot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cc-crawl-statistics's Issues

Hostname counts

Maybe I'm just missing it (very likely), but it doesn't look like the current site provides visibility into the top hostnames? But the README.md here seems to indicate hostnames are among the things that are counted? I was going to run some Athena queries to get them and try to get a view into them but was wondering if it might make sense to try to add a view like that here. I understand that there are LOTS of them and perhaps that is why they are missing...

How does Common Crawl curate the list of domains for crawling?

Sorry if I might have missed it, but could someone please point me to where I could find a post/Q&A which explains how Common Crawl curated all of the domains crawled every month?
Are these domains from zonefiles provided by several sources (e.g., ICANN)? Or Are they extracted from previously crawled data, and if so how was the very first list of domains curated?
Thank you!

If you export the statistical data to me in a simple form, I can write viewers in Javascript

Sebastian,

I can make these statistics more understandable. But I work with Javascript. If you use tab separated format, I can read that and convert it to json and that can be shared with others who might want to process and analyze the data. The raw data also.

Python is NOT a universal Internet standard, and not likely to become one until a compiler and basic sharing are cleaned up. That is another of the Internet Foundation projects, but low priority since so few (relative to the whole Internet) people are using it.

If you share your data in a global format (I can help you), then anyone can bolt into your output. A community using it can then generate usage data to guide your development. You should have about 10,000 people already working in this area globally. I can help you find them. I have found many groups doing statistics on the web. They are not working together effectively.

From better statistics, I can help you find and connect to the groups who can use Common Crawl. I want to replace Google search for certain types of topics where a for-profit company is not allowed, or always suspect.

I have been working to redesign the Internet internals and policies for the last 22 years full time. I think we have talked before. But I want to write some proposals for new organizations and basic statistics on the characteristics of the Internet are needed.

Later I want to profile domains like EDU, GOV and others to show exactly what is happening with them. Then ask that they be re-written or mapped to more useful, visible and auditable form. I can be quite specific. But I want to get you statistics in a form to show others. I am not going to screen scrape your results, or invest my scare time in a language I think is totally inadequate right now. You have done a good job, I just want to help use the data you are producing to change the Internet. That includes helping Common Crawl to tackle some global problems to demonstrate effective ways to index and understand all of what is available and how it is structured and functions. "Covid-19" is a bit too large. But individual nodes in that problem can be tackled with the resources available. If I can demonstrate a few, then there are donors and sponsors and organizations to help.

I am negotiating to have my own InternetFoundation.Org site rebuilt. If you have specific questions, you have my email. I think this conversation is not private. So I am only explaining basic things that anyone can do.

Thank you for what you are doing. Can you export your results in a computer-readable form? If you have thought about it, I would rather adapt to your export format, and give feedback or rewrite. Adding Javascript should greatly expand the community of people who can connect to CommonCrawl and its derivative products and tools. I will look at all the others in the coming days. But my time is pretty covered. I will take time to give you some ideas for how to better use these statistics. In my working career, I was a senior mathematical statistician. I have about 50 years in statistics. Mainly on global economic, social, technological and scientific modeling and simulation. Since those all require massive collection and curating of information from many different places and forms, I have also been adept at finding and gathering data. Connecting all the data on the Internet with a few basic human-computer readable forms is much faster and more efficient than the current proprietary, binary, compressed and obscure formats now. Many requiring substantial investment of time to find the required tools and dependencies.

GitHub.com itself is one of the Internet Foundation projects. I am working to index and remap it completely. It has massive duplication and too many undocumented pieces. There are many groups working individually, but not together.

site:GitHub.com has 75.1 Million entry points
site:GitHub.com "covid-19" has 111 Thousand entry points.
site:GitHub.com "Common Crawl" OR "CommonCrawl" has 4200 entry points

As a sample, if you can give me that list of 111,000 Urls I can see about profiling who is doing what. I, and many other people, could use the data from CommonCrawl, starting with basic statistics. But it needs to be easy to try basic things like look at all the pages that contain "covid-19". And have the result back in a form that is easy for JavaScript to use. Batch processing in Python will NOT help get the information to billions of browsers and users of the Internet who only have JavaScript - content scripts, scripts in pages, background html and scripts.

I can help on many things but don't ask me to learn things you can do in minutes. I can reasonably tackle "Covid-19" on the Internet, but not if I have to do it alone.

Pardon me if I am writing a bit formally and much. I am sharing this note with some people I hope will help. There are about 5000 immediate problems on the Internet that could be helped with data and statistics from CommonCrawl. I can only do them one by one, and I need data from you to do that. I hope you can spend a bit of time to help me get started. If there are things that need doing, and I profile the 4200 entry points for "Common Crawl" OR "CommonCrawl" on GitHub, then you can more easily get that whole community working together. Not in ones and twos, but as a complete, open, visible and clear group. Yes, I know CC is working to get them organized, but I think I can speed up the process, if they will help on "Covid-19" and "Remapping the Internet" and related topics.

Richard Collins, Director, The Internet Foundation

Charsets, languages, MIME types stats broken

These pages have vestigial tables which look like they're meant to be includes of the related HTML files in the same directory, but they don't render properly, rendering instead as:

layout table_include table_sortlist
table languages-top-200.html {sortList: [[1,1]]}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.