commoncrawl / cc-webgraph Goto Github PK

View Code? Open in Web Editor NEW

64.0 10.0 4.0 163 KB

Tools to construct and process webgraphs from Common Crawl data

License: Apache License 2.0

Java 53.41% Shell 46.59%

webgraph-framework common-crawl pagerank centrality-measures webgraph commoncrawl

cc-webgraph's People

Contributors

Stargazers

Watchers

Forkers

shashank2000 xyl012 aahmadai petercarragher

cc-webgraph's Issues

HostToDomainGraph: allow public suffixes as domain names

Some public suffixes with two or more components are also valid domain names, i.e., they can be resolved to an IP address by DNS:

ma.gov.br which points to the homepage of the government of the Brazilian State Maranhão
gov.br resolves if the stripped www. is added: www.gov.br points to the site of the Brazilian government
although most public suffixes do not resolve by DNS: nsw.edu.au, broadway.vista, whois.nic.vista, edu.zm

Either generally allow valid public suffixes with two or more components or additionally try to validate them (if resolvable by DNS). Note: for the Aug/Sep/Oct 2018 webgraphs about 8000-9000 host names failed to be mapped to domain names via the public suffix list.

The latest `Webgraph` data?

My job needs to check whether given links are included in the a specific crawl.

I find this explaining how to access Webgraph data to go through all links in common crawl.

And I found this to search webgraph release in commoncrawl.org.

But I wonder do we have a place to access the latest Webgraph data? Instead of manually serach on coomoncrawl.org?

Or the links in Commoncrawl is quite the same among different crawls? So that it's safe to use not too fresh Webgraph data?

Unable to access webgraph data

Hi, I am trying to download the webgraph data from:

https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2023-mar-may-oct/index.html

but I am unable to do so. Please let me know if there is a different link I should be downloading from.

HostToDomainGraph / JoinSortRanks: input/output is always UTF-8

The tools HostToDomainGraph and JoinSortRanks must read and write data always as UTF-8. This is important if the input hostnames include Unicode internationalized domain names (IDN). For Common Crawl webgraphs the hostnames should be represented in ASCII as IDNA. However, due to a bug this wasn't always the case (cf. commoncrawl/cc-pyspark#35). Because the output streams wheren't properly configured the cc-webgraph Java tools replaced all non-ASCII chars by ?, eg. app.ıo -> app.?o.

Explore columnar storage format for webgraph rankings and node labels

We should consider using a more efficient storage format, such as Parquet, for the host and domain-level rankings. The tooling to read Parquet files has improved in recent years, and readers for this format are now available for almost all programming languages.

Requirements (at least nice to have):

smaller storage footprint
easy analysis and quick lookups by domain name using big data tools (e.g. Amazon Athena - a wish expressed on the CC group)
- note: this will probably require sorting the data by reverse domain name
still fast to get the top-n ranking domains
well-defined table schema including column descriptions
example code how to use the new data format
(optionally) store also the column holding the node IDs
- this would make the vertex file(s) obsolete
- could also drop the textual files holding the edges because the edges (the unlabeled graph) are stored anyway and more efficiently in the webgraph (.graph) format
would allow to add more columns, e.g. indegrees and outdegrees, with little overhead

Maven Build Failure

When running mvn package in cc-webgraph, I get a build failure alert.

mvn -version returns:

Apache Maven 3.9.5 (57804ffe001d7215b5e7bcb531cf83df38f93546)
Maven home: /usr/local/apache-maven-3.9.5
Java version: 21, vendor: Oracle Corporation, runtime: /Library/Java/JavaVirtualMachines/jdk-21.jdk/Contents/Home
Default locale: en_GB, platform encoding: UTF-8
OS name: "mac os x", version: "13.1", arch: "aarch64", family: "mac"

And here is the error log:

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.commoncrawl.webgraph.TestJoinSortRanks
[ERROR] Java heap space
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  2.780 s
[INFO] Finished at: 2023-10-05T08:57:17-04:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M6:test (default-test) on project cc-webgraph: 
[ERROR] 
[ERROR] Please refer to /Users/abra/src/cc-webgraph/target/surefire-reports for the individual test results.
[ERROR] Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
[ERROR] There was an error in the forked process
[ERROR] Java heap space
[ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException: There was an error in the forked process
[ERROR] Java heap space
[ERROR] 	at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:699)
[ERROR] 	at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:311)
[ERROR] 	at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:268)
[ERROR] 	at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1332)
[ERROR] 	at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1165)
[ERROR] 	at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:929)
[ERROR] 	at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:126)
[ERROR] 	at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2(MojoExecutor.java:328)
[ERROR] 	at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute(MojoExecutor.java:316)
[ERROR] 	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
[ERROR] 	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:174)
[ERROR] 	at org.apache.maven.lifecycle.internal.MojoExecutor.access$000(MojoExecutor.java:75)
[ERROR] 	at org.apache.maven.lifecycle.internal.MojoExecutor$1.run(MojoExecutor.java:162)
[ERROR] 	at org.apache.maven.plugin.DefaultMojosExecutionStrategy.execute(DefaultMojosExecutionStrategy.java:39)
[ERROR] 	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:159)
[ERROR] 	at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:105)
[ERROR] 	at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:73)
[ERROR] 	at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:53)
[ERROR] 	at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:118)
[ERROR] 	at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:261)
[ERROR] 	at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:173)
[ERROR] 	at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:101)
[ERROR] 	at org.apache.maven.cli.MavenCli.execute(MavenCli.java:906)
[ERROR] 	at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:283)
[ERROR] 	at org.apache.maven.cli.MavenCli.main(MavenCli.java:206)
[ERROR] 	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
[ERROR] 	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
[ERROR] 	at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:283)
[ERROR] 	at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:226)
[ERROR] 	at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:407)
[ERROR] 	at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:348)

I've already tried increasing the memory allocated to the plugin by editing pom.xml with

<configuration>
          <argLine>-Xmx1024m -Xms512m</argLine>
</configuration>

Joining and sorting ranks fails to load serialized double array holding page rank scores

seen during construction of the web graphs October, November/December 2021 and January 2022

[main] INFO org.commoncrawl.webgraph.JoinSortRanks - Loading page rank values from host/cc-main-2021-22-oct-nov-jan-host-pagerank.ranks
Exception in thread "main" java.lang.IllegalArgumentException: newLimit < 0: (-1216172000 < 0)
        at java.base/java.nio.Buffer.createLimitException(Buffer.java:372)
        at java.base/java.nio.Buffer.limit(Buffer.java:346)
        at java.base/java.nio.ByteBuffer.limit(ByteBuffer.java:1107)
        at java.base/java.nio.MappedByteBuffer.limit(MappedByteBuffer.java:235)
        at java.base/java.nio.MappedByteBuffer.limit(MappedByteBuffer.java:67)
        at it.unimi.dsi.fastutil.io.BinIO.loadDoubles(BinIO.java:6431)
        at it.unimi.dsi.fastutil.io.BinIO.loadDoubles(BinIO.java:6452)
        at it.unimi.dsi.fastutil.io.BinIO.loadDoubles(BinIO.java:6520)
        at it.unimi.dsi.fastutil.io.BinIO.loadDoubles(BinIO.java:7006)
        at it.unimi.dsi.fastutil.io.BinIO.loadDoubles(BinIO.java:7018)
        at org.commoncrawl.webgraph.JoinSortRanks.loadPageRank(JoinSortRanks.java:50)
        at org.commoncrawl.webgraph.JoinSortRanks.main(JoinSortRanks.java:319)

worked around (no time at this point for a deeper analysis) by downgrading (reverting recent commits to f33704b used to build the preceding graph)

Crawling CC using the Webgraph

Hi,

I am interested in gathering data from common crawl using the webgraph as a metric for webpage similarity. So far, I have recreated parts of the webgraph and am using the following code to download WET files from the common crawl dataset:

from warcio import ArchiveIterator
import cdx_toolkit
import requests

def get_page_url(webURL):
	cdx = cdx_toolkit.CDXFetcher(source='cc')
	for obj in cdx.iter(webURL, limit=1):
		return wet_builder(obj['filename'])


def wet_builder(page_url):
	warc_url = f'https://commoncrawl.s3.amazonaws.com/{page_url}'
	wet_url = warc_url.replace('/warc/', '/wet/').replace('warc.gz', 'warc.wet.gz')
	r = requests.get(wet_url, stream=True)
	records = ArchiveIterator(r.raw)
	record = next(records)
	record = next(records)
	a = record.content_stream().read()
	retText= a.decode('utf-8')[:1000]
	r.close()
	return retText

However, this code runs slowly and requires each website to be passed in manually. Is there a faster (more native) way to crawl the webgraph and download each webpage's text content? I don't need to go through the entire webgraph, I am just looking to crawl through a subset of it.

Constructing a Weighted Webgraph

Hi there,

I am trying to use the commoncrawl repositories to build a weighted webgraph, where weights represent the number of links between the source and target domain. The goal is to use the weighted version of the webgraph to research various link spam methods and where they are being used.

It should be feasible given the infrastructure here. The most straightforward method (which I hope is plausible), relies on the link extraction code having no deduplication. If that is the case, it should be as simple as counting duplicates instead of dropping them in hostlinks_to_graph. Followup code changes in HostToDomainGraph.convertEdge to parse the counts would be needed and aggregate them at the domain level (similar to the countHosts option). And finally some changes in the webgraph framework for reasoning over the weights.

Alternatively, if the extraction job doesn't list every link and has prior deduplication, we would need to count the links as we go. E.g. in cc-pyspark/wat_extract_links.py, propagating the counts through the code wherever unweighted links are currently yielded (as in process_records and hostlinks_extract_fastwarc), and then combine all the link count dictionaries from each crawl and each split.

Before going ahead with this I wanted to ask if the first approach is feasible (the latter would be grim). Also, if anything like this has been attempted before, or weighted webgraphs have already been generated and I missed it, please let me know!

Thanks for any and all help!

Duplicate nodes in domain-level webgraph

Hi,

I noticed duplicate rows in the Common Crawl Oct/Nov/Jan 2021-2022 domain-level webgraph. You can find it by unzipping the file and running the command:

grep -e $'\tno\.hordaland\t' cc-main-2021-22-oct-nov-jan-domain-ranks.txt

This leads to the following output:

grep -e $'\tno\.hordaland\t' cc-main-2021-22-oct-nov-jan-domain-ranks.txt
720700  1.4451489E7     273019  1.4683511019933418E-7   no.hordaland    1
46550260        1.0575547E7     79735653        4.063660522732166E-9    no.hordaland    1

This is the only duplicate I found in the entire file.

You probably want to do a sanity check for this in whatever is your preferred language. This is how I found the issue in the first place! I used python with pandas:

import pandas as pd

df = pd.read_csv(
    "./dataset/cc-main-2021-22-oct-nov-jan-domain-ranks.txt.gz",
    compression="gzip",
    sep="\t")

# should work but crashes
# print(df["#host_rev"].is_unique())

# using unique instead works
print(len(df) == len(df["#host_rev"].unique()))

# to see non-unique values
print(df["#host_rev"].value_counts())

Pandas takes several minutes for some of these operations due to loading the full file in memory. You may want to try vaex or something similar if using python. I'm sure there are similar tools in Java and other languages. You could of course do the check in other ways, but I thought I'd provide the code in case it's useful.

Note I haven't checked yet if these duplicate nodes are present in the other web graph files but I suspect they probably are. So you probably want to do similar sanity checks for those. This should improve the quality of what is an excellent dataset. Thanks very much for making this available!

commoncrawl / cc-webgraph Goto Github PK

cc-webgraph's People

Contributors

Stargazers

Watchers

Forkers

cc-webgraph's Issues

HostToDomainGraph: allow public suffixes as domain names

The latest `Webgraph` data?

Unable to access webgraph data

HostToDomainGraph / JoinSortRanks: input/output is always UTF-8

Explore columnar storage format for webgraph rankings and node labels

Maven Build Failure

Joining and sorting ranks fails to load serialized double array holding page rank scores

Crawling CC using the Webgraph

Constructing a Weighted Webgraph

Duplicate nodes in domain-level webgraph

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent