commoncrawl / cc-webgraph Goto Github PK
View Code? Open in Web Editor NEWTools to construct and process webgraphs from Common Crawl data
License: Apache License 2.0
Tools to construct and process webgraphs from Common Crawl data
License: Apache License 2.0
Some public suffixes with two or more components are also valid domain names, i.e., they can be resolved to an IP address by DNS:
www.
is added: www.gov.br points to the site of the Brazilian governmentEither generally allow valid public suffixes with two or more components or additionally try to validate them (if resolvable by DNS). Note: for the Aug/Sep/Oct 2018 webgraphs about 8000-9000 host names failed to be mapped to domain names via the public suffix list.
My job needs to check whether given links are included in the a specific crawl.
I find this explaining how to access Webgraph data to go through all links in common crawl.
And I found this to search webgraph release in commoncrawl.org.
But I wonder do we have a place to access the latest Webgraph
data? Instead of manually serach on coomoncrawl.org?
Or the links in Commoncrawl is quite the same among different crawls? So that it's safe to use not too fresh Webgraph data?
Hi, I am trying to download the webgraph data from:
https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2023-mar-may-oct/index.html
but I am unable to do so. Please let me know if there is a different link I should be downloading from.
The tools HostToDomainGraph and JoinSortRanks must read and write data always as UTF-8. This is important if the input hostnames include Unicode internationalized domain names (IDN). For Common Crawl webgraphs the hostnames should be represented in ASCII as IDNA. However, due to a bug this wasn't always the case (cf. commoncrawl/cc-pyspark#35). Because the output streams wheren't properly configured the cc-webgraph Java tools replaced all non-ASCII chars by ?
, eg. app.ıo
-> app.?o
.
We should consider using a more efficient storage format, such as Parquet, for the host and domain-level rankings. The tooling to read Parquet files has improved in recent years, and readers for this format are now available for almost all programming languages.
Requirements (at least nice to have):
.graph
) formatWhen running mvn package
in cc-webgraph, I get a build failure alert.
mvn -version returns:
Apache Maven 3.9.5 (57804ffe001d7215b5e7bcb531cf83df38f93546)
Maven home: /usr/local/apache-maven-3.9.5
Java version: 21, vendor: Oracle Corporation, runtime: /Library/Java/JavaVirtualMachines/jdk-21.jdk/Contents/Home
Default locale: en_GB, platform encoding: UTF-8
OS name: "mac os x", version: "13.1", arch: "aarch64", family: "mac"
And here is the error log:
[INFO] -------------------------------------------------------
[INFO] T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.commoncrawl.webgraph.TestJoinSortRanks
[ERROR] Java heap space
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 2.780 s
[INFO] Finished at: 2023-10-05T08:57:17-04:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M6:test (default-test) on project cc-webgraph:
[ERROR]
[ERROR] Please refer to /Users/abra/src/cc-webgraph/target/surefire-reports for the individual test results.
[ERROR] Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
[ERROR] There was an error in the forked process
[ERROR] Java heap space
[ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException: There was an error in the forked process
[ERROR] Java heap space
[ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:699)
[ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:311)
[ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:268)
[ERROR] at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1332)
[ERROR] at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1165)
[ERROR] at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:929)
[ERROR] at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:126)
[ERROR] at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2(MojoExecutor.java:328)
[ERROR] at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute(MojoExecutor.java:316)
[ERROR] at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
[ERROR] at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:174)
[ERROR] at org.apache.maven.lifecycle.internal.MojoExecutor.access$000(MojoExecutor.java:75)
[ERROR] at org.apache.maven.lifecycle.internal.MojoExecutor$1.run(MojoExecutor.java:162)
[ERROR] at org.apache.maven.plugin.DefaultMojosExecutionStrategy.execute(DefaultMojosExecutionStrategy.java:39)
[ERROR] at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:159)
[ERROR] at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:105)
[ERROR] at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:73)
[ERROR] at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:53)
[ERROR] at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:118)
[ERROR] at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:261)
[ERROR] at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:173)
[ERROR] at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:101)
[ERROR] at org.apache.maven.cli.MavenCli.execute(MavenCli.java:906)
[ERROR] at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:283)
[ERROR] at org.apache.maven.cli.MavenCli.main(MavenCli.java:206)
[ERROR] at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
[ERROR] at java.base/java.lang.reflect.Method.invoke(Method.java:580)
[ERROR] at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:283)
[ERROR] at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:226)
[ERROR] at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:407)
[ERROR] at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:348)
I've already tried increasing the memory allocated to the plugin by editing pom.xml with
<configuration>
<argLine>-Xmx1024m -Xms512m</argLine>
</configuration>
[main] INFO org.commoncrawl.webgraph.JoinSortRanks - Loading page rank values from host/cc-main-2021-22-oct-nov-jan-host-pagerank.ranks
Exception in thread "main" java.lang.IllegalArgumentException: newLimit < 0: (-1216172000 < 0)
at java.base/java.nio.Buffer.createLimitException(Buffer.java:372)
at java.base/java.nio.Buffer.limit(Buffer.java:346)
at java.base/java.nio.ByteBuffer.limit(ByteBuffer.java:1107)
at java.base/java.nio.MappedByteBuffer.limit(MappedByteBuffer.java:235)
at java.base/java.nio.MappedByteBuffer.limit(MappedByteBuffer.java:67)
at it.unimi.dsi.fastutil.io.BinIO.loadDoubles(BinIO.java:6431)
at it.unimi.dsi.fastutil.io.BinIO.loadDoubles(BinIO.java:6452)
at it.unimi.dsi.fastutil.io.BinIO.loadDoubles(BinIO.java:6520)
at it.unimi.dsi.fastutil.io.BinIO.loadDoubles(BinIO.java:7006)
at it.unimi.dsi.fastutil.io.BinIO.loadDoubles(BinIO.java:7018)
at org.commoncrawl.webgraph.JoinSortRanks.loadPageRank(JoinSortRanks.java:50)
at org.commoncrawl.webgraph.JoinSortRanks.main(JoinSortRanks.java:319)
Hi,
I am interested in gathering data from common crawl using the webgraph as a metric for webpage similarity. So far, I have recreated parts of the webgraph and am using the following code to download WET files from the common crawl dataset:
from warcio import ArchiveIterator
import cdx_toolkit
import requests
def get_page_url(webURL):
cdx = cdx_toolkit.CDXFetcher(source='cc')
for obj in cdx.iter(webURL, limit=1):
return wet_builder(obj['filename'])
def wet_builder(page_url):
warc_url = f'https://commoncrawl.s3.amazonaws.com/{page_url}'
wet_url = warc_url.replace('/warc/', '/wet/').replace('warc.gz', 'warc.wet.gz')
r = requests.get(wet_url, stream=True)
records = ArchiveIterator(r.raw)
record = next(records)
record = next(records)
a = record.content_stream().read()
retText= a.decode('utf-8')[:1000]
r.close()
return retText
However, this code runs slowly and requires each website to be passed in manually. Is there a faster (more native) way to crawl the webgraph and download each webpage's text content? I don't need to go through the entire webgraph, I am just looking to crawl through a subset of it.
Hi there,
I am trying to use the commoncrawl repositories to build a weighted webgraph, where weights represent the number of links between the source and target domain. The goal is to use the weighted version of the webgraph to research various link spam methods and where they are being used.
It should be feasible given the infrastructure here. The most straightforward method (which I hope is plausible), relies on the link extraction code having no deduplication. If that is the case, it should be as simple as counting duplicates instead of dropping them in hostlinks_to_graph. Followup code changes in HostToDomainGraph.convertEdge to parse the counts would be needed and aggregate them at the domain level (similar to the countHosts option). And finally some changes in the webgraph framework for reasoning over the weights.
Alternatively, if the extraction job doesn't list every link and has prior deduplication, we would need to count the links as we go. E.g. in cc-pyspark/wat_extract_links.py, propagating the counts through the code wherever unweighted links are currently yielded (as in process_records and hostlinks_extract_fastwarc), and then combine all the link count dictionaries from each crawl and each split.
Before going ahead with this I wanted to ask if the first approach is feasible (the latter would be grim). Also, if anything like this has been attempted before, or weighted webgraphs have already been generated and I missed it, please let me know!
Thanks for any and all help!
Hi,
I noticed duplicate rows in the Common Crawl Oct/Nov/Jan 2021-2022 domain-level webgraph. You can find it by unzipping the file and running the command:
grep -e $'\tno\.hordaland\t' cc-main-2021-22-oct-nov-jan-domain-ranks.txt
This leads to the following output:
grep -e $'\tno\.hordaland\t' cc-main-2021-22-oct-nov-jan-domain-ranks.txt
720700 1.4451489E7 273019 1.4683511019933418E-7 no.hordaland 1
46550260 1.0575547E7 79735653 4.063660522732166E-9 no.hordaland 1
This is the only duplicate I found in the entire file.
You probably want to do a sanity check for this in whatever is your preferred language. This is how I found the issue in the first place! I used python
with pandas
:
import pandas as pd
df = pd.read_csv(
"./dataset/cc-main-2021-22-oct-nov-jan-domain-ranks.txt.gz",
compression="gzip",
sep="\t")
# should work but crashes
# print(df["#host_rev"].is_unique())
# using unique instead works
print(len(df) == len(df["#host_rev"].unique()))
# to see non-unique values
print(df["#host_rev"].value_counts())
Pandas takes several minutes for some of these operations due to loading the full file in memory. You may want to try vaex or something similar if using python. I'm sure there are similar tools in Java and other languages. You could of course do the check in other ways, but I thought I'd provide the code in case it's useful.
Note I haven't checked yet if these duplicate nodes are present in the other web graph files but I suspect they probably are. So you probably want to do similar sanity checks for those. This should improve the quality of what is an excellent dataset. Thanks very much for making this available!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.