Code Monkey home page Code Monkey logo

cc-webgraph's People

Contributors

dependabot[bot] avatar jnioche avatar sebastian-nagel avatar thunderpoot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cc-webgraph's Issues

HostToDomainGraph: allow public suffixes as domain names

Some public suffixes with two or more components are also valid domain names, i.e., they can be resolved to an IP address by DNS:

  • ma.gov.br which points to the homepage of the government of the Brazilian State Maranhão
  • gov.br resolves if the stripped www. is added: www.gov.br points to the site of the Brazilian government
  • although most public suffixes do not resolve by DNS: nsw.edu.au, broadway.vista, whois.nic.vista, edu.zm

Either generally allow valid public suffixes with two or more components or additionally try to validate them (if resolvable by DNS). Note: for the Aug/Sep/Oct 2018 webgraphs about 8000-9000 host names failed to be mapped to domain names via the public suffix list.

The latest `Webgraph` data?

My job needs to check whether given links are included in the a specific crawl.

I find this explaining how to access Webgraph data to go through all links in common crawl.

And I found this to search webgraph release in commoncrawl.org.

But I wonder do we have a place to access the latest Webgraph data? Instead of manually serach on coomoncrawl.org?

Or the links in Commoncrawl is quite the same among different crawls? So that it's safe to use not too fresh Webgraph data?

HostToDomainGraph / JoinSortRanks: input/output is always UTF-8

The tools HostToDomainGraph and JoinSortRanks must read and write data always as UTF-8. This is important if the input hostnames include Unicode internationalized domain names (IDN). For Common Crawl webgraphs the hostnames should be represented in ASCII as IDNA. However, due to a bug this wasn't always the case (cf. commoncrawl/cc-pyspark#35). Because the output streams wheren't properly configured the cc-webgraph Java tools replaced all non-ASCII chars by ?, eg. app.ıo -> app.?o.

Explore columnar storage format for webgraph rankings and node labels

We should consider using a more efficient storage format, such as Parquet, for the host and domain-level rankings. The tooling to read Parquet files has improved in recent years, and readers for this format are now available for almost all programming languages.

Requirements (at least nice to have):

  • smaller storage footprint
  • easy analysis and quick lookups by domain name using big data tools (e.g. Amazon Athena - a wish expressed on the CC group)
    • note: this will probably require sorting the data by reverse domain name
  • still fast to get the top-n ranking domains
  • well-defined table schema including column descriptions
  • example code how to use the new data format
  • (optionally) store also the column holding the node IDs
    • this would make the vertex file(s) obsolete
    • could also drop the textual files holding the edges because the edges (the unlabeled graph) are stored anyway and more efficiently in the webgraph (.graph) format
  • would allow to add more columns, e.g. indegrees and outdegrees, with little overhead

Maven Build Failure

When running mvn package in cc-webgraph, I get a build failure alert.

mvn -version returns:

Apache Maven 3.9.5 (57804ffe001d7215b5e7bcb531cf83df38f93546)
Maven home: /usr/local/apache-maven-3.9.5
Java version: 21, vendor: Oracle Corporation, runtime: /Library/Java/JavaVirtualMachines/jdk-21.jdk/Contents/Home
Default locale: en_GB, platform encoding: UTF-8
OS name: "mac os x", version: "13.1", arch: "aarch64", family: "mac"

And here is the error log:

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.commoncrawl.webgraph.TestJoinSortRanks
[ERROR] Java heap space
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  2.780 s
[INFO] Finished at: 2023-10-05T08:57:17-04:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M6:test (default-test) on project cc-webgraph: 
[ERROR] 
[ERROR] Please refer to /Users/abra/src/cc-webgraph/target/surefire-reports for the individual test results.
[ERROR] Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
[ERROR] There was an error in the forked process
[ERROR] Java heap space
[ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException: There was an error in the forked process
[ERROR] Java heap space
[ERROR] 	at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:699)
[ERROR] 	at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:311)
[ERROR] 	at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:268)
[ERROR] 	at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1332)
[ERROR] 	at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1165)
[ERROR] 	at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:929)
[ERROR] 	at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:126)
[ERROR] 	at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2(MojoExecutor.java:328)
[ERROR] 	at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute(MojoExecutor.java:316)
[ERROR] 	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
[ERROR] 	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:174)
[ERROR] 	at org.apache.maven.lifecycle.internal.MojoExecutor.access$000(MojoExecutor.java:75)
[ERROR] 	at org.apache.maven.lifecycle.internal.MojoExecutor$1.run(MojoExecutor.java:162)
[ERROR] 	at org.apache.maven.plugin.DefaultMojosExecutionStrategy.execute(DefaultMojosExecutionStrategy.java:39)
[ERROR] 	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:159)
[ERROR] 	at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:105)
[ERROR] 	at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:73)
[ERROR] 	at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:53)
[ERROR] 	at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:118)
[ERROR] 	at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:261)
[ERROR] 	at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:173)
[ERROR] 	at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:101)
[ERROR] 	at org.apache.maven.cli.MavenCli.execute(MavenCli.java:906)
[ERROR] 	at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:283)
[ERROR] 	at org.apache.maven.cli.MavenCli.main(MavenCli.java:206)
[ERROR] 	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
[ERROR] 	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
[ERROR] 	at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:283)
[ERROR] 	at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:226)
[ERROR] 	at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:407)
[ERROR] 	at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:348)

I've already tried increasing the memory allocated to the plugin by editing pom.xml with

<configuration>
          <argLine>-Xmx1024m -Xms512m</argLine>
</configuration>

Joining and sorting ranks fails to load serialized double array holding page rank scores

[main] INFO org.commoncrawl.webgraph.JoinSortRanks - Loading page rank values from host/cc-main-2021-22-oct-nov-jan-host-pagerank.ranks
Exception in thread "main" java.lang.IllegalArgumentException: newLimit < 0: (-1216172000 < 0)
        at java.base/java.nio.Buffer.createLimitException(Buffer.java:372)
        at java.base/java.nio.Buffer.limit(Buffer.java:346)
        at java.base/java.nio.ByteBuffer.limit(ByteBuffer.java:1107)
        at java.base/java.nio.MappedByteBuffer.limit(MappedByteBuffer.java:235)
        at java.base/java.nio.MappedByteBuffer.limit(MappedByteBuffer.java:67)
        at it.unimi.dsi.fastutil.io.BinIO.loadDoubles(BinIO.java:6431)
        at it.unimi.dsi.fastutil.io.BinIO.loadDoubles(BinIO.java:6452)
        at it.unimi.dsi.fastutil.io.BinIO.loadDoubles(BinIO.java:6520)
        at it.unimi.dsi.fastutil.io.BinIO.loadDoubles(BinIO.java:7006)
        at it.unimi.dsi.fastutil.io.BinIO.loadDoubles(BinIO.java:7018)
        at org.commoncrawl.webgraph.JoinSortRanks.loadPageRank(JoinSortRanks.java:50)
        at org.commoncrawl.webgraph.JoinSortRanks.main(JoinSortRanks.java:319)
  • worked around (no time at this point for a deeper analysis) by downgrading (reverting recent commits to f33704b used to build the preceding graph)

Crawling CC using the Webgraph

Hi,

I am interested in gathering data from common crawl using the webgraph as a metric for webpage similarity. So far, I have recreated parts of the webgraph and am using the following code to download WET files from the common crawl dataset:

from warcio import ArchiveIterator
import cdx_toolkit
import requests

def get_page_url(webURL):
	cdx = cdx_toolkit.CDXFetcher(source='cc')
	for obj in cdx.iter(webURL, limit=1):
		return wet_builder(obj['filename'])


def wet_builder(page_url):
	warc_url = f'https://commoncrawl.s3.amazonaws.com/{page_url}'
	wet_url = warc_url.replace('/warc/', '/wet/').replace('warc.gz', 'warc.wet.gz')
	r = requests.get(wet_url, stream=True)
	records = ArchiveIterator(r.raw)
	record = next(records)
	record = next(records)
	a = record.content_stream().read()
	retText= a.decode('utf-8')[:1000]
	r.close()
	return retText

However, this code runs slowly and requires each website to be passed in manually. Is there a faster (more native) way to crawl the webgraph and download each webpage's text content? I don't need to go through the entire webgraph, I am just looking to crawl through a subset of it.

Constructing a Weighted Webgraph

Hi there,

I am trying to use the commoncrawl repositories to build a weighted webgraph, where weights represent the number of links between the source and target domain. The goal is to use the weighted version of the webgraph to research various link spam methods and where they are being used.

It should be feasible given the infrastructure here. The most straightforward method (which I hope is plausible), relies on the link extraction code having no deduplication. If that is the case, it should be as simple as counting duplicates instead of dropping them in hostlinks_to_graph. Followup code changes in HostToDomainGraph.convertEdge to parse the counts would be needed and aggregate them at the domain level (similar to the countHosts option). And finally some changes in the webgraph framework for reasoning over the weights.

Alternatively, if the extraction job doesn't list every link and has prior deduplication, we would need to count the links as we go. E.g. in cc-pyspark/wat_extract_links.py, propagating the counts through the code wherever unweighted links are currently yielded (as in process_records and hostlinks_extract_fastwarc), and then combine all the link count dictionaries from each crawl and each split.

Before going ahead with this I wanted to ask if the first approach is feasible (the latter would be grim). Also, if anything like this has been attempted before, or weighted webgraphs have already been generated and I missed it, please let me know!

Thanks for any and all help!

Duplicate nodes in domain-level webgraph

Hi,

I noticed duplicate rows in the Common Crawl Oct/Nov/Jan 2021-2022 domain-level webgraph. You can find it by unzipping the file and running the command:

grep -e $'\tno\.hordaland\t' cc-main-2021-22-oct-nov-jan-domain-ranks.txt

This leads to the following output:

grep -e $'\tno\.hordaland\t' cc-main-2021-22-oct-nov-jan-domain-ranks.txt
720700  1.4451489E7     273019  1.4683511019933418E-7   no.hordaland    1
46550260        1.0575547E7     79735653        4.063660522732166E-9    no.hordaland    1

This is the only duplicate I found in the entire file.

You probably want to do a sanity check for this in whatever is your preferred language. This is how I found the issue in the first place! I used python with pandas:

import pandas as pd

df = pd.read_csv(
    "./dataset/cc-main-2021-22-oct-nov-jan-domain-ranks.txt.gz",
    compression="gzip",
    sep="\t")

# should work but crashes
# print(df["#host_rev"].is_unique())

# using unique instead works
print(len(df) == len(df["#host_rev"].unique()))

# to see non-unique values
print(df["#host_rev"].value_counts())

Pandas takes several minutes for some of these operations due to loading the full file in memory. You may want to try vaex or something similar if using python. I'm sure there are similar tools in Java and other languages. You could of course do the check in other ways, but I thought I'd provide the code in case it's useful.

Note I haven't checked yet if these duplicate nodes are present in the other web graph files but I suspect they probably are. So you probably want to do similar sanity checks for those. This should improve the quality of what is an excellent dataset. Thanks very much for making this available!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.