trivio / common_crawl_index Goto Github PK

View Code? Open in Web Editor NEW

191.0 191.0 48.0 782 KB

Index URLs in Common Crawl

Python 100.00%

common_crawl_index's People

Contributors

Stargazers

Watchers

Forkers

keiw wallberg wiseman thebennos jiangwei1221 oyiptong abinghua jeffnappi icrazybone imclab pinzhang pombredanne renandev soheil-zz saloua-cliqz pwaila wumpus benoitdherin bag-of-projects xfredcox princeedward kellymorrison hainguyen007 stephaniemak jjangsangy casafc ebolless popol1991 project-renard-survey kalyanp davidchu201 dportabella gurusura nelsonjiao ci-research commoncrawl zouzias gptcod sysujayce trendsci bobthehands confetticoncept alias-dmesg

common_crawl_index's Issues

arcFileParition should be arcFilePartition in code example at end of README.md

Update index?

Not sure if this is the right place to create a ticket, but I'm wondering if it's possible to update the url index to the March 2014 crawl?

URLs not correctly sorted in index?

I think I found incorrectly sorted URLs in the index. For example, in block 4, net.about-plumbing.www/... comes after net.absolutely.www/...:

  'my.com.mahsuri.www/blog/rawatan/ti\x00',
  'name.armando.francesco.www/gallery/andrea_e_vera/slides/DSCN4325\x00',
  'net.123tools.www/audio_multimedia/audio_file_players/index-n-123tools-3\x00',
+ 'net.about-plumbing.www/new-hampshire/colebrook-m\x00',
  "net.absolutely.www/event/Premiere_of_'Land_of_the_Lost'/land_of_the_lost_02_wenn24377\x00",
- 'net.about-plumbing.www/new-hampshire/colebrook-m\x00',
  'net.adiochiropractic.www/templates20/article/1296\x00',
  'net.agilpage.www/index~m~1~w~11276\x00',
  'net.alblasserdam.www/nieuws/2009-07-02-6051-internetprovider-proserve-bouwt-regionaal-datacenter-in-alblasserdam.html:http\x00',

(The above is a diff of the actual list of prefixes in the block and the sorted list of prefixes in the block.)

This could cause a failure of the bisection lookup.

typo: arcFileParition should be arcFilePartition

Using the remote_read script I think there is a typo in the item_keys. Shouldn't arcFileParition be arcFilePartition?

AttributeError: 'NoneType' object has no attribute 'get_key'

Input:

./bin/remote_copy check "org.domain.www"

Output:

Traceback (most recent call last):
  File "./bin/remote_copy", line 152, in <module>
    mmap = BotoMap(s3_anon, src_bucket, '/common-crawl/projects/url-index/url-index.1356128792')
  File "./bin/remote_copy", line 35, in __init__
    self.key = bucket.get_key(key_name)
AttributeError: 'NoneType' object has no attribute 'get_key'

ImportError: No module named boto

Hello,
I am experimenting with your Python script which seems very promising for a separate project requiring lookups of Common Crawl URLs. I have installed your current version from Git on my Ubuntu machine and ran:

$ bin/index_lookup_remote france.fr Traceback (most recent call last): File "bin/index_lookup_remote", line 26, in <module> import boto ImportError: No module named boto
Can you please tell me what boto is supposed to be?
Thanks,
Dan

Docs should mention that urls are stored in revers hostname order.

Investigate anomalies

https://github.com/vbabu75/common_crawl/blob/master/other/common_crawl_index_anomalies.txt

Retrieving urls for a specific coutnry tld?

Hello,

Could this script retrieve urls belonging to a country tld eg *.fi?

revers hostname transformation is ambiguous

Function reversehost() in some cases makes original url recovery ambiguous,

i.e ua.com.book-hunter.www/book/view/231/page:16:http represents both

reversehost('http://www.book-hunter.com.ua/book/view/231/page:16') 
== reversehost('http://www.book-hunter.com.ua:16/book/view/231/page') 
== 'ua.com.book-hunter.www/book/view/231/page:16:http'

which makes such revers hostname order ambiguous and reversehost() procedure not always invertible to the actual URL.

reverse hostname transformation breaks urls with username:[email protected]

Right now CC contains some urls with username:password in them.
On index creation they were transformed by function reversehost() and as a result they are not searchable.

reversehost('http://123456:[email protected]/') 
== '123456/:[email protected]:http'

reversehost('http://Dennis:[email protected]/members/index.shtml') 
== 'Dennis/members/index.shtml:[email protected]:http'

Any plans to index and support the newer datasets?

How can I get a file in text mode?

I am trying to copy a file in text mode, but it is not working. The URL is com.wordpress.alinebessa/2011/06/11/documenting-accerciser-first-impressions/:http

which exists in CommonCrawl. When I check it out here: http://urlsearch.commoncrawl.org/page/1346876860454/1346973204444/3513/41986721/13163

It gets loaded correctly, but this does not happen when I try to fetch it in remote_copy (method copy_arc_files) by making:

if src_key:
print src_key.get_contents_as_string(headers=headers, encoding="iso-8859-1")

It comes back to me as bytes. Can you folks please help me in retrieving the actual text? Thanks!

./remote_read www.direkt-einkauf.at does not return anything

The reason i believe it should return something is:

url -r 1048584-1048744 http://s3.amazonaws.com/aws-publicdatasets/common-crawl/projects/url-index/url-index.1356128792

which is the second index block, which contains prefixes for this domain

Unexpected results with different key lengths

I expected all of these to give the same first result:

$ bin/remote_read org.wikipedia.en | head -1
org.wikipedia.en/wiki/1525:http {'compressedSize': 15889, 'arcSourceSegmentId': 1346876860777, 'arcFilePartition': 4752, 'arcFileDate': 1346910706993, 'arcFileOffset': 75900503}
$ bin/remote_read org.wikipedia.en/wiki | head -1
org.wikipedia.en/wiki/1647_in_literature:http {'compressedSize': 9294, 'arcSourceSegmentId': 1346876860777, 'arcFilePartition': 1900, 'arcFileDate': 1346910123817, 'arcFileOffset': 77716338}
$ bin/remote_read org.wikipedia.en/wiki/1 | head -1
org.wikipedia.en/wiki/1942:_Joint_Strike:http {'compressedSize': 10488, 'arcSourceSegmentId': 1346823846039, 'arcFilePartition': 1724, 'arcFileDate': 1346872155599, 'arcFileOffset': 10475238}
$ bin/remote_read org.wikipedia.en/wiki/1942 | head -1
[no output]

The last one didn't even find the existing URL. Am I doing something wrong?

project deprecated?

Is this project deprecated? I see there are no commits since 2013, and there appears to be a new index scheme available since 2015: http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/

Since the example in the README.md is not working, I'm guessing that this project is not being (and does not need to be) maintained. If that's the case, I think we could help some people by updating the README.md file to indicate that the project is deprecated by the new index.

If that's true, I am happy to create a pull request for it.