Code Monkey home page Code Monkey logo

commoncrawl-fetcher-lite's People

Contributors

dependabot[bot] avatar tballison avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

commoncrawl-fetcher-lite's Issues

Won't download older crawls

It appears that crawls older than CC-MAIN-2017-22 can't be downloaded. I tried several and all died with the same error, see below.

INFO [pool-2-thread-4] 10:14:01,348 org.tallison.cc.index.extractor.CCFileExtractor Finished fetching 745,235,260 bytes in 88,630 ms for index gz: cc-index/collections/CC-MAIN-2017-04/indexes/cdx-00002.gz
ERROR [main] 10:14:01,451 org.tallison.cc.index.extractor.CCFileExtractor main loop exception
java.util.concurrent.ExecutionException: java.lang.NullPointerException: Cannot invoke "java.lang.CharSequence.length()" because "this.text" is null
at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
at org.tallison.cc.index.extractor.CCFileExtractor.execute(CCFileExtractor.java:105) [commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
at org.tallison.cc.index.extractor.CCFileExtractor.main(CCFileExtractor.java:69) [commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
at org.tallison.cc.index.extractor.CCFetcherCli.main(CCFetcherCli.java:38) [commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
Caused by: java.lang.NullPointerException: Cannot invoke "java.lang.CharSequence.length()" because "this.text" is null
at java.util.regex.Matcher.getTextLength(Matcher.java:1769) ~[?:?]
at java.util.regex.Matcher.reset(Matcher.java:415) ~[?:?]
at java.util.regex.Matcher.(Matcher.java:252) ~[?:?]
at java.util.regex.Pattern.matcher(Pattern.java:1149) ~[?:?]
at org.tallison.cc.index.selector.RegexSelector.select(RegexSelector.java:38) ~[commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
at org.tallison.cc.index.selector.RecordSelector.select(RecordSelector.java:61) ~[commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
at org.tallison.cc.index.extractor.CCFileExtractorRecordProcessor.process(CCFileExtractorRecordProcessor.java:75) ~[commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
at org.tallison.cc.index.extractor.CCFileExtractor$IndexWorker.processFile(CCFileExtractor.java:190) ~[commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
at org.tallison.cc.index.extractor.CCFileExtractor$IndexWorker.call(CCFileExtractor.java:160) ~[commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
at org.tallison.cc.index.extractor.CCFileExtractor$IndexWorker.call(CCFileExtractor.java:129) ~[commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:577) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
at java.lang.Thread.run(Thread.java:1589) ~[?:?]
Exception in thread "main" java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.NullPointerException: Cannot invoke "java.lang.CharSequence.length()" because "this.text" is null
at org.tallison.cc.index.extractor.CCFileExtractor.execute(CCFileExtractor.java:119)
at org.tallison.cc.index.extractor.CCFileExtractor.main(CCFileExtractor.java:69)
at org.tallison.cc.index.extractor.CCFetcherCli.main(CCFetcherCli.java:38)
Caused by: java.util.concurrent.ExecutionException: java.lang.NullPointerException: Cannot invoke "java.lang.CharSequence.length()" because "this.text" is null
at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
at org.tallison.cc.index.extractor.CCFileExtractor.execute(CCFileExtractor.java:105)
... 2 more
Caused by: java.lang.NullPointerException: Cannot invoke "java.lang.CharSequence.length()" because "this.text" is null
at java.base/java.util.regex.Matcher.getTextLength(Matcher.java:1769)
at java.base/java.util.regex.Matcher.reset(Matcher.java:415)
at java.base/java.util.regex.Matcher.(Matcher.java:252)
at java.base/java.util.regex.Pattern.matcher(Pattern.java:1149)
at org.tallison.cc.index.selector.RegexSelector.select(RegexSelector.java:38)
at org.tallison.cc.index.selector.RecordSelector.select(RecordSelector.java:61)
at org.tallison.cc.index.extractor.CCFileExtractorRecordProcessor.process(CCFileExtractorRecordProcessor.java:75)
at org.tallison.cc.index.extractor.CCFileExtractor$IndexWorker.processFile(CCFileExtractor.java:190)
at org.tallison.cc.index.extractor.CCFileExtractor$IndexWorker.call(CCFileExtractor.java:160)
at org.tallison.cc.index.extractor.CCFileExtractor$IndexWorker.call(CCFileExtractor.java:129)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:577)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1589)

Backoff messages

Getting lots of backoff messages which result in sleeps. How does the throttling setting in Fetcher affect this?

WARN [pool-2-thread-3] 10:01:13,678 org.tallison.cc.index.io.BackoffHttpFetcher got backoff warning (#2) for https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-06/segments/1674764501555.34/warc/CC-MAIN-20230209081052-20230209111052-00825.warc.gz. Will sleep 120 seconds. Message: bad status code: 503 ::
SlowDownPlease reduce your request rate.C6X3AAMQA0VP9EPCx7TfH5kVcwDGJGNw4rwLR7gptqe/Nwh2MpYEfxIOrWt3czhmG3YU7Oa8oAF7EPxPvZAOVssap9ZEK8hV9vT6rQ==.
WARN [pool-2-thread-2] 10:02:54,901 org.tallison.cc.index.io.BackoffHttpFetcher got backoff warning (#3) for https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-06/segments/1674764500641.25/warc/CC-MAIN-20230207201702-20230207231702-00340.warc.gz. Will sleep 600 seconds. Message: bad status code: 503 ::
SlowDownPlease reduce your request rate.VPQZDEX9K6R69M1EkZ65q9fOau4XFXEiNqyCaMAViPP0TDWFo6PYSQ6udIWYKH0z41XNE1IuOZTBSKY0RBxP1cOd/Fg=.
WARN [pool-2-thread-3] 10:03:14,058 org.tallison.cc.index.io.BackoffHttpFetcher got backoff warning (#1) for https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-06/segments/1674764494976.72/warc/CC-MAIN-20230127101040-20230127131040-00572.warc.gz. Will sleep 30 seconds. Message: bad status code: 503 ::
SlowDownPlease reduce your request rate.P9TA1QBQZ93ZQV7ZV2SClLNr9PzQ7Dvdmsvs0aDhxPzD8Bb3HXyYRX+NVR9GTwCMs6ts2ASIRPU5nRmgdxE9Dum+zSM=.

Shutdown fetching when throttling fails

There's backoff logic to sleep for certain amounts after getting a 503 from aws. We should end the entire fetch process if one of the threads carries out all of the throttling steps and still gets a 503.

/tmp filling up

Using lots of threads can result in /tmp filling up very quickly, which can be a bad thing on some systems. 20 threads filled 16GB /tmp in just a few minutes.

Perhaps a setting to configure a different /tmp location for the commoncrawl indexes?

On a related note, for multiple runs pulling different kinds of files from Common Crawl, it might make sense to store the indexes somewhere outside of /tmp so as not to re-download them every time.

3. Allow a different extracted file naming scheme

Going to vote for preserving the original file extension for the file naming scheme. When coming up with formats that need signatures created having the original file extension is a good way to do a first pass in splitting the files up into groups of similar files. Otherwise it's just a bunch of numbers and letters.

This is particularly useful when pulling application/octet-stream files.

Add loggers for extracted files and for truncated

We currently have a special built truncated url writer. We can swap in a csv logger and get rid of that. We can also add csv loggers for extracted urls and a fuller bit of info on the truncated urls.

Redownloading cdx files

Does:

java -jar commoncrawl-fetcher-lite-X.Y.Z.jar FetchIndices fetch-indices-config.json

attempt to re-download existing index files? With all of the problems CC is having these days it's almost impossible to get the index files downloaded. I've restarted a few times but the code seems to start with index 00000 again.

Does the code ignore existing cdx files when restarting?

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.