tballison / commoncrawl-fetcher-lite Goto Github PK
View Code? Open in Web Editor NEWSimplified version of a common crawl fetcher
License: Apache License 2.0
Simplified version of a common crawl fetcher
License: Apache License 2.0
Over on https://issues.apache.org/jira/browse/TIKA-3992, we'd like to extract whatever CC has downloaded whether or not the file has been truncated.
Although the primary workflow has been a) extract intact files and b) list urls for truncated. I think we should add an alternate workflow to just extract whatever is in CC.
It appears that crawls older than CC-MAIN-2017-22 can't be downloaded. I tried several and all died with the same error, see below.
INFO [pool-2-thread-4] 10:14:01,348 org.tallison.cc.index.extractor.CCFileExtractor Finished fetching 745,235,260 bytes in 88,630 ms for index gz: cc-index/collections/CC-MAIN-2017-04/indexes/cdx-00002.gz
ERROR [main] 10:14:01,451 org.tallison.cc.index.extractor.CCFileExtractor main loop exception
java.util.concurrent.ExecutionException: java.lang.NullPointerException: Cannot invoke "java.lang.CharSequence.length()" because "this.text" is null
at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
at org.tallison.cc.index.extractor.CCFileExtractor.execute(CCFileExtractor.java:105) [commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
at org.tallison.cc.index.extractor.CCFileExtractor.main(CCFileExtractor.java:69) [commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
at org.tallison.cc.index.extractor.CCFetcherCli.main(CCFetcherCli.java:38) [commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
Caused by: java.lang.NullPointerException: Cannot invoke "java.lang.CharSequence.length()" because "this.text" is null
at java.util.regex.Matcher.getTextLength(Matcher.java:1769) ~[?:?]
at java.util.regex.Matcher.reset(Matcher.java:415) ~[?:?]
at java.util.regex.Matcher.(Matcher.java:252) ~[?:?]
at java.util.regex.Pattern.matcher(Pattern.java:1149) ~[?:?]
at org.tallison.cc.index.selector.RegexSelector.select(RegexSelector.java:38) ~[commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
at org.tallison.cc.index.selector.RecordSelector.select(RecordSelector.java:61) ~[commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
at org.tallison.cc.index.extractor.CCFileExtractorRecordProcessor.process(CCFileExtractorRecordProcessor.java:75) ~[commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
at org.tallison.cc.index.extractor.CCFileExtractor$IndexWorker.processFile(CCFileExtractor.java:190) ~[commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
at org.tallison.cc.index.extractor.CCFileExtractor$IndexWorker.call(CCFileExtractor.java:160) ~[commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
at org.tallison.cc.index.extractor.CCFileExtractor$IndexWorker.call(CCFileExtractor.java:129) ~[commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:577) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
at java.lang.Thread.run(Thread.java:1589) ~[?:?]
Exception in thread "main" java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.NullPointerException: Cannot invoke "java.lang.CharSequence.length()" because "this.text" is null
at org.tallison.cc.index.extractor.CCFileExtractor.execute(CCFileExtractor.java:119)
at org.tallison.cc.index.extractor.CCFileExtractor.main(CCFileExtractor.java:69)
at org.tallison.cc.index.extractor.CCFetcherCli.main(CCFetcherCli.java:38)
Caused by: java.util.concurrent.ExecutionException: java.lang.NullPointerException: Cannot invoke "java.lang.CharSequence.length()" because "this.text" is null
at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
at org.tallison.cc.index.extractor.CCFileExtractor.execute(CCFileExtractor.java:105)
... 2 more
Caused by: java.lang.NullPointerException: Cannot invoke "java.lang.CharSequence.length()" because "this.text" is null
at java.base/java.util.regex.Matcher.getTextLength(Matcher.java:1769)
at java.base/java.util.regex.Matcher.reset(Matcher.java:415)
at java.base/java.util.regex.Matcher.(Matcher.java:252)
at java.base/java.util.regex.Pattern.matcher(Pattern.java:1149)
at org.tallison.cc.index.selector.RegexSelector.select(RegexSelector.java:38)
at org.tallison.cc.index.selector.RecordSelector.select(RecordSelector.java:61)
at org.tallison.cc.index.extractor.CCFileExtractorRecordProcessor.process(CCFileExtractorRecordProcessor.java:75)
at org.tallison.cc.index.extractor.CCFileExtractor$IndexWorker.processFile(CCFileExtractor.java:190)
at org.tallison.cc.index.extractor.CCFileExtractor$IndexWorker.call(CCFileExtractor.java:160)
at org.tallison.cc.index.extractor.CCFileExtractor$IndexWorker.call(CCFileExtractor.java:129)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:577)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1589)
Getting lots of backoff messages which result in sleeps. How does the throttling setting in Fetcher affect this?
WARN [pool-2-thread-3] 10:01:13,678 org.tallison.cc.index.io.BackoffHttpFetcher got backoff warning (#2) for https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-06/segments/1674764501555.34/warc/CC-MAIN-20230209081052-20230209111052-00825.warc.gz. Will sleep 120 seconds. Message: bad status code: 503 ::
SlowDown
Please reduce your request rate.C6X3AAMQA0VP9EPCx7TfH5kVcwDGJGNw4rwLR7gptqe/Nwh2MpYEfxIOrWt3czhmG3YU7Oa8oAF7EPxPvZAOVssap9ZEK8hV9vT6rQ==.
WARN [pool-2-thread-2] 10:02:54,901 org.tallison.cc.index.io.BackoffHttpFetcher got backoff warning (#3) for https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-06/segments/1674764500641.25/warc/CC-MAIN-20230207201702-20230207231702-00340.warc.gz. Will sleep 600 seconds. Message: bad status code: 503 ::
SlowDown
Please reduce your request rate.VPQZDEX9K6R69M1EkZ65q9fOau4XFXEiNqyCaMAViPP0TDWFo6PYSQ6udIWYKH0z41XNE1IuOZTBSKY0RBxP1cOd/Fg=.
WARN [pool-2-thread-3] 10:03:14,058 org.tallison.cc.index.io.BackoffHttpFetcher got backoff warning (#1) for https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-06/segments/1674764494976.72/warc/CC-MAIN-20230127101040-20230127131040-00572.warc.gz. Will sleep 30 seconds. Message: bad status code: 503 ::
SlowDown
Please reduce your request rate.P9TA1QBQZ93ZQV7ZV2SClLNr9PzQ7Dvdmsvs0aDhxPzD8Bb3HXyYRX+NVR9GTwCMs6ts2ASIRPU5nRmgdxE9Dum+zSM=.
There's backoff logic to sleep for certain amounts after getting a 503 from aws. We should end the entire fetch process if one of the threads carries out all of the throttling steps and still gets a 503.
Using lots of threads can result in /tmp filling up very quickly, which can be a bad thing on some systems. 20 threads filled 16GB /tmp in just a few minutes.
Perhaps a setting to configure a different /tmp location for the commoncrawl indexes?
On a related note, for multiple runs pulling different kinds of files from Common Crawl, it might make sense to store the indexes somewhere outside of /tmp so as not to re-download them every time.
Going to vote for preserving the original file extension for the file naming scheme. When coming up with formats that need signatures created having the original file extension is a good way to do a first pass in splitting the files up into groups of similar files. Otherwise it's just a bunch of numbers and letters.
This is particularly useful when pulling application/octet-stream files.
NPE is thrown on minimal-config.json. This is, um, not good.
We currently have a special built truncated url writer. We can swap in a csv logger and get rid of that. We can also add csv loggers for extracted urls and a fuller bit of info on the truncated urls.
Does:
java -jar commoncrawl-fetcher-lite-X.Y.Z.jar FetchIndices fetch-indices-config.json
attempt to re-download existing index files? With all of the problems CC is having these days it's almost impossible to get the index files downloaded. I've restarted a few times but the code seems to start with index 00000 again.
Does the code ignore existing cdx files when restarting?
Thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.