tballison / commoncrawl-fetcher-lite Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 2.0 186 KB

Simplified version of a common crawl fetcher

License: Apache License 2.0

Java 100.00%

commoncrawl-fetcher-lite's People

Contributors

Stargazers

Watchers

Forkers

davidshq davidshq-contribute

commoncrawl-fetcher-lite's Issues

Allow users to download truncated bytes

Over on https://issues.apache.org/jira/browse/TIKA-3992, we'd like to extract whatever CC has downloaded whether or not the file has been truncated.

Although the primary workflow has been a) extract intact files and b) list urls for truncated. I think we should add an alternate workflow to just extract whatever is in CC.

Won't download older crawls

It appears that crawls older than CC-MAIN-2017-22 can't be downloaded. I tried several and all died with the same error, see below.

INFO [pool-2-thread-4] 10:14:01,348 org.tallison.cc.index.extractor.CCFileExtractor Finished fetching 745,235,260 bytes in 88,630 ms for index gz: cc-index/collections/CC-MAIN-2017-04/indexes/cdx-00002.gz
ERROR [main] 10:14:01,451 org.tallison.cc.index.extractor.CCFileExtractor main loop exception
java.util.concurrent.ExecutionException: java.lang.NullPointerException: Cannot invoke "java.lang.CharSequence.length()" because "this.text" is null
at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
at org.tallison.cc.index.extractor.CCFileExtractor.execute(CCFileExtractor.java:105) [commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
at org.tallison.cc.index.extractor.CCFileExtractor.main(CCFileExtractor.java:69) [commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
at org.tallison.cc.index.extractor.CCFetcherCli.main(CCFetcherCli.java:38) [commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
Caused by: java.lang.NullPointerException: Cannot invoke "java.lang.CharSequence.length()" because "this.text" is null
at java.util.regex.Matcher.getTextLength(Matcher.java:1769) ~[?:?]
at java.util.regex.Matcher.reset(Matcher.java:415) ~[?:?]
at java.util.regex.Matcher.(Matcher.java:252) ~[?:?]
at java.util.regex.Pattern.matcher(Pattern.java:1149) ~[?:?]
at org.tallison.cc.index.selector.RegexSelector.select(RegexSelector.java:38) ~[commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
at org.tallison.cc.index.selector.RecordSelector.select(RecordSelector.java:61) ~[commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
at org.tallison.cc.index.extractor.CCFileExtractorRecordProcessor.process(CCFileExtractorRecordProcessor.java:75) ~[commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
at org.tallison.cc.index.extractor.CCFileExtractor$IndexWorker.processFile(CCFileExtractor.java:190) ~[commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
at org.tallison.cc.index.extractor.CCFileExtractor$IndexWorker.call(CCFileExtractor.java:160) ~[commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
at org.tallison.cc.index.extractor.CCFileExtractor$IndexWorker.call(CCFileExtractor.java:129) ~[commoncrawl-fetcher-lite-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:577) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
at java.lang.Thread.run(Thread.java:1589) ~[?:?]
Exception in thread "main" java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.NullPointerException: Cannot invoke "java.lang.CharSequence.length()" because "this.text" is null
at org.tallison.cc.index.extractor.CCFileExtractor.execute(CCFileExtractor.java:119)
at org.tallison.cc.index.extractor.CCFileExtractor.main(CCFileExtractor.java:69)
at org.tallison.cc.index.extractor.CCFetcherCli.main(CCFetcherCli.java:38)
Caused by: java.util.concurrent.ExecutionException: java.lang.NullPointerException: Cannot invoke "java.lang.CharSequence.length()" because "this.text" is null
at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
at org.tallison.cc.index.extractor.CCFileExtractor.execute(CCFileExtractor.java:105)
... 2 more
Caused by: java.lang.NullPointerException: Cannot invoke "java.lang.CharSequence.length()" because "this.text" is null
at java.base/java.util.regex.Matcher.getTextLength(Matcher.java:1769)
at java.base/java.util.regex.Matcher.reset(Matcher.java:415)
at java.base/java.util.regex.Matcher.(Matcher.java:252)
at java.base/java.util.regex.Pattern.matcher(Pattern.java:1149)
at org.tallison.cc.index.selector.RegexSelector.select(RegexSelector.java:38)
at org.tallison.cc.index.selector.RecordSelector.select(RecordSelector.java:61)
at org.tallison.cc.index.extractor.CCFileExtractorRecordProcessor.process(CCFileExtractorRecordProcessor.java:75)
at org.tallison.cc.index.extractor.CCFileExtractor$IndexWorker.processFile(CCFileExtractor.java:190)
at org.tallison.cc.index.extractor.CCFileExtractor$IndexWorker.call(CCFileExtractor.java:160)
at org.tallison.cc.index.extractor.CCFileExtractor$IndexWorker.call(CCFileExtractor.java:129)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:577)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1589)

Add mime to logged files

Add checkstyle

Incorporate index urls into json file

Backoff messages

Getting lots of backoff messages which result in sleeps. How does the throttling setting in Fetcher affect this?

WARN [pool-2-thread-3] 10:01:13,678 org.tallison.cc.index.io.BackoffHttpFetcher got backoff warning (#2) for https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-06/segments/1674764501555.34/warc/CC-MAIN-20230209081052-20230209111052-00825.warc.gz. Will sleep 120 seconds. Message: bad status code: 503 ::
SlowDownPlease reduce your request rate.C6X3AAMQA0VP9EPCx7TfH5kVcwDGJGNw4rwLR7gptqe/Nwh2MpYEfxIOrWt3czhmG3YU7Oa8oAF7EPxPvZAOVssap9ZEK8hV9vT6rQ==.
WARN [pool-2-thread-2] 10:02:54,901 org.tallison.cc.index.io.BackoffHttpFetcher got backoff warning (#3) for https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-06/segments/1674764500641.25/warc/CC-MAIN-20230207201702-20230207231702-00340.warc.gz. Will sleep 600 seconds. Message: bad status code: 503 ::
SlowDownPlease reduce your request rate.VPQZDEX9K6R69M1EkZ65q9fOau4XFXEiNqyCaMAViPP0TDWFo6PYSQ6udIWYKH0z41XNE1IuOZTBSKY0RBxP1cOd/Fg=.
WARN [pool-2-thread-3] 10:03:14,058 org.tallison.cc.index.io.BackoffHttpFetcher got backoff warning (#1) for https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-06/segments/1674764494976.72/warc/CC-MAIN-20230127101040-20230127131040-00572.warc.gz. Will sleep 30 seconds. Message: bad status code: 503 ::
SlowDownPlease reduce your request rate.P9TA1QBQZ93ZQV7ZV2SClLNr9PzQ7Dvdmsvs0aDhxPzD8Bb3HXyYRX+NVR9GTwCMs6ts2ASIRPU5nRmgdxE9Dum+zSM=.

Shutdown fetching when throttling fails

There's backoff logic to sleep for certain amounts after getting a 503 from aws. We should end the entire fetch process if one of the threads carries out all of the throttling steps and still gets a 503.

/tmp filling up

Using lots of threads can result in /tmp filling up very quickly, which can be a bad thing on some systems. 20 threads filled 16GB /tmp in just a few minutes.

Perhaps a setting to configure a different /tmp location for the commoncrawl indexes?

On a related note, for multiple runs pulling different kinds of files from Common Crawl, it might make sense to store the indexes somewhere outside of /tmp so as not to re-download them every time.

Allow writing extracted files to s3

Configure release

3. Allow a different extracted file naming scheme

Going to vote for preserving the original file extension for the file naming scheme. When coming up with formats that need signatures created having the original file extension is a good way to do a first pass in splitting the files up into groups of similar files. Otherwise it's just a bunch of numbers and letters.

This is particularly useful when pulling application/octet-stream files.

NPE if no fetcher is specified

NPE is thrown on minimal-config.json. This is, um, not good.

Add loggers for extracted files and for truncated

We currently have a special built truncated url writer. We can swap in a csv logger and get rid of that. We can also add csv loggers for extracted urls and a fuller bit of info on the truncated urls.

Add maven enforcer to prevent version collisions

Redownloading cdx files

Does:

java -jar commoncrawl-fetcher-lite-X.Y.Z.jar FetchIndices fetch-indices-config.json

attempt to re-download existing index files? With all of the problems CC is having these days it's almost impossible to get the index files downloaded. I've restarted a few times but the code seems to start with index 00000 again.

Does the code ignore existing cdx files when restarting?

Thanks.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.