recrm / archivetools Goto Github PK
View Code? Open in Web Editor NEWA collection of tools for archiving and analysing the internet.
License: GNU General Public License v3.0
A collection of tools for archiving and analysing the internet.
License: GNU General Public License v3.0
warc-extractor.py:38: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import MutableMapping
change line 38 of warc-extractor.py:
from collections import MutableMapping
to
from collections.abc import MutableMapping
Hi, I get the following message when extracting a dump:
Count of unknown mime type.
{'': 17,
'application/binary': 7,
'application/font-woff2': 6,
'application/x-font-ttf': 6,
'application/x-font-woff': 1,
'application/x-javascript': 3458,
'binary/octet-stream': 17,
'font/truetype': 7,
'font/ttf': 1,
'font/woff': 2,
'font/woff2': 156,
'font/x-woff': 7,
'image/x-icon': 9,
'text/javascript': 34}
Best Regards, forgive my ignorance
please add a pyproject.toml and publish all to PyPI for pipx compatiblety
and add a console scripts entry point. If you're a poetry user, use these instructions.
$ warc-extractor.py
parsing 195.242.99.71-8181-2016-03-23-3324e7c6-00000.warc
Traceback (most recent call last):
File "/home/username/bin/warc-extractor.py", line 200, in __getitem__
return super().__getitem__(name)
File "/home/username/bin/warc-extractor.py", line 83, in __getitem__
return self._d[name.lower()]
KeyError: 'content_type'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/username/bin/warc-extractor.py", line 828, in <module>
parse(args)
File "/home/username/bin/warc-extractor.py", line 713, in parse
inc(record.http, "content_type", "http-content")
File "/home/username/bin/warc-extractor.py", line 654, in inc
obj = obj[header]
File "/home/username/bin/warc-extractor.py", line 204, in __getitem__
return self.content.type
File "/home/username/bin/warc-extractor.py", line 230, in content
self._content = ContentType(string)
File "/home/username/bin/warc-extractor.py", line 267, in __init__
data[test[0]] = test[1]
IndexError: list index out of range
WARC file is from https://archive.org/details/warc-195-242-99-71-8181
other utilities are able to extract at least some data from it
If there's a bad spot in the file (which I'm not sure if there is or not), can there be an option to skip over it and continue processing?
I am developing a small tool that detects and classifies object in images extracted from WARC archives. I use some functionality from warc-extractor.py in my Python code. I order to do that, I have to rename 'warc-extractor.py' to 'warc_extractor.py' since the dash is not a valid character in a module name.
I propose to change the names of the Python scripts, replacing '-' by '_'. I can submit a pull request if you like?
Hello,
and thank you for this script. I'm a complete beginner with Python, but somehow I managed to finally get it set up on Windows. However, all I get when trying to run the warc-extractor is the following message:
It doesn't matter what command I put in, or what WARC file I use, the result is the same. It's probably on my end, though the other ones seem to run fine:
Your help would be greatly appreciated. Thank you
The warc extractor works like a charm on Windows when dumping all of the content...
I was wondering if you could provide some more info on the usage of filters?
For example... could we filter out sub-folders inside a warc archive based on folder name(s)? If so, how would I do that?
I also have a lot of index files in an archive, numbered like this: index.html, index(2).html, index(3).html etc.
Would it be possible to only dump the first one in each folder (so, only the first index.html file) instead of all numbered index files as mentioned above?
Thanks in advance. :)
There is a known bug where warc-extractor.py does not handle windows paths properly.
Windows has far more restrictions on what is an appropriate path than Linux. Unfortunately, dealing with all of the possible crazy path names that warc scrappers can sometimes pull is a big job. Thankfully, most of these requests are 404 errors. Until a proper fix is implemented it is recommended to run the -dump content command along with the http:error:200 filter. If path errors still persist in windows also use the -error flag to skip unworkable path names.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.