recrm / archivetools Goto Github PK

View Code? Open in Web Editor NEW

68.0 68.0 15.0 53 KB

A collection of tools for archiving and analysing the internet.

License: GNU General Public License v3.0

Python 100.00%

archivetools's People

Contributors

Stargazers

Watchers

Forkers

ppival code4days mandark321 jafamo pl77 arctype-co haritha2298 giovanisp yacylover royvb-git palladidrago brfud maxpeal lanky

archivetools's Issues

DeprecationWarning on import of MutableMapping

warc-extractor.py:38: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import MutableMapping

change line 38 of warc-extractor.py:

from collections import MutableMapping

from collections.abc import MutableMapping

Unknown mime type - Questions

Hi, I get the following message when extracting a dump:

Count of unknown mime type.
{'': 17,
'application/binary': 7,
'application/font-woff2': 6,
'application/x-font-ttf': 6,
'application/x-font-woff': 1,
'application/x-javascript': 3458,
'binary/octet-stream': 17,
'font/truetype': 7,
'font/ttf': 1,
'font/woff': 2,
'font/woff2': 156,
'font/x-woff': 7,
'image/x-icon': 9,
'text/javascript': 34}

Can I be sure all files inside the webcapture are being extracted? (like when you extract a zip file, for example)
This is the only script I have found that can perform full dumps. Do you know any other else? (my only worry, related to the last question, is that not all files were extracted)

Best Regards, forgive my ignorance

missing pyproject.toml and missing on publish PyPI for pipx compatiblety

please add a pyproject.toml and publish all to PyPI for pipx compatiblety
and add a console scripts entry point. If you're a poetry user, use these instructions.

warc-extractor.py fails -- "list index out of range"

$ warc-extractor.py
parsing 195.242.99.71-8181-2016-03-23-3324e7c6-00000.warc
Traceback (most recent call last):
  File "/home/username/bin/warc-extractor.py", line 200, in __getitem__
    return super().__getitem__(name)
  File "/home/username/bin/warc-extractor.py", line 83, in __getitem__
    return self._d[name.lower()]
KeyError: 'content_type'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/username/bin/warc-extractor.py", line 828, in <module>
    parse(args)
  File "/home/username/bin/warc-extractor.py", line 713, in parse
    inc(record.http, "content_type", "http-content")
  File "/home/username/bin/warc-extractor.py", line 654, in inc
    obj = obj[header]
  File "/home/username/bin/warc-extractor.py", line 204, in __getitem__
    return self.content.type
  File "/home/username/bin/warc-extractor.py", line 230, in content
    self._content = ContentType(string)
  File "/home/username/bin/warc-extractor.py", line 267, in __init__
    data[test[0]] = test[1]
IndexError: list index out of range

WARC file is from https://archive.org/details/warc-195-242-99-71-8181

other utilities are able to extract at least some data from it

If there's a bad spot in the file (which I'm not sure if there is or not), can there be an option to skip over it and continue processing?

The Python scripts are not importable as modules

I am developing a small tool that detects and classifies object in images extracted from WARC archives. I use some functionality from warc-extractor.py in my Python code. I order to do that, I have to rename 'warc-extractor.py' to 'warc_extractor.py' since the dash is not a valid character in a module name.
I propose to change the names of the Python scripts, replacing '-' by '_'. I can submit a pull request if you like?

DeprecationWarning

Hello,

and thank you for this script. I'm a complete beginner with Python, but somehow I managed to finally get it set up on Windows. However, all I get when trying to run the warc-extractor is the following message:

It doesn't matter what command I put in, or what WARC file I use, the result is the same. It's probably on my end, though the other ones seem to run fine:

Your help would be greatly appreciated. Thank you

Advanced filters?

The warc extractor works like a charm on Windows when dumping all of the content...
I was wondering if you could provide some more info on the usage of filters?

For example... could we filter out sub-folders inside a warc archive based on folder name(s)? If so, how would I do that?

I also have a lot of index files in an archive, numbered like this: index.html, index(2).html, index(3).html etc.
Would it be possible to only dump the first one in each folder (so, only the first index.html file) instead of all numbered index files as mentioned above?

Thanks in advance. :)

Paths in windows.

There is a known bug where warc-extractor.py does not handle windows paths properly.

Windows has far more restrictions on what is an appropriate path than Linux. Unfortunately, dealing with all of the possible crazy path names that warc scrappers can sometimes pull is a big job. Thankfully, most of these requests are 404 errors. Until a proper fix is implemented it is recommended to run the -dump content command along with the http:error:200 filter. If path errors still persist in windows also use the -error flag to skip unworkable path names.

recrm / archivetools Goto Github PK

archivetools's People

Contributors

Stargazers

Watchers

Forkers

archivetools's Issues

DeprecationWarning on import of MutableMapping

Unknown mime type - Questions

missing pyproject.toml and missing on publish PyPI for pipx compatiblety

warc-extractor.py fails -- "list index out of range"

The Python scripts are not importable as modules

DeprecationWarning

Advanced filters?

Paths in windows.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent