Code Monkey home page Code Monkey logo

archivetools's People

Contributors

ppival avatar recrm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

archivetools's Issues

DeprecationWarning on import of MutableMapping

warc-extractor.py:38: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import MutableMapping

change line 38 of warc-extractor.py:

from collections import MutableMapping

to

from collections.abc import MutableMapping

Unknown mime type - Questions

Hi, I get the following message when extracting a dump:

Count of unknown mime type.
{'': 17,
'application/binary': 7,
'application/font-woff2': 6,
'application/x-font-ttf': 6,
'application/x-font-woff': 1,
'application/x-javascript': 3458,
'binary/octet-stream': 17,
'font/truetype': 7,
'font/ttf': 1,
'font/woff': 2,
'font/woff2': 156,
'font/x-woff': 7,
'image/x-icon': 9,
'text/javascript': 34}

  1. Can I be sure all files inside the webcapture are being extracted? (like when you extract a zip file, for example)
  2. This is the only script I have found that can perform full dumps. Do you know any other else? (my only worry, related to the last question, is that not all files were extracted)

Best Regards, forgive my ignorance

warc-extractor.py fails -- "list index out of range"

$ warc-extractor.py
parsing 195.242.99.71-8181-2016-03-23-3324e7c6-00000.warc
Traceback (most recent call last):
  File "/home/username/bin/warc-extractor.py", line 200, in __getitem__
    return super().__getitem__(name)
  File "/home/username/bin/warc-extractor.py", line 83, in __getitem__
    return self._d[name.lower()]
KeyError: 'content_type'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/username/bin/warc-extractor.py", line 828, in <module>
    parse(args)
  File "/home/username/bin/warc-extractor.py", line 713, in parse
    inc(record.http, "content_type", "http-content")
  File "/home/username/bin/warc-extractor.py", line 654, in inc
    obj = obj[header]
  File "/home/username/bin/warc-extractor.py", line 204, in __getitem__
    return self.content.type
  File "/home/username/bin/warc-extractor.py", line 230, in content
    self._content = ContentType(string)
  File "/home/username/bin/warc-extractor.py", line 267, in __init__
    data[test[0]] = test[1]
IndexError: list index out of range

WARC file is from https://archive.org/details/warc-195-242-99-71-8181

other utilities are able to extract at least some data from it

If there's a bad spot in the file (which I'm not sure if there is or not), can there be an option to skip over it and continue processing?

The Python scripts are not importable as modules

I am developing a small tool that detects and classifies object in images extracted from WARC archives. I use some functionality from warc-extractor.py in my Python code. I order to do that, I have to rename 'warc-extractor.py' to 'warc_extractor.py' since the dash is not a valid character in a module name.
I propose to change the names of the Python scripts, replacing '-' by '_'. I can submit a pull request if you like?

DeprecationWarning

Hello,

and thank you for this script. I'm a complete beginner with Python, but somehow I managed to finally get it set up on Windows. However, all I get when trying to run the warc-extractor is the following message:
s
It doesn't matter what command I put in, or what WARC file I use, the result is the same. It's probably on my end, though the other ones seem to run fine:
t
Your help would be greatly appreciated. Thank you

Advanced filters?

The warc extractor works like a charm on Windows when dumping all of the content...
I was wondering if you could provide some more info on the usage of filters?

For example... could we filter out sub-folders inside a warc archive based on folder name(s)? If so, how would I do that?

I also have a lot of index files in an archive, numbered like this: index.html, index(2).html, index(3).html etc.
Would it be possible to only dump the first one in each folder (so, only the first index.html file) instead of all numbered index files as mentioned above?

Thanks in advance. :)

Paths in windows.

There is a known bug where warc-extractor.py does not handle windows paths properly.

Windows has far more restrictions on what is an appropriate path than Linux. Unfortunately, dealing with all of the possible crazy path names that warc scrappers can sometimes pull is a big job. Thankfully, most of these requests are 404 errors. Until a proper fix is implemented it is recommended to run the -dump content command along with the http:error:200 filter. If path errors still persist in windows also use the -error flag to skip unworkable path names.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.