Code Monkey home page Code Monkey logo

warc2html's People

Contributors

anthmn avatar ato avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

warc2html's Issues

Option to skip on error

This is a great utility! It would be even greater if you offered a flag that allowed the extraction to skip specific records if they throw an error.

In my case, I have a WARC archive that contains some really long URLs, and during extraction it gets to the point that it throws this error, then stops:

Exception in thread "main" java.nio.file.FileSystemException: test-extract/maps.google.com/index;ll=44.969598%2C-93.247374&spn=0.007658%2C0.03006&ie=UTF8&hl=en_US&z=15&t=roadmap&sll=44.969598%2C-93.247374&sspn=0.007658%2C0.03006&q=414%20Cedar%20Ave%2C%20Minneapolis%2C%20MN%2055454%2C%20USA%20%28Malabari%20Kitchen%20Restaurant%29&output=embed.html: File name too long
	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:100)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
	at java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:261)
	at java.base/java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:482)
	at java.base/java.nio.file.Files.newOutputStream(Files.java:227)
	at org.netpreserve.warc2html.Warc2Html.writeTo(Warc2Html.java:227)
	at org.netpreserve.warc2html.Warc2Html.main(Warc2Html.java:70)

I'm OK with the extraction process skipping this record and proceeding to the next, if that is possible, though it would be good to get output of what records were skipped at the end. As it is now, I've got most of the site that I wanted to pull out of the archive, but I'm missing some of the CSS and JS files used to display it because they presumably occur later in the archive.

This is running WARC2HTML on MacOS (Sonoma), which may determine how long a filename is too long in this instance.

Publish OS packages for warc2html

Hi,

Can we please get some common OS packages for warc2html, in order to make it easier to install?

  • macOS (Homebrew)
  • Debian/Ubuntu (PPA)
  • RHEL (yum repo)
  • Windows (Chocolatey)

How to convert a WARC split into many files?

If I have a WARC split into number files -00000.warc.gz, -00001.warc.gz, etc. How can I load these into this tool? I'm fairly ignorant to the WARC format, sorry if this is a general question / not specific to this tool

Exception in thread "main" java.lang.StringIndexOutOfBoundsException

Hello, i got an exception from an warc from archive.org?


Exception in thread "main" java.lang.StringIndexOutOfBoundsException: index -1, length 0
        at java.base/java.lang.String.checkIndex(String.java:4563)
        at java.base/java.lang.AbstractStringBuilder.charAt(AbstractStringBuilder.java:351)
        at java.base/java.lang.StringBuilder.charAt(StringBuilder.java:91)
        at org.netpreserve.urlcanon.SemanticPreciseCanonicalizer.removeLeadingTrailingAndDuplicateChars(SemanticPreciseCanonicalizer.java:90)
        at org.netpreserve.urlcanon.AggressiveCanonicalizer.removeRedundantAmpersandsFromQuery(AggressiveCanonicalizer.java:100)
        at org.netpreserve.urlcanon.AggressiveCanonicalizer.canonicalize(AggressiveCanonicalizer.java:46)
        at org.netpreserve.warc2html.Warc2Html.makeUrlKey(Warc2Html.java:75)
        at org.netpreserve.warc2html.Warc2Html.rewriteLink(Warc2Html.java:259)
        at org.netpreserve.warc2html.Warc2Html.lambda$writeTo$0(Warc2Html.java:235)
        at org.netpreserve.warc2html.LinkRewriter.rewriteHTML(LinkRewriter.java:56)
        at org.netpreserve.warc2html.Warc2Html.writeTo(Warc2Html.java:235)
        at org.netpreserve.warc2html.Warc2Html.main(Warc2Html.java:70)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.