Code Monkey home page Code Monkey logo

warcprotocol's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

warcprotocol's Issues

Get unexpected HTML Body when parsing the compressed "*.warc.gz" in CommonCrawl

When using "toimik/WarcProtocol" to parse the "warc.gz" file from CommonCrawl.

  • For example one of the file: s3://commoncrawl/crawl-data/CC-MAIN-2023-14/segments/1679296948868.90/warc/CC-MAIN-20230328170730-20230328200730-00137.warc.gz

The HTML Body we parsed would got Extra ending. (One of the case can be seen in left part of snapshot).

But when we Unzip the "warc.gz" file first, then use "toimik/WarcProtocol" do the parse, we can get the Correct one.

Not sure if anyone met same issue, as CommonCrawl "warc.gz" files are widely used now.

image

WarcParser progress

Large files take hours to process. Would it be possible to have some information about the progress, for example, have the Record contain a byte offset of the record in the source stream? (or if they are not in sequential order, a total number of records)

"WARC-Identified-Payload-Type" on input WARC not used

I have an input WARC file with a response record. This record has a WARC-Identified-Payload-Type header, however its value is not used.

Specifically the ResponseRecord returned by the WarcParser.Parse() method doesn't set its IdentifiedPayloadType to the value of the WARC-Identified-Payload-Type header.

This isn't my expectation. The value exists in the WARC, and has a valid format. It should appear on the parsed response record.

http.warc.gz

(While I could create a custom PayloadTypeIdentifier, this is wasteful since it requires computation to happen on each parsed recorded, and its logic to determine the payload type may not be as sophisticated as the logic that originally determined the payload content type when creating the WARC)

Payload Identification for RequestRecord and ResponseRecord is hardwired to HTTP

Currently it is impossible to create RequestRecord or ResponseRecord records with any Payload data for non-HTTP traffic. This means that headers like "WARC-Identified-Payload-Type" or "WARC-Payload-Digest" cannot be set for these records if they contain non-HTTP traffic, and there is no way to manually set those.

While PayloadTypeIdentifier can be extended to identify different payload content types, its Identify() method is only called with any payload bytes that have been extracted from the content block bytes. So it depends on Payload detection.

Payload detection itself is done in the Utils.IndexOfPayload() method, called with the content block bytes.IndexOfPayload() is hardcoded to search the byte array for an index of an HTTP-style Double CRLFs. Anything after that index is considered the payload. If no HTTP-style double CRLF is found, the Payload byte array is set to an empty array. This means even a custom PayloadTypeIdentifier can't help since it receives an empty byte array for non-HTTP records.

Not having headers like "WARC-Identified-Payload-Type" creates major interoperability challenges, since tools in the WARC ecosystem, such as warcio and cdxj-indexer use those headers when creating CDX files and more.

WARC-Target-URI is not properly URL encoded, violate spec, generates errors with "warcio check"

WarcProtocol outputs WARC-Target-URI headers using targetUri.Tostring() which does not URL encode characters, as shown below:

image

This creates URLs with spaces and other unallowed characters to appear in the WARC-Target-URI, which violates the spec. WARC listing tools like warcio flag this error:

$ warcio check converted.warc
Replacing spaces in invalid WARC-Target-URI: gemini://multiverse.thruhere.net/library/math_logic_comp/Unix System Administration Handbook.pdf

I believe the code should instead call targetUri.AbsoluteUrl to get the URL with proper URL encoding

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.