Code Monkey home page Code Monkey logo

Comments (7)

nurhafiz avatar nurhafiz commented on June 17, 2024

In Set(string field, string value) of ResponseRecord and other record types that accept a payload, I left the following comment:

// NOTE: FieldForIdentifiedPayloadType, if any, is ignored because it is supposed to be auto detected when the content block is set

However, PayloadTypeIdentifier.Identify(byte[] payload) is not implemented; it returns null (instead of System.NotImplementedException). Any PR for this is welcomed.

I am unsure what the correct logic should be. Should WARC-Identified-Payload-Type bet set to the parsed value or automatically identified? Which value is the correct one if both returned different values?

Should I introduce another property named AutoIdentifiedPayloadType to be set to the value of PayloadTypeIdentifier.Identify(byte[] payload) instead?

from warcprotocol.

acidus99 avatar acidus99 commented on June 17, 2024

// NOTE: FieldForIdentifiedPayloadType, if any, is ignored because it is supposed to be auto detected when the content block is set

Is it supposed to be auto detected? Why? The spec doesn't say that. Why would the parser be a better system to determine the content type of a blob of bytes, or which subset of the bytes is "meaningful" and a payload, or the digest of a payload, than whatever generated the WARC?

I think this comes down to, from a separation of concerns perspective, what functionality should the parser provide? I suggest it should simply parse input WARCs. If the record says the WARC-IP-Address is 127.0.0.1 and the WARC-Identified-Payload-Type is application/pdf;lang=de and the Payload digest is magic:FFFFFF then those are reflected in the constructed records. The parser should do some format validation to ensure the WARC is properly formatted (For example, is a WARC-Record-ID a valid URL).

The parser shouldn't be computing things, like hashes or payloads or content types. If you want the code to data validation, great, but per the separation of concerns, that belongs outside the parser class. A WarcValidator class. It shouldn't take a string representing the name of hash. That approach won't support custom hash algorithms. It whould use the .NET interface for hashes, so that even a custom hash could be used to validate hashes/digests, etc. It should be extensible, etc.

The parser should be dumb and simple. Validators and other things can be built on top of that.

Anyway, just my thoughts on design.

from warcprotocol.

nurhafiz avatar nurhafiz commented on June 17, 2024

I took into account this from the specs:

The content-type of the record’s payload as determined by an independent check. This string shall not be arrived at by blindly promoting an HTTP Content-Type value up from a record block into the WARC header without direct analysis of the payload, as such values may often be unreliable.

(Emphasis mine)

https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-identified-payload-type

Hence, I assumed that auto detection must be done. What do you think?

from warcprotocol.

acidus99 avatar acidus99 commented on June 17, 2024

I read that as advice for WARC creators, not WARC parsers.

Parsers should not present data that's wasn't present in the source document. If a record doesn't have a WARC-Identified-Payload-Type header, the parser should not try and figure it out the content type and then present it as if it were part of the WARC. If a WARC-Identified-Payload-Type is present, the user should see the value parsed from the source.

For example, perhaps a user is processing a bunch of WARCs and wants to gather statistics on the records. If the records returned by the parser contains different or more data than the source WARCs, the user cannot accurately do their task.

(To be clear, I'm not saying get rid of all the great PayloadContentIdentifier code, or the code to compute digests, etc. That's all great. It just should be used to make creating WARCs easier. It shouldn't be used to enrich the data in a WARC that is being parsed)

Does that make sense?

from warcprotocol.

nurhafiz avatar nurhafiz commented on June 17, 2024

I read that as advice for WARC creators, not WARC parsers.

If I am not mistaken, this paragraph implies that the advice applies to WARC parsers:

Because new fields may be defined in extensions to the core WARC format, WARC processing software shall ignore fields with unrecognized names.

https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#named-fields

I have not looked into this but do share if you are aware of how other WARC libraries (not necessarily in C#) handle this.

from warcprotocol.

acidus99 avatar acidus99 commented on June 17, 2024

🤷 I don't know. Other parts talk about gzipping individual records, which is meant for creators. I just keep coming back to the parser shouldn't create and present data that doesn't exist in the source. Otherwise how can the user, using Toimik.WarcProtocol work with the actual data?

I (and I suspect most users) will need to use a WARC parsing library that returns the original values present in the WARC, in some form.

To answer some of your question about Auto Identified content, etc, here is how I think about how to group the different types of functionality:

  • WarcParser. This parses input WARCs and returns records. Those records should contain the data that was present in the source WARC. The parser does some basic format validation (ID's must be valid URLs, required headers are present, etc)

  • WarcValidator. This does some data validation and returns warnings or errors. Things like WARC-Block-Digest hashes that don't match. Concurrent-To point to records that don't exist, Headers present on records where they are not allowed per the spec. This is like a linter, similar to other WARC tools/functionality such as warcio

  • Aux classes/methods that can compute common digests, put it output in different Base32 or Base64 strings (see here). This also probably includes your PayloadTypeIdentifer code, with the default \r\n\r\n delimiter, etc. These classes can be used by users that want to compute more data for the WARCs they have parsed (tell me whatthe payload is, tell me what the block digest should be that isn't present, etc), or to help users that are creating a WARC and need to compute various headers values for the records they are creating

  • WarcCreator. Given some records, write them in the appropriate format, handle gzipping individual records if need be, possibly handle splitting across files, etc

from warcprotocol.

nurhafiz avatar nurhafiz commented on June 17, 2024

Please disregard the earlier fix as I am working on a new one without introducing AutoIdentifiedPayloadType.

from warcprotocol.

Related Issues (4)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.