Currently it is impossible to create RequestRecord or

Could you give me example(s) of a RequestRecord / <co

I am able to reproduce the Request and <code class="n

Payload Identification for RequestRecord and ResponseRecord is hardwired to HTTP about warcprotocol HOT 8 CLOSED

toimik commented on September 26, 2024

Payload Identification for RequestRecord and ResponseRecord is hardwired to HTTP

from warcprotocol.

Comments (8)

acidus99 commented on September 26, 2024

If you are open to the PR, I am happy to take an approach to fix this. The approach I would take is to create an IPayloadDetector interface which is passed into the various Record constructors.

public interface IPayloadDetector
    {
        // could we detect a payload?
        bool Detect(byte[] contentBlock);

        // if so, what's the content type
        string ContentType { get; }

        // if so, what are the bytes?
        byte[] PayloadBytes { get; }
    }

This would allow for a pluggable system to detect payloads and their content types. I could call it in the same place you are calling Utils.IndexOfPayload().

from warcprotocol.

nurhafiz commented on September 26, 2024

Could you give me example(s) of a RequestRecord / ResponseRecord record with Payload data for non-HTTP traffic?

from warcprotocol.

acidus99 commented on September 26, 2024

Sure, here is WARC (Github only accepts a .warc.gz) with a RequestRecord and ResponseRecord for the Gemini protocol

gemini-request-response-records.warc.gz

Gemini requests are a single line, just a fully qualified URL and CRLF, and don't have the equivalent of an HTTP request body, such as with a POST. So the RequestRecord would have no payload.

Responses are a single line, with a status code, Meta/MimeType field, and CRLF. Everything else is in the response is the payload. In the example WARC, the ResponseRecord would have a WARC-Identified-Payload-Type header of text/gemini.

I'm also looking for tools/example WARCs that include Request/Response records for other non-HTTP traffic such as DNS, FTP, gopher, etc.

from warcprotocol.

acidus99 commented on September 26, 2024

There are some WARCs of Gemini in the Internet Archive as well:

https://archive.org/details/mozz-gemini-crawl-2020-1
https://archive.org/details/mozz-gemini-crawl-2020-2
https://archive.org/details/mozz-gemini-crawl-2020-3

from warcprotocol.

nurhafiz commented on September 26, 2024

I am able to reproduce the Request and Response records in gemini-request-response-records.warc.gz using the following:

using var writer = new WarcWriter("your file");
var record = new RequestRecord(
    version: "1.1",
    recordId: new Uri("urn:uuid:e540b81b-09f6-4aff-a818-4a132778a3f6"),
    date: DateTime.Now,
    new PayloadTypeIdentifier(),
    contentBlock: Encoding.UTF8.GetBytes($"gemini://gemi.dev/why- gemini.gmi{WarcParser.CrLf}"),
    contentType: "application/gemini; msgtype=request",
    infoId: new Uri("urn:uuid:d1905d86-66a1-4910-8c69-81ec8f9c2c95"),
    targetUri: new Uri("gemini://gemi.dev/why-gemini.gmi"));
writer.Write(record);

var content = $"20 text/gemini\r\n# 🤓 Gemini Space for hacking \r\n\r\nAs of January 2022, there are only about there’s only around 200,000 pages across 1200 domains. This is small enough to be manageable, but large enough to be interesting to work on. To put things in perspective, Gemini space is roughly the size of WWW in 1992 right now.\r\n\r\nFun areas to hack on:\r\n* How can I surface cool and interesting content?\r\n* How do you search this space? How do you keep that up-to-date? This is great excuse to play around with full text search, reverse indexes, and hashtags?\r\n* How can you find meaningful meta data with a spartian protocol?\r\n* What does a graphic of Gemini space look like? How can we visualize the world of content?\r\n* How interconnecting are capsules and pages? How smol is this smol space?\r\n\r\nSome ideas in my head:\r\n* Antenna is freaking amazing! How can we better discovery content like this automatically? Polling gemlogs, etc.\r\n* Searching Gemini: This is a big one. We have a few search engines, but the results can be hit or miss. How can we improve this?\r\n* Gemini space is dynamic (e.g. ☠️ the Mailing List ☠️). Should we archive content? How would a Wayback machine for Geminispace work?\r\n* Capsule linter: Crawling gemini I see a lot of problems with various capsules. Broken links, bad Mimetypes, invalid gemtext, content in other languages missing \"lang=\" attributes, etc.{WarcParser.CrLf}";
using var writer = new WarcWriter("your file");
var concurrentTos = new HashSet<Uri>
{
    new Uri("urn:uuid:e540b81b-09f6-4aff-a818-4a132778a3f6"),
};

var record = new ResponseRecord(
    version: "1.1",
    recordId: new Uri("urn:uuid:2f66d487-3130-4b2b-a227-9dfa2bc8247f"),
    date: DateTime.Now,
    new CustomPayloadTypeIdentifier(),
    contentBlock: Encoding.UTF8.GetBytes(content),
    contentType: "application/gemini; msgtype=response",
    infoId: new Uri("urn:uuid:d1905d86-66a1-4910-8c69-81ec8f9c2c95"),
    targetUri: new Uri("gemini://gemi.dev/why-gemini.gmi"),
    // payloadDigest: "manually computed sha", // Uncomment to get the WARC-Payload-Digest header
    concurrentTos: concurrentTos,
    digestFactory: new DigestFactory("sha1"));
writer.Write(record);

....

//  This will produce the `WARC-Identified-Payload-Type: foobar` header
private class CustomPayloadTypeIdentifier : PayloadTypeIdentifier
{
    public override string? Identify(byte[] payload)
    {
        return "foobar";
    }
}

However, ResponseRecord has a different value for Content-Length. As a result, the computation of the Block-Digest is affected.

Could you check whether the source file's Content-Length is correct? Or is the emoji the culprit?

from warcprotocol.

acidus99 commented on September 26, 2024

hi. I can create a custom PayloadTypeIdentifier class. That's not the problem, sorry for not being more clear. The issue is with ResponseRecord's SetContentBlock() method and how it decides what bytes to pass to the custom PayloadTypeIdentifier. Specifically the issue is here:

Detecting where the "meaningful" payload is inside the Response Record content bytes is done entirely in the line var index = Utils.IndexOfPayload(contentBlock);.

Utils.IndexOfPayload() searches the byte array for the sequence \r\n\r\n. This is looking for the the double lines between the HTTP headers and an HTTP body of an HTTP response. In other words, Utils.IndexOfPayload() assumes the content block is an HTTP response, and will return -1 if it cannot find a \r\n\r\n in the content bytes.

For non-HTTP protocols, it is unlikely that a \r\n\r\n sequence exists in the content bytes. Even if it did, it's likely to be a false positive because most protocols don't use \r\n\r\n to separate meta data like headers from "meaningful" payload data. The example WARC I gave you isn't a great example because it just happened to contain a \r\n\r\n in the content block bytes, but that's just a coincidence. The subset of the content bytes after that \r\n\r\n isn't meaningful.

Let's me give you a better example of the problem. I'm using content block bytes what don't contain a \r\n\r\n (because that's not meaningful in Gemini or other non-HTTP protocols), and I've modified the CustomPayloadTypeIdentifier class to output the length of the byte array that was supplied to the Identify() method:

var content = $"20 text/gemini\r\n# A test file that doesn't have a double CRLF anywhere.\nJust A Test";
using(var writer = new WarcWriter("path-to-warc.warc")) {
    var concurrentTos = new HashSet<Uri>
{
    new Uri("urn:uuid:e540b81b-09f6-4aff-a818-4a132778a3f6"),
};

    var record = new ResponseRecord(
        version: "1.1",
        recordId: new Uri("urn:uuid:2f66d487-3130-4b2b-a227-9dfa2bc8247f"),
        date: DateTime.Now,
        new CustomPayloadTypeIdentifier(),
        contentBlock: Encoding.UTF8.GetBytes(content),
        contentType: "application/gemini; msgtype=response",
        infoId: new Uri("urn:uuid:d1905d86-66a1-4910-8c69-81ec8f9c2c95"),
        targetUri: new Uri("gemini://gemi.dev/why-gemini.gmi"),
        // payloadDigest: "manually computed sha", // Uncomment to get the WARC-Payload-Digest header
        concurrentTos: concurrentTos,
        digestFactory: new DigestFactory("sha1"));
    writer.Write(record);
}

//  This will produce the `WARC-Identified-Payload-Type: foobar-payload-length-` and the length of the byte array received
public class CustomPayloadTypeIdentifier : PayloadTypeIdentifier
{
    public override string? Identify(byte[] payload)
    {
        return $"foobar-payload-length-{payload.Length}";
    }
}

Since contentBlock doesn't have a \r\n\r\n sequence in it, Utils.IndexOfPayload() returns -1, so the Payload bytes for the record are set to an empty byte array, so the CustomPayloadTypeIdentifier is given an empty byte array to work on. Here is the output WARC:

WARC/1.1
WARC-Type: response
WARC-Record-ID: <urn:uuid:2f66d487-3130-4b2b-a227-9dfa2bc8247f>
WARC-Date: 2023-04-26T15:20:02Z
Content-Length: 83
Content-Type: application/gemini; msgtype=response
WARC-Concurrent-To: <urn:uuid:e540b81b-09f6-4aff-a818-4a132778a3f6>
WARC-Block-Digest: sha1:A2D708682D4A6534EA21B7A3E82BFA223C7215BD
WARC-Target-URI: gemini://gemi.dev/why-gemini.gmi
WARC-Warcinfo-ID: <urn:uuid:d1905d86-66a1-4910-8c69-81ec8f9c2c95>
WARC-Identified-Payload-Type: foobar-payload-length-0

20 text/gemini
# A test file that doesn't have a double \r\n anywhere.
Just A Test

Notice the WARC header WARC-Identified-Payload-Type: foobar-payload-length-0, showing that the Identify() method was given an 0 length byte array.

The fundamental issue here is that identification of Payload content type is directly linked to being able to extract out Payload bytes from the content bytes of the response record. What bytes should be extracted out as the Payload varies from protocol to protocol. But right now, the logic is hardcoded to always use Utils.IndexOfPayload() to extract Payload bytes, and that is looking for an HTTP-specific byte sequence. That's the problem.

This is why I'm suggesting some kind of PayloadDetector class that user can pass in. This class is responsible for determining what bytes in the content bytes make up the Payload (since that varies from protocol to protocol), and what the WARC-Identified-Payload-Type should be, if any. In short, like your PayloadTypeIdentifier class, but with an added method to detect and extract the Payload bytes.

Does that make sense?

from warcprotocol.

nurhafiz commented on September 26, 2024

Thanks for the clarification.
Please see if the fix meets your needs. I'll commit to main if it does.

from warcprotocol.

acidus99 commented on September 26, 2024

That helps. thanks

from warcprotocol.

Payload Identification for RequestRecord and ResponseRecord is hardwired to HTTP about warcprotocol HOT 8 CLOSED

Comments (8)

Related Issues (4)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent