Code Monkey home page Code Monkey logo

Comments (8)

acidus99 avatar acidus99 commented on September 26, 2024

If you are open to the PR, I am happy to take an approach to fix this. The approach I would take is to create an IPayloadDetector interface which is passed into the various Record constructors.

public interface IPayloadDetector
    {
        // could we detect a payload?
        bool Detect(byte[] contentBlock);

        // if so, what's the content type
        string ContentType { get; }

        // if so, what are the bytes?
        byte[] PayloadBytes { get; }
    }

This would allow for a pluggable system to detect payloads and their content types. I could call it in the same place you are calling Utils.IndexOfPayload().

from warcprotocol.

nurhafiz avatar nurhafiz commented on September 26, 2024

Could you give me example(s) of a RequestRecord / ResponseRecord record with Payload data for non-HTTP traffic?

from warcprotocol.

acidus99 avatar acidus99 commented on September 26, 2024

Sure, here is WARC (Github only accepts a .warc.gz) with a RequestRecord and ResponseRecord for the Gemini protocol

gemini-request-response-records.warc.gz

Gemini requests are a single line, just a fully qualified URL and CRLF, and don't have the equivalent of an HTTP request body, such as with a POST. So the RequestRecord would have no payload.

Responses are a single line, with a status code, Meta/MimeType field, and CRLF. Everything else is in the response is the payload. In the example WARC, the ResponseRecord would have a WARC-Identified-Payload-Type header of text/gemini.

I'm also looking for tools/example WARCs that include Request/Response records for other non-HTTP traffic such as DNS, FTP, gopher, etc.

from warcprotocol.

acidus99 avatar acidus99 commented on September 26, 2024

There are some WARCs of Gemini in the Internet Archive as well:

https://archive.org/details/mozz-gemini-crawl-2020-1
https://archive.org/details/mozz-gemini-crawl-2020-2
https://archive.org/details/mozz-gemini-crawl-2020-3

from warcprotocol.

nurhafiz avatar nurhafiz commented on September 26, 2024

I am able to reproduce the Request and Response records in gemini-request-response-records.warc.gz using the following:

using var writer = new WarcWriter("your file");
var record = new RequestRecord(
    version: "1.1",
    recordId: new Uri("urn:uuid:e540b81b-09f6-4aff-a818-4a132778a3f6"),
    date: DateTime.Now,
    new PayloadTypeIdentifier(),
    contentBlock: Encoding.UTF8.GetBytes($"gemini://gemi.dev/why- gemini.gmi{WarcParser.CrLf}"),
    contentType: "application/gemini; msgtype=request",
    infoId: new Uri("urn:uuid:d1905d86-66a1-4910-8c69-81ec8f9c2c95"),
    targetUri: new Uri("gemini://gemi.dev/why-gemini.gmi"));
writer.Write(record);

Request

var content = $"20 text/gemini\r\n# 🤓 Gemini Space for hacking \r\n\r\nAs of January 2022, there are only about there’s only around 200,000 pages across 1200 domains. This is small enough to be manageable, but large enough to be interesting to work on. To put things in perspective, Gemini space is roughly the size of WWW in 1992 right now.\r\n\r\nFun areas to hack on:\r\n* How can I surface cool and interesting content?\r\n* How do you search this space? How do you keep that up-to-date? This is great excuse to play around with full text search, reverse indexes, and hashtags?\r\n* How can you find meaningful meta data with a spartian protocol?\r\n* What does a graphic of Gemini space look like? How can we visualize the world of content?\r\n* How interconnecting are capsules and pages? How smol is this smol space?\r\n\r\nSome ideas in my head:\r\n* Antenna is freaking amazing! How can we better discovery content like this automatically? Polling gemlogs, etc.\r\n* Searching Gemini: This is a big one. We have a few search engines, but the results can be hit or miss. How can we improve this?\r\n* Gemini space is dynamic (e.g. ☠️ the Mailing List ☠️). Should we archive content? How would a Wayback machine for Geminispace work?\r\n* Capsule linter: Crawling gemini I see a lot of problems with various capsules. Broken links, bad Mimetypes, invalid gemtext, content in other languages missing \"lang=\" attributes, etc.{WarcParser.CrLf}";
using var writer = new WarcWriter("your file");
var concurrentTos = new HashSet<Uri>
{
    new Uri("urn:uuid:e540b81b-09f6-4aff-a818-4a132778a3f6"),
};

var record = new ResponseRecord(
    version: "1.1",
    recordId: new Uri("urn:uuid:2f66d487-3130-4b2b-a227-9dfa2bc8247f"),
    date: DateTime.Now,
    new CustomPayloadTypeIdentifier(),
    contentBlock: Encoding.UTF8.GetBytes(content),
    contentType: "application/gemini; msgtype=response",
    infoId: new Uri("urn:uuid:d1905d86-66a1-4910-8c69-81ec8f9c2c95"),
    targetUri: new Uri("gemini://gemi.dev/why-gemini.gmi"),
    // payloadDigest: "manually computed sha", // Uncomment to get the WARC-Payload-Digest header
    concurrentTos: concurrentTos,
    digestFactory: new DigestFactory("sha1"));
writer.Write(record);

....

//  This will produce the `WARC-Identified-Payload-Type: foobar` header
private class CustomPayloadTypeIdentifier : PayloadTypeIdentifier
{
    public override string? Identify(byte[] payload)
    {
        return "foobar";
    }
}

Response1

Response2

However, ResponseRecord has a different value for Content-Length. As a result, the computation of the Block-Digest is affected.

Could you check whether the source file's Content-Length is correct? Or is the emoji the culprit?

from warcprotocol.

acidus99 avatar acidus99 commented on September 26, 2024

hi. I can create a custom PayloadTypeIdentifier class. That's not the problem, sorry for not being more clear. The issue is with ResponseRecord's SetContentBlock() method and how it decides what bytes to pass to the custom PayloadTypeIdentifier. Specifically the issue is here:

image

Detecting where the "meaningful" payload is inside the Response Record content bytes is done entirely in the line var index = Utils.IndexOfPayload(contentBlock);.

Utils.IndexOfPayload() searches the byte array for the sequence \r\n\r\n. This is looking for the the double lines between the HTTP headers and an HTTP body of an HTTP response. In other words, Utils.IndexOfPayload() assumes the content block is an HTTP response, and will return -1 if it cannot find a \r\n\r\n in the content bytes.

For non-HTTP protocols, it is unlikely that a \r\n\r\n sequence exists in the content bytes. Even if it did, it's likely to be a false positive because most protocols don't use \r\n\r\n to separate meta data like headers from "meaningful" payload data. The example WARC I gave you isn't a great example because it just happened to contain a \r\n\r\n in the content block bytes, but that's just a coincidence. The subset of the content bytes after that \r\n\r\n isn't meaningful.

Let's me give you a better example of the problem. I'm using content block bytes what don't contain a \r\n\r\n (because that's not meaningful in Gemini or other non-HTTP protocols), and I've modified the CustomPayloadTypeIdentifier class to output the length of the byte array that was supplied to the Identify() method:

var content = $"20 text/gemini\r\n# A test file that doesn't have a double CRLF anywhere.\nJust A Test";
using(var writer = new WarcWriter("path-to-warc.warc")) {
    var concurrentTos = new HashSet<Uri>
{
    new Uri("urn:uuid:e540b81b-09f6-4aff-a818-4a132778a3f6"),
};

    var record = new ResponseRecord(
        version: "1.1",
        recordId: new Uri("urn:uuid:2f66d487-3130-4b2b-a227-9dfa2bc8247f"),
        date: DateTime.Now,
        new CustomPayloadTypeIdentifier(),
        contentBlock: Encoding.UTF8.GetBytes(content),
        contentType: "application/gemini; msgtype=response",
        infoId: new Uri("urn:uuid:d1905d86-66a1-4910-8c69-81ec8f9c2c95"),
        targetUri: new Uri("gemini://gemi.dev/why-gemini.gmi"),
        // payloadDigest: "manually computed sha", // Uncomment to get the WARC-Payload-Digest header
        concurrentTos: concurrentTos,
        digestFactory: new DigestFactory("sha1"));
    writer.Write(record);
}

//  This will produce the `WARC-Identified-Payload-Type: foobar-payload-length-` and the length of the byte array received
public class CustomPayloadTypeIdentifier : PayloadTypeIdentifier
{
    public override string? Identify(byte[] payload)
    {
        return $"foobar-payload-length-{payload.Length}";
    }
}

Since contentBlock doesn't have a \r\n\r\n sequence in it, Utils.IndexOfPayload() returns -1, so the Payload bytes for the record are set to an empty byte array, so the CustomPayloadTypeIdentifier is given an empty byte array to work on. Here is the output WARC:

WARC/1.1
WARC-Type: response
WARC-Record-ID: <urn:uuid:2f66d487-3130-4b2b-a227-9dfa2bc8247f>
WARC-Date: 2023-04-26T15:20:02Z
Content-Length: 83
Content-Type: application/gemini; msgtype=response
WARC-Concurrent-To: <urn:uuid:e540b81b-09f6-4aff-a818-4a132778a3f6>
WARC-Block-Digest: sha1:A2D708682D4A6534EA21B7A3E82BFA223C7215BD
WARC-Target-URI: gemini://gemi.dev/why-gemini.gmi
WARC-Warcinfo-ID: <urn:uuid:d1905d86-66a1-4910-8c69-81ec8f9c2c95>
WARC-Identified-Payload-Type: foobar-payload-length-0

20 text/gemini
# A test file that doesn't have a double \r\n anywhere.
Just A Test

Notice the WARC header WARC-Identified-Payload-Type: foobar-payload-length-0, showing that the Identify() method was given an 0 length byte array.

The fundamental issue here is that identification of Payload content type is directly linked to being able to extract out Payload bytes from the content bytes of the response record. What bytes should be extracted out as the Payload varies from protocol to protocol. But right now, the logic is hardcoded to always use Utils.IndexOfPayload() to extract Payload bytes, and that is looking for an HTTP-specific byte sequence. That's the problem.

This is why I'm suggesting some kind of PayloadDetector class that user can pass in. This class is responsible for determining what bytes in the content bytes make up the Payload (since that varies from protocol to protocol), and what the WARC-Identified-Payload-Type should be, if any. In short, like your PayloadTypeIdentifier class, but with an added method to detect and extract the Payload bytes.

Does that make sense?

from warcprotocol.

nurhafiz avatar nurhafiz commented on September 26, 2024

Thanks for the clarification.
Please see if the fix meets your needs. I'll commit to main if it does.

from warcprotocol.

acidus99 avatar acidus99 commented on September 26, 2024

That helps. thanks

from warcprotocol.

Related Issues (4)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.