probe-lab / network-measurements Goto Github PK

View Code? Open in Web Editor NEW

48.0 48.0 12.0 440.73 MB

License: MIT License

Python 0.14% Shell 0.02% Jupyter Notebook 99.84%

network-measurements's People

Contributors

Stargazers

Watchers

Forkers

dennis-tra leobago cortze pradeephubgit noryev mrd0ll4r gitaaron pedroakos teshomegit sukunrt

network-measurements's Issues

Network self-organisation

Hi,

I had a thought that IPFS nodes may self-organize into higher level clusters because of the way connections are formed and maintained.

More specifically, knowing the high level of node churn, do longer running nodes tend towards connecting to each others?
I would think not because of the way latency is prioritised, which result in nodes organising based on distance. Is this good?

More generally, how do we measure self-organization, and could we not use this to our benefit too?

Kubo Version 12-Month Trend

Right now, weekly report includes a snapshot in time for a specific week:

This is useful for understanding current distribution, but does not help with building intuition about trends, how slow is adoption new versions, or if there is difference in ramp-down of specific older version over multiple weeks.

We have historical data, so perhaps we could create a visualization: a plot line where X-axis is time (last 12 months) and Y-axis is the % of peers running specific kubo that week (week-to-week). Similar to this (webextension version from firefox add-on store):

This would be similar to existing:

But focused on % and bigger time window (12 months)

@dennis-tra is this feasible with existing data and tooling, or too much of an ask?

Large Number of Unavailable Peers

Context

We're seeing a very large number of offline peers each week (graph below, latest graph here). Offline peers are defined as those that are seen online for 10% of time or less (https://probelab.io/ipfsdht/#availability). This might be affecting the churn that we're seeing in the network as the churn CDF shows median lifetime of ~20 minutes but in reality will be lower since churn excludes nodes we have never contacted.

Such short-lived peers do not actually contribute to the network, as they fill other peers' routing tables, but do not stay online to provide records, if they happen to store any.

This is a tracking issue for figuring out more details, together with some thoughts on what we can do to find out where this large number is coming from.

Facts

We see:

~13-20k unique peers offline each week, which make up 30-40% of all peers seen.
~1250 connection errors per crawl

What might be happening

It could be very short lived nodes whose lifetime fits between crawler runs (30m intervals).
On startup, a node contacts neighbours and they will add the new node to their routing tables, the node could then go offline and never be seen by the crawler.

Ways forward

We need to:

find what proportion of 20k have never been contacted?
catch peers with short lifetimes - get user agent and lifetime estimate
- possible experiment: run instance of nebula with 5 minute crawl interval
find what is the in-degree of the unresponsive peers - how many other peers have them in their routing table?

As a solution, we could avoid adding peers to the routing table immediately after they're seen online. We could wait for some amount of time before adding them. In the meantime, new peers can be pinged more frequently when they are first added to routing table, gradually decreasing ping frequency over time as peer is known to be stable.

The primary question here would be how long should we wait before adding peers to the routing table.

Other thoughts and ideas more than welcome.

Is IPFS serving the closest copy of cached content

I'm wondering what would be the outcome of the following experiment.

A publisher publishes a file from a US-based node.
An EU-based node requests and fetches the file and then either pins it permanently or provides it temporarily.
At this point, the provider record should include the PeerID of both (the US and the EU) nodes.
Another EU-based peer is requesting the same file.

Have we verified that they will receive the EU-based copy? @dennis-tra did we look into this aspect for the experiments we reported here: https://gateway.ipfs.io/ipfs/bafybeidbzzyvjuzuf7yjet27sftttod5fowge3nzr3ybz5uxxldsdonozq ?

Step 3 above would also be worth a look, i.e., do both PeerIDs end up in all the provider records published in the system? Or if not, at which fraction of the records do we have both peers?

Track and measure number of Brave browser IPFS nodes

Brave browser ships a feature which downloads and runs Kubo.

We want to measure the number of Brave IPFS nodes on the public network.

@lidel said they announce themselves as kubo/0.16.0/brave and that we could find them by:

collecting peerids from peer records on DHT
read agent version of each via ipfs id QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN | jq .AgentVersion

DHT Lookup Latency Increase since mid-June 2023

Context

We've been observing a slight increase in the DHT Lookup Latency since around the mid of June 2023. The increase is in the order of ~10% and is captured in our measurement plots at: https://probelab.io/ipfskpi/#dht-lookup-performance-long-plot. This is a tracking issue to identify the cause of the latency increase.

Evidence

Below the short-term latency graph (https://probelab.io/ipfsdht/#dht-lookup-performance-overall-plot):

Observing the CDFs of the DHT Lookup latency across different regions over time, we see a clear move towards the right of the plot for several regions, most notably for eu-central, but also ap-south-1 and also af-south-1 (in Week 27).

Week 24 (2023-06-12/18)
https://github.com/plprobelab/network-measurements/tree/master/reports/2023/calendar-week-24/ipfs#dht-performance

Week 25 (2023-06-19-25)
https://github.com/plprobelab/network-measurements/tree/master/reports/2023/calendar-week-25/ipfs#dht-performance

Week 26 (2023-06-26 - 2023-07-02)
https://github.com/plprobelab/network-measurements/tree/master/reports/2023/calendar-week-26/ipfs#dht-performance

Week 27 (2023-07-03/09)
https://github.com/plprobelab/network-measurements/tree/master/reports/2023/calendar-week-27/ipfs#dht-performance

Thoughts

The latency seems to be heading back down, but we're not sure if there's a specific reason for this behaviour. Some thoughts:

The latency bump seems to coincide with kubo-v0.21.0-rc1 and later releases at the end of June: https://github.com/ipfs/kubo/releases/tag/v0.21.0-rc1. There doesn't seem to be something that could affect performance there, other than "Saving previously seen nodes for later bootstrapping" (https://github.com/ipfs/kubo/blob/release-v0.21.0/docs/changelogs/v0.21.md#saving-previously-seen-nodes-for-later-bootstrapping), but even in this case, the original bootstrappers are not removed.
The latency bump also comes about a month later than the Boxo release at kubo-v0.20.0: https://github.com/ipfs/kubo/releases#boxo-under-the-covers. Not sure if something in there could affect performance (?)
The majority of nodes still use kubo-v0.18 as per: https://probelab.io/ipfsdht/#kubo-version-distribution, but there are about 3.5k nodes in v0.20.0 and v0.21.0, which could be enough in order to cause this slight increase.
At the same time, we're observing a decrease in the number of available DHT server nodes in the EU region (https://probelab.io/ipfsdht/#dht-availability-classified-region-plot - not sure if we can find historical data for this @iand @dennis-tra ?), which could impact performance from our eu-central node.

Any other thoughts @Jorropo @aschmahmann @lidel @hacdias ?

Website Monitoring feedback 202302 and 202303

I have expanded the scope of this issue to be feedback on the various website-monitoring reports that have come in during 202302 and 202303. I'll consider this done when we have a first draft that I would feel comfortable sharing with other leaders and not needing to be there to answer/explain it. After that we can develop a separate process for how we report ongoing observations, questions, and suggestions.

This concerns https://github.com/protocol/network-measurements/blob/master/reports/2023/calendar-week-7/ipfs/README.md#website-monitoring

First off, thanks for adding this! Good stuff.

A few things that I think would be helpful to document:

What is the configuration of the node monitoring these sites? For example, is it stalk Chromium phantoms node? (I think we should be explicit that Companion (for intercepting IPFS URLs) is not in the mix.
Is the cache cleared between each run?
I assume "Page Load" is https://github.com/macbre/phantomas/blob/devel/docs/metrics.md#performancetimingpageload . I don't find their docs helpful. There is so much that goes into loading a page. That said, I assume this is the "Load" metric that shows up in one's web inspector (screenshot - red vertical bar). I could imagine it would be better to get DOMContentLoaded (blue vertical) since that isn't as susceptible to the JS processing on the page I believe (but does capture the network traffic up front fetching JS). (That said, this isn't my expertise and I know there are a lot of intricacies. @lidel will likely have a good suggestion here.). Regardless, I'd love to be more specific than "page load", or at least point people to something like https://developer.mozilla.org/en-US/docs/Web/API/PerformanceNavigationTiming so they have more insight into what that means)

Week over week trends - It would be great to have a mechanism to detect if this radically changes week over week. One idea would be to pick a few sites and a few regions and plot the p50 and p90 of time to first bye (since that shouldn't be susceptible to the content of the page).
Other sites I could imagine adding:

ipfs.tech
blog.ipfs.tech
docs.libp2p.io
blog.libp2p.io
Or maybe it would be worth just syncing to whatever set of sites are being pinned to the collab cluster

Measurement breakdown of IPFS TTFB performance based on content resolution path

Specifically trying to understand the differences in TTFB performance between ipfs.io gateways and go-ipfs nodes across the following resolution paths:

DHT (that hits a Hydra node) - may be related to libp2p/hydra-booster#93
DHT (that does not hit a Hydra node)
Peered cluster (i.e. Pinata)

These performance metrics can help inform where bottlenecks are happening, and how to think through setting a reasonable SLA for services that build on top of IPFS.

cc @yiannisbot @guseggert

Unreachable providers for popular CIDs

We've recently started measuring the performance of PL websites over kubo. We've been presenting some of these results in our weekly reports and we're also now putting more results at probelab.io (e.g., https://probelab.io/websites/protocol.ai/ for protocol.ai). As a way to get more insight into why the performance is what it is, we have collected the number of providers for each one of them. That will enable us to see if, for instance, there are no providers for a site.

We've found an unexpected result, which might makes sense if one gives a deeper thought into it: there are a ton of unreachable providers for most of the websites we're monitoring as shown in the graph below for protocol.ai. Note that the stable providers for protocol.ai should be two, i.e., that's where we currently pin content.

This happens because clients fetch the site, re-provide it and then leave the network, leaving stale records behind. In turn, this means that popular content, which is supposed to be privileged due to the content addressing nature of IPFS, is basically disadvantaged because clients would have to contact tens of "would be" providers before they find one that is actually available.

I'm starting this issue to raise attention to the issue, which should be addressed asap, IMO. We've previously discussed in slack a couple of fixes, such as setting a TTL for provider records equal to the average uptime of the node publishing the provider record. However, this would be a breaking protocol change and would therefore not be easy to deploy before the Composable DHT is in place. Turning off reproviding (temporarily, until we have Composable DHT) could be another avenue to fix this issue.

Other ideas are more than welcome. Tagging people who contributed to the discussion earlier, or would likely have ideas, or be aware of previous discussion around this issue: @Jorropo @guillaumemichel @aschmahmann @lidel @dennis-tra

Gateway Measurements: Helia

Corresponding Helia Issue: ipfs/helia#275

Tasks

Beta Give feedback

Build a helia-docker container: ipfs/helia-http-gateway#1
Run Gateway measurment tests for helia-node in docker.
Publish results on https://probelab.io/ipfsgateways/
Options

As discussed with @dennis-tra.

RFM: IPNI Lookup Performance

Request from @BigLep in FIL slack (#probe-lab channel).

The ProbeLab team is currently running a continuous experiment to measure the IPFS DHT Publish & Lookup performance (see details here: https://probelab.io/ipfsdht/#performance). There is a request to do the same for IPNI indexers, ideally using the same set of nodes to avoid extra costs.

From @BigLep: "Stopwatch starts when we begin the GET /routing/v1/providers/{CID} (link) call and the stopwatch ends when the HTTP request completes."

PL Websites not continuously pinned at PL pinning cluster or Fleek

Context: ProbeLab is monitoring the uptime and performance of several PL websites at: https://probelab.io/websites/ Those sites are pinned in two stable providers (among other nodes that decide to pin these sites in the P2P network): i) PL's pinning cluster, and ii) Fleek's cluster.

One of the things we're monitoring is whether those stable providers are continuously making those sites available.

Assumption: We've worked closely with both the team that operates PL's pinning cluster and Fleek to make sure everything is in place and correctly configured (e.g., all nodes are running the Accelerated DHT Client) to reprovide the CIDs for the websites, so we've been expecting the situation to be rather stable. Stable here means websites are pinned to 7 nodes from PL's pinning cluster and 2 nodes from Fleek's fleet.

Results: Our results are presented under each website's results page, e.g., https://probelab.io/websites/blog.ipfs.tech/#website-trend-hosters-blogipfstech for https://blog.ipfs.tech.

We see that stable providers are far from stable for several websites.
We've seen a period roughly during the days of 24th to 27th July, where all sites were not available from either of the two known stable providers, e.g.:
- blog.libp2p.io: https://probelab.io/websites/blog.libp2p.io/
For some websites the situation continues. E.g.,
- conensuslab.world: https://probelab.io/websites/consensuslab.world/#website-trend-hosters-consensuslabworld
- green.filecoin.io: https://probelab.io/websites/green.filecoin.io/#website-trend-hosters-greenfilecoinio
- research.protocol.ai: https://probelab.io/websites/research.protocol.ai/#website-trend-hosters-researchprotocolai
- specs.ipfs.tech: https://probelab.io/websites/specs.ipfs.tech/#website-trend-hosters-specsipfstech
- web3.storage: https://probelab.io/websites/web3.storage/#website-trend-hosters-web3storage
From then on, the situation continues to be pretty unstable, e.g.:
- drand.love: https://probelab.io/websites/drand.love/#website-trend-hosters-drandlove
- docs.libp2p.io: https://probelab.io/websites/docs.libp2p.io/

This is a tracking issue for the resolution of the situation. Tagging @gmasgras and @cewood for the PL team and will propagate further to Fleek folks.

RFM-16 Proposal: An alternative to measuring bitswap efficiacy

RFM-16 suggests the following method for testing bitswap efficacy:

Pick a large number of random CIDs (as many as needed in order to be able to arrive to a statistically safe conclusion) and share them with all the nodes involved in the experiment.

and then:

Carry out Bitswap discovery for these CIDs.

This might very well do the trick, particularly in a closed network.

However, I'd like to suggest an alternative to consider that could work "in the wild." Given a set of peers, i.e. ipfs swarm peers, request their current wantlist, i.e. ipfs bitswap wantlist --peer={PEER_ID}. Then, poll that wantlist to see how long certain CIDs stay on the wantlist. This metric, the average lifespan of a CID on a wantlist, could be very useful towards getting a sense of the overall user experience of an IPFS node user.

It was also suggested (I believe by @guseggert) that this "average lifespan of a wantlist entry" metric could be rolled into ipfs stat

Thank you for your consideration! 🙏

Add table of all advertised stream handlers and how many peerid advertised thoses stream handlers.

I would like data like this:

Stream Handler	Seen
`/ipfs/bitswap/1.0.0`	123456
`/ipfs/id/1.0.0`	234567

RFM Proposal: Number of Client nodes across various networks and implementations

We are currently capturing the number of clients observed in the IPFS public DHT network and we report this as part of our weekly reports (currently in this repo - see example for Week 17 as well as at probelab.io: https://probelab.io/ipfsdht/#client-vs-server-node-estimate.

As per this discussion thread in Slack, this is great, but only captures part of the story, i.e., it focuses on the public IPFS DHT only, which in turn, means that it is mostly focusing on Kubo. However, IPFS is more than the kubo implementation and more than the public IPFS DHT. A request from @BigLep is to be able to "show the number of peer ids observed across various "networks" and break out by implementation".

In order to go about doing this, we'd need to identify data sources (i.e., how to collect the data) from different: i) IPFS implementations (e.g., Kubo, Helia, Iroh), and ii) networks that run IPFS nodes (e.g., the IPFS DHT, the Lotus DHT, cid.contact/IPNI, etc). We should also ideally deduplicate the PeerIDs to avoid double-counting a peer that participates in more than one network (?).

I'm starting this issue to capture first what we want to target and then come up with data collection ideas (e.g., through measurement tools, logs etc.).

cc: @BigLep @dennis-tra

Impact of peers that rotate their PeerIDs

I'm wondering what is the impact of peers that join the IPFS DHT and rotate their PeerIDs excessively. We've seen in recent reports, e.g., Week 5 Nebula Report, that there are 5 peers which rotate their PeerID 5000 times each, within the space of a week. This comes down to peers having a separate PeerID every couple of minutes. The number of rotating PeerIDs seen are roughly as many as the relatively stable nodes in the network (aka network size). The routing table of DHT peers is updated every 10mins, so the impact is likely not sticking around for longer than that, but given the excessive number of rotations, I feel that this requires a second thought.

I can see three cases where this might have an impact (although there might be more):

In the GET process, when looking for closer peers and hitting a peer that has disappeared from the network (rotated their PeerID)
In Provider Record availability, when looking for a record that has been stored with a peer that rotated their PeerID
In content availability, when the peer that advertises their PeerID has advertised some content and then is not reachable anymore.

The first case should be covered by the concurrency factor, although the large number of rotations might be causing issues. We could check the second case through the CID Hoarder - @cortze it's worth spinning up an experiment to cross-check what happens with previous results. Not sure what can be done for the third case :)

Thoughts on whether this is actually a problem or not:

It's worth checking whether those PeerIDs co-exist in parallel in the network, or whether when we see a new PeerID from the same IP address, the previous one(s) we've seen from the same IP address have disappeared. @dennis-tra do we know that already? Is there a way to check that from the Nebula logs?

Also, from @mcamou:

re: thousands of PeerIDs with the same IP, I don't think that we can completely rule out that they are different peers mainly due to NAT. On the one hand, some ISPs implement CG-NAT, where they do use a single IP for multiple customers. On the other hand, you might have large companies who have a single Internet PoP for their whole network.

Depending on how many IP's we have in this state, we might want to make a study regarding the above 2 cases (and others that we might think about). One thing to look at would be whether the same PeerID shows up consistently or whether it's a one-off.

Extra thoughts more than welcome.

Track number of client nodes in the IPFS DHT Network

Summarising several approaches from offband discussions here to have them documented.

Approach 1: kubo README file - idea initially circulated by @BigLep

Description: The kubo README file is stored and advertised by every node in the network (ipfs/kubo#9590 (comment)), regardless of whether the node is a client or a server in the beginning. The provider records for this README are becoming stale after a while, either because peers are categorised as clients (and are therefore unreachable), or because the leave the network (churn). But the records are still there until they expire. We could count the number of providers across the network for the kubo README CID and approximate the network-wide client vs server ratio.
Downside: This approach would only count kubo nodes (which is a good start and likely the vast majority of clients).

Approach 2: Honeypot - idea circulated by @dennis-tra

Description: We have:

the honeypot that tracks inbound connections/time,
the crawls that give us information about in how many routing tables the honeypot is.

Maybe we can estimate what share of queries should come across the honeypot and then estimate the total number of clients in the network, based on the number of unique clients the honeypot sees. This would be a low overhead setup and may allow better estimates with more honeypots.
Downside: The approach would need maintenance and infrastructure cost of the honeypot(s).

Approach 3: Baby-Hydras - idea circulated by @guillaumemichel

Description: Another approximation we could get is by running multiple DHT servers. Think of a few baby hydras. Each DHT server would log all peerids sending DHT requests, and get the % of client vs servers by correlating the logs with crawls results. This gives the % of clients and servers observed, we average the results of all DHT servers, and extrapolate this number to get the total number of client, given that we know the total number of servers.
Downside: The approach would need maintenance and infrastructure cost of the DHT servers/baby-hydras.

Approach 4: Bootstrapper + Nebula - info gathered by @yiannisbot

Description: We capture the total number of Unique PeerIDs through the bootstrapper. What this gives us is the "Total number of nodes that joined the network as either clients or servers". Given that we have the total number of DHT server nodes from the Nebula crawler, we can have a pretty good estimation of the number of clients that join the network. The calculation would simply be: Total number of Unique PeerIDs (seen by bootstrappers) - DHT Server PeerIDs (found by Nebula). In this case, clients will include: other non-kubo clients (whether based on the Go IPFS codebase, Iroh, etc.) and js-ipfs based ones too (nodejs, and maybe browser, although the browser ones shouldn't be talking to the bootstrappers anyway).
Downside: We rely on data from a central point - the bootstrappers.

Approach 4 seems like the easiest to get us quick results. All of the rest would be good to have to compare results and have extra data points.

Any other views, or suggested approaches?

website monitoring: add specs.ipfs.tech

I just realized from looking at https://github.com/protocol/network-measurements/blob/master/reports/2023/calendar-week-21/ipfs/README.md#website-monitoring that we're missing specs.ipfs.tech. Please add. This is needed as part of external monitoring of specs.ipfs.tech per ipfs/specs#418

RFM Proposal: Data on usage of libp2p circuit relay v1

In https://github.com/ipfs/interop, we still have tests running libp2p circuit relay v1, which makes sense because it has functionality that relay v2 does not; however, It has caused some issues. See

ipfs/interop#493
ipfs/interop#496
and the discussions within those PRs.

I'm wondering if we can get metrics on which relay versions are being used and how much traffic exists for each. I understand that we should be able to query the DHT for multiaddr that indicate which relay version(s) are available.

As far as which metrics would be useful, I think the following is a good start:

p99/90/50 of relay enablement:
- is relay enabled for a node?
- Which versions are they supporting?
relay usage:
- For relay traffic on the network (X), how much is v1(Y), how much is v2(Z), how much is vN? (X/Y vs X/Z vs X/N...)

Questions I want to answer with this data:

Do we need to support relayv1 in interop tests?
- Can we remove tests for relayV1 in ipfs/interop? If it's not in use anywhere, yes. If its use is under some percentage when compared to relayV2, then probably..

Please let me know if this request/issue is better suited elsewhere! Thanks.

Balance Kademlia Buckets

eta: 2023Q1

Study the impact of balancing Kademlia buckets over each bucket subkeyspace.

Associated PR: #36

Broadcast latencies in the Filecoin network

Hi,

We're working on analyzing the security of Filecoin's Consensus mechanism, which significantly relies on timing assumptions.
To define a model that best captures the reality it would be extremely helpful to know the current latencies in Filecoin's mainnet. In particular, the latencies associated with broadcasting.
Any information would be valuable! (Mean latency per sender/receiver, complete distribution of latencies, 95th percentile, etc.)

cc @yiannisbot @sa8 @jsoares

probe-lab / network-measurements Goto Github PK

network-measurements's People

Contributors

Stargazers

Watchers

Forkers

network-measurements's Issues

Context

Facts

What might be happening

Ways forward

Context

Evidence

Thoughts

Tasks

Approach 1: kubo README file - idea initially circulated by @BigLep

Approach 2: Honeypot - idea circulated by @dennis-tra

Approach 3: Baby-Hydras - idea circulated by @guillaumemichel

Approach 4: Bootstrapper + Nebula - info gathered by @yiannisbot

Recommend Projects

Recommend Topics

Recommend Org