graphops / file-hosting-service Goto Github PK

View Code? Open in Web Editor NEW

4.0 1.0 0.0 76.25 MB

Rust implementation of Subfile Data Service

Home Page: https://github.com/graphops/subfile-data-service

License: Apache License 2.0

Rust 98.69% Shell 0.32% PLpgSQL 0.99%

data-services graph-protocol http indexers p2p-file-sharing server the-graph

file-hosting-service's Introduction

File Hosting Service

Introduction

File Hosting Service (FHS) is a marketplace for sharing file data and is part of The Graph Network's World of Data Services.

FHS is a decentralized, peer-to-peer data sharing platform designed for efficient and trust-minimised file sharing that is payments-enabled. It leverages a combination of technologies including hash commitments on IPFS for file discovery and verification, chunked data transfer and micropayments reducing trust requirements between clients and servers, and secure and efficient data transfers via HTTP2. The system is built with scalability, performance, integrity, and security in mind, aiming to create a robust market for file sharing.

Target Audience

This documentation is tailored for individuals who have a basic understanding of decentralized technologies, peer-to-peer networks, and cryptographic principles. Whether you are an indexer running various blockchain nodes looking for sharing and verifying your data, an indexer looking to launch service for a new chain, or simply a user interested in the world of decentralized file sharing, this guide aims to provide you with a clear and comprehensive understanding of how File Service operates.

Features

Decentralized File Sharing: FHS uses direct connections for file transfers, eliminating central points of failure.
IPFS Integration: Employ IPFS for efficient and reliable file discovery and content verification.
SHA2-256 Hashing: Ensure data integrity through robust and incremental cryptographic hashing.
HTTP2 and TLS: Leverage the latest web protocols for secure and efficient data transfer.

To be supported:

Micropayments Support: Implement a system of micropayments to facilitate fair compensation and reduce trust requirements.
Scalability and Performance: Designed with a focus on handling large volumes of data and high user traffic.
User-Friendly Interface: Intuitive design for easy navigation and operation.

More details can be found in Feature Checklist

Upgrading

The project will follow conventional semantic versioning specified here. Server will expose an endpoint for package versioning to ensure correct versions are used during exchanges.

Background Resources

You may learn background information on various components of the exchange

Cryptography: SHA2-256 Generic guide, Hashed Data Structure slides
Networking: HTTPS with SSL/TLS.
Specifications: IPFS file storage, retrieval, and content addressing.
Blockchain: World of data services, flatfiles for Ethereum, use case.

Documentation

Architecture

Entity Definitions

Contracts

Quickstarts and Configuring

Server Guide

Client Guide

Publisher Guide

On-Chain Guide

Contributing

We welcome and appreciate your contributions! Please see the Contributor Guide, Code Of Conduct and Security Notes for this repository.

file-hosting-service's People

Contributors

Stargazers

Watchers

file-hosting-service's Issues

Clear and succinct Presentation

Prepare a presentation to share among Core developers.
Describe protocol at a high level (how it works, requirements, user personas, launch plans), get feedback and ask burning questions

Refactor: Custom service error

Generalize errors in subfile_exchange
make a SubfileExchangeError enum to replace the current anyhow::Error
Document potential causes/fixes in an Errors.md file

Building subfiles

For files to become subfiles we need

IOH presentation (Jan 16th)

Feat: Docs update

Slim down documentation, succinct, update and accurate explanations

Feat: Downloader send requests based on missing chunks

Currently, downloader loop through chunk_files and chunk indices, and attempt at most max_retry for a single chunk. This is a trivial way to handle failure but doesn't have smart guarantee of full file download when the max_retry exceeded.

New approach may be

At the start of download, resolve for a map of file to a Vec of full range chunk indices. This is partially implemented in 8bd5ddf
When a chunk has been successfully downloaded, remove the chunk index.
Instead of looping through i in 0..(chunk_file.total_bytes / chunk_size + 1), send request based the entries in Vec<index>.
A file is considered complete when the Vec is empty. When the Vec is not empty, continuously send requests until there's no available indexer endpoint.
max_retry is currently used for all errors from any requests on (file, chunk, indexer_endpoint); then indexer_endpoint for the subfile is added to the blocklist and not used again. By the fact that client only run with 1 target subfile at a time, indexer will not be used for other files. Update to immediately block indexer when the verification fails; allow retry for other errors (likely a timeout).

Tracking: PoC checklist

This issue should track all the items needed for a proof of concept. Aim to finish by Dec 1st, basic testing done by Dec 20.

Start with firehose flatfiles, data verifiability guaranteed by ovc and files decoder

More updated details can be found in README.md under the PoC checklist section.

General

Minimal components

Standarlise subfile.yaml manifest formats (in subfile_manifest.md)
Basic indexer selection algorithm done in client CLI (data availability checked with each indexer endpoints)
Draft out GRC
basic documentations, usage, and architectural diagrams

Next steps

verifiable TAP payments on data chunks
verifiable committed blockchain data
verifiable staking for computed blockchain data
More intelligent indexer selection algorithms - Economics research
Subfile data service enabled on the staking contract

Publisher/Provider

We can expect that the provider will use an CLI to interact with the continuous service. They can create a subfile service (as a deployment unit) on-chain and/or start serving the file off-chain. We assume that the the file is accessible by the continuous service and not necessarily by the CLI.

Minimal components

Next steps

Paid query flow: parse, validate, store, and redeem receipts with TAP
cost models: store a db of cost models for each torrent file according to chunk sizes, serve a dedicated route on subfile-service

Client

Clearly state the limitations of our approach that while payment is minimized to a chunk at a time, We require 1 of n (provider) trust as there is no guarantee for a subfile to be completely torrented if no indexer serves the target files.

Minimal components

Assume the client is capable of identifying the correct ipfs hash and maintain budget balance on-chain

Next steps

client: expose an endpoint for showing download progress
CLI: stop download by subfile hash
Payment: Read subfile manifest and construct receipts using budget and chunk sizes
Payment: build and send TAP receipts

Testing and Documentation

Conduct basic testing to ensure functionality, reliability, and security
Benchmark performant sensitive functions
Create foundational documentation outlining the usage, architecture, and known limitations.

Goals of MVP

Validate the feasibility and utility of the data exchange service.
Gather early feedback from users and identify areas for improvement.
Identify unforeseen challenges or limitations that arise during development.

Scope outside of MVP

Advanced dispute resolution mechanisms: this may live on Horizon
Optimal service and price matching algorithms: conduct Economics research on information markets, building on top of this paper
Formal verifications on data validation and integrity checks.
Detailed and polished user interface.

Post-MVP Developments

Once the MVP is successfully developed, tested, and validated, subsequent iterations would focus on

refining the existing features,
adding the excluded advanced features,
optimizing performance,
enhancing security,
and improving user experience based on feedback and requirements.

Spike: Generalize storage paths

Currently all the file paths are declared in a local path setting, but the reality is many users are using cloud storage instead of local stores.

While each individual user may specify their relative paths locally, it is important to also allow them to specify a cloud storage path.

Look into ways to support accessing files in both local and clouds
crate object_store ; readme
go library dstore written and used by streamingfast

Identify subfile data types regarding object storage versus file storage

Refactor: Split into smaller crates

Problem statement

Everything sits in one crate at the moment. As the functionality and complexity grows, the single crate might make sense to break down to different parts.

Expectation proposal

Specify a workspace structure, such as subfile-common, subfile-service, and subfile-cli

Repo checklist

Before open sourcing

Feat: Server TUI

TUI crate choice: crossterm

Allow servers to manage subfile services in terminal. Similar to managing through admin API

Feat: failure mode - Downloader switch indexer endpoint after max_retry

The base unit of failing to download a subfile is failing to download a chunk.

Given an indexer query endpoint and a chunk bytes range, downloader currently retry a configured max_retry amount of time before adding the endpoint to a blocklist.

Downloader currently tracks a HashMap of files and chunk indices yet to download.

After trying to download from the same indexer multiple times and still have missing pieces, the downloader uses the HashMap for missing pieces and switch to a new indexer endpoint.

Feat: Basic Receipt fee

Problem statement

Previous receipt construction used a dumb constant of 1 for the fee values. Update it to be more realistic and satisfy indexer's price requirement.

Do this after porting indexer service such that the indexer serves some type of cost model (not standardlised in indexer-rs, depends on how we setup the price/cost schema).

Expectation proposal

Basic calculations

Estimate a price from available indexer endpoints. get indexers' individual price posting from /cost. The most simple definition is for indexers to return $p_i$ for price per byte for indexer $i$; build a map of pricing <Indexer, Price>.
When making a query chunk request, include $fee=p_i*chunk_size$
On the server side, fee value should be checked against the posted price; indexer should not accept values lower than their posted price, (I'm not sure where this check is in existing indexer software, or if it checks at all)

Feat: Client wallet connection and dumb TAP payment

To pay for file exchanges, the client must also pass in a mnemonic/private key for connecting to their wallet.
Client is in charge of approve GRT spending and Escrow contract deposits
Client will construct a TAP receipt for each chunk request if on-chain is setup correctly, otherwise require a free query token
Refactor path to make paid or free query

Feat: Service allocation management

Add a CLI for the server for allocation management

send allocate tx to open an allocation with some tokens to an IPFS hash
send unallocate to close an allocation against the IPFS hash with 0x0 POI
We are not considering indexing rewards, so always close with 0x0, and no need to consider expiring allocation lifetime.

Feat: GraphQL API service

Problem statement

To align with indexer-service, some routes should utilize GraphQL API instead of RESTful API.

graphql query for cost
graphql query for status
~ - [ ] graphql query and mutation for authenticated admin ~

Expectation proposal

set up graphql playground
provides output types for SubfileManifest, FileMetaInfo, ChunkFile, CostModel, ...
add input parameters for specific queries or filtering
Query and Mutation objects

Alternative considerations
Easier to do after porting to indexer-rs

payments for file data service

Feat: File Discovery and matching across datasets

Indexer serves /status endpoint that shows the Subfile IPFS hashes. This is sufficient for matching on a subfile level, but no matching for specific files.

It is necessary to allow matching across subfiles for a specific file so that servers can more freely select the subfile IPFS without affecting the actual file availability.

Imagine server serving $subfile_a = {file_x, file_y, file_z}$. Client requests $subfile_b = {file_x}$. The current check will not match $subfile_a$ with $subfile_b$. We add an additional check (run on server/client-side, or a third party) to resolve $subfile_a$ and $subfile_b$ to set of files for matching.

 // done in the current check_availability
  1. read status from indexer_endpoints for serving_list
  2. if target_subfile is in one of the serving_list, return indexer_endpoint as available

pub fn file_availability(indexer_endpoints, target_subfile) {
  1. resolve target_subfile's vector of FileMetaInfo for all containing chunk file, represented by File hash
  2. resolve each subfile in serving_lists to get vec of FileMetaInfo
  4. for each target file, check if there is a serving_subfile's FileMetaInfo containing, record the serving indexer_endpoint
  5. if there is a target file that is not served by any indexers' subfile, immediately return unavailability as the target subfile cannot be completed
  6. return a map of file hash to serving indexer_endpoint, the serving subfile
}

When the client construct a range download request, construct request for corresponding indexer_endpoint, server subfile, and file hash

Future consideration

Consequently, it makes sense to simplify routing path subfiles/id/:subfile_hash with a header for file_hash to path files/id/:file_hash, but this means the server doesn't have to opt into a specific subfile. Consider if this makes sense from a server perspective, or add an additional configuration

Http services

A subfile server should

Perf: Limit concurrency for parallel requests

A file may contain a large number of chunks.

To prevent overwhelming system resources, utilize tokio::sync::Semaphores to limit concurrency. A semaphore maintains a set of permits, and a task must acquire a permit from the semaphore before proceeding.

// Declare number of permits
let semaphore = Arc::new(Semaphore::new(max_concurrent_tasks));
// Before task starts
let permit = semaphore.clone().acquire_owned().await.expect("Failed to acquire semaphore permit");
// Release the permit when task finishes
drop(permit);

Feat: graphQL client

GraphQL client queries for network subgraph to get allocation id and the corresponding indexer and deployment hash

also read registered indexers
Optionally add a client to Escrow subgraph for available balances, but everything important should already be handled

Spec: File discovery UI/UX

Added brief docs for discovery: aab30cc

previously implmented: #19

Open for discussion

Feat: Server token management

All serving files are using the same auth token, and it makes sense to use something more complicated. However, this would take attention away from payments, so this is low priority atm.

We can potentially add token management that is mutable and specific to the subfiles and the client

Segregated admin service

After porting indexer-rs, admin endpoint has been temporarily disabled due to indexer-rs traits constraints.

For security, separately set up an admin server with a different port and endpoints to manage bundles and cost models (perhaps also allocations?).

Perf: Benchmark performant sensitive functions

Basic functions to benchmark

using criterion

Deploy on cluster and test workflows

Expectation proposal

Support deployment on our cluster and test free query workflows
the file server should be

plugged into a s3 bucket
use CLI to publish files and bundles and add to the file server
expose service port
set a free query auth token

test with a different client

query status of the file server
download some bundles with free query auth token

Tracking: MVP checklist

This issue should track all the items needed for a minimal viable product.

More updated details can be found in Feature_checklist.md.

General

Standarlise subfile/chunkfile manifest formats #2
Draft out GRC #25
Better naming #18

User experience

Dataset Discovery on a marketplace
- File Discovery and matching across datasets #19
  - Search Functionality
- Dataset Listing from server
Matching Algorithm #21
- User Preference Analysis (price per byte, response rate)
- Transaction History Utilization

Subfile transfer

work with Horizon for data service interfaces
Server
- port into indexer-service framework (should take care of TAP receipt handling)
- add cost model scheme, allow updates for pricing per byte #20
Use generic path to be compatible with cloud storages #15

Subfile Client

Verifiable Payment #21
- Take private key/mneomic for wallet connections
- take budget for the overall subfile
  - construct receipts using budget and chunk sizes
  - add receipt to request
Parallelize requests #16
Multiple connections (HTTPS over HTTP2)
Continually requesting missing pieces until the complete file is obtained #22

Testing and Deployment

Beyong MVP scope

Multiple hashing options/scheme; Taking file sizes and number of files, analyze performance and memory of using merkle tree vs hash list

Standarlise subfile.yaml manifest formats

Acceptance criteria:

Reference subgraph manifest structures
Take torrent file structures into account, create manifest for subfiles
Clear documentations
Implementation

Test: basic unit & e2e tests

Unit tests

File verification by chunks
Subfile verification
Chunk file generation
Publishing

E2E test

Basic scenario
1. Initialize server and downloader
2. Server serves a subfile
3. Client request to download
4. Check for the download result

Feat: Server cost model

To add price matching in subfile exchanges,

Start with adding a CLI config --price-per-byte to ServerArgs.
The price is not stored in a database or subfile-specific at the moment. There may be a case where the bytes in a subfile is more valuable than another, then we can consider a more complex management.
Add a /cost endpoint that returns the price per byte for client/third-party price matching
When receiving paid queries, parse for receipts and condition on the receipt value where $\text{price per byte} * \text{bytes range} \leq \text{receipt value}$
Add a admin/cost endpoint for adjusting price_per_byte on the fly, taking a method set_price and a param price_per_byte to replace the previous price.

Additionally consider if indexer_framework can take care of cost models

Feat: Automatic escrow deposits

Problem statement

Without the gateway, we should take payment UI into careful consideration. There is a cost-availability tradeoff of depositing to multiple indexers versus minimal amount (just 1 for instance) of indexers.

Expectation proposal

Fetch cost from each available indexer.
Price range: $$\text{average pricing} = \frac{1}{n}\sum_{indexer_i} p_i = p_{avg}$$
$$\text{max} = \max_{indexer_i} p_i = p_{max} \space \space \space, \space \space \space \space \text{min} = \min_{indexer_i} p_i = p_{min} $$
Estimate balance required for the target. Suggest to have balance $B = \text{total bytes} * p_{avg}$, but for absolute necessary there should be $B_{min} = \text{total bytes} * p_{min}$.
Check for available balance - Warn if the escrow allowance is not enough ($B > allowance$); Do not proceed if $B_{min} > allowance$.
Automatic deposit for the target file/bundle. Take a user config of "num_download_channels" ($n$) as number of indexers to deposit towards. Order indexer by their cost and select the cheapest $n$ (later use better indexer selection such as preference for latency or geo location). For $indexer_i$, deposit $\frac{\text{total bytes}}{n}*p_i$. If a download path becomes unavailable, consider withdraw tokens and deposit new tokens to other available indexers.

Consider adding e2e testing for deposit and redeem; could involve indexer-agent for redeem, or add automatic redeem call after unallocate

Feat: Resume download

When downloading is stopped midway, relaunching the downloader client should resume download by the chunks

Options:

add storage of some metadata (remaining target chunks)
chunk hash the target subfile and identify missing chunks

Chore: slim down docker image

reference to subgraph radio docker image

Docs: Draft GRC

utilize existing docs and reference existing GRCs

Refactor: update config names

service

bundle -> initial_bundles (make it clear the set of bundles served can be managed through the admin endpoint)
price_per_byte -> default_pricing (make it clear the pricing is per byte, can be updated through admin, is unit of GRT)

Feat: Metrics for the server

Track metrics helps with managing data distribution and file handling by tracking performance and efficiency

Response Time: Average response time for file requests.
Throughput: Number of requests handled per unit of time (e.g., requests per second).
Error Rate: Percentage of requests resulting in errors.
Data Transfer Efficiency: Amount of data successfully transferred versus requested.
Uptime and Availability: Percentage of time the server is operational and accessible.
Request Distribution: Distribution of requests across different files.

Consider for the Client to have a satisfaction metrics: perceived latency, download speed, and timeouts.

Feat: deployment specific payment management in admin

Problem statement

When file-service got integrated with indexer-framework, cost mutation was deleted, but we should add it back and allow for better configurations.

Expectation proposal

Server tracks a map of manifest hash to prices.
Add mutation functions to admin endpoint to update price per byte for a deployment: set_price(deployment, price), remove_price(deployment)
Update the query functions such that, if specific manifests are queried (costModel(hash: ...) or costModels(hashes:...), find the specific pricing or use the default fallback. If all manifests are queried (costModels()`), only return the ones with specific pricing.

Alternative considerations
Later explore storing the prices for future sessions

Feat: Server admin API with admin token

Server API on file management

Add Subfile at /admin/subfiles/add: Add a subfile to the subfiles hashmap, with parameter Subfile ipfs hash and server accessible path.
Delete Subfile at /admin/subfiles/delete/{subfile_id}: Removes a subfile from the subfiles hashmap.
Subfile Statistics at /admin/subfiles/stats: Provides statistical data about the subfiles (e.g., query count, size distribution).

Optionally require an Admin token, configured by server start-up

Http dependencies

investigate how to best support HTTP services for uploading and downloading large files with range.

https://rust-lang-nursery.github.io/rust-cookbook/web/clients/download.html#make-a-partial-download-with-http-range-headers
https://crates.io/crates/http_downloader

Acceptance criteria

code either generated or implemented
clear documentation
basic testing on consistency and security

Perf: Client parallelize requests for one file at a time

Client already make parallel requests for indexer status, make further obvious client-side improvements

Optimize the nested loop on calculating chunk range and making requests
Open file once, write multiple times
Make a few requests in parallel

Feat: Direct file level discovery and matching

Problem statement

There may be use cases where the users want to request transfer for a specific file, instead of a set of files. While we can wrap a single file around to make it a singleton set, it may be easier to understand from the user perspective to directly request a single file.

Expectation proposal

Add a new discovery method matching for a single file
hosting method for a single file
potentially add metadata fields to file schema

Alternative considerations
Update docs

Tracking: on-chain transactions

Problem statement

Missing on-chain components. The original plan was to wait for protocol v2 but we might as well go ahead exploring the options to make it compatible with v1.

Expectation proposal

Options to be compatible with protocol v1

Allocation based on subfiles
Allocation based on an IPFS file to an indexer url

Create a Transaction manager entity that

Renaming

Problem statement

General File Service? or a bit more specific to File Sharing Service?

Expectation proposal

Rename to remove or replace Hosting

Additional context
Hosting isn't accurate to this data service

Feat: Downloader progress bar

Add a progress bar TUI for the downloader

Options

a single bar showing the number of downloaded files / total files
a single bar showing downloaded bytes / total bytes to be downloaded in the subfile
multi-bar; one bar for each file, showing number of downloaded bytes / total bytes in a file
multi-bar; primary bar for downloaded file / total file, secondary bar for number of downloaded bytes / total bytes in the downloading file

Potential crate: indicatif

Feat: Subfile finder

Problem statement

Subfile client is currently responsible for making discoveries, but as indexer selection algorithm grow, it makes sense to have a separate entity that handles discovery, finding, matching, etc.

Expectation proposal

Refactor the current discovery into its own struct, and have client call the struct for all things related to finding a query endpoint.

Add sufficient testing for discovery to alleviate future manual testing workload

Refactor: Adjust to a better fitting namespace

Determine new namespaces (ideas, vote, feedback)
Update existing code/docs

Consumer client

We assume that the consumer runs a CLI at where the download should happen, and leave decisions to the consumer for what happens after the data has been downloaded.

The CLI should handle

Feat: Port server to indexer framework

Problem statement

Welp, I've reinvented the wheel of indexer service once again unnecessarily.

Expectation proposal

Port subfile service to use https://github.com/graphprotocol/indexer-rs/ as the work on indexer framework was finished.
This way we don't need to explicitly add TAP manager to the service, wait til later to handle TAP receipts when serving queries.

graphops / file-hosting-service Goto Github PK

file-hosting-service's Introduction

File Hosting Service

Introduction

Target Audience

Features

Upgrading

Background Resources

Documentation

Quickstarts and Configuring

Contributing

file-hosting-service's People

Contributors

Stargazers

Watchers

file-hosting-service's Issues

General

Minimal components

Next steps

Publisher/Provider

Minimal components

Next steps

Client

Minimal components

Next steps

Testing and Documentation

Goals of MVP

Scope outside of MVP

Post-MVP Developments

Future consideration

General

User experience

Subfile transfer

Subfile Client

Testing and Deployment

Acceptance criteria

Recommend Projects

Recommend Topics

Recommend Org