Add a global file-based repository of interesting scientific and math

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

global repository for data (private cloud, maybe S3, across projects, ...) about cloud HOT 8 OPEN

sagemath commented on August 11, 2024

global repository for data (private cloud, maybe S3, across projects, ...)

from cloud.

Comments (8)

williamstein commented on August 11, 2024

One issue to think through is that NFS is a single point of failure and doesn't scale well to a potentially large number of clients... Something like you describe could be very useful for LMFDB though.

From a scalability and data management perspective it would be much nicer to just use Cassandra. Then everything is highly scalable, robust, redundant, etc. The drawback is that all data must be something you can put into the database, which is a potentially substantial hurdle. This would be basically an analogue of the Amazon S3 API, but for our cloud. If you think about scaling up to potentially a million users, a big-data database (with good docs) seems more manageable than a massive NFS share...

from cloud.

haraldschilly commented on August 11, 2024

What about replacing NSF with "Ceph" in that idea? I fully get your point that just NFS will neither scale nor be reliable enough ... I have the impression Ceph could be the one for the job. (It's just not so easy to set up, but doable to run without a spof - at least the documentation makes me belief that)

Ceph also provides such a S3 API, besides traditional filesystems. It would also open up the possibility to offer general block devices per project, etc.

And since you mentioned LMFDB: There was even a workshop where others from fields like astronomy were invited to share how they manage their data. My strong impression is, most of them will not adapt to storing this in any kind of database (or even care to post-process it) and are very much attached to a traditional file-based system.
That's also the target for their scripts and software. That's why I think this could be sort of a great selling point - and channels/embraces what the users want and do.

Cassandra is #32

from cloud.

williamstein commented on August 11, 2024

Ceph sounds very interesting. Keith Clawson has been using ceph a lot lately for the new VM's (to replace boxen.math.washington.edu), so he will have input. Great idea!

from cloud.

tangentspace commented on August 11, 2024

Lets do it. I have Ceph running on 2 nodes and I'm expanding that cluster to 4 nodes. My main concern has been that it's relatively easy to use and to scale up, but it's rather complicated internally and it will take some time and experience to feel confident about operating it for a large number of users. Setting up a data repository would be a great next step because it could be as large as want and it could get some real world use, but it's not mission critical. Most of the data wouldn't need to be backed up if it's publicly available elsewhere, and it should be easy to back up user data using the snapshot feature.

We don't have a collection of free drives to start this right now, but there are a lot of 2 TB drives in the older cloud servers that could be used for Ceph since we're already planning on upgrading to 4 TB drives. It would be best to spread the data over as many servers as possible for the trial to maximize redundancy and to get a feel for any scaling issues. One limitation could be the 2 gigabit network connection, which can definitely become saturated when you write large amounts of data to a replicated partition or replace a failed partition. For a read only archive we should be fine, and it could potentially be be much faster than NFS because read operations can be distributed among all nodes that have copies of the data.

from cloud.

haraldschilly commented on August 11, 2024

@kclawson that sounds great to me. we can easily just start with a small data set from an open machine learning data library. nothing fancy, just a demo. then we can could add a few TB from that mentioned LMFDB project, also merely as a test.

Since you work on that, what would be the best way to sort out

quotas: I've read that quotas for CephFS don't exist yet. So, it is be better to create several rdb block devices and treat them independently for each set of data, right? That would make management of them, snapshots+backup, deletion, etc. easier, too.
permissions: one or more want to have read-write access, all the others just read-access, or no access. ACL and xfs should do this.

from cloud.

williamstein commented on August 11, 2024

We also need a backup strategy...

from cloud.

haraldschilly commented on August 11, 2024

Keith will know how to backup best? There are several possibilities, below a list what could be done:

Part of ceph:
- underlying blocks are replicated: the policy for the pool could say that at each site (data center) is at least one copy?
- Each of these rbl block devices allow for snapshots, which can be layered on top of each other (due to COW). This means, it's possible to rollback if the file-system fails.
Outside: These snapshots can be cloned read-only, mount them, create a backup: e.g. let bup index the files and store the actual content - not just the raw image? (or however it's done best)

from cloud.

haraldschilly commented on August 11, 2024

Here is another take on this: Riak.
That's a distributed p2p fault tolerant/self repairing apache 2 licensed key/value store. On top of it, there is S3 support: Riak Cloud Storage with different backends (e.g. LevelDB)

I haven't really looked at the details, and I also don't know enough about the actual requirements, but I think it's worth to have a look. For example, it mentions that it does support user accounts...

Also, there seems to be a major update to 2.0. It's maybe risky, but just for testing there are packages for this technical preview for ubuntu, too.

from cloud.

global repository for data (private cloud, maybe S3, across projects, ...) about cloud HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent