Expected behavior arangodb3 able to keep running when there is at

ArangoError: IO error: No space left on deviceWhile appending to file: /var/lib/arangodb3/engine-rocksdb/002001.sst: No space left on device about ot-node HOT 8 CLOSED

origintrail commented on June 16, 2024

ArangoError: IO error: No space left on deviceWhile appending to file: /var/lib/arangodb3/engine-rocksdb/002001.sst: No space left on device

from ot-node.

Comments (8)

kotlarmilos commented on June 16, 2024

It seems that arangodb has an issue with disk space allocation. Try to increase disk space. Also, we will release a set of features that addresses memory optimization issues.

from ot-node.

calr0x commented on June 16, 2024

The current fill-rate the disk is growing is about 2gig a week. It is imperative the cause is addressed. Early xdai node runners are beginning to no longer be able to make backups on 40gig drives due to space issues. Stepping up the disk space, at the current job rate, means they are losing money running them as that step-up is pricey.

Several of my projects have addressed ways to keep the vps size the same and still function:

DockSucker: Removes the docker container entirely.

Smoothbrain: Institutes cold arango backups vs arangodump. I simply kill the node/arango and upload the raw directory to S3. The nodes do not have the room to store a second copy any longer.

My oldest node (40gig) has every project enabled. Once the disk fills up (about 6 weeks) I will have no choice then to either upgrade the server (and begin losing money depending on the host) or shut down the nodes that are in this state.

Another person and I are investigating arango as best we can. Currently we are testing disabling prefill and reducing the journal size as we don't believe performance tweaks are needed but we need help on this as we both can only do this in our free time.

I moderate 2 Telegram channels, 1 for novices with basic issues and another for advanced usage. The rate we are seeing people need assistance with full drives is increasing substantially and it is not an easy problem to walk them thru as we have to add additional storage temporarily to generate a backup and then change them to Smoothbrain. (Their normal disk usage is around 25gig used and the arangodump maxes it out).

We are also currently testing renting large servers and running multiple (6+) containers on 1 server as a way to reduce server costs. Some of the people in the group are running a significant amount of servers (20+) so reducing the vps costs by 6x or more is a lot.

Disk space in the next several weeks is going to cross into a very critical stage. As a node runner the number 1 issue we are facing is disk space filling up unusually fast.

Thanks!

from ot-node.

Valcyclovir commented on June 16, 2024

Seeing many node runners experiencing the same disk space issues. calr0x has developed temporary tools to help deal with disk issues but we, as a community, need the team's help on this. We have tried many things over the past couple of weeks to try and solve these issues but we have no arangodb3 experts to help us deal with the main elephant in the room. It is getting bigger by the minute, node runners have to keep updating their servers to accommodate this elephant while only being paid a fraction of it, this in itself, is also an issue.
A way to solve this issue and to make it fair to node runners running several nodes (and therefore having the same copy of this arango database running on each node) is to have one central arango database, from which node runners can decide how many nodes to attach to it. This is something possible according to our community of node runners, and a very welcome optimization of space and fees for everyone.
Really hoping this issue gets way more attention, because most of us are in this together for the tech it brings, and a lot of us are running multiple nodes, and those nodes are far from optimized in terms of resources used right now. If we should see adoption on the front end, we need this tech to be sustainable and optimized in the back end.

from ot-node.

calr0x commented on June 16, 2024

Another solution we are working on that the above commenter mentioned is separating the node from the DB.

We would rent a beefy server to host the DB server and run multiple dockerless nodes that each connect to their OWN database on the server. This has the advantage of only needing 1 vps host with a large storage capability and then nodes can be packed tight on far fewer servers:

Node1 uses node1db on the server
Node1 uses node2db on the server
Etc

The day will come where jobs are at a level that, financially and system resource-wise, this won't be necessary or ideal when compared to renting an adequate vps host. At a disk-space growth rate of 2gig a week there is no solution that is currently long-term, however.

from ot-node.

lukeskinner commented on June 16, 2024

Just to add some numbers... I made a new xDai node at launch (March 2021). By the end of May 2021 the docker container was taking up 28GB of data. By June it was taking up 36GB of data.

As mentioned above in calr0x's comments I've since moved to his containerless solution (docksucker). This eliminates the docker penalty of how it stores data but I'm still at 24GB of data, of which 99% is just for arrangodb.

My node has won 54 jobs in total and the total size of the data between all these jobs is under 40MB according to the blockchain. But I'm seeing 24GB of arrangodb space used.

I've had a look in my arrangodb and i've got like 10 rows/records in one of the collections and there's barely any data in those.

I personally think this is a pretty fatal flaw in either the technologies used or how they're configured. I'll sound a bit dramatic but I think you're going to lose lots of node runners in the long term because this is not sustainable. I personally won't be running a node if I hit over 80GB of arrangodb disk space when I've only won like 200 jobs...

from ot-node.

calr0x commented on June 16, 2024

I'd like to request a dialog between us all that is responded to more timely by OT than is normal. We can help share requested info and knowledge to try to get to the bottom of this.

I can't stress enough that this is a significant issue within the node running community. Anything OT can do to help is greatly appreciated!

from ot-node.

branarakic commented on June 16, 2024

Hi tracers,

First of all thanks for the constructive contributions and suggestions on this issue. As I understand there hasn't been a lot of communication from the dev team on this topic, at least not publicly on github, so I'll add some additional context on the ongoing work to attack this issue.

There's several points to this topic: 1) efficiency of storage utilization, 2) protocol requirements and 3) growth considerations. Let me dive deeper into all of those.

Efficiency of storage utilization

There are several factors which affect the disk usage, most important being: database storage engine (arangodb) efficiency, docker and node data management. As some of you have already mentioned in the discussion above, there's room for improvement on all of those and the team has been working on them since the inclusion of the pruning feature as all of these factors affect its performance. The upcoming release (5.1.0 scheduled for this week) introduces improvements that increase the efficiency of utilization from the side of node data management (further removing some unnecessary data duplicates, appearing due to a bug on the node). We’d like to ask for community help with assessing the impact of this release as it is not that easy to test (details on how coming up).
In the midterm we will be addressing docker and DB implementation details as well.

Protocol requirements

As publishing on ODN replicates datasets to all nodes that are effectively bidding for data holding services, even the nodes that do not get selected by smart contracts still opportunistically keep dataset copies. Strictly speaking this is not a protocol requirement, and optimizations on this front are scheduled for v6 which is already in the works and will be detailed in an upcoming RFC. The main focus of v6 is to introduce data handling improvements, extend interoperability (RDF) and provide new features for the knowledge tools. We expect to achieve at least a 10x improvement in storage utilization from the upcoming v6 with the introduction of a new indexing component in the node implementation.

Growth considerations

All above being said, as adoption increases, so will the needs for storage capacity on the network. Put simply, if optimizations bring a 100x improvement (100x less storage used), and network utilization grows 100x (100x more data to be stored), these two forces will ‘cancel out’, so the decision of each node runner to scale up if needed will again be based on individual node profitability. Given additional tools to monitor profitability and directing node behavior on the network (such as AAP plugin) this should be easier to do, however the protocol implementation cannot guarantee that nodes will not require additional resources in the future.

In summary - there are several improvements on the way in short, mid and long term in several aspects on the network. Expect a release as early as this week, with an ask for assistance from the dev team with testing the effect of the release.

I want to give a shoutout to everyone contributing in the discussion and would like to invite you to join us in making v6 as powerful as we can together - we will be issuing an RFC after the draft is completed by the team, and would really appreciate getting your input and feedback.

Thanks and #traceon

from ot-node.

kotlarmilos commented on June 16, 2024

Hi Tracers,

We’ve been reviewing the issues on the repo and we believe that the issue should no longer persist when pruning is enabled. We will close it for now.

from ot-node.

ArangoError: IO error: No space left on deviceWhile appending to file: /var/lib/arangodb3/engine-rocksdb/002001.sst: No space left on device about ot-node HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent