Code Monkey home page Code Monkey logo

cosmos-omnibus's People

Contributors

alijnmerchant21 avatar andy108369 avatar betterclever avatar boz avatar chainzero avatar charlesjudith avatar clydedevv avatar discoverdefiteam avatar flopana avatar jianhonghong avatar lpx55 avatar marius-avram avatar moonshine-cc avatar nddeluca avatar niccoloraspa avatar nullmames avatar omahs avatar omniwired avatar orenl-lava avatar perseverance42 avatar pikachuexe avatar poolpirate avatar pratikbin avatar rodri-r avatar source-protocol-cosmos avatar spacepotahto avatar tasiov avatar tombeynon avatar vitalyv1337 avatar zanglang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cosmos-omnibus's Issues

date: invalid date '-"30 days"' right after snapshot completes, potentially breaking the old snapshot removal

The snapshot completes without the issues.

The error must be related to this line https://github.com/ovrclk/cosmos-omnibus/blob/c61281464/snapshot.sh#L51 and probably impacts only the old snapshot removal.

root-node-1  |  701GiB 1:47:15 [ 107MiB/s] [ 111MiB/s] [====================> ] 99% ETA 0:00:13
root-node-1  |  701GiB 1:47:20 [ 113MiB/s] [ 111MiB/s] [====================> ] 99% ETA 0:00:08
root-node-1  |  702GiB 1:47:25 [ 120MiB/s] [ 111MiB/s] [====================> ] 99% ETA 0:00:03
root-node-1  |  703GiB 1:47:30 [ 110MiB/s] [ 111MiB/s] [====================>] 100% ETA 0:00:00
root-node-1  |  703GiB 1:47:31 [ 111MiB/s] [ 111MiB/s] [====================>] 100%            
root-node-1  | 
root-node-1  | date: invalid date '-"30 days"'
root-node-1  | 14:51:45: Starting server
root-node-1  | 14:51:45: Snapshot will run at 13:04:00 on day 5
root-node-1  | 2:52PM INF starting node with ABCI Tendermint in-process

Bitcanna deployment crashes after max 10 minutes of running - image outdated

Deploying a Bitcanna node on any provider causes it to crash after 3-10 minutes of running.

  • Tested using 4 and 8 VCPU
  • Tested using 8Gi, 16Gi and 32Gi of RAM
  • Tested using 100Gi ephemeral storage
  • Tested using 5Gi of ephemeral storage + 500Gi of persistent storage (beta2 and beta3)
  • Tested on Europlots.com provider (ephemeral only and ephemeral+beta3) and boxedcloud.net provider (ephemeral+beta2)

These are the deployment events:

[node]: [Normal] [Pulled] [Pod] Container image "ghcr.io/akash-network/cosmos-omnibus:v0.3.27-bitcanna-v1.5.3" already present on machine [node]: [Normal] [Created] [Pod] Created container node [node]: [Normal] [Started] [Pod] Started container node [node]: [Warning] [BackOff] [Pod] Back-off restarting failed container [node]: [Normal] [Pulled] [Pod] Container image "ghcr.io/akash-network/cosmos-omnibus:v0.3.27-bitcanna-v1.5.3" already present on machine [node]: [Normal] [Created] [Pod] Created container node [node]: [Normal] [Started] [Pod] Started container node [node]: [Warning] [BackOff] [Pod] Back-off restarting failed container [node]: [Warning] [BackOff] [Pod] Back-off restarting failed container [node]: [Normal] [Pulled] [Pod] Container image "ghcr.io/akash-network/cosmos-omnibus:v0.3.27-bitcanna-v1.5.3" already present on machine [node]: [Normal] [Created] [Pod] Created container node [node]: [Normal] [Started] [Pod] Started container node [node]: [Warning] [BackOff] [Pod] Back-off restarting failed container [node]: [Warning] [BackOff] [Pod] Back-off restarting failed container

And these are the logs

2:03PM INF ABCI Replay Blocks appHeight=8082400 module=consensus stateHeight=8082400 storeHeight=8082401 2:03PM INF Replay last block using real app module=consensus panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x154660d]

As Andy noticed, the current version of Bitcanna is 1.6.3 and the image being pulled is version 1.5.3, this might be the reason for crashing.

Another issue not directly related to this is that when using persistent storage the default ephemeral storage amount is 100 Mi, this produces another error:
[node]: [Warning] [Evicted] [Pod] Pod ephemeral local storage usage exceeds the total limit of containers 104857600.
Increasing ephemeral storage to 500Mi solves this error.

I suggest increasing the default amount of ephemeral storage when persistent storage is used to 1Gi

preseed `priv_validator_state.json` when not found

There is a known tendermint issue which is not letting some chains to start unless priv_validator_state.json has been manually created.

[node-0] + exec /bin/sh -c '$START_CMD'
[node-0] 8:39PM INF running app args=["start"] module=cosmovisor path=/root/.sifnoded/cosmovisor/genesis/bin/sifnoded
[node-0] 8:39PM INF starting node with ABCI Tendermint in-process
[node-0] open /root/.sifnoded/data/priv_validator_state.json: no such file or directory
[node-0] Error: exit status 1
[node-0] 8:39PM ERR error="exit status 1" module=cosmovisor

Workaround:

mkdir /root/.sifnoded/data
echo '{"height":"0","round":0,"step":0}' > /root/.sifnoded/data/priv_validator_state.json

usr/bin/run.sh: line 133: /bin/cheqd-noded: cannot execute binary file: Exec format error

I have tried to run cheqd node on akash, but got error usr/bin/run.sh: line 133: /bin/cheqd-noded: cannot execute binary file: Exec format error
The issue is reproducible locally with cosmos-omnibus/cheqd/docker-compose.yaml
Full output from my local start docker-compose run

cheqd git:(master) docker-compose run --rm node_1
Pulling node_1 (ghcr.io/ovrclk/cosmos-omnibus:v0.1.7-cheqd-v0.5.0)...
v0.1.7-cheqd-v0.5.0: Pulling from ovrclk/cosmos-omnibus
982cba7e471c: Already exists
b02d86f59850: Pull complete
8b047e8f2e47: Pull complete
478cfe935c2f: Pull complete
c6620b71e668: Pull complete
3dd5012f8777: Pull complete
765cedd16b53: Pull complete
0bcbb0a46bc0: Pull complete
8732c2e3ec66: Pull complete
a9c135496e31: Pull complete
c98c1734796d: Pull complete
691b1aed8b8e: Pull complete
aeedf56cb501: Pull complete
4f4fb700ef54: Pull complete
Digest: sha256:f098b917e564f555ad4989ec71e612271d7e4aacc8ff9d6732ab5bee6efbdf48
Status: Downloaded newer image for ghcr.io/ovrclk/cosmos-omnibus:v0.1.7-cheqd-v0.5.0
Creating cheqd_node_1_run ... done
/usr/bin/run.sh: line 133: /bin/cheqd-noded: cannot execute binary file: Exec format error
ERROR: 126

Any ideas why the binary file in wrong format?

Setup API and swagger

I want to run akash node on akash, but it seems that the image in this repo does not enable API and swagger, what should i do to enable them?

add PIGZ to accelerate .tar.gz by 14x !

We should add and probably default to pigz compression tool.

The results speak for themselves:

  • tar - 600 MiB/s
  • tar.gz - 19.7 MiB/s
  • tar.zst - 46.6 MiB/s
  • tar.gz (PIGZ) - 275 MiB/s !

The secret is simple: gzip is constrained to a single thread while pigz can use multiple threads to perform the compression.

pigz performs the same thing as gzip, but it distributes the work across multiple processors and cores while compressing, considerably speeding up the compression/decompression process.

# TAR (no compression)
root@rpc:~# tar c -C $SNAPSHOT_DIR . 2>/dev/null | pv -petrafb -i 1 -s $SNAPSHOT_SIZE > /dev/null
8.20GiB 0:00:07 [ 988MiB/s] [1.17GiB/s] [>                              ]  1% ETA 0:10:26
19.0GiB 0:00:28 [ 403MiB/s] [ 693MiB/s] [==>                            ]  2% ETA 0:17:48
22.4GiB 0:00:37 [ 383MiB/s] [ 620MiB/s] [====>                          ]  3% ETA 0:19:49
24.4GiB 0:00:42 [ 405MiB/s] [ 594MiB/s] [====>                          ]  3% ETA 0:20:37
^C.8GiB 0:00:43 [ 385MiB/s] [ 589MiB/s] [====>                          ]  3% ETA 0:20:46

# TAR.GZ (`gzip -1`)
root@rpc:~# tar c -C $SNAPSHOT_DIR . 2>/dev/null | gzip -1 | pv -petrafb -i 1 -s $SNAPSHOT_SIZE >/dev/null
 137MiB 0:00:07 [19.4MiB/s] [19.7MiB/s] [>                             ]  0% ETA 10:43:41
 451MiB 0:00:23 [20.3MiB/s] [19.6MiB/s] [>                             ]  0% ETA 10:45:54
 550MiB 0:00:28 [20.1MiB/s] [19.7MiB/s] [>                             ]  0% ETA 10:43:50
 610MiB 0:00:31 [20.5MiB/s] [19.7MiB/s] [>                             ]  0% ETA 10:42:45
^C31MiB 0:00:32 [20.2MiB/s] [19.7MiB/s] [>                             ]  0% ETA 10:42:11

# TAR.ZST (`zstd`, v1.4.8)
root@rpc:~# tar c -C $SNAPSHOT_DIR . 2>/dev/null | zstd -c $zstd_extra_arg | pv -petrafb -i 1 -s $SNAPSHOT_SIZE > /dev/null
 452MiB 0:00:10 [47.1MiB/s] [45.3MiB/s] [>                              ]  0% ETA 4:39:39
 691MiB 0:00:15 [48.2MiB/s] [46.1MiB/s] [>                              ]  0% ETA 4:34:46
1.18GiB 0:00:26 [47.0MiB/s] [46.4MiB/s] [>                              ]  0% ETA 4:32:28
1.46GiB 0:00:32 [45.5MiB/s] [46.6MiB/s] [>                              ]  0% ETA 4:31:35
^C50GiB 0:00:33 [47.6MiB/s] [46.6MiB/s] [>                              ]  0% ETA 4:31:22

# TAR.GZ with PIGZ
root@rpc:~# tar c -C $SNAPSHOT_DIR . 2>/dev/null | pigz --fast | pv -petrafb -i 1 -s $SNAPSHOT_SIZE > /dev/null
 861MiB 0:00:03 [ 294MiB/s] [ 287MiB/s] [>                              ]  0% ETA 0:44:05
2.15GiB 0:00:08 [ 276MiB/s] [ 275MiB/s] [>                              ]  0% ETA 0:45:55
4.35GiB 0:00:16 [ 278MiB/s] [ 278MiB/s] [>                              ]  0% ETA 0:45:18
7.07GiB 0:00:26 [ 287MiB/s] [ 278MiB/s] [>                              ]  0% ETA 0:45:06
8.60GiB 0:00:32 [ 267MiB/s] [ 275MiB/s] [>                              ]  1% ETA 0:45:30
^C

Update ImpactHub version

It looks like ImpactHub have changed their versioning; the v1.6.0 tag doesn't exist anymore and latest version appears to be v0.17.0. Builds will fail for impacthub until this is changed.

This will need updating in the chain registry as well.

Fat Gaia

The v4.2.1 config originally found here for gaia will work, and begin to sync. The beginning of gaia's current state is compatible with those consensus rules. Then, She will change over to v5.0.0 consensus rules, and the current binary is v5.0.2.

In order to make fat gaia happy, you'll need to give her cake, which is available at quicksync.io

Gaia is very fat, and she is sure to get fatter.

cosmos/gaia#704

Validators

Strongly held opinion:

Due to the fact that validators secure their chains and do so with cryptographic material, validator operation is best done from people's homes and offices, and automation is a - not a +.

[question] AKASH_MODE env. variable

Is AKASH_MODE somehow used or that's only to help the operator see what is the mode?

What would AKASH_MODE=full mean - that's not the pruning=nothing (i.e. archiving mode node).

Maybe we should remove AKASH_MODE or rename it (or update the readme) to reduce confusion?

cosmos-omnibus$ git grep AKASH_MODE
_examples/validator-and-private-sentries/deploy.yml:      - AKASH_MODE=validator
_examples/validator-and-private-sentries/deploy.yml:      - AKASH_MODE=full
_examples/validator-and-private-sentries/deploy.yml:      - AKASH_MODE=full
_examples/validator-and-public-sentries/sentries-deploy.yml:      - AKASH_MODE=full
_examples/validator-and-public-sentries/sentries-deploy.yml:      - AKASH_MODE=full
_examples/validator-and-public-sentries/validator-deploy.yml:      - AKASH_MODE=validator

I cannot run this image successfully

When i run akash node on akash, there is nothing log and maybe crashed.
When i run on my machine, it always exit with code 7 or 35, and sometimes, it doesn't do anything, and it doesn't show up.
Only once, it seems to have run successfully.
PCZKA~(63`VHEA5HNOQKOCB
I read the tutorials in the documentation that the first startup in the document seems to have failed. Is it starting repeatedly to expect a normal run?

Automated snapshot generation

Need a solution for snapshotting a node's data directory and uploading the resulting archive somewhere accessible.

Ideally this should be able to run on Akash in a single container. Need a script to start/stop a tendermint server at a certain day/time, create an archive and upload it, then start the server again. Should exit if the tendermint server exits. Should also clean up backups automatically.

add fail-safe mode by default to avoid double-sign when the new pod started while the old one is terminating but is still up

Add fail-safe mode by default to avoid double-sign when the new pod started while the old one is terminating but is still up for any reason.

The fail-safe mode would check whether the last few recent blocks (5, 10 or 50) haven't been signed by the validator before starting it.
And if it does sign the recent blocks, it would then fail safely with some message such as: echo "WARNING!!! There is an active validator with the same pubkey is validating the blocks! Safely exiting..."

It has happened to Dimokus as he woke up to a jailed validator because of that.

Relay docs

Document how to use this to provide relay nodes.

You can assign this to me.

BitSong binary update to v0.11.0

Recently BitSong went through a chain update and released a new version of the binary. The old version v0.10.0 is deprecated.
Need to update the repository to the new version: v.0.11.0

snapshots: automatically detect directory structure

some snapshots have data and wasm directories, I think they should get automatically detected before the extraction.

they are currently getting set via these variables:

- SNAPSHOT_WASM_PATH=wasm
- SNAPSHOT_DATA_PATH=data

ref. #372

snapshots: add Storj DCS support for backup & restore

Storj is cheap and reliable!

with enterprise-grade 99.95% data availability https://www.storj.io/solutions-brief/data-sovereignty

image

image

Example config:

      - SNAPSHOT_METADATA_URL=https://link.storjshare.io/s/<REDACTED>/akash-snapshots/rpc.la
      - SNAPSHOT_PATH=akash-snapshots/rpc.la
      - SNAPSHOT_JSON=https://link.storjshare.io/s/<REDACTED>/akash-snapshots/rpc.la/snapshot.json?download=1
      - SNAPSHOT_FORMAT=tar.zst
      - ZSTD_NBTHREADS=0 # use all cores => faster
$ curl -s https://link.storjshare.io/s/<REDACTED>/akash-snapshots/rpc.nl/snapshot.json?download=1 | jq -r .latest
https://link.storjshare.io/s/<REDACTED>/akash-snapshots/rpc.nl/akashnet-2_2022-12-08T22:32:30.tar.zst

This already is working with Storj DCS S3 Credentials, I've tested it.
Only the snapshot script needs to append ?download=1 suffix to the fileUrl


When sharing the snapshot

The download speed via link.storjshare.io can be very slow.

Whilst the uploading via S3 gateway https://gateway.storjshare.io is quite fast, the downloading is a bit slow and aws s3 cp does not download in parallel.

Hence, it is suggested to use the uplink tool which connects to the Storj DCS nodes directly for downloading.

Example uplink usage:

curl -L https://github.com/storj/storj/releases/latest/download/uplink_linux_amd64.zip -o uplink_linux_amd64.zip
unzip -o uplink_linux_amd64.zip
sudo install uplink /usr/local/bin/uplink
rm uplink

uplink access import --force --interactive=false akash-snapshots <ACCESS_GRANT_TOKEN/or a file>

uplink cp -p 16 sj://akash-snapshots/rpc.nl/akashnet-2_2022-12-08T22:32:30.tar.zst .

uplink can do the stream:

  • upload:
tar c -C $SNAPSHOT_DIR . | ZSTD_NBTHREADS=0 zstd -c | uplink cp -p 4 -t 4 - sj://akash-snapshots/test1/test.tar.zst
  • download:
uplink cp -p 4 -t 4 sj://akash-snapshots/rpc.nl/akashnet-2_2022-12-08T22:32:30.tar.zst - | pv -petrafb | ZSTD_NBTHREADS=0 zstd -cd | tar xf -

can add --parallelism-chunk-size 128M (allow user specify the chunk size)

Todo

  • append ?download=1 suffix when forming snapshot.json;
  • update readme with the uplink example when sharing the snapshot for fast downloads;
  • snapshot restore: add the uplink tool for restoring the snapshot;

Refs

Launching nodes seems broken? Osmosis and Cosmos

Issues trying to get cosmos and osmosis nodes up and running using the cloudmos templates (via the app)

Osmosis node starts and then silently stops

Cosmos emits this message and then stops:

node: Issue obtaining the Polkachu snapshot. Likely the API issue.
node: Sleeping for 10 minutes before restarting...

Set Moniker environment var correctly

Currently the MONIKER env variable is only used on init, meaning if you change it in subsequent runs it won't change the moniker of the node.

Omnibus just needs to set {NAMESPACE}_MONIKER in addition to MONIKER so the node uses the env var config instead of the config file created by init.

This can be worked around in the meantime by specifying both MONIKER and {NAMESPACE}_MONIKER in your environment variables.

CI server

If the overclock labs team would be OK with it, I configure a pretty nice CI server and would be happy to donate a powerful one to the cosmos-omnibus project. This could be used to automatically confirm weather or not chains can state sync, and probably a lot more.

Or you could.

Basically the constraint would be the 7gb of storage that github provides.

properly handle signals (SIGINT, SIGTERM) to terminate the server correctly

A shell script is running the binary, when the container receives SIGTERM (say on Pod termination or docker stop / kill with the default SIGTERM, or Ctrl+C - SIGINT) the shell script does not forward these signals to the binary.

You can quickly test and see SIGTERM is not working, just issue docker [compose] stop -t600 node it won't do anything, until it times out and issues the SIGKILL which is not stopping the binary gracefully.

Cosmos SDK understands SIGTERM & SIGINT for graceful server termination.

Signal handling can be done directly in the shell script, in cosmos-omnibus that would be both run.sh and snapshot.sh.

We can reuse the signal handling from one of my projects I've been working on - self-update.

Workaround

Get into the container and then kill <pid-of-cosmos-binary> (issues SIGTERM by default).

not working rest api on nodes

Hello, i'm trying to run evmos node with ghc.io/ovrclk/cosmos-omnibus:v0.3.5-evmos-v8.2.0. All working except 1317 port. In #85 @tombeynon replied

@kamsz The REST API is enabled using the tendermint config api.enable, so with environment variables for Chihuahua that would be CHIHUAHUAD_API_ENABLE=true. Then just expose 1317.

I try to do with evmos node with environment variable EVMOS_API_ENABLE=1 and EVMOS_API_ENABLE=true, but its not working. Then i try it with akash node with variable AKASH_API_ENABLE=1 and its worked. How i can expose and enable 1317 port from docker?

The generic image is not published on tag

This is the response I got when trying to run docker run with the generic image:

docker run ghcr.io/ovrclk/cosmos-omnibus:v0.2.2-generic                                                                                                                                                                                                                                                                                                                      
Unable to find image 'ghcr.io/ovrclk/cosmos-omnibus:v0.2.2-generic' locally
docker: Error response from daemon: manifest unknown.

Checking out the tag workflows, it does not look like the generic image is being pushed to the container registry on tag.

Setup automated tests to run on push

Need a test script which at least checks the binary runs, but ideally would check the node boots after a configurable amount of time and RPC responds (something to check the namespace is correct)

[akash] should probably use pruned snapshot by default? to save 8410% (400G => 4.7G) and speed up the deployment

Akash should probably use the pruned snapshot by default, to save 8410% (400G => 4.7G) as well as to speed up the deployment.

Change is simple =>

from:

  SNAPSHOT_JSON=https://cosmos-snapshots.s3.filebase.com/akash/snapshot.json

to:

  SNAPSHOT_JSON=https://cosmos-snapshots.s3.filebase.com/akash/pruned/snapshot.json

Ref.

Getting all the data from chain.json

Just a question, cause I'm not too familiar with Docker deployments, so sorry if this doesn't make sense. But it seems each chain's Docker files have some data hardcoded in the Docker file. It seems this is primarily the git repo/version, link to genesis file, name of home directory, and gas prices. However, it seems that all of this data is available in the chain.json, so is there a reason to not only pass in the chain.json as the only input?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.