Code Monkey home page Code Monkey logo

Comments (15)

hitchhooker avatar hitchhooker commented on August 22, 2024 2

workaround has been to identify node being stuck and restarting service something like this

#!/bin/bash

# Variables
LOG_FILE="bsc.log"
PATTERN="No block bodies to write"
MIN_OCCURRENCES=5
NODE_RESTART_CMD="systemctl restart your-node.service"

# Function to count consecutive occurrences of the pattern with the same block number
count_occurrences() {
  local log_file="$1"
  local pattern="$2"
  local count=0
  local prev_block_number=""

  grep -F "$pattern" "$log_file" | while read -r line; do
    if [[ $line == *"$pattern"* ]]; then
      block_number=$(echo "$line" | grep -oP 'block number=\K\d+')
      if [[ "$block_number" == "$prev_block_number" ]]; then
        ((count++))
      else
        count=1
      fi

      if (( count >= MIN_OCCURRENCES )); then
        return 0
      fi

      prev_block_number="$block_number"
    fi
  done

  return 1
}

# Main loop
while true; do
  if count_occurrences "$LOG_FILE" "$PATTERN"; then
    echo "Restarting node due to consecutive occurrences of '$PATTERN' with the same block number in $LOG_FILE"
    $NODE_RESTART_CMD
  fi
  sleep 60
done

from bsc-erigon.

yusben avatar yusben commented on August 22, 2024 2

@calmbeing thank you for the update.

Despite running with the suggested parameters on v1.0.7 we are still facing issue with the syncing. We observe the error no block headers/bodies to write... to be less frequent, but overall the syncing seems to be slower than before. Please find attached the entire log of one of our node that was restarted by the cronjob 9h ago. bsc-mainnet-erigon-archive.log

Also, these are the startup command parameters for reference

- '--chain=bsc'
- '--datadir=/data/erigon'
- '--log.console.verbosity=info'
- '--nat=extip:$(POD_IP)'
- '--maxpeers=1000'
- '--p2p.protocol=66,67'
- '--torrent.download.slots=20'
- '--torrent.download.rate=512mb'
- '--bodies.cache=214748364800'
- '--batchSize=4096M'
- '--db.pagesize=16k'
- '--sentry.drop-useless-peers'

I assume you should use v1.0.8 with the latest fixes. I deployed it yesterday with the flag as was recommended by calmbeing and so far erigon has been working for 24+ hours flawlessly and without no block headers in the logs.

from bsc-erigon.

calmbeing avatar calmbeing commented on August 22, 2024 2

@broyeztony , good to hear that, your sync slow problem should be resolved by this latest commit cc23b96, we'll draft a release later.

from bsc-erigon.

calmbeing avatar calmbeing commented on August 22, 2024 1

Suggest to use the latest mdbx snapshot for archive node: https://github.com/bnb-chain/bsc-snapshots

from bsc-erigon.

broyeztony avatar broyeztony commented on August 22, 2024 1

Following our call today, here are the 'ETL' logs we see in the client container

[INFO] [05-11|09:11:46.633] [] ETL [2/2] Loading                     into=PlainState current_prefix=ae5f14a9
[INFO] [05-11|09:12:16.634] [] ETL [2/2] Loading                     into=PlainState current_prefix=b41b9ac5
[INFO] [05-11|09:12:28.010] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=7.3GB sys=9.7GB
[INFO] [05-11|09:12:46.633] [] ETL [2/2] Loading                     into=PlainState current_prefix=bb56a527
[INFO] [05-11|09:13:16.633] [] ETL [2/2] Loading                     into=PlainState current_prefix=c4bfd229
[INFO] [05-11|09:13:27.226] [p2p] GoodPeers                          eth66=992
[INFO] [05-11|09:13:28.027] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=5.0GB sys=9.7GB
[INFO] [05-11|09:13:49.650] [] ETL [2/2] Loading                     into=PlainState current_prefix=cca4e84c
[INFO] [05-11|09:14:17.557] [] ETL [2/2] Loading                     into=PlainState current_prefix=d72ffbc2
[INFO] [05-11|09:14:28.013] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=7.2GB sys=9.7GB
[INFO] [05-11|09:14:46.633] [] ETL [2/2] Loading                     into=PlainState current_prefix=ddb59a10
[INFO] [05-11|09:15:16.634] [] ETL [2/2] Loading                     into=PlainState current_prefix=e7f08bbf
[INFO] [05-11|09:15:27.225] [p2p] GoodPeers                          eth66=992
[INFO] [05-11|09:15:28.024] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=5.4GB sys=9.7GB
[INFO] [05-11|09:15:46.633] [] ETL [2/2] Loading                     into=PlainState current_prefix=ea0d2132
[INFO] [05-11|09:16:16.634] [] ETL [2/2] Loading                     into=PlainState current_prefix=f387ef35
[INFO] [05-11|09:16:28.010] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=7.5GB sys=9.7GB
[INFO] [05-11|09:16:46.633] [] ETL [2/2] Loading                     into=PlainState current_prefix=fd36e2c2
[INFO] [05-11|09:17:27.225] [p2p] GoodPeers                          eth66=992
[INFO] [05-11|09:17:28.032] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=5.3GB sys=9.7GB
[INFO] [05-11|09:18:14.291] [7/15 Execution] Completed on            block=28111500
[INFO] [05-11|09:18:14.291] [7/15 Execution] DONE                    in=47m50.839811187s
[INFO] [05-11|09:18:14.292] [8/15 HashState] Promoting plain state   from=28101756 to=28111500
[INFO] [05-11|09:18:14.292] [8/15 HashState] Incremental promotion   from=28101756 to=28111500 codes=true csbucket=AccountChangeSet
[INFO] [05-11|09:18:28.009] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=7.0GB sys=9.7GB
[INFO] [05-11|09:18:44.293] [8/15 HashState] ETL [1/2] Extracting    from=AccountChangeSet current_prefix=0000000001acce32
[INFO] [05-11|09:19:14.293] [8/15 HashState] ETL [1/2] Extracting    from=AccountChangeSet current_prefix=0000000001acd406
[INFO] [05-11|09:19:27.225] [p2p] GoodPeers                          eth66=994
[INFO] [05-11|09:19:28.024] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=4.6GB sys=9.7GB
[INFO] [05-11|09:19:44.293] [8/15 HashState] ETL [1/2] Extracting    from=AccountChangeSet current_prefix=0000000001acd957
[INFO] [05-11|09:20:14.293] [8/15 HashState] ETL [1/2] Extracting    from=AccountChangeSet current_prefix=0000000001acde06
[INFO] [05-11|09:20:28.011] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=3.7GB sys=9.7GB
[INFO] [05-11|09:20:44.293] [8/15 HashState] ETL [1/2] Extracting    from=AccountChangeSet current_prefix=0000000001ace3b5
[INFO] [05-11|09:21:14.293] [8/15 HashState] ETL [1/2] Extracting    from=AccountChangeSet current_prefix=0000000001acea5a
[INFO] [05-11|09:21:27.225] [p2p] GoodPeers                          eth66=998
[INFO] [05-11|09:21:28.027] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=5.8GB sys=9.7GB
[INFO] [05-11|09:22:14.187] [8/15 HashState] ETL [2/2] Loading       into=HashedCodeHash current_prefix=4a728024
[INFO] [05-11|09:22:28.010] [txpool] stat                            pending=10000 baseFee=0 queued=30000 alloc=5.0GB sys=9.7GB
[INFO] [05-11|09:22:44.187] [8/15 HashState] ETL [2/2] Loading       into=HashedCodeHash current_prefix=95b338d1
[INFO] [05-11|09:23:14.187] [8/15 HashState] ETL [2/2] Loading       into=HashedCodeHash current_prefix=dfe51ad1

We are not sure that these logs are relevant to the issue, but we are also running other network on Erigon without issue and we don't the see these 'ETL' logs there.

Also could you provide the configuration on which you are running your own archive and full nodes?
Thank you very much

from bsc-erigon.

setunapo avatar setunapo commented on August 22, 2024 1

got it, we are checking on it.

from bsc-erigon.

calmbeing avatar calmbeing commented on August 22, 2024 1

@broyeztony @fylsan3 After careful debug, the root cause is for two reason:

  1. Request packet params is kind of different from bsc-geth.
  2. Always connect to useless peers even return empty response.
  • The first one was fixed in the source code,

  • For the second one: please specify flag : --sentry.drop-useless-peers, which would drop useless peers with empty response

Please let us know if any issue

from bsc-erigon.

yusben avatar yusben commented on August 22, 2024 1

@yusben Thank you for your inputs. Would you mind sharing the setup you are using to run your node? Not so much the startup command but rather:

  • are you running a full (pruned) node or an archive node?
  • disk / storage type / filesystem ? is it local to the machine that runs erigon or a scalable volume like AWS EBS. What filesystem are you using there?

Using v1.0.8 and ... --sentry.drop-useless-peers --p2p.protocol=66... we are not encountering the no block headers/bodies error logs either (except on 1 testnet node, same issue as reported there) but our syncing loop on mainnet archive is just too slow (our mainnet/full nodes however perform better and remains relatively 'close' to head)

On average, it takes our best nodes to sync ~400 blocks over ~20m. Pasting a example below:

[INFO] [05-23|06:07:00.025] RPC Daemon notified of new headers       from=28453505 to=28453885 hash=0xbcc966a30ad2e90786e2c585868671bdcf813fd2b2c49f4de0c42121df41324e header sending=371.552µs log sending=316ns
... <stages 1 to 15>
[INFO] [05-23|06:26:40.867] Timings (slower than 50ms)               Headers=1.112s BlockHashes=546ms Bodies=1.251s Senders=288ms Execution=2m54.614s HashState=3m19.178s IntermediateHashes=3m2.516s CallTraces=2m13.753s AccountHistoryIndex=1m9.658s StorageHistoryIndex=2m10.962s LogIndex=1m34.543s TxLookup=1m48.088s

As a consequence, our nodes always fall behind head by > 400 blocks. @harveyff @calmbeing FYI.

I am running a pruned node on baremetal server with 3 nvme ssds in raid0 ext4. Now it is working for 10 days without No block bodies to write in this log period error and is healthy for 99%+ of time.

command: | --chain=bsc --datadir=/home/erigon/erigon-data --db.pagesize=16k --prune=hrtc --ws --http --http.addr 0.0.0.0 --http.port 8545 --http.vhosts=* --maxpeers 100 --http.api=eth,erigon,web3,net,debug,trace,txpool,admin --port 30303 --p2p.allowed-ports=30303,30304 --p2p.protocol=66 --log.dir.path=/home/erigon/erigon-data/logs --metrics --metrics.addr 0.0.0.0

from bsc-erigon.

calmbeing avatar calmbeing commented on August 22, 2024

@broyeztony For very short-term, you could use something like following script to detect "No block write.." and restart automatically. For the long-term, we're analyzing carefully about the p2p sentry code and will optimize it soon.

#!/bin/bash
set -x
# Define the number of log lines to check
LOG_LINES=50

# Define the number of occurrences before triggering a restart
OCCURRENCES=10

# Continuously check the logs
while true; do
    # Get the last 30 log lines
    LOGS=$(journalctl -u erigon-bsc --lines=$LOG_LINES)

    # Count the number of occurrences of the error message
    COUNT=$(echo "$LOGS" | grep -o "No block headers to write in this log period block number" | wc -l)

    # Check if the error occurred more than 10 times
    if [ $COUNT -gt $OCCURRENCES ]; then
        # Restart the erigon-bsc service
        echo "Error occurred more than $OCCURRENCES times. Restarting service..."
        systemctl restart erigon-bsc
    fi

    # Sleep for 60 seconds before checking the logs again
    sleep 60
done

from bsc-erigon.

calmbeing avatar calmbeing commented on August 22, 2024

workaround has been to identify node being stuck and restarting service something like this

#!/bin/bash

# Variables
LOG_FILE="bsc.log"
PATTERN="No block bodies to write"
MIN_OCCURRENCES=5
NODE_RESTART_CMD="systemctl restart your-node.service"

# Function to count consecutive occurrences of the pattern with the same block number
count_occurrences() {
  local log_file="$1"
  local pattern="$2"
  local count=0
  local prev_block_number=""

  grep -F "$pattern" "$log_file" | while read -r line; do
    if [[ $line == *"$pattern"* ]]; then
      block_number=$(echo "$line" | grep -oP 'block number=\K\d+')
      if [[ "$block_number" == "$prev_block_number" ]]; then
        ((count++))
      else
        count=1
      fi

      if (( count >= MIN_OCCURRENCES )); then
        return 0
      fi

      prev_block_number="$block_number"
    fi
  done

  return 1
}

# Main loop
while true; do
  if count_occurrences "$LOG_FILE" "$PATTERN"; then
    echo "Restarting node due to consecutive occurrences of '$PATTERN' with the same block number in $LOG_FILE"
    $NODE_RESTART_CMD
  fi
  sleep 60
done

Also, can refer to this script

from bsc-erigon.

koraykoska avatar koraykoska commented on August 22, 2024

A restart does not work at all for us. This error continues to come without the node syncing. We used the latest snapshot on the official page. Is there any workaround?

from bsc-erigon.

koraykoska avatar koraykoska commented on August 22, 2024

After every restart:

[INFO] [05-16|16:55:51.333] [txpool] Started
[INFO] [05-16|16:56:10.518] [2/15 Headers] Wrote block headers       number=28265891 blk/second=5.000 alloc=2.9GB sys=3.1GB
[INFO] [05-16|16:56:30.518] [2/15 Headers] No block headers to write in this log period block number=28265891
[INFO] [05-16|16:56:50.492] [txpool] stat                            pending=1 baseFee=0 queued=4 alloc=3.1GB sys=3.3GB
[INFO] [05-16|16:56:50.518] [2/15 Headers] No block headers to write in this log period block number=28265891
[INFO] [05-16|16:57:10.518] [2/15 Headers] No block headers to write in this log period block number=28265891
[INFO] [05-16|16:57:30.519] [2/15 Headers] No block headers to write in this log period block number=28265891
[INFO] [05-16|16:57:49.054] [p2p] GoodPeers                          eth66=73 eth67=4 eth68=1
[INFO] [05-16|16:57:50.494] [txpool] stat                            pending=3641 baseFee=0 queued=2182 alloc=2.4GB sys=3.6GB
[INFO] [05-16|16:57:50.519] [2/15 Headers] Wrote block headers       number=28265933 blk/second=2.100 alloc=2.4GB sys=3.6GB
[INFO] [05-16|16:58:10.518] [2/15 Headers] No block headers to write in this log period block number=28265933

And it continues to never sync, only No block headers to write in this log period

from bsc-erigon.

broyeztony avatar broyeztony commented on August 22, 2024

@calmbeing thank you for the update.

Despite running with the suggested parameters on v1.0.7 we are still facing issue with the syncing.
We observe the error no block headers/bodies to write... to be less frequent, but overall the syncing seems to be slower than before.
Please find attached the entire log of one of our node that was restarted by the cronjob 9h ago.
bsc-mainnet-erigon-archive.log

Also, these are the startup command parameters for reference

- '--chain=bsc'
- '--datadir=/data/erigon'
- '--log.console.verbosity=info'
- '--nat=extip:$(POD_IP)'
- '--maxpeers=1000'
- '--p2p.protocol=66,67'
- '--torrent.download.slots=20'
- '--torrent.download.rate=512mb'
- '--bodies.cache=214748364800'
- '--batchSize=4096M'
- '--db.pagesize=16k'
- '--sentry.drop-useless-peers'

from bsc-erigon.

broyeztony avatar broyeztony commented on August 22, 2024

@yusben Thank you for your inputs.
Would you mind sharing the setup you are using to run your node? Not so much the startup command but rather:

  • are you running a full (pruned) node or an archive node?
  • disk / storage type / filesystem ? is it local to the machine that runs erigon or a scalable volume like AWS EBS. What filesystem are you using there?

Using v1.0.8 and ... --sentry.drop-useless-peers --p2p.protocol=66... we are not encountering the no block headers/bodies error logs either (except on 1 testnet node, same issue as reported there) but our syncing loop on mainnet archive is just too slow (our mainnet/full nodes however perform better and remains relatively 'close' to head)

On average, it takes our best nodes to sync ~400 blocks over ~20m. Pasting a example below:

[INFO] [05-23|06:07:00.025] RPC Daemon notified of new headers       from=28453505 to=28453885 hash=0xbcc966a30ad2e90786e2c585868671bdcf813fd2b2c49f4de0c42121df41324e header sending=371.552µs log sending=316ns
... <stages 1 to 15>
[INFO] [05-23|06:26:40.867] Timings (slower than 50ms)               Headers=1.112s BlockHashes=546ms Bodies=1.251s Senders=288ms Execution=2m54.614s HashState=3m19.178s IntermediateHashes=3m2.516s CallTraces=2m13.753s AccountHistoryIndex=1m9.658s StorageHistoryIndex=2m10.962s LogIndex=1m34.543s TxLookup=1m48.088s

As a consequence, our nodes always fall behind head by > 400 blocks.
@harveyff @calmbeing FYI.

from bsc-erigon.

setunapo avatar setunapo commented on August 22, 2024

closed, seems the sync is much smooth, could reopen if reoccur

from bsc-erigon.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.