What happened? We are attempting to use the EMQX broker to send MQ

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

How many subscribers to the topic were there? <p dir="a

Excessive memory spike after a MQTT client publishes a large payload (e.g. ~100mb), causing broker crash,about emqx/emqx

Comments (39)

id commented on June 16, 2024 2

Still not able to reproduce with emqx on host, and emqx in docker with memory limit.

React-less nodejs code to simplify troubleshooting:

process.env.NODE_TLS_REJECT_UNAUTHORIZED = "0";
//process.env.DEBUG = "*";
process.env.NODE_ENV = "dev";
const mqtt = require("mqtt");
const mqttClient = mqtt.connect({
  port: 8084,
  path: '/mqtt',
  clientId: 'test',
  username: 'test',
  password: 'test',
  protocol: 'wss',
  hostname: '127.0.0.1',
  keepalive: 120,
  clean: false,
  connectTimeout: 1000,
  reconnectPeriod: 0,
})

mqttClient.on('packetsend', (packet) => {
  console.log(packet)
})

mqttClient.on('connect', () => {
  // Change the last number to vary the number of megabytes in the payload (roughly accurate).
  const payload = JSON.stringify({ field: 'x'.repeat(1000 * 1000 * 90) })
  mqttClient.publish('t/test', payload)
})

npm install mqtt --save
node ./test.js

from emqx.

bernard-bear commented on June 16, 2024 2

I presume that I should run the command at resting state?

Here are the files for both pods:
alloc-emqx-cluster-core-5854b66996-0.txt
alloc-emqx-cluster-core-5854b66996-1.txt

from emqx.

id commented on June 16, 2024 1

Tried with vanilla emqx 5.3.2 installed with emqx-operator on GKE and the test script above, could not reproduce. No OOM, and no memory usage spike.

gcloud container clusters create emqx
gcloud container clusters get-credentials emqx
helm repo add jetstack https://charts.jetstack.io
helm repo add emqx https://repos.emqx.io/charts
helm repo update
helm upgrade --install cert-manager jetstack/cert-manager --namespace cert-manager --create-namespace --set installCRDs=true
helm upgrade --install emqx-operator emqx/emqx-operator --namespace emqx-operator-system --create-namespace
kubectl wait --for=condition=Ready pods -l "control-plane=controller-manager" -n emqx-operator-system
kubectl create namespace emqx
kubectl apply -f emqx.yaml
kubectl -n emqx wait --for=condition=Ready emqx emqx --timeout=120s
kubectl -n emqx get svc

emqx.yaml

apiVersion: apps.emqx.io/v2beta1
kind: EMQX
metadata:
  name: emqx
  namespace: emqx
spec:
  image: emqx/emqx:5.3.2
  coreTemplate:
    spec:
      replicas: 3
      volumeClaimTemplates:
        resources:
          requests:
            storage: 10Gi
        accessModes:
          - ReadWriteOnce
  listenersServiceTemplate:
    metadata:
      annotations:
        cloud.google.com/l4-rbs: "enabled"
    spec:
      type: LoadBalancer
  dashboardServiceTemplate:
    spec:
      type: LoadBalancer

from emqx.

qzhuyan commented on June 16, 2024 1

@bernard-bear glad to know, thanks!

Larger driver buffer give less stress on memory allocator, create less blocks with larger size, less fragmentations (I think this is main reason) and less context switch, basically reduce the work of memory allocator.

To be clear it is just memory spikes, not memory usage, the node reclaims and defreg the memory when it has the resource to do it.

When you want to limit the memory usage consider set it to 1.3 * memory spikes that tested.

from emqx.

id commented on June 16, 2024

Hi @bernard-bear, thanks for the report. I could not reproduce this locally, will try on GKE.

from emqx.

ieQu1 commented on June 16, 2024

How many subscribers to the topic were there? Do you have any rule engine rules?

If the broker has to fan-out a message to N subscribers, it will create N copies of the message.

from emqx.

zmstone commented on June 16, 2024

The behaviour described above happens even if the max packet size is set to a low value (e.g. 1mb)

@bernard-bear frame_too_large should be logged if mqtt.max_packet_size limit is hit, could you check if you can observe such logs by setting this limit to 1MB and send a 2MB or 10MB message?

If this limit indeed works, then we can significantly cut the investigation scope.

from emqx.

zmstone commented on June 16, 2024

@bernard-bear noticed that you have debug level logging.
could you try to test it at info level ?

from emqx.

bernard-bear commented on June 16, 2024

How many subscribers to the topic were there?

@ieQu1 The client was the only client, so there were no subscribers to the topic.

Do you have any rule engine rules?

@ieQu1 Nope.

frame_too_large

@zmstone After sending a 2MB message with a 1MB limit, I do not observe any frame_too_large via log trace on my client id.

These are the logs (info level):

2024-01-19T03:05:32.784021+00:00 [MQTT] bernard@<redacted>@XXX.XXX.XXX.XXX:XXXXX msg: mqtt_packet_received, packet: CONNECT(Q0, R0, D0, ClientId=bernard@<redacted>, ProtoName=MQTT, ProtoVsn=4, CleanStart=false, KeepAlive=120, Username=bernard@<redacted>, Password=******)
2024-01-19T03:05:32.784264+00:00 [AUTHN] bernard@<redacted>@XXX.XXX.XXX.XXX:XXXXX msg: jwt_verify_error, jwk: {jose_jwk,undefined,{jose_jwk_kty_rsa,{'RSAPublicKey',<redacted>}},#{}}, jwt: <redacted>, provider: emqx_authn_jwt, reason: {badarg,[<<"<redacted>">>]}
2024-01-19T03:05:32.784399+00:00 [AUTHN] bernard@<redacted>@XXX.XXX.XXX.XXX:XXXXX msg: invalid_jwt_signature, jwks: [{jose_jwk,undefined,{jose_jwk_kty_rsa,{'RSAPublicKey',<redacted>}},#{}}], jwt: <redacted>, provider: emqx_authn_jwt
2024-01-19T03:05:32.784490+00:00 [AUTHN] bernard@<redacted>@XXX.XXX.XXX.XXX:XXXXX msg: authenticator_result, authenticator: jwt, result: ignore
2024-01-19T03:05:32.784583+00:00 [AUTHN] bernard@<redacted>@XXX.XXX.XXX.XXX:XXXXX msg: authenticator_result, authenticator: password_based:built_in_database, result: {ok,#{is_superuser => false}}
2024-01-19T03:05:32.784620+00:00 [AUTHN] bernard@<redacted>@XXX.XXX.XXX.XXX:XXXXX msg: authentication_result, reason: chain_result, result: {stop,{ok,#{is_superuser => false}}}
2024-01-19T03:05:32.786528+00:00 [WS-MQTT] bernard@<redacted>@XXX.XXX.XXX.XXX:XXXXX msg: mqtt_packet_sent, packet: CONNACK(Q0, R0, D0, AckFlags=0, ReasonCode=0)
<MQTT client gets disconnected at this point, and subsequently reconnects>
2024-01-19T03:05:49.691282+00:00 [MQTT] bernard@<redacted>@YYY.YYY.YYY.YYY:YYYYY msg: mqtt_packet_received, packet: CONNECT(Q0, R0, D0, ClientId=bernard@<redacted>, ProtoName=MQTT, ProtoVsn=4, CleanStart=false, KeepAlive=120, Username=bernard@<redacted>, Password=******)

@bernard-bear noticed that you have debug level logging.
could you try to test it at info level ?

@zmstone It was only on debug level for a short duration. I've tested it at info level as well multiple times and the same issue occurs.

from emqx.

zmstone commented on June 16, 2024

mqtt_packet_received this log is only produced at debug level or as a trace log.
maybe your config changes did not take effect?
could you please share emqx.conf and data/cluster.hocon or the configs in your k8s yaml files.

from emqx.

bernard-bear commented on June 16, 2024

mqtt_packet_received this log is only produced at debug level or as a trace log.
maybe your config changes did not take effect?

I tried again and verified in cluster.hocon that console logging level is info, but mqtt_packet_received is still produced. I was using the "Log Trace" feature on the web UI dashboard. Is that what you mean by trace log?

from emqx.

bernard-bear commented on June 16, 2024

Here are the configs as requested:

/opt/emqx/etc/emqx.conf

mqtt {
  peer_cert_as_username = "cn"
  max_packet_size = 1MB
}
retainer {
  msg_expiry_interval = 2160h
  max_payload_size = 1MB
  msg_clear_interval = 1h
  backend {
      storage_type = disc
  }
}
dashboard {
  listeners {
    https {
      bind = "0.0.0.0:18084"
      ssl_options {
        cacertfile = "/opt/emqx/etc/certs/ca.crt"
        certfile = "/opt/emqx/etc/certs/tls.crt"
        keyfile = "/opt/emqx/etc/certs/tls.key"
      }
    }
  }
}
api_key.bootstrap_file = "/opt/emqx/etc/bootstrap_api_key"
authorization.no_match = deny
authentication = [
  {
    algorithm = "public-key"
    enable = true
    from = password
    mechanism = jwt
    public_key = "<redacted>"
    use_jwks = false
  }
]
listeners.ssl.default {
  bind = "0.0.0.0:8883"
  enable_authn = false
  ssl_options {
    cacertfile = "/opt/emqx/etc/certs/ca.crt"
    certfile = "/opt/emqx/etc/certs/tls.crt"
    keyfile = "/opt/emqx/etc/certs/tls.key"
    verify = verify_peer
    fail_if_no_peer_cert = true
  }
}
listeners.ws.default {
  bind = "0.0.0.0:8083"
  enable = false
}
listeners.wss.default {
  bind = "0.0.0.0:8084"
  ssl_options {
    cacertfile = "/opt/emqx/etc/certs/ca.crt"
    certfile = "/opt/emqx/etc/certs/tls.crt"
    keyfile = "/opt/emqx/etc/certs/tls.key"
  }
}
log.console {
  enable = true
  level = info
}

/opt/emqx/data/configs/cluster.hocon

authentication = [
  {
    acl_claim_name = acl
    algorithm = public-key
    enable = true
    from = password
    mechanism = jwt
    public_key = "<redacted>"
    use_jwks = false
    verify_claims = ""
  },
  {
    backend = built_in_database
    mechanism = password_based
    password_hash_algorithm {name = sha256, salt_position = suffix}
    user_id_type = username
  }
]
authorization {
  cache {
    enable = true
    max_size = 32
    ttl = 1m
  }
  deny_action = ignore
  no_match = deny
  sources = [
    {
      body {clientid = "${clientid}", topic = "${topic}"}
      connect_timeout = 15s
      enable = true
      enable_pipelining = 100
      headers {
        accept = "application/json"
        cache-control = no-cache
        connection = keep-alive
        content-type = "application/json"
        keep-alive = "timeout=30, max=1000"
      }
      method = post
      pool_size = 8
      request_timeout = 30s
      ssl {
        ciphers = []
        depth = 10
        enable = false
        hibernate_after = 5s
        log_level = notice
        reuse_sessions = true
        secure_renegotiate = true
        verify = verify_peer
        versions = [tlsv1.3, tlsv1.2]
      }
      type = http
      url = "<redacted>"
    },
    {
      enable = true
      path = "${EMQX_ETC_DIR}/acl.conf"
      type = file
    }
  ]
}
log {
  console {
    enable = true
    formatter = text
    level = info
    time_offset = system
  }
  file {
    default {
      enable = false
      formatter = text
      level = warning
      path = "/opt/emqx/log/emqx.log"
      rotation_count = 10
      rotation_size = 50MB
      time_offset = system
    }
  }
}
mqtt {
  await_rel_timeout = 300s
  exclusive_subscription = false
  idle_timeout = 15s
  ignore_loop_deliver = false
  keepalive_multiplier = 1.5
  max_awaiting_rel = 100
  max_clientid_len = 65535
  max_inflight = 32
  max_mqueue_len = 1000
  max_packet_size = 1MB
  max_qos_allowed = 2
  max_subscriptions = infinity
  max_topic_alias = 65535
  max_topic_levels = 128
  mqueue_default_priority = lowest
  mqueue_priorities = disabled
  mqueue_store_qos0 = true
  peer_cert_as_clientid = disabled
  peer_cert_as_username = cn
  response_information = ""
  retain_available = true
  retry_interval = 30s
  server_keepalive = disabled
  session_expiry_interval = 2h
  shared_subscription = true
  shared_subscription_strategy = round_robin
  strict_mode = false
  upgrade_qos = false
  use_username_as_clientid = false
  wildcard_subscription = true
}

from emqx.

qzhuyan commented on June 16, 2024

I was using the "Log Trace" feature on the web UI dashboard. Is that what you mean by trace log?

yes

from emqx.

qzhuyan commented on June 16, 2024

@bernard-bear have you tried to disable the trace? do you still get memory spikes?

from emqx.

bernard-bear commented on June 16, 2024

@bernard-bear have you tried to disable the trace? do you still get memory spikes?

Yes, the memory spike was reproduced multiple times consistently, both with log trace on and off.

from emqx.

qzhuyan commented on June 16, 2024

I cannot reproduce it with default emqx config and set mqtt.max_packet_size 200M , I publish a payload size of 150M and the memory heap spike is below < 600M (bump from 270M).

However I could easily reproduced what you said when turn on the debug tracing which bumps the memory usage to >2GB

On average, it seems that the memory spike is about 20 times of the payload (i.e. if the payload is 50mb, memory usage increases by around 1000mb)

Could you run these commands in your container to see if they are disabled?

emqx eval 'persistent_term:get(emqx_trace_filter, [])'

emqx eval 'emqx_logger:get_primary_log_level()'

from emqx.

bernard-bear commented on June 16, 2024

These are the results from the commands (same for both nodes):

emqx eval 'persistent_term:get(emqx_trace_filter, [])'
[]
emqx eval 'emqx_logger:get_primary_log_level()'
info

from emqx.

qzhuyan commented on June 16, 2024

thanks for the update.

I am comparing the difference between your setup and mine.

to clarify:

You have the configs listed in #12344 (comment)

And only one client shoots only one publish message (QoS 0?) (with payload size 90M) to EMQX with wss (secured web socket) then the EMQX get OOM killed?

The behaviour described above happens even if the max packet size is set to a low value (e.g. 1mb), which means that the memory spike occurs even before the publish packet has been accepted by the broker.

Do you mean you could reproduced the issue(OOM killed) by setting "max packet size" to 1MB and shoot 90M payload
AND you could also reproduce the issue by by setting "max packet size" to 100MB and shoot 90M payload

Could you provide an example message? (you could strip the payload, only headers are interested).

from emqx.

bernard-bear commented on June 16, 2024

You have the configs listed in #12344 (comment)

Yup

And only one client shoots only one publish message (QoS 0?) (with payload size 90M) to EMQX with was (secured web socket) then the EMQX get OOM killed?

Yes, this is correct. It is QoS 0. I have also verified that bytes.received in the metrics dashboard increases by the correct number of bytes (e.g. 90M) after message is published.

Do you mean you could reproduced the issue(OOM killed) by setting "max packet size" to 1MB and shoot 90M payload
AND you could also reproduce the issue by by setting "max packet size" to 100MB and shoot 90M payload

Yes, this is also correct. In the former case, the client will be forcefully disconnected, and the message doesn't actually get published. In the latter case, the client remains connected, and the message does get published. But in both cases, the memory spike occurs, and if the spike exceeds the memory limit, it gets OOM killed.

Could you provide an example message? (you could strip the payload, only headers are interested).

I am using the MQTT.js library (library version 5.1.3) with all the standard defaults, which includes connecting via MQTT v3.1.1 protocol. Here's a minimally reproducible example (I am using React but plain JavaScript should be the same). Previously, I had also seen the same memory spike OOM killed with a different MQTT client (Python Paho) but I have not done any further testing with that.

import { connect } from 'mqtt/dist/mqtt.min'

const MqttPage = () => {
  useEffect(() => {
    const mqttClient = connect({
      port: 8084,
      path: '/mqtt',
      clientId: '<redacted>',
      username: '<redacted>',
      password: '<redacted>',
      protocol: 'wss',
      hostname: '<redacted>',
      keepalive: 120,
      clean: false,
    })

    mqttClient.on('packetsend', (packet) => {
      console.log(packet)
    })

    mqttClient.on('connect', () => {
      // Change the last number to vary the number of megabytes in the payload (roughly accurate).
      const payload = JSON.stringify({ field: 'x'.repeat(1000 * 1000 * 90) })
      mqttClient.publish('topic_name', payload)
    })

    return () => {
      mqttClient.end()
    }
  })
}

Packet (captured via the console.log):

{
    "cmd": "publish",
    "topic": "topic_name",
    "payload": "{\"field\":\"<truncated>"}",
    "qos": 0,
    "retain": false,
    "messageId": 0,
    "dup": false
}

from emqx.

qzhuyan commented on June 16, 2024

I suspect something about acl, could you try disable ACL? I got mem bump when ACL via http failed.

from emqx.

qzhuyan commented on June 16, 2024

I think the issue also correlates to the clean: false,, once I managed to reproduce it, I could always reproduce it with clean: false, only using clean: true could get rid of the mem bump.

must be some garbage left in the system.

from emqx.

qzhuyan commented on June 16, 2024

Pls try to set these envvar and restart emqx pod to see if it makes any differences.

for testing payload 100M, assume network mtu 1400.

ERL_FLAGS='+MBmbcgs 50 +MBsmbcs 8192'

from emqx.

bernard-bear commented on June 16, 2024

I suspect something about acl, could you try disable ACL? I got mem bump when ACL via http failed.

I can still reproduce the memory spike after I turned off both authentication methods (JWT and built-in database) and both authorization methods (HTTP Server and File). I've also tried with both clean: false and clean: true.

I think the issue also correlates to the clean: false

I can still reproduce the memory spike after changing to clean: true.

Pls try to set these envvar and restart emqx pod to see if it makes any differences.

I restarted both pods via these commands:

emqx stop
ERL_FLAGS='+MBmbcgs 50 +MBsmbcs 8192'; emqx start

However, I am still reproducing the same memory spike.

By the way, have you tried with GKE?

from emqx.

qzhuyan commented on June 16, 2024

However, I am still reproducing the same memory spike.

the spike is unavoidable due to the buffering but the peak should be lowered, like from 2GB to 900M

By the way, have you tried with GKE?

I don't think it has something to do with GKE in terms of memory usage unless the memory usage report is wrong.
I assume when you say memory spikes you read it from ps or top for the emqx process right?

from emqx.

bernard-bear commented on June 16, 2024

the spike is unavoidable due to the buffering but the peak should be lowered, like from 2GB to 900M

The peak was not lowered unfortunately. The broker still crashed due to OOM

I assume when you say memory spikes you read it from ps or top for the emqx process right?

I'm reading from the EMQX web dashboard, as seen in the video above

Is there anything else we can try?

from emqx.

qzhuyan commented on June 16, 2024

I'm reading from the EMQX web dashboard, as seen in the video #12344 (comment)

That is OS memory not the memory emqx process uses. In container env it MAYBE the host memory usage that includes the other pods.

from emqx.

bernard-bear commented on June 16, 2024

That is OS memory not the memory emqx process uses. In container env it MAYBE the host memory usage that includes the other pods.

I just checked the memory via top via the following steps (steps below are for 1 pod, but did the same for the other pod):

bernard@...........$ kubectl exec -it emqx-cluster-core-5854b66996-0 bash
emqx@emqx-cluster-core-5854b66996-1:/opt/emqx$ top

At resting state, the usage hovers around these values:

top - 09:11:38 up 22 days, 38 min,  0 users,  load average: 0.14, 0.31, 0.30
Tasks:   8 total,   1 running,   7 sleeping,   0 stopped,   0 zombie
%Cpu(s):  6.8 us,  1.4 sy,  0.0 ni, 90.4 id,  1.1 wa,  0.0 hi,  0.3 si,  0.1 st
MiB Mem :  16006.2 total,  11354.6 free,   1504.5 used,   3147.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  14040.7 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                      
      1 emqx      20   0 2982.8m 247.8m  97.6m S   2.6   1.5   3:53.33 beam.smp                     
    279 emqx      20   0    2.3m   0.6m   0.5m S   0.0   0.0   0:00.00 erl_child_setup              
    466 emqx      20   0    3.6m   0.9m   0.8m S   0.0   0.0   0:00.02 inet_gethost                 
    467 emqx      20   0    3.8m   1.8m   1.7m S   0.0   0.0   0:00.24 inet_gethost                 
    470 emqx      20   0    2.2m   0.5m   0.4m S   0.0   0.0   0:01.49 memsup                       
    471 emqx      20   0    2.3m   0.6m   0.5m S   0.0   0.0   0:00.02 cpu_sup                      
    742 emqx      20   0    5.9m   3.8m   3.3m S   0.0   0.0   0:00.36 bash                         
    751 emqx      20   0    8.7m   3.5m   3.0m R   0.0   0.0   0:00.00 top

After sending a message, the memory usage for beam.smp increases by a significant amount, similar to what I saw in the dashboard. Is that the emqx process memory usage? Or is there another way to get just the memory for the emqx process?

from emqx.

qzhuyan commented on June 16, 2024

I read RES : 247.8 MB.

yes beam.smp is the emqx process.

Do you have a top shoot when you get spikes?

from emqx.

bernard-bear commented on June 16, 2024

Do you have a top shoot when you get spikes?

Here it is:

This was with a 60mb payload. Broker did not crash.

top - 09:20:16 up 22 days, 47 min,  0 users,  load average: 0.36, 0.17, 0.21
Tasks:   9 total,   1 running,   8 sleeping,   0 stopped,   0 zombie
%Cpu(s):  8.6 us,  7.4 sy,  0.0 ni, 82.4 id,  1.0 wa,  0.0 hi,  0.5 si,  0.1 st
MiB Mem :  16006.2 total,   9833.9 free,   3023.8 used,   3148.4 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  12521.4 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                      
      1 emqx      20   0 4573.9m   1.7g  97.6m S  50.2  10.9   4:13.39 beam.smp                     
    279 emqx      20   0    2.3m   0.6m   0.5m S   0.0   0.0   0:00.00 erl_child_setup              
    466 emqx      20   0    3.6m   0.9m   0.8m S   0.0   0.0   0:00.02 inet_gethost                 
    467 emqx      20   0    3.8m   1.8m   1.7m S   0.0   0.0   0:00.27 inet_gethost                 
    470 emqx      20   0    2.2m   0.5m   0.4m S   0.0   0.0   0:01.57 memsup                       
    471 emqx      20   0    2.3m   0.6m   0.5m S   0.0   0.0   0:00.02 cpu_sup                      
    742 emqx      20   0    5.9m   3.8m   3.3m S   0.0   0.0   0:00.36 bash                         
    752 emqx      20   0    5.9m   3.7m   3.2m S   0.0   0.0   0:00.37 bash                         
    758 emqx      20   0    8.7m   3.6m   3.1m R   0.0   0.0   0:00.02 top

from emqx.

qzhuyan commented on June 16, 2024

Yes it is indeed an issue.

could you run this command to fetch the allocators counters?

emqx eval 'recon_alloc:snapshot(), recon_alloc:snapshot_save("/tmp/alloc.txt").'

and send a copy of /tmp/alloc.txt in the container?

from emqx.

qzhuyan commented on June 16, 2024

I checked the allocator counters. the mem spike is caused by many blocks which is not GCed.

Could be caused by slow mem allocation or busy CPU that don't have enough resource to do the GCs in that short period.
And GC get done, when there is low workload and mem usage back to normal.

what resource limit did you set on the EMQX pod in terms of CPU and memory?

@id in your GKE test, did you set resource limits?

from emqx.

bernard-bear commented on June 16, 2024

what resource limit did you set on the EMQX pod in terms of CPU and memory?

These are the resource limits:

resources:
  limits:
    cpu: 500m
    ephemeral-storage: 1Gi
    memory: 2Gi
  requests:
    cpu: 500m
    ephemeral-storage: 1Gi
    memory: 2Gi

from emqx.

qzhuyan commented on June 16, 2024

OK, try unlimit CPU and see what happens.

from emqx.

bernard-bear commented on June 16, 2024

Tried again with higher resource limits:

resources:
  limits:
    cpu: "2"
    ephemeral-storage: 1Gi
    memory: 2Gi
  requests:
    cpu: "2"
    ephemeral-storage: 1Gi
    memory: 2Gi

Still observe similar memory spike. Resting values:

top - 07:33:28 up  1:40,  0 users,  load average: 0.45, 0.22, 0.36
Tasks:  11 total,   1 running,  10 sleeping,   0 stopped,   0 zombie
%Cpu(s):   0.0/1.5     2[|                                                    ]
MiB Mem :  16006.2 total,  10876.8 free,   1170.9 used,   3958.6 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  14381.1 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND  
      1 emqx      20   0 2958.8m 229.4m  97.7m S   0.0   1.4   0:59.55 beam.smp 
    271 emqx      20   0    2.3m   0.5m   0.4m S   0.0   0.0   0:00.00 erl_chi+ 
    299 emqx      20   0    3.6m   0.8m   0.7m S   0.0   0.0   0:00.00 inet_ge+ 
    300 emqx      20   0    3.8m   1.7m   1.6m S   0.0   0.0   0:00.00 inet_ge+ 
    301 emqx      20   0    2.2m   0.6m   0.5m S   0.0   0.0   0:00.44 memsup   
    302 emqx      20   0    2.3m   0.6m   0.5m S   0.0   0.0   0:00.05 cpu_sup  
    307 emqx      20   0    3.8m   1.7m   1.6m S   0.0   0.0   0:00.00 inet_ge+ 
    308 emqx      20   0    5.9m   3.7m   3.2m S   0.0   0.0   0:00.37 bash     
    315 emqx      20   0    8.7m   3.6m   3.1m S   0.0   0.0   0:00.05 top      
    316 emqx      20   0    5.9m   3.8m   3.3m S   0.0   0.0   0:00.38 bash     
    323 emqx      20   0    8.7m   3.5m   3.1m R   0.0   0.0   0:00.05 top

When 60mb payload is published once (QoS 0):

top - 07:36:25 up  1:43,  0 users,  load average: 0.90, 0.58, 0.47
Tasks:  11 total,   1 running,  10 sleeping,   0 stopped,   0 zombie
%Cpu(s):  19.2/16.5   36[|||||||||||||||||||                                  ]
MiB Mem :  16006.2 total,   9820.0 free,   2226.9 used,   3959.3 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  13325.1 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND  
      1 emqx      20   0 4109.0m   1.3g  97.7m S  98.7   8.2   1:21.72 beam.smp 
    271 emqx      20   0    2.3m   0.5m   0.4m S   0.0   0.0   0:00.00 erl_chi+ 
    299 emqx      20   0    3.6m   0.8m   0.7m S   0.0   0.0   0:00.00 inet_ge+ 
    300 emqx      20   0    3.8m   1.7m   1.6m S   0.0   0.0   0:00.00 inet_ge+ 
    301 emqx      20   0    2.2m   0.6m   0.5m S   0.0   0.0   0:00.49 memsup   
    302 emqx      20   0    2.3m   0.6m   0.5m S   0.0   0.0   0:00.06 cpu_sup  
    307 emqx      20   0    3.8m   1.7m   1.6m S   0.0   0.0   0:00.00 inet_ge+ 
    308 emqx      20   0    5.9m   3.7m   3.2m S   0.0   0.0   0:00.37 bash     
    315 emqx      20   0    8.7m   3.6m   3.1m S   0.0   0.0   0:00.13 top      
    316 emqx      20   0    5.9m   3.8m   3.3m S   0.0   0.0   0:00.38 bash     
    323 emqx      20   0    8.7m   3.5m   3.1m R   0.0   0.0   0:00.12 top

from emqx.

qzhuyan commented on June 16, 2024

is that 1.3 g peak you get? which looks reduced from 1.7g.

from emqx.

bernard-bear commented on June 16, 2024

is that 1.3 g peak you get? which looks reduced from 1.7g.

Possibly, but I suspect could also be just because top updates with a low frequency. Are you suggesting to increase the CPU limit even further? Is it normal to consume this amount of CPU?

from emqx.

qzhuyan commented on June 16, 2024

I could reproduce the issue if I use cgroup to limit CPU resource.

the peak go up from 600MB to 950MB. When there is a memory pressure, the peak could go up to 1.3 GB.

however I found default wss socket buffer is too small in your case, may you try set this envvar

EMQX_LISTENERS__WSS__default__tcp_options__buffer=8388608

from emqx.

qzhuyan commented on June 16, 2024

Is it normal to consume this amount of CPU?

I cannot tell because different platforms. only test could tell.

from emqx.

bernard-bear commented on June 16, 2024

however I found default wss socket buffer is too small in your case, may you try set this envvar
EMQX_LISTENERS__WSS__default__tcp_options__buffer=8388608

Hi @qzhuyan, this seems to resolve the memory spike problem. Now, with a 60mb payload, the memory usage increases from ~200mb to ~500 - 800mb, which is much lower than before (previously would increase to >1gb). The CPU limit didn't seem to matter - the memory usage was roughly the same for 0.5 CPU vs 2 CPU.

I think we can consider this issue closed. Thanks so much to you and your colleagues for the prompt assistance with this!

Out of curiosity, do you have an idea how the TCP buffer value might have affected the memory usage?

from emqx.

Excessive memory spike after a MQTT client publishes a large payload (e.g. ~100mb), causing broker crash about emqx HOT 39 CLOSED

Comments (39)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent