We are using Spark-Acid latest master code, which contains commit <a class="commit-lin

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Heartbeat Runner is not running. about spark-acid HOT 17 CLOSED

qubole commented on July 19, 2024 4

Heartbeat Runner is not running.

from spark-acid.

Comments (17)

amoghmargoor commented on July 19, 2024

@sourabh912 this is failing during adding dynamic partitions.

from spark-acid.

shuwnyuantee commented on July 19, 2024

@amoghmargoor Thanks for confirming the issue, looking forward for the fix 👍

from spark-acid.

amoghmargoor commented on July 19, 2024

@shuwnyuantee we tried reproducing this issue in house but could not. Can you help us with some pointers with which we can reproduce this ?
Along with pointers following information would also help:

Complete Stacktrace of the Driver. there are 28 frames missing.
which version of Hive Metastore are you using ?

from spark-acid.

shuwnyuantee commented on July 19, 2024

@amoghmargoor

Steps to reproduce:

We build Spark-Acid jar based on commit ada7e2c586133a43 (dated Jun 4, 2020) & redeploy to production. Then we restart our Spark app & it crashes with the below error. We are in a bad loop now, where any further restart throws similar error.

Complete Stacktrace of Driver log

com.qubole.shaded.hadoop.hive.metastore.api.NoSuchTxnException: No such transaction txnid:22534
    at com.qubole.shaded.hadoop.hive.metastore.api.ThriftHiveMetastore$add_dynamic_partitions_result$add_dynamic_partitions_resultStandardScheme.read(ThriftHiveMetastore.java)
    at com.qubole.shaded.hadoop.hive.metastore.api.ThriftHiveMetastore$add_dynamic_partitions_result$add_dynamic_partitions_resultStandardScheme.read(ThriftHiveMetastore.java)
    at com.qubole.shaded.hadoop.hive.metastore.api.ThriftHiveMetastore$add_dynamic_partitions_result.read(ThriftHiveMetastore.java)
    at com.qubole.shaded.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
    at com.qubole.shaded.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_add_dynamic_partitions(ThriftHiveMetastore.java:5527)
    at com.qubole.shaded.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.add_dynamic_partitions(ThriftHiveMetastore.java:5514)
    at com.qubole.shaded.hadoop.hive.metastore.HiveMetaStoreClient.addDynamicPartitions(HiveMetaStoreClient.java:2669)
    at com.qubole.spark.hiveacid.transaction.HiveAcidTxnManager.addDynamicPartitions(HiveAcidTxnManager.scala:350)
    at com.qubole.spark.hiveacid.transaction.HiveAcidTxn.addDynamicPartitions(HiveAcidTxn.scala:108)
    at com.qubole.spark.hiveacid.writer.TableWriter.process(TableWriter.scala:213)
    at com.qubole.spark.hiveacid.HiveAcidTable$$anonfun$addBatch$1.apply$mcV$sp(HiveAcidTable.scala:275)
    at com.qubole.spark.hiveacid.HiveAcidTable.inTxnRetry$1(HiveAcidTable.scala:107)
    at com.qubole.spark.hiveacid.HiveAcidTable.inTxn(HiveAcidTable.scala:131)
    at com.qubole.spark.hiveacid.HiveAcidTable.addBatch(HiveAcidTable.scala:273)
    at com.qubole.spark.hiveacid.streaming.HiveAcidSink.addBatch(HiveAcidSink.scala:87)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535)
    at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
    at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
    at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
    at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
    at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
    at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
    at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
org.apache.spark.sql.streaming.StreamingQueryException: Query hiveSink [id = 3267e047-4848-4821-a2f7-04d6b8dd84c9, runId = 25008e19-fce0-497f-be1c-1020fd1d6942] terminated with exception: No such transaction txnid:22534
    at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:297)
    at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)

Hive & Metastore version: 3.1.2

from spark-acid.

shuwnyuantee commented on July 19, 2024

@amoghmargoor
Are you guys able to reproduce the issue? Any estimate on potential fix?

Previously we build Spark-Acid jar based on commit 0e5bbf8074891 (dated Apr 8, 2020).
Is it OK if we revert to this commit for the moment? Will it cause any issue in Hive compaction?

Let me know if you have any suggestion, thanks! 🙏

from spark-acid.

amoghmargoor commented on July 19, 2024

@shuwnyuantee we were not able to reproduce it internally. But we were discussing on what could have possibly gone wrong. So I have created new branch https://github.com/qubole/spark-acid/tree/issue-54 with required log messages. Can you create jar from that branch and run on your side and get us these two logs:

Driver logs
Hive metastore logs

Another thing I wanted to know if you have:

Switched on initiator: hive.compactor.initiator.on on your end ?
Changed hive.txn.timeout on your side ?

Btw regarding your question:
If your streaming job is creating new Partitions then this fix is required. If it is writing to existing partitions, then this fix is not required. But we feel even without the fix it may fail while committing the transaction. Hence we wanted to dig further to see whats going on.

from spark-acid.

shuwnyuantee commented on July 19, 2024

@amoghmargoor I see your changes about logInfo. Do I have to change log level to info somewhere, in order to capture the log?

Here's our current Hive config:

hive.compactor.initiator.on = true
hive.txn.timeout = 300 (default value, should be in seconds)

from spark-acid.

amoghmargoor commented on July 19, 2024

@shuwnyuantee you may not need to do any change if your Driver logs already capture INFO logs.

Thanks for the info on HiveConfig.

from spark-acid.

shuwnyuantee commented on July 19, 2024

@amoghmargoor In what case will those logs being invoked (as in commit 57b6916b2b753c)?

I tried building jar as you suggested, tested changes locally, but didn't see any related logs in:

driver
worker
hive-metastore

Do they start appearing once data are stream into Hive table? Or they only appear in very specific condition that I might not trigger them on local testing?

p/s : I haven't deployed it in production, wanted to make sure I did the right thing before proceed to production deploy.

from spark-acid.

amoghmargoor commented on July 19, 2024

@shuwnyuantee yes they appear when streaming will start as they are at addBatch() function and at heartbeats. Your Log Level should be INFO or lower i.e., TRACE.

from spark-acid.

shuwnyuantee commented on July 19, 2024

@amoghmargoor I did something wrong previously, sorry my bad.

I have captured the logs you requested here:

hive-metastore.log
spark-driver.log
spark-worker.log

Thanks for help 👍

from spark-acid.

amoghmargoor commented on July 19, 2024

Thanks @shuwnyuantee . This helps a great deal.
Let me tell you exactly what happened here first. these are the things which happened in order

New transaction is created. From driver log:
04:26:59.654 [stream execution thread for hiveSink [id = 3267e047-4848-4821-a2f7-04d6b8dd84c9, runId = e989f15a-b4de-4ccf-95f7-605d6fa44aec]] INFO : Transaction created: 22565 04:26:59.654 [stream execution thread for hiveSink [id = 3267e047-4848-4821-a2f7-04d6b8dd84c9, runId = e989f15a-b4de-4ccf-95f7-605d6fa44aec]] INFO : Transaction created: 22565
Hive Metastore treats this transaction as aborted and removes it. This is the responsibility of Initiator in Hive Metastore. From Hive Metastore Logs:

2020-06-12T04:38:04,033 INFO [Thread-10]: compactor.Initiator (Initiator.java:run(97)) - Checking to see if we should compact default.tmp_clst5.year=2019/month=7/day=20 2020-06-12T04:38:04,064 INFO [Thread-10]: txn.CompactionTxnHandler (CompactionTxnHandler.java:cleanEmptyAbortedTxns(605)) - Removed 1 empty Aborted transactions from TXNS 2020-06-12T04:38:04,064 INFO [Thread-10]: txn.CompactionTxnHandler (CompactionTxnHandler.java:cleanEmptyAbortedTxns(607)) - Aborted transactions removed from TXNS: [22565]

Since initiator removed this, either commit to this transaction or adding dynamic partitions for this transaction will fail now saying no such transaction exist. From Driver Log:

04:38:53.219 [stream execution thread for hiveSink [id = 3267e047-4848-4821-a2f7-04d6b8dd84c9, runId = e989f15a-b4de-4ccf-95f7-605d6fa44aec]] INFO : Adding dynamic partition txnId: 22565 writeId: 740 dbN ame: default tableName: clickstream_events_sink_v1 partitions: year=2020/month=6/day=4 04:38:53.240 [stream execution thread for hiveSink [id = 3267e047-4848-4821-a2f7-04d6b8dd84c9, runId = e989f15a-b4de-4ccf-95f7-605d6fa44aec]] ERROR: Exception NoSuchTxnException(message:No such transaction txnid:22565)

So @shuwnyuantee as you can see even if you remove the commit to add dynamic partition, it will still fail as issue is somewhere else. it will fail while committing the transaction if you remove that commit.

So coming back to the issue here: Why does Hive Metastore regard an ongoing issue as aborted ?
Most probable reason is:
For keeping transaction alive Spark Application has to keep sending Heart beat to the Hive Metastore. Otherwise HiveMetastore will abort that transaction. You can check File HiveAcidTxnManager.scala. There is class named HearBeatRunnable that takes care of sending those heartbeats. I have added logs to them too.
Do you see any heartbeat logs in Driver code either related to failure or success ?

This is a good progress.

from spark-acid.

shuwnyuantee commented on July 19, 2024

@amoghmargoor

In Spark Driver log, I didn't find any logs related to Spark-acid sending heartbeat. Keywords I have searched for are:

Heartbeat failure for transaction id

Heartbeat sent for txnId

Failure to heartbeat for txnId

from spark-acid.

amoghmargoor commented on July 19, 2024

@shuwnyuantee Thats concerning ... so the HeartbeatRunner is somehow failed or not running in your setting. Is it possible to send entire Driver log of the Application ? This should be solved else it will have effects in your production.

from spark-acid.

shuwnyuantee commented on July 19, 2024

@amoghmargoor

Here's the complete driver log:

complete_driver.log

from spark-acid.

amoghmargoor commented on July 19, 2024

@shuwnyuantee I have figured out the issue hopefully and going to make a fix for this ASAP. Hopefully this should be fixed in a day or two.

from spark-acid.

amoghmargoor commented on July 19, 2024

@shuwnyuantee this is fixed and merged. thanks for reporting.

from spark-acid.

Heartbeat Runner is not running. about spark-acid HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent