Code Monkey home page Code Monkey logo

Comments (17)

amoghmargoor avatar amoghmargoor commented on July 19, 2024

@sourabh912 this is failing during adding dynamic partitions.

from spark-acid.

shuwnyuantee avatar shuwnyuantee commented on July 19, 2024

@amoghmargoor Thanks for confirming the issue, looking forward for the fix 👍

from spark-acid.

amoghmargoor avatar amoghmargoor commented on July 19, 2024

@shuwnyuantee we tried reproducing this issue in house but could not. Can you help us with some pointers with which we can reproduce this ?
Along with pointers following information would also help:

  1. Complete Stacktrace of the Driver. there are 28 frames missing.
  2. which version of Hive Metastore are you using ?

from spark-acid.

shuwnyuantee avatar shuwnyuantee commented on July 19, 2024

@amoghmargoor

  1. Steps to reproduce:

We build Spark-Acid jar based on commit ada7e2c586133a43 (dated Jun 4, 2020) & redeploy to production. Then we restart our Spark app & it crashes with the below error. We are in a bad loop now, where any further restart throws similar error.

  1. Complete Stacktrace of Driver log
com.qubole.shaded.hadoop.hive.metastore.api.NoSuchTxnException: No such transaction txnid:22534
    at com.qubole.shaded.hadoop.hive.metastore.api.ThriftHiveMetastore$add_dynamic_partitions_result$add_dynamic_partitions_resultStandardScheme.read(ThriftHiveMetastore.java)
    at com.qubole.shaded.hadoop.hive.metastore.api.ThriftHiveMetastore$add_dynamic_partitions_result$add_dynamic_partitions_resultStandardScheme.read(ThriftHiveMetastore.java)
    at com.qubole.shaded.hadoop.hive.metastore.api.ThriftHiveMetastore$add_dynamic_partitions_result.read(ThriftHiveMetastore.java)
    at com.qubole.shaded.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
    at com.qubole.shaded.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_add_dynamic_partitions(ThriftHiveMetastore.java:5527)
    at com.qubole.shaded.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.add_dynamic_partitions(ThriftHiveMetastore.java:5514)
    at com.qubole.shaded.hadoop.hive.metastore.HiveMetaStoreClient.addDynamicPartitions(HiveMetaStoreClient.java:2669)
    at com.qubole.spark.hiveacid.transaction.HiveAcidTxnManager.addDynamicPartitions(HiveAcidTxnManager.scala:350)
    at com.qubole.spark.hiveacid.transaction.HiveAcidTxn.addDynamicPartitions(HiveAcidTxn.scala:108)
    at com.qubole.spark.hiveacid.writer.TableWriter.process(TableWriter.scala:213)
    at com.qubole.spark.hiveacid.HiveAcidTable$$anonfun$addBatch$1.apply$mcV$sp(HiveAcidTable.scala:275)
    at com.qubole.spark.hiveacid.HiveAcidTable.inTxnRetry$1(HiveAcidTable.scala:107)
    at com.qubole.spark.hiveacid.HiveAcidTable.inTxn(HiveAcidTable.scala:131)
    at com.qubole.spark.hiveacid.HiveAcidTable.addBatch(HiveAcidTable.scala:273)
    at com.qubole.spark.hiveacid.streaming.HiveAcidSink.addBatch(HiveAcidSink.scala:87)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535)
    at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
    at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
    at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
    at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
    at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
    at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
    at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
org.apache.spark.sql.streaming.StreamingQueryException: Query hiveSink [id = 3267e047-4848-4821-a2f7-04d6b8dd84c9, runId = 25008e19-fce0-497f-be1c-1020fd1d6942] terminated with exception: No such transaction txnid:22534
    at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:297)
    at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
  1. Hive & Metastore version: 3.1.2

from spark-acid.

shuwnyuantee avatar shuwnyuantee commented on July 19, 2024

@amoghmargoor
Are you guys able to reproduce the issue? Any estimate on potential fix?

Previously we build Spark-Acid jar based on commit 0e5bbf8074891 (dated Apr 8, 2020).
Is it OK if we revert to this commit for the moment? Will it cause any issue in Hive compaction?

Let me know if you have any suggestion, thanks! 🙏

from spark-acid.

amoghmargoor avatar amoghmargoor commented on July 19, 2024

@shuwnyuantee we were not able to reproduce it internally. But we were discussing on what could have possibly gone wrong. So I have created new branch https://github.com/qubole/spark-acid/tree/issue-54 with required log messages. Can you create jar from that branch and run on your side and get us these two logs:

  1. Driver logs
  2. Hive metastore logs

Another thing I wanted to know if you have:

  1. Switched on initiator: hive.compactor.initiator.on on your end ?
  2. Changed hive.txn.timeout on your side ?

Btw regarding your question:
If your streaming job is creating new Partitions then this fix is required. If it is writing to existing partitions, then this fix is not required. But we feel even without the fix it may fail while committing the transaction. Hence we wanted to dig further to see whats going on.

from spark-acid.

shuwnyuantee avatar shuwnyuantee commented on July 19, 2024

@amoghmargoor I see your changes about logInfo. Do I have to change log level to info somewhere, in order to capture the log?

Here's our current Hive config:

hive.compactor.initiator.on = true
hive.txn.timeout = 300 (default value, should be in seconds)

from spark-acid.

amoghmargoor avatar amoghmargoor commented on July 19, 2024

@shuwnyuantee you may not need to do any change if your Driver logs already capture INFO logs.

Thanks for the info on HiveConfig.

from spark-acid.

shuwnyuantee avatar shuwnyuantee commented on July 19, 2024

@amoghmargoor In what case will those logs being invoked (as in commit 57b6916b2b753c)?

I tried building jar as you suggested, tested changes locally, but didn't see any related logs in:

  • driver
  • worker
  • hive-metastore

Do they start appearing once data are stream into Hive table? Or they only appear in very specific condition that I might not trigger them on local testing?

p/s : I haven't deployed it in production, wanted to make sure I did the right thing before proceed to production deploy.

from spark-acid.

amoghmargoor avatar amoghmargoor commented on July 19, 2024

@shuwnyuantee yes they appear when streaming will start as they are at addBatch() function and at heartbeats. Your Log Level should be INFO or lower i.e., TRACE.

from spark-acid.

shuwnyuantee avatar shuwnyuantee commented on July 19, 2024

@amoghmargoor I did something wrong previously, sorry my bad.

I have captured the logs you requested here:

hive-metastore.log
spark-driver.log
spark-worker.log

Thanks for help 👍

from spark-acid.

amoghmargoor avatar amoghmargoor commented on July 19, 2024

Thanks @shuwnyuantee . This helps a great deal.
Let me tell you exactly what happened here first. these are the things which happened in order

  1. New transaction is created. From driver log:
    04:26:59.654 [stream execution thread for hiveSink [id = 3267e047-4848-4821-a2f7-04d6b8dd84c9, runId = e989f15a-b4de-4ccf-95f7-605d6fa44aec]] INFO : Transaction created: 22565 04:26:59.654 [stream execution thread for hiveSink [id = 3267e047-4848-4821-a2f7-04d6b8dd84c9, runId = e989f15a-b4de-4ccf-95f7-605d6fa44aec]] INFO : Transaction created: 22565

  2. Hive Metastore treats this transaction as aborted and removes it. This is the responsibility of Initiator in Hive Metastore. From Hive Metastore Logs:

2020-06-12T04:38:04,033 INFO [Thread-10]: compactor.Initiator (Initiator.java:run(97)) - Checking to see if we should compact default.tmp_clst5.year=2019/month=7/day=20 2020-06-12T04:38:04,064 INFO [Thread-10]: txn.CompactionTxnHandler (CompactionTxnHandler.java:cleanEmptyAbortedTxns(605)) - Removed 1 empty Aborted transactions from TXNS 2020-06-12T04:38:04,064 INFO [Thread-10]: txn.CompactionTxnHandler (CompactionTxnHandler.java:cleanEmptyAbortedTxns(607)) - Aborted transactions removed from TXNS: [22565]

  1. Since initiator removed this, either commit to this transaction or adding dynamic partitions for this transaction will fail now saying no such transaction exist. From Driver Log:

04:38:53.219 [stream execution thread for hiveSink [id = 3267e047-4848-4821-a2f7-04d6b8dd84c9, runId = e989f15a-b4de-4ccf-95f7-605d6fa44aec]] INFO : Adding dynamic partition txnId: 22565 writeId: 740 dbN ame: default tableName: clickstream_events_sink_v1 partitions: year=2020/month=6/day=4 04:38:53.240 [stream execution thread for hiveSink [id = 3267e047-4848-4821-a2f7-04d6b8dd84c9, runId = e989f15a-b4de-4ccf-95f7-605d6fa44aec]] ERROR: Exception NoSuchTxnException(message:No such transaction txnid:22565)

So @shuwnyuantee as you can see even if you remove the commit to add dynamic partition, it will still fail as issue is somewhere else. it will fail while committing the transaction if you remove that commit.

So coming back to the issue here: Why does Hive Metastore regard an ongoing issue as aborted ?
Most probable reason is:
For keeping transaction alive Spark Application has to keep sending Heart beat to the Hive Metastore. Otherwise HiveMetastore will abort that transaction. You can check File HiveAcidTxnManager.scala. There is class named HearBeatRunnable that takes care of sending those heartbeats. I have added logs to them too.
Do you see any heartbeat logs in Driver code either related to failure or success ?

This is a good progress.

from spark-acid.

shuwnyuantee avatar shuwnyuantee commented on July 19, 2024

@amoghmargoor

In Spark Driver log, I didn't find any logs related to Spark-acid sending heartbeat. Keywords I have searched for are:

Heartbeat failure for transaction id

Heartbeat sent for txnId

Failure to heartbeat for txnId

from spark-acid.

amoghmargoor avatar amoghmargoor commented on July 19, 2024

@shuwnyuantee Thats concerning ... so the HeartbeatRunner is somehow failed or not running in your setting. Is it possible to send entire Driver log of the Application ? This should be solved else it will have effects in your production.

from spark-acid.

shuwnyuantee avatar shuwnyuantee commented on July 19, 2024

@amoghmargoor

Here's the complete driver log:

complete_driver.log

from spark-acid.

amoghmargoor avatar amoghmargoor commented on July 19, 2024

@shuwnyuantee I have figured out the issue hopefully and going to make a fix for this ASAP. Hopefully this should be fixed in a day or two.

from spark-acid.

amoghmargoor avatar amoghmargoor commented on July 19, 2024

@shuwnyuantee this is fixed and merged. thanks for reporting.

from spark-acid.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.