Comments (17)
@sourabh912 this is failing during adding dynamic partitions.
from spark-acid.
@amoghmargoor Thanks for confirming the issue, looking forward for the fix 👍
from spark-acid.
@shuwnyuantee we tried reproducing this issue in house but could not. Can you help us with some pointers with which we can reproduce this ?
Along with pointers following information would also help:
- Complete Stacktrace of the Driver. there are 28 frames missing.
- which version of Hive Metastore are you using ?
from spark-acid.
- Steps to reproduce:
We build Spark-Acid
jar based on commit ada7e2c586133a43
(dated Jun 4, 2020) & redeploy to production. Then we restart our Spark app & it crashes with the below error. We are in a bad loop now, where any further restart throws similar error.
- Complete Stacktrace of Driver log
com.qubole.shaded.hadoop.hive.metastore.api.NoSuchTxnException: No such transaction txnid:22534
at com.qubole.shaded.hadoop.hive.metastore.api.ThriftHiveMetastore$add_dynamic_partitions_result$add_dynamic_partitions_resultStandardScheme.read(ThriftHiveMetastore.java)
at com.qubole.shaded.hadoop.hive.metastore.api.ThriftHiveMetastore$add_dynamic_partitions_result$add_dynamic_partitions_resultStandardScheme.read(ThriftHiveMetastore.java)
at com.qubole.shaded.hadoop.hive.metastore.api.ThriftHiveMetastore$add_dynamic_partitions_result.read(ThriftHiveMetastore.java)
at com.qubole.shaded.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
at com.qubole.shaded.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_add_dynamic_partitions(ThriftHiveMetastore.java:5527)
at com.qubole.shaded.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.add_dynamic_partitions(ThriftHiveMetastore.java:5514)
at com.qubole.shaded.hadoop.hive.metastore.HiveMetaStoreClient.addDynamicPartitions(HiveMetaStoreClient.java:2669)
at com.qubole.spark.hiveacid.transaction.HiveAcidTxnManager.addDynamicPartitions(HiveAcidTxnManager.scala:350)
at com.qubole.spark.hiveacid.transaction.HiveAcidTxn.addDynamicPartitions(HiveAcidTxn.scala:108)
at com.qubole.spark.hiveacid.writer.TableWriter.process(TableWriter.scala:213)
at com.qubole.spark.hiveacid.HiveAcidTable$$anonfun$addBatch$1.apply$mcV$sp(HiveAcidTable.scala:275)
at com.qubole.spark.hiveacid.HiveAcidTable.inTxnRetry$1(HiveAcidTable.scala:107)
at com.qubole.spark.hiveacid.HiveAcidTable.inTxn(HiveAcidTable.scala:131)
at com.qubole.spark.hiveacid.HiveAcidTable.addBatch(HiveAcidTable.scala:273)
at com.qubole.spark.hiveacid.streaming.HiveAcidSink.addBatch(HiveAcidSink.scala:87)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
org.apache.spark.sql.streaming.StreamingQueryException: Query hiveSink [id = 3267e047-4848-4821-a2f7-04d6b8dd84c9, runId = 25008e19-fce0-497f-be1c-1020fd1d6942] terminated with exception: No such transaction txnid:22534
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:297)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
- Hive & Metastore version:
3.1.2
from spark-acid.
@amoghmargoor
Are you guys able to reproduce the issue? Any estimate on potential fix?
Previously we build Spark-Acid
jar based on commit 0e5bbf8074891
(dated Apr 8, 2020).
Is it OK if we revert to this commit for the moment? Will it cause any issue in Hive compaction?
Let me know if you have any suggestion, thanks! 🙏
from spark-acid.
@shuwnyuantee we were not able to reproduce it internally. But we were discussing on what could have possibly gone wrong. So I have created new branch https://github.com/qubole/spark-acid/tree/issue-54 with required log messages. Can you create jar from that branch and run on your side and get us these two logs:
- Driver logs
- Hive metastore logs
Another thing I wanted to know if you have:
- Switched on initiator:
hive.compactor.initiator.on
on your end ? - Changed
hive.txn.timeout
on your side ?
Btw regarding your question:
If your streaming job is creating new Partitions then this fix is required. If it is writing to existing partitions, then this fix is not required. But we feel even without the fix it may fail while committing the transaction. Hence we wanted to dig further to see whats going on.
from spark-acid.
@amoghmargoor I see your changes about logInfo
. Do I have to change log level to info somewhere, in order to capture the log?
Here's our current Hive config:
hive.compactor.initiator.on = true
hive.txn.timeout = 300 (default value, should be in seconds)
from spark-acid.
@shuwnyuantee you may not need to do any change if your Driver logs already capture INFO logs.
Thanks for the info on HiveConfig.
from spark-acid.
@amoghmargoor In what case will those logs being invoked (as in commit 57b6916b2b753c
)?
I tried building jar as you suggested, tested changes locally, but didn't see any related logs in:
- driver
- worker
- hive-metastore
Do they start appearing once data are stream into Hive table? Or they only appear in very specific condition that I might not trigger them on local testing?
p/s : I haven't deployed it in production, wanted to make sure I did the right thing before proceed to production deploy.
from spark-acid.
@shuwnyuantee yes they appear when streaming will start as they are at addBatch() function and at heartbeats. Your Log Level should be INFO or lower i.e., TRACE.
from spark-acid.
@amoghmargoor I did something wrong previously, sorry my bad.
I have captured the logs you requested here:
hive-metastore.log
spark-driver.log
spark-worker.log
Thanks for help 👍
from spark-acid.
Thanks @shuwnyuantee . This helps a great deal.
Let me tell you exactly what happened here first. these are the things which happened in order
-
New transaction is created. From driver log:
04:26:59.654 [stream execution thread for hiveSink [id = 3267e047-4848-4821-a2f7-04d6b8dd84c9, runId = e989f15a-b4de-4ccf-95f7-605d6fa44aec]] INFO : Transaction created: 22565 04:26:59.654 [stream execution thread for hiveSink [id = 3267e047-4848-4821-a2f7-04d6b8dd84c9, runId = e989f15a-b4de-4ccf-95f7-605d6fa44aec]] INFO : Transaction created: 22565
-
Hive Metastore treats this transaction as aborted and removes it. This is the responsibility of Initiator in Hive Metastore. From Hive Metastore Logs:
2020-06-12T04:38:04,033 INFO [Thread-10]: compactor.Initiator (Initiator.java:run(97)) - Checking to see if we should compact default.tmp_clst5.year=2019/month=7/day=20 2020-06-12T04:38:04,064 INFO [Thread-10]: txn.CompactionTxnHandler (CompactionTxnHandler.java:cleanEmptyAbortedTxns(605)) - Removed 1 empty Aborted transactions from TXNS 2020-06-12T04:38:04,064 INFO [Thread-10]: txn.CompactionTxnHandler (CompactionTxnHandler.java:cleanEmptyAbortedTxns(607)) - Aborted transactions removed from TXNS: [22565]
- Since initiator removed this, either commit to this transaction or adding dynamic partitions for this transaction will fail now saying no such transaction exist. From Driver Log:
04:38:53.219 [stream execution thread for hiveSink [id = 3267e047-4848-4821-a2f7-04d6b8dd84c9, runId = e989f15a-b4de-4ccf-95f7-605d6fa44aec]] INFO : Adding dynamic partition txnId: 22565 writeId: 740 dbN ame: default tableName: clickstream_events_sink_v1 partitions: year=2020/month=6/day=4 04:38:53.240 [stream execution thread for hiveSink [id = 3267e047-4848-4821-a2f7-04d6b8dd84c9, runId = e989f15a-b4de-4ccf-95f7-605d6fa44aec]] ERROR: Exception NoSuchTxnException(message:No such transaction txnid:22565)
So @shuwnyuantee as you can see even if you remove the commit to add dynamic partition, it will still fail as issue is somewhere else. it will fail while committing the transaction if you remove that commit.
So coming back to the issue here: Why does Hive Metastore regard an ongoing issue as aborted ?
Most probable reason is:
For keeping transaction alive Spark Application has to keep sending Heart beat to the Hive Metastore. Otherwise HiveMetastore will abort that transaction. You can check File HiveAcidTxnManager.scala. There is class named HearBeatRunnable
that takes care of sending those heartbeats. I have added logs to them too.
Do you see any heartbeat logs in Driver code either related to failure or success ?
This is a good progress.
from spark-acid.
In Spark Driver log, I didn't find any logs related to Spark-acid sending heartbeat. Keywords I have searched for are:
Heartbeat failure for transaction id
Heartbeat sent for txnId
Failure to heartbeat for txnId
from spark-acid.
@shuwnyuantee Thats concerning ... so the HeartbeatRunner is somehow failed or not running in your setting. Is it possible to send entire Driver log of the Application ? This should be solved else it will have effects in your production.
from spark-acid.
Here's the complete driver log:
from spark-acid.
@shuwnyuantee I have figured out the issue hopefully and going to make a fix for this ASAP. Hopefully this should be fixed in a day or two.
from spark-acid.
@shuwnyuantee this is fixed and merged. thanks for reporting.
from spark-acid.
Related Issues (20)
- NPE while reading Multiple Partitions HOT 6
- issue with spark-acid using kryo serialization HOT 9
- Error during write to Hive Acid Table Into Multiple Partitions HOT 1
- latest version not found in maven repository
- Intermittent ORC File corruptions while using spark-acid write HOT 3
- Latest Verion?
- INSERT OVERWRITE operation is updating the metastore information of ACID table to latest base directory even incase of failures HOT 2
- Unable to download sbt-spark-package from bintray HOT 2
- Support hive version 1.2.1
- java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.io.AcidInputFormat$AcidRecordReader.getRecordIdentifier()Lorg/apache/hadoop/hive/ql/io/orc/OrcRawRecordMerger$ReaderKey;
- Issue with Multi Statement Delta files
- Add support for Datasource V2 HOT 1
- Muli table transaction support
- Performance improvements in acid writer
- WriteIds being incremented for every Read
- Update/delete operation on acid table sometimes fetches wrong bucket ID to write to HOT 1
- spark-acid incorrectly reads/writes pre-Gregorian timestamps HOT 3
- Bug in Predicate pushdown in acid reader
- Support for bucketed Acid tables
- Remove undesired dependencies from assembly jar
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spark-acid.