taosdata / datax Goto Github PK
View Code? Open in Web Editor NEWThis project forked from alibaba/datax
DataX是阿里云DataWorks数据集成的开源版本。涛思数据基于DataX,开发了TDengine的Writer和Reader插件,为用户提供ETL和数据迁移的工具。
License: Other
This project forked from alibaba/datax
DataX是阿里云DataWorks数据集成的开源版本。涛思数据基于DataX,开发了TDengine的Writer和Reader插件,为用户提供ETL和数据迁移的工具。
License: Other
Describe the bug 描述你遇到的问题
通过datax把mongodb中多个集合数据导入到tdengine中的同一个超级表时报OOM,可复现的现象是导入第一个mongodb集合成功,第二个集合开始就失败报OOM
使用的数据库和datax版本
Mongodb:4.0.3
Tdengine:3.0.5.0
Datax:mongodbreader,tdengine30writer
To Reproduce 如何重现问题
1:tdengine新建数据库
2:mongo中有待迁移集合N个,每个集合上亿条数据
3:当tdengine待迁移的超级表中无数据时,迁移任意一个mongo集合到tdengine中都可以成功
4:当待迁移tdengine库的超级表中已有上亿条数据后,再通过datax迁移mongodb任意一个集合(包括之前迁移成功的集合)数据时datax发生OOM
问题排查过程
1:调大datax内存到6G,一样发生OOM
2:排除datax配置问题
3:datax发生OOM期间tdengine数据库cpu高启,源端mongodb无导出流量显示,判断为在mongodb数据导出前datax发生的OOM
系统监控截图
追踪hprof文件后,定位到datax问题源码的截图:
直接在tdengine中执行sql,复现了一样的问题,判断是datax把所有tagid给加载到了datax的内存中,导致OOM
Expected behavior 期待修复的效果
不是很确定为什么datax需要执行下面的代码,感觉意义不大,是否可以屏蔽掉或者只抓取每个子表tagid就行了,不需要加载具体明细tagid数据
2023-06-08 17:45:35.245 [job-0] INFO StandAloneJobContainerCommunicator - Total 0 records, 0 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 0.00%
2023-06-08 17:45:45.248 [job-0] INFO StandAloneJobContainerCommunicator - Total 10000 records, 256450 bytes | Speed 25.04KB/s, 1000 records/s | Error 2000 records, 51270 bytes | All Task WaitWriterTime 0.006s | All Task WaitReaderTime 0.338s | Percentage 0.00%
2023-06-08 17:45:45.249 [job-0] ERROR JobContainer - 运行scheduler 模式[standalone]出错.
2023-06-08 17:45:45.249 [job-0] ERROR JobContainer - Exception when job run
com.alibaba.datax.common.exception.DataXException: Code:[Framework-14], Description:[DataX传输脏数据超过用户预期,该错误通常是由于源端数据存在较多业务脏数据导致,请仔细检查DataX汇报的脏数据日志信息, 或者您可以适当调大脏数据阈值 .]. - 脏数据条数检查不通过,限制是[0]条,但实际上捕获了[2000]条.
at com.alibaba.datax.common.exception.DataXException.asDataXException(DataXException.java:30) ~[datax-common-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.core.util.ErrorRecordChecker.checkRecordLimit(ErrorRecordChecker.java:58) ~[datax-core-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.core.job.scheduler.AbstractScheduler.schedule(AbstractScheduler.java:89) ~[datax-core-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.core.job.JobContainer.schedule(JobContainer.java:535) ~[datax-core-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.core.job.JobContainer.start(JobContainer.java:119) ~[datax-core-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.core.Engine.start(Engine.java:93) [datax-core-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.core.Engine.entry(Engine.java:175) [datax-core-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.core.Engine.main(Engine.java:208) [datax-core-0.0.1-SNAPSHOT.jar:na]
2023-06-08 17:45:45.250 [job-0] INFO StandAloneJobContainerCommunicator - Total 10000 records, 256450 bytes | Speed 250.44KB/s, 10000 records/s | Error 2000 records, 51270 bytes | All Task WaitWriterTime 0.006s | All Task WaitReaderTime 0.338s | Percentage 0.00%
2023-06-08 17:45:45.252 [job-0] ERROR Engine -
经DataX智能分析,该任务最可能的错误原因是:
com.alibaba.datax.common.exception.DataXException: Code:[Framework-14], Description:[DataX传输脏数据超过用户预期,该错误通常是由于源端数据存在较多业务脏数据导致,请仔细检查DataX汇报的脏数据日志信息, 或者您可以适当调大脏数据阈值 .]. - 脏数据条数检查不通过,限制是[0]条,但实际上捕获了[2000]条.
at com.alibaba.datax.common.exception.DataXException.asDataXException(DataXException.java:30)
at com.alibaba.datax.core.util.ErrorRecordChecker.checkRecordLimit(ErrorRecordChecker.java:58)
at com.alibaba.datax.core.job.scheduler.AbstractScheduler.schedule(AbstractScheduler.java:89)
at com.alibaba.datax.core.job.JobContainer.schedule(JobContainer.java:535)
at com.alibaba.datax.core.job.JobContainer.start(JobContainer.java:119)
at com.alibaba.datax.core.Engine.start(Engine.java:93)
at com.alibaba.datax.core.Engine.entry(Engine.java:175)
at com.alibaba.datax.core.Engine.main(Engine.java:208)
{
"job":{
"content":[
{
"reader":{
"name":"tdengine30reader",
"parameter":{
"username":"root",
"password":"taosdata",
"connection":[
{
"table":[
"weather"
],
"jdbcUrl":[
"jdbc:TAOS-RS://127.0.0.1:6041/wanyanjun?timestampFormat=TIMESTAMP"
]
}
],
"column":[
"ts",
"temperature",
"humidity",
"location",
"groupid"
]
}
},
"writer":{
"name":"tdengine30writer",
"parameter":{
"username":"root",
"password":"taosdata",
"column":[
"ts",
"temperature",
"humidity",
"location",
"groupid"
],
"connection":[
{
"table":[
"weather"
],
"jdbcUrl":"jdbc:TAOS-RS://192.168.3.212:6041/wanyanjun"
}
],
"encoding":"UTF-8",
"batchSize":1000,
"ignoreTagsUnmatched":true
}
}
}
],
"setting":{
"speed":{
"channel":5
},
"errorLimit":{
"record":0
}
}
}
}
相关错误日志如下:
ERROR convert nchar string to UCS4_LE failed:{"carNumber":"","image":"","outParkingSpace":1638927251000,"parkingSpaceStatus":0}
02/20 18:21:18.947000 00150464 TSC ERROR 0xf bind column 4: type mismatch or invalid
02/20 18:21:18.947000 00150464 JNI ERROR jobj:00000069469FE648, conn:000001CC512F7AA0, code:Invalid operation
column name shuld Enclose with back quotes ,to solve the uppercase column name .
taos版本 3.0.0
源码中的taos-jdbcdriver驱动是2.0.37,改成了3.0.0还是不行
[INFO] --------------------------------[ jar ]---------------------------------
Downloading from central: https://maven.aliyun.com/repository/central/com/alibaba/datax/tdenginewriter/tdenginewriter/0.0.1-SNAPSHOT/maven-metadata.xml
Downloading from central: https://maven.aliyun.com/repository/central/com/alibaba/datax/tdenginewriter/tdenginewriter/0.0.1-SNAPSHOT/tdenginewriter-0.0.1-SNAPSHOT.pom
[WARNING] The POM for com.alibaba.datax.tdenginewriter:tdenginewriter:jar:0.0.1-SNAPSHOT is missing, no dependency information available
Downloading from central: https://maven.aliyun.com/repository/central/com/alibaba/datax/tdenginewriter/tdenginewriter/0.0.1-SNAPSHOT/tdenginewriter-0.0.1-SNAPSHOT.jar
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Skipping datax-all
[INFO] This project has been banned from the build due to previous failures.
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] datax-all 0.0.1-SNAPSHOT ........................... SUCCESS [ 0.100 s]
[INFO] datax-common ....................................... SUCCESS [ 2.106 s]
[INFO] datax-transformer .................................. SUCCESS [ 0.886 s]
[INFO] datax-core ......................................... SUCCESS [ 2.224 s]
[INFO] plugin-rdbms-util .................................. SUCCESS [ 0.387 s]
[INFO] tdenginereader ..................................... FAILURE [ 0.861 s]
[INFO] plugin-unstructured-storage-util 0.0.1-SNAPSHOT .... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 6.752 s
[INFO] Finished at: 2022-07-21T15:28:25+08:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project tdenginereader: Could not resolve dependencies for project com.alibaba.datax:tdenginereader:jar:0.0.1-SNAPSHOT: Could not find artifact com.alibaba.datax.tdenginewriter:tdenginewriter:jar:0.0.1-SNAPSHOT in central (https://maven.aliyun.com/repository/central) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn -rf :tdenginereader
java.lang.UnsatisfiedLinkError: Native Library C:\Windows\System32\taos.dll already loaded in another classloader
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1900) ~[na:1.8.0_261]
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1850) ~[na:1.8.0_261]
at java.lang.Runtime.loadLibrary0(Runtime.java:871) ~[na:1.8.0_261]
at java.lang.System.loadLibrary(System.java:1122) ~[na:1.8.0_261]
at com.taosdata.jdbc.TSDBJNIConnector.(TSDBJNIConnector.java:28) ~[taos-jdbcdriver-2.0.42.jar:na]
at com.taosdata.jdbc.TSDBDriver.connect(TSDBDriver.java:162) ~[taos-jdbcdriver-2.0.42.jar:na]
at java.sql.DriverManager.getConnection(DriverManager.java:664) ~[na:1.8.0_261]
at java.sql.DriverManager.getConnection(DriverManager.java:247) ~[na:1.8.0_261]
at com.alibaba.datax.plugin.writer.tdengine20writer.DefaultDataHandler.handle(DefaultDataHandler.java:77) ~[tdengine20writer-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.plugin.writer.tdengine20writer.TDengineWriter$Task.startWrite(TDengineWriter.java:109) ~[tdengine20writer-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.core.taskgroup.runner.WriterRunner.run(WriterRunner.java:56) ~[datax-core-0.0.1-SNAPSHOT.jar:na]
DataX TDengine,可以多个超级表一起操作吗,column可以不写吗【会有字段重复的问题】,或者可以直接数据库同步吗?
Exception in thread "main" java.lang.NoSuchMethodError: com.alibaba.fastjson.JSONArray.getTimestamp(I)Ljava/lang/Object;
at com.taosdata.jdbc.rs.RestfulResultSet.parseTimestampColumnData(RestfulResultSet.java:255)
at com.taosdata.jdbc.rs.RestfulResultSet.parseColumnData(RestfulResultSet.java:183)
at com.taosdata.jdbc.rs.RestfulResultSet.(RestfulResultSet.java:98)
at com.taosdata.jdbc.rs.RestfulStatement.execute(RestfulStatement.java:88)
at com.taosdata.jdbc.rs.RestfulStatement.executeQuery(RestfulStatement.java:37)
at Test.main(Test.java:16)
com.alibaba.datax.common.exception.DataXException: Code:[TDengineWriter-02], Description:[runtime exception]. - cannot find col: ts in columns: [ts, i_a, i_b, i_c, i_sum, elc, u_a, u_b, u_c, power, corp_id, equipid, line_id]
at com.alibaba.datax.common.exception.DataXException.asDataXException(DataXException.java:30) ~[datax-common-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.plugin.writer.tdenginewriter.DefaultDataHandler.indexOf(DefaultDataHandler.java:552) [tdenginewriter-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.plugin.writer.tdenginewriter.DefaultDataHandler.writeBatchToSupTableBySchemaless(DefaultDataHandler.java:317) [tdenginewriter-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.plugin.writer.tdenginewriter.DefaultDataHandler.writeBatch(DefaultDataHandler.java:158) [tdenginewriter-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.plugin.writer.tdenginewriter.DefaultDataHandler.writeEachRow(DefaultDataHandler.java:129) [tdenginewriter-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.plugin.writer.tdenginewriter.DefaultDataHandler.handle(DefaultDataHandler.java:96) [tdenginewriter-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.plugin.writer.tdenginewriter.TDengineWriter$Task.startWrite(TDengineWriter.java:109) [tdenginewriter-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.core.taskgroup.runner.WriterRunner.run(WriterRunner.java:56) [datax-core-0.0.1-SNAPSHOT.jar:na]
at java.lang.Thread.run(Thread.java:750) [na:1.8.0_345]
脚本如下:
{
"content":[
{
"reader":{
"name":"tdenginereader",
"parameter":{
"beginDateTime":"2022-10-25 17:47:31",
"column":[
"ts",
"i_a",
"i_b",
"i_c",
"i_sum",
"elc",
"u_a",
"u_b",
"u_c",
"power",
"corpid",
"equipid",
"lineid"
],
"connection":[
{
"jdbcUrl":[
"jdbc:TAOS-RS://xxxxxxx:6041/gkjk?timestampFormat=TIMESTAMP"
],
"table":[
"equip_min"
]
}
],
"endDateTime":"2022-10-25 17:50:31",
"password":"",
"splitInterval":"5m",
"username":"hdec"
}
},
"writer":{
"name":"tdenginewriter",
"parameter":{
"batchSize":1000,
"column":[
"ts",
"i_a",
"i_b",
"i_c",
"i_sum",
"elc",
"u_a",
"u_b",
"u_c",
"power",
"corp_id",
"equipid",
"line_id"
],
"connection":[
{
"jdbcUrl":"jdbc:TAOS://xxxxxxx:6030/test?timestampFormat=TIMESTAMP",
"table":[
"abandoned_water_elc_data"
]
}
],
"ignoreTagsUnmatched":true,
"password":"",
"username":"root"
}
}
}
],
"setting":{
"speed":{
"channel":1
}
}
}
2024-01-23 13:40:27.865 [0-0-0-writer] ERROR StdoutPluginCollector - 脏数据:
{"exception":"TDengine ERROR (0x80003002): Invalid data format","record":[{"byteSize":8,"index":0,"rawData":1705770149000,"type":"DATE"},{"byteSize":1,"index":1,"rawData":2,"type":"LONG"},{"byteSize":10,"index":2,"rawData":1705770318,"type":"LONG"},{"byteSize":3,"index":3,"rawData":169,"type":"LONG"},{"byteSize":9,"index":4,"rawData":"4.7264E-4","type":"DOUBLE"},{"byteSize":10,"index":5,"rawData":"0.00571764","type":"DOUBLE"},{"byteSize":23,"index":6,"rawData":"id_15","type":"STRING"},{"byteSize":5,"index":7,"rawData":"15","type":"STRING"}],"type":"writer"}
行数据:[balanced_state,tname=id_15,device_id=15 battery_state=3,end_time=1705770742,duration=424,energy=6.4E-7f64,capacity=6.4E-7f64 1705770318000]
2024-01-23 13:40:27.865 [0-0-0-writer] ERROR DefaultDataHandler - TDengine ERROR (0x80003002): Invalid data format
读取没有问题,但是写入的时候会报数据格式错误,是因为已经在3.2.1.0的库中创建了超级表,在进行无模式插入时,例如battery_state=3表示的是double类型,实际应该插入battery_state=3u8的形式,最终导致格式错误
有什么解决方案吗?应该是Datax将数据转为Long类型,但是转不回老版本的taos数据库类型?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.