Code Monkey home page Code Monkey logo

dumpling's Introduction

dumpling's People

Contributors

3pointer avatar abingcbc avatar andrewdi avatar csuzhangxc avatar cxt90730 avatar docsir avatar dveeden avatar ehco1996 avatar fossabot avatar gmhdbjd avatar hi-rustin avatar ianthereal avatar jiyfhust avatar kennytm avatar lance6716 avatar leavrth avatar lichunzhu avatar little-wallace avatar okjiang avatar recall704 avatar se-bin avatar siddontang avatar sylzd avatar tangenta avatar tiancaiamao avatar tirsen avatar tisonkun avatar yujuncen avatar zhangjinpeng87 avatar zhongzc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dumpling's Issues

Support `--where` clause

Support the --where clause from mysqldump. The condition is applied on all tables.

  • The file size of the SQL dump should still be close to the configuration. If we dumped 1000 files of 256 MB each originally, and we use --where to filter out half of the rows, the result should be 500 files of 256 MB each, not 1000 files of 128 MB each.

  • If the expression cannot be evaluated, abort and report the error.

  • The condition used should be written out for SQL dumps (can be skipped for CSV and binary dumps) like:

    /* WHERE (id > 500) */
    INSERT INTO ...
  • Even if all rows are filtered out, it should still generate an SQL file with content:

    /* All data are filtered out by this condition: */
    /* WHERE (id > 500) */

Should work just fine by adding the condition to SELECT *.

Score

  • 600

Support output compression

Support writing the data file as *.gz.

Well, just use https://godoc.org/compress/gzip and it should be done.

(Note: we have a configuration about the size of each data file. You could decide whether this means the size before or after compression, whichever is simpler to implement.)

Score

  • 50

Support --rows parameter

support parameter --rows or -r compared with mydumper. which means

  • Split table into chunks of this many rows, default unlimited

replace go-sql-connector/mysql with siddontang/go-mysql/client to improve dumpling's performance

Description

Background

Currently dumpling's performance is only 1/2 to 2/3 of mydumper. There are two parts that cost a lot of time for dumpling.

  1. After analyzing the torch graph and doing some simple tests we find that dumpling costs a lot of time in fetch one row.

That's because driver.Value in database/sql package is an interface{} type variable.

When we convert []byte type varible to interface{} type, it will use runtime.mallocgc in runtime.convTslice to do that which will cost a lot of time, but we can't change driver.Value.

One solution is to abandon the usage of database/sql and directly use the []byte value readed from mysql server. But this is a huge change for dumpling.

  1. Now dumpling will do these things in serial:

read a row -> escape string -> write to buffer -> read next row ...

Actually, when we escape the value, we can start to read another row to improve the performance. But it seems hard for database/sql package to implement this function, which means we may have to implement the MySQL client by ourselves.

Reference

  1. convertion code in go-mysql: https://github.com/go-sql-driver/mysql/blob/73dc904a9ece5c074295a77abb7a135797a351bf/packets.go#L770
  2. dumpling torch graph:
    image
  3. mydumper torch graph:
    image
  4. code to test the effiency when assign []byte to interface{}
    assign []byte to interface{} will cost much more time than assign to []byte in this test.
    https://gist.github.com/lichunzhu/2433d332b4bfc57fb7c1aa3f404b4c58
  5. Test: If we use dumpling with only scan (disable escape and write), it will cost the same time as mydumper both write and read.
    Revelant torch: image

Tasks

  • improve dumpling's performance, make it better than mydumper (for both single-threaded and multi-threaded running)
    • one possible approach is to replace go-sql-connector/mysql with siddontang/go-mysql/client to improve dumpling's performance. What's more, we need to refactor this package to parallel reading from database, escaping chapters and writing to disks.

Score

  • 6600

Mentor

Recommended Skills

  • performance improvement for golang

Dumpling will crash when running without any command line argument

When running dumpling without any command line argument, it will crash with following errors:

Release version:
Git commit hash: 2655b67e4a5e01367a0f2cd9355b5d8e98f1583e
Git branch:      master
Build timestamp: 2020-05-11 08:17:14Z
Go version:      go version go1.14.1 linux/amd64

dump failed: err = err = dial tcp 127.0.0.1:4000: connect: connection refused
goroutine 1 [running]:
runtime/debug.Stack(0xa13a60, 0xc000080190, 0xa14400)
	/usr/local/go/src/runtime/debug/stack.go:24 +0x9d
github.com/pingcap/dumpling/v4/export.withStack(0xa13a60, 0xc000080190, 0xc0000aa040, 0x95bdcf)
	/home/tidb/luancheng/dumpling/v4/export/error.go:40 +0x8d
github.com/pingcap/dumpling/v4/export.simpleQuery(0xc000144180, 0x95bdcf, 0x10, 0xc000073998, 0xc000010088, 0x0)
	/home/tidb/luancheng/dumpling/v4/export/sql.go:367 +0x152
github.com/pingcap/dumpling/v4/export.SelectVersion(0xc000144180, 0x2a, 0x434466, 0xc00002a0f0, 0xc00007a1a0)
	/home/tidb/luancheng/dumpling/v4/export/sql.go:94 +0x81
github.com/pingcap/dumpling/v4/export.detectServerInfo(0xc000144180, 0x5, 0xc00002a0f0, 0x2a, 0xc000144180)
	/home/tidb/luancheng/dumpling/v4/export/prepare.go:30 +0x2f
github.com/pingcap/dumpling/v4/export.Dump(0xc00010e780, 0x0, 0x0)
	/home/tidb/luancheng/dumpling/v4/export/dump.go:35 +0x2cb
main.run()
	/home/tidb/luancheng/dumpling/cmd/dumpling/main.go:122 +0x5b9
main.glob..func1(0xd6e380, 0xda2790, 0x0, 0x0)
	/home/tidb/luancheng/dumpling/cmd/dumpling/main.go:58 +0x20
github.com/spf13/cobra.(*Command).execute(0xd6e380, 0xc0000921b0, 0x0, 0x0, 0xd6e380, 0xc0000921b0)
	/home/tidb/go/pkg/mod/github.com/spf13/[email protected]/command.go:844 +0x29d
github.com/spf13/cobra.(*Command).ExecuteC(0xd6e380, 0x0, 0x1, 0xc000082058)
	/home/tidb/go/pkg/mod/github.com/spf13/[email protected]/command.go:945 +0x317
github.com/spf13/cobra.(*Command).Execute(...)
	/home/tidb/go/pkg/mod/github.com/spf13/[email protected]/command.go:885
main.main()
	/home/tidb/luancheng/dumpling/cmd/dumpling/main.go:131 +0x2d

goroutine 1 [running]:
runtime/debug.Stack(0xa14400, 0xc000095260, 0xa14400)
	/usr/local/go/src/runtime/debug/stack.go:24 +0x9d
github.com/pingcap/dumpling/v4/export.withStack(0xa14400, 0xc000095260, 0x10, 0xc000073998)
	/home/tidb/luancheng/dumpling/v4/export/error.go:40 +0x8d
github.com/pingcap/dumpling/v4/export.SelectVersion(0xc000144180, 0x2a, 0x434466, 0xc00002a0f0, 0xc00007a1a0)
	/home/tidb/luancheng/dumpling/v4/export/sql.go:96 +0x9e
github.com/pingcap/dumpling/v4/export.detectServerInfo(0xc000144180, 0x5, 0xc00002a0f0, 0x2a, 0xc000144180)
	/home/tidb/luancheng/dumpling/v4/export/prepare.go:30 +0x2f
github.com/pingcap/dumpling/v4/export.Dump(0xc00010e780, 0x0, 0x0)
	/home/tidb/luancheng/dumpling/v4/export/dump.go:35 +0x2cb
main.run()
	/home/tidb/luancheng/dumpling/cmd/dumpling/main.go:122 +0x5b9
main.glob..func1(0xd6e380, 0xda2790, 0x0, 0x0)
	/home/tidb/luancheng/dumpling/cmd/dumpling/main.go:58 +0x20
github.com/spf13/cobra.(*Command).execute(0xd6e380, 0xc0000921b0, 0x0, 0x0, 0xd6e380, 0xc0000921b0)
	/home/tidb/go/pkg/mod/github.com/spf13/[email protected]/command.go:844 +0x29d
github.com/spf13/cobra.(*Command).ExecuteC(0xd6e380, 0x0, 0x1, 0xc000082058)
	/home/tidb/go/pkg/mod/github.com/spf13/[email protected]/command.go:945 +0x317
github.com/spf13/cobra.(*Command).Execute(...)
	/home/tidb/go/pkg/mod/github.com/spf13/[email protected]/command.go:885
main.main()
	/home/tidb/luancheng/dumpling/cmd/dumpling/main.go:131 +0x2d

It's better to print help info when running without argument.

Benchmark against mydumper

Compare the speed of Dumpling on a database against Mydumper, on a large (> 10 GB) database.

Score

  • 450

Add context in `export.Dump`

As a lib, dumpling should add context.Context in export.Dump to shutdown the dump process by using context.CancelFunc.

E2E test

  • Create integration tests on CI showing MySQL/TiDB โ†’ Dumpling โ†’ Loader/Lightning โ†’ MySQL/TiDB works with small-scale data.
  • Run the same test with large-scale data (of size > 100 GB).

You can use sync_diff_inspector to compare databases for the integration tests.

Score

  • 900

Add --version/-V flags

dumpling should support a --version/-V flag to match behavior of other PingCAP tools.

`-F` flag seems not work with `--filetype csv` and `--sql`

When I want to dump data selected by SQL to CSV(using current master):

./dumpling \
-F $(( 1024 * 3 ))  \
--filetype csv \
--sql "select * from test.sbtest1;" \
-o csv-and-size-2

what I want to see

All selected lines are dumped into 3KB chunks of file:

hillium@continuation ~/D/d/b/csv-and-size-5> ls -lh ../csv-and-size-5/
total 480
-rwxr-xr-x  1 hillium  staff   3.2K  5 22 13:52 1590126769.1443635317331776148.csv
-rwxr-xr-x  1 hillium  staff   3.2K  5 22 13:52 1590126769.1460320609597786623.csv
-rwxr-xr-x  1 hillium  staff   3.2K  5 22 13:52 1590126769.1598098976185383115.csv
-rwxr-xr-x  1 hillium  staff   3.2K  5 22 13:52 1590126769.1874068156324778273.csv
-rwxr-xr-x  1 hillium  staff   3.2K  5 22 13:52 1590126769.1905388747193831650.csv
-rwxr-xr-x  1 hillium  staff   3.2K  5 22 13:52 1590126769.1976235410884491574.csv
-rwxr-xr-x  1 hillium  staff   3.2K  5 22 13:52 1590126769.2015796113853353331.csv
-rwxr-xr-x  1 hillium  staff   3.2K  5 22 13:52 1590126769.2050257992909156333.csv
-rwxr-xr-x  1 hillium  staff   3.2K  5 22 13:52 1590126769.2338498362660772719.csv

what I got instead

I got just one file with no-complete records:

hillium@continuation ~/D/d/b/csv-and-size-5> ls -lh ../csv-and-size-2/
total 24
-rwxr-xr-x  1 hillium  staff   3.2K  5 22 13:46 1590126407.csv
-rwxr-xr-x  1 hillium  staff   140B  5 22 13:46 metadata

notes

According to debug log, seems we name the csv file bad, so our output files will become hidden files:

[2020/05/22 13:46:47.163 +08:00] [DEBUG] [writer_util.go:256] ["dumping table"] [table=] ["record counts"=17]
[2020/05/22 13:46:47.163 +08:00] [DEBUG] [writer_util.go:331] ["opened file"] [path=csv-and-size-2/..45.csv]
[2020/05/22 13:46:47.163 +08:00] [DEBUG] [writer_util.go:341] ["tear down lazy file writer..."]
[2020/05/22 13:46:47.163 +08:00] [DEBUG] [writer_util.go:256] ["dumping table"] [table=] ["record counts"=17]
[2020/05/22 13:46:47.163 +08:00] [DEBUG] [writer_util.go:331] ["opened file"] [path=csv-and-size-2/..46.csv]

If a table IR give us empty database or table name, we use the unix timestamp to name the file:

if ir.DatabaseName() != "" && ir.TableName() != "" {
fileName = fmt.Sprintf("%s.%s.%d.csv", ir.DatabaseName(), ir.TableName(), ir.ChunkIndex())
} else {
fileName = fmt.Sprintf("%d.csv", time.Now().Unix())
}

But we name the output file directly by (DB name, Table name, count) 3-tuple during looping:

fileName = fmt.Sprintf("%s.%s.%d.csv", ir.DatabaseName(), ir.TableName(), chunkIndex)

One more thing, name it directly by unix time won't work, because we may dump multi files at same second. Maybe we can solve this by get nano second, or use ${UNIX time}-${chunk index}.csv format.

dumpling should handle cases that estimated count are not accurate

Now dumpling computes estimatedChunks through tidb estRows.

	// every chunk would have eventual adjustments
	estimatedChunks := count / conf.Rows
	estimatedStep := (max-min)/estimatedChunks + 1

When table is not created but not used in tidb, estRows will be 10000. Then estimatedChunks will be very large if conf.Rows is not big. Dumpling may dump many sql files which is not expected.

Support dump databases and tables with special name

dumpling need to support dumping databases and tables with special name.
When I try to run ./dumpling -u root -P 3306 -H 127.0.0.1 -B "live-test" to dump database live-test, dumpling returns error:
We can escape database and table name to solve this problem.

Release version:
Git commit hash: a36198fef0eff1752f6bd645bb5f884e3c30ed98
Git branch:      master
Build timestamp: 2020-04-21 02:46:31Z
Go version:      go version go1.13.4 linux/amd64

[2020/04/21 22:46:41.256 +08:00] [INFO] [config.go:111] ["detect server type"] [type=MySQL]
[2020/04/21 22:46:41.256 +08:00] [INFO] [config.go:129] ["detect server version"] [version=5.7.26-log]
dump failed: SHOW CREATE DATABASE live-test: err = Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '-test' at line 1
goroutine 1 [running]:
runtime/debug.Stack(0xa2c9c0, 0xc000120280, 0xa2d680)
        /usr/local/go/src/runtime/debug/stack.go:24 +0x9d
github.com/pingcap/dumpling/v4/export.withStack(0xa2c9c0, 0xc000120280, 0xc0000c8078, 0xc000128080)
        /home/pingcap/goPath/src/github.com/pingcap/dumpling/v4/export/error.go:40 +0x8d
github.com/pingcap/dumpling/v4/export.simpleQuery(0xc0001540c0, 0xc000128080, 0x1e, 0xc0001198c0, 0x1, 0xc000128080)
        /home/pingcap/goPath/src/github.com/pingcap/dumpling/v4/export/sql.go:345 +0x152
github.com/pingcap/dumpling/v4/export.ShowCreateDatabase(0xc0001540c0, 0x7ffce0e6c37e, 0x9, 0x1ed, 0x0, 0x1ed, 0x0)
        /home/pingcap/goPath/src/github.com/pingcap/dumpling/v4/export/sql.go:38 +0x10c
github.com/pingcap/dumpling/v4/export.dumpDatabases(0xa371c0, 0xc0000c8078, 0xc0000fea00, 0xc0001540c0, 0xa35540, 0xc000124030, 0x0, 0xc0)
        /home/pingcap/goPath/src/github.com/pingcap/dumpling/v4/export/dump.go:102 +0x2e3
github.com/pingcap/dumpling/v4/export.Dump(0xc0000fea00, 0x0, 0x0)
        /home/pingcap/goPath/src/github.com/pingcap/dumpling/v4/export/dump.go:90 +0x83e
main.run()
        /home/pingcap/goPath/src/github.com/pingcap/dumpling/cmd/dumpling/main.go:119 +0x64b
main.glob..func1(0xd80aa0, 0xc000152aa0, 0x0, 0xa)
        /home/pingcap/goPath/src/github.com/pingcap/dumpling/cmd/dumpling/main.go:57 +0x20
github.com/spf13/cobra.(*Command).execute(0xd80aa0, 0xc0000d0010, 0xa, 0xa, 0xd80aa0, 0xc0000d0010)
        /home/pingcap/goPath/pkg/mod/github.com/spf13/[email protected]/command.go:844 +0x2aa
github.com/spf13/cobra.(*Command).ExecuteC(0xd80aa0, 0x0, 0x1, 0xc0000a6058)
        /home/pingcap/goPath/pkg/mod/github.com/spf13/[email protected]/command.go:945 +0x317
github.com/spf13/cobra.(*Command).Execute(...)
        /home/pingcap/goPath/pkg/mod/github.com/spf13/[email protected]/command.go:885
main.main()
        /home/pingcap/goPath/src/github.com/pingcap/dumpling/cmd/dumpling/main.go:128 +0x2d

dumpling costs too much time on getting schemas' tables

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error.

Run dumpling with -T args to dump only one table. The TiDB cluster has more than 1000 schemas and 100000 tables. Each select query will cost about 3s.

  1. What did you expect to see?

Dumpling data quickly and successfully.

  1. What did you see instead?

Dumpling is blocked at listAllTables.

How to dump views?

I don't see how to dump views with option like this one -W, --no-views Do not dump views (default true)

`--escape-backslash=false` leads to data mismatched

mysql [email protected]:test> create table t (a varchar(100))
Query OK, 0 rows affected
Time: 0.005s
mysql [email protected]:test> insert into t values("'hello")
Query OK, 1 row affected
Time: 0.003s
mysql [email protected]:test> select * from t
+--------+
| a      |
+--------+
| 'hello |
+--------+
1 row in set
Time: 0.014s

bash> ./bin/dumpling --filetype csv --escape-backslash=false
bash> cat export-2020-06-09T16:31:45+08:00/test.t.0.csv
"a"
"''hello"

The 'hello became ''hello

Support `--tidb-force-priority` in dumpling like mydumper

Feature Request

Is your feature request related to a problem? Please describe:

Some users don't want dumpling to affect their normal TiDB work.

Describe the feature you'd like:

Use session variable tidb-force-priority to set dumpling sqls priority to reduce the resources usage of tidb cluster.

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Optimization:

Support dumping SQL files to a local filesystem

Create an MVP for this project, showing performance and correctness. See pingcap/community#123 for the full list of expectations.

Checklist:

  • Make sure your architecture takes account of all features in Phases 1 and 2. Implementing those features in the future should not require a complete refactoring.
  • Resulting data are sorted by primary key (_tidb_rowid, if exists, is always considered as the primary key.)
  • SQL files are split into size close to the given configuration
  • Single tables can be dumped in parallel, if a primary key or unique btree key exists (Postponed into #12, and TiDB may not be able to handle it)
  • Consistency: dumping a snapshot instead of live data (either acquire a read lock or ignore new updates)
  • Error handling and recovery
  • Support all data types allowed by TiDB (text, blob, int, float, decimal, bit, enum, set, timestamp, datetime, date, time, year, json)
  • Does not fail with edge cases for data (e.g. naughty strings, character sets, time zones)
  • Properly handle special features (e.g. special characters in table name, table partitions, views, generated columns, MariaDB sequences, etc.)
  • Has adequate logging and Prometheus metrics
  • Unit test coverage โ‰ฅ 75%

Score

  • 6200 5000

dump TiDB INFORMATION_SCHEMA failed

[2020/03/01 21:07:14.710 +08:00] [WARN] [session.go:1123] ["compile SQL failed"] [conn=51] [error="[planner:1054]Unknown column '_tidb_rowid' in 'field list'"] [SQL="SELECT _tidb_rowid from INFORMATION_SCHEMA.INSPECTION_RESULT LIMIT 0"]

Seem table INSPECTION_RESULT has no _tidb_rowid

Support output metadata after the dump operation

Support output a metadata into output dir after dump operation. DM need the position in metadata to get the start checkpoint for syncer in all-mode.
An example of Mydumper's metadata:

Started dump at: 2020-03-10 15:53:25
SHOW MASTER STATUS:
        Log: ON.000001
        Pos: 7502
        GTID:6ce40be3-e359-11e9-87e0-36933cb0ca5a:1-29

Finished dump at: 2020-03-10 15:53:25

Consider making dumpling `-F` flag use `MB` unit

Dumpling and Mydumper have the same command line flag -F, but the unit is totally different, Mydumper -F unit is MB while Dumpling -F unit is bytes.

Also, there is no unit in the --help document and bytes is not user-friendly if users want to configure 64MB, they have to specify 67108864 which is cumbersome.

Users would easily make mistakes if they used Mydumper previously, especially we link Mydumper document in Dumpling document.

Let Dumpling set GC safepoint ttl with new pd interface

When Dumpling dumps large data it needs to set TiDB snapshot at a ts. When the dumping time is too long tikv may gc this TiDB snapshot ts and Dumpling will fail to dump, exits with an error.

To avoid this problem, users need to set tikv_gc_life_time longer, which is explained in the end of https://pingcap.com/docs-cn/stable/export-or-backup-using-dumpling. However, it's not convenient. The users need to run

update mysql.tidb set VARIABLE_VALUE = '720h' where VARIABLE_NAME = 'tikv_gc_life_time';

before dumping and run

update mysql.tidb set VARIABLE_VALUE = '10m' where VARIABLE_NAME = 'tikv_gc_life_time';

after dumping.

But, Dumpling shouldn't change this variable because dumping process may fail in the half.

In v4.0.0, PD has a new interface UpdateServiceGCSafePoint which allows service to set TiKV GC safe point with a ttl, which is very suitable for dumpling's scene. What's more, TiCDC has already used this interface in their service.

So, in summary, I think dumpling can also use this feature to make sure dumped ts is not "GCed".

Omit generated columns when generating sqls

When table has GENERATED column current dumpling will also generate sqls with the generated values, which will cause an error in downstream dbs.
We need to omit GENERATED columns.

example

table-structure:

CREATE TABLE `t1` (
  `id` int(11) DEFAULT NULL,
  `name` varchar(20) DEFAULT NULL,
  `info` json DEFAULT NULL,
  `gen_id` int(11) GENERATED ALWAYS AS (json_extract(`info`,'$.id')) VIRTUAL,
  `info2` varchar(40) DEFAULT NULL,
  UNIQUE KEY `gen_idx` (`gen_id`),
  KEY `multi_col` (`id`,`gen_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

sqls generated by dumpling:

/*!40101 SET NAMES binary*/;
INSERT INTO `t1` VALUES
(1, 'Arya Stark', '{"id": 120}', 120, NULL),
(2, 'Catelyn Stark', '{"id": 121}', 121, NULL),
(3, 'Eddard Stark', '{"id": 122}', 122, NULL);

sqls generated by mydumper:

/*!40101 SET NAMES binary*/;
/*!40014 SET FOREIGN_KEY_CHECKS=0*/;
INSERT INTO `t1` (`id`,`name`,`info`,`info2`) VALUES
(1,'Arya Stark',CONVERT('{\"id\": 120}' USING UTF8MB4),NULL),
(2,'Catelyn Stark',CONVERT('{\"id\": 121}' USING UTF8MB4),NULL),
(3,'Eddard Stark',CONVERT('{\"id\": 122}' USING UTF8MB4),NULL);

--snapshot argument doesn't work

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error.
    Run dumpling with --snapshot argument.

  2. What did you expect to see?
    Dump tables with given snapshot.

  3. What did you see instead?
    Dump tables with current snapshot.

Support dumping CSV

Support dumping data as CSV files instead of SQL files. (Schema files should still be SQL)

Score

  • 300

Reduce the time dumpling holds the FTWRL lock

Feature Request

Is your feature request related to a problem? Please describe:

When dumpling MySQL, dumpling will use FTWRL to lock all the tables and release the lock after all tables have been dumped which will strongly affect the MySQL server.

Describe the feature you'd like:

In mydumper, they use a special way to confirm consistency and "lock time". The solution is: FTWRL -> start THREADS*transactions(WITH CONSISTENT SNAPSHOT) -> UNLOCK TABLES -> dump data
I think we can also do this in dumpling.

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Optimization:

Usability improvement with cmd line parameter `-t` and `-F`

Background

  • -t controls how many tables can be exported concurrently. If user provides -r in parameter, each table could be separated into some concurrent export jobs. So the -t doesn't mean exactly the upper limit of export threads.
  • -F only accepts an integer, which represents the size of the output file in bytes. User may want to provide a more human-readable parameter such as 64MB, 1GB.

Requirement

  • -t controls the global upper limit of export threads, or provides another parameter to limit the concurrency of single table export, then the global concurrency can be controlled.
  • -F accepts human readable file size

Maybe we need quote table names when querying DB info.

Currently, we inject table name directly into sql query.

query := fmt.Sprintf("SHOW CREATE DATABASE %s", database)

query := fmt.Sprintf("SHOW CREATE TABLE %s.%s", database, table)

Even those table names are get by connection from database directly, so we don't need to fear SQL injection attack. But when there are tables using key words of SQL, dumpling will fail, a example is run integration_test on MariaDB:

dump failed: SHOW CREATE DATABASE rows: err = Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'rows' at line 1
goroutine 1 [running]:
runtime/debug.Stack(0x1615be0, 0xc00000e4e0, 0x1616980)
	/usr/local/Cellar/go/1.14/libexec/src/runtime/debug/stack.go:24 +0x9d
github.com/pingcap/dumpling/v4/export.withStack(0x1615be0, 0xc00000e4e0, 0xc0000260c0, 0xc0000243e0)
	/Users/hillium/Developer/dumpling/v4/export/error.go:40 +0x8d
github.com/pingcap/dumpling/v4/export.simpleQuery(0xc0001f4180, 0xc0000243e0, 0x19, 0xc0001858c0, 0x1, 0xc0000243e0)
	/Users/hillium/Developer/dumpling/v4/export/sql.go:367 +0x152
github.com/pingcap/dumpling/v4/export.ShowCreateDatabase(0xc0001f4180, 0x7ffeefbff7b5, 0x4, 0xc0001859e0, 0x14200df, 0x7ffeefbff7c1, 0x26)
	/Users/hillium/Developer/dumpling/v4/export/sql.go:38 +0x10c
github.com/pingcap/dumpling/v4/export.dumpDatabases(0x1620180, 0xc0000260c0, 0xc0001b4780, 0xc0001f4180, 0x161e3c0, 0xc000010078, 0x0, 0x100)
	/Users/hillium/Developer/dumpling/v4/export/dump.go:109 +0x124
github.com/pingcap/dumpling/v4/export.Dump(0xc0001b4780, 0x0, 0x0)
	/Users/hillium/Developer/dumpling/v4/export/dump.go:92 +0x84f
main.run()
	/Users/hillium/Developer/dumpling/cmd/dumpling/main.go:122 +0x5b9
main.glob..func1(0x19774e0, 0xc0001ec600, 0x0, 0x10)
	/Users/hillium/Developer/dumpling/cmd/dumpling/main.go:58 +0x20
github.com/spf13/cobra.(*Command).execute(0x19774e0, 0xc000020130, 0x10, 0x11, 0x19774e0, 0xc000020130)
	/Users/hillium/go/pkg/mod/github.com/spf13/[email protected]/command.go:844 +0x29d
github.com/spf13/cobra.(*Command).ExecuteC(0x19774e0, 0x0, 0x1, 0xc00007a058)
	/Users/hillium/go/pkg/mod/github.com/spf13/[email protected]/command.go:945 +0x317
github.com/spf13/cobra.(*Command).Execute(...)
	/Users/hillium/go/pkg/mod/github.com/spf13/[email protected]/command.go:885
main.main()
	/Users/hillium/Developer/dumpling/cmd/dumpling/main.go:131 +0x2d

Support a custom binary output format

SQL and CSV are not easy to parse. The purpose of Dumpling is to support a binary format which allows Lightning to read data more quickly.

  • Decide the format specification
  • Implement encoder
  • Implement decoder (read file and output a row of values)

Score

  • 600

Error occurs with argument '-r'

./bin/dumpling -B test -B d1 -r 100
zsh: correct 'test' to 'tests' [nyae]? n
Release version:
Git commit hash: e1ffd1a5e64ea4f7e2a3be4c234d11eea0f7ec29
Git branch:      master
Build timestamp: 2020-06-05 08:00:49Z
Go version:      go version go1.14.2 darwin/amd64

[2020/06/05 17:13:02.370 +08:00] [INFO] [config.go:118] ["detect server type"] [type=TiDB]
[2020/06/05 17:13:02.370 +08:00] [INFO] [config.go:136] ["detect server version"] [version=4.0.0-beta.2-562-ge85b2c516]
[2020/06/05 17:13:02.415 +08:00] [WARN] [sql.go:485] ["can't get estimate count from db"] [query="EXPLAIN SELECT `a` FROM `test`.`t`"] [error="sql: Scan error on column index 2, name \"task\": destination not a pointer"]

Support specifying output file format

Feature Request

Is your feature request related to a problem? Please describe:

Some users need to specify filename format to avoid duplication.

Describe the feature you'd like:

Support specifying output file name format instead of formatted file name.

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Optimization:

Cannot dump heavily skewed data in parallel

Example:

create table x(id bigint primary key);
insert into x (a) values (1);
insert into x (a) values (2);
...
insert into x (a) values (9999);
insert into x (a) values (10000);

insert into x (a) values (9223372036854775807);

Currently Dumpling only splits the range using min and max ignoring the actual data distribution, so this will cause one thread to dump all small ID and while the rest only dump zero or one rows.

We could either introduce mydumper's bisection algorithm, or use a work-stealing algorithm for the completed threads to "steal" the unprocessed range from the working threads.

Make `-t` control the global upper limit of export threads

Separated from #76.

Currently, -t controls how many tables can be exported concurrently. If user provides -r in parameter, each table could be separated into some concurrent export jobs. So the -t doesn't mean exactly the upper limit of export threads.

We want to change -t to control the global upper limit of export threads, or provides another parameter to limit the concurrency of single table export, then the global concurrency can be controlled.

Unify `--filesize` and `--statement-size` definition with mydumper's

Feature Request

Is your feature request related to a problem? Please describe:

Currently dumpling's --filesize and --statement-size only counts the column values' size. However, mydumper counts the text file's size, which may confuse users.

Describe the feature you'd like:

Unify --filesize and --statement-size definition with mydumper's which means sentences like 'INSERT' are also counted in filesize.

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Optimization:

Add ci pipelines, official documents and more tests in dumpling

  • Add official documents in https://github.com/pingcap/docs-cn @YuJuncen

  • Add -T/--tables-list argument like mydumper

  • Add black-white filter to replace -x argument in mydumper

  • Improve performance

  • Add more tests to dump data in different scenes [ask QA for test cases] @lichunzhu

    Correctness test:

    • TiDB feature

    Performance test:

    • dump TB level data and make sure it's correct.
    • dump TB level data and each table has
  • Add release pipelines[ask PM] @lichunzhu

  • Publish dumpling [Due time: 4.0.1]

invalid metadata for multiple GTID sets

  1. dumpling data from a MySQL with multple GTID sets
> SHOW MASTER STATUS;
+------------------+----------+--------------+------------------+--------------------------------------------+
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                          |
+------------------+----------+--------------+------------------+--------------------------------------------+
| mysql-bin.000003 | 1274     |              |                  | 5b5a8e4e-9b43-11ea-900d-0242ac170002:1-10, |
|                  |          |              |                  | 5b642cb6-9b43-11ea-8914-0242ac170003:1-7   |
+------------------+----------+--------------+------------------+--------------------------------------------+
  1. extra newline after the first GTID set
# cat metadata
Started dump at: 2020-05-21 10:09:16
SHOW MASTER STATUS:
		Log: mysql-bin.000003
		Pos: 1274
		GTID:5b5a8e4e-9b43-11ea-900d-0242ac170002:1-10,
5b642cb6-9b43-11ea-8914-0242ac170003:1-7
Finished dump at: 2020-05-21 10:09:16

Argument `threads` should be validated

The users are able to use dumpling -t 0 (stuck) and dumpling -t -1 (panic). As a product, it should validate the effective ranges for numeric arguments.

Unclean exit if connection fails

Trying to execute dumpling without any command-line arguments assumes a connection to 127.0.0.1:4000, and if that fails, the command exits uncleanly:

$ dumpling/bin/dumpling
Release version:
Git commit hash: ba51a591f93ca26325f01ef484074697e7d9c451
Git branch:      master
Build timestamp: 2020-05-11 05:56:17Z
Go version:      go version go1.14.2 linux/amd64

dump failed: err = err = dial tcp 127.0.0.1:4000: connect: connection refused
goroutine 1 [running]:
runtime/debug.Stack(0xa13be0, 0xc00006c410, 0xa14580)
        /home/centos/go/src/runtime/debug/stack.go:24 +0x9d
github.com/pingcap/dumpling/v4/export.withStack(0xa13be0, 0xc00006c410, 0xc000024098, 0x95bdcf)
        /home/centos/dumpling/v4/export/error.go:40 +0x8d
github.com/pingcap/dumpling/v4/export.simpleQuery(0xc0000ec180, 0x95bdcf, 0x10, 0xc0000c9998, 0xc0000100c0, 0x0)
        /home/centos/dumpling/v4/export/sql.go:367 +0x152
github.com/pingcap/dumpling/v4/export.SelectVersion(0xc0000ec180, 0x2a, 0x4344c6, 0xc00002a180, 0xc00007ab60)
        /home/centos/dumpling/v4/export/sql.go:94 +0x81
github.com/pingcap/dumpling/v4/export.detectServerInfo(0xc0000ec180, 0x5, 0xc00002a180, 0x2a, 0xc0000ec180)
        /home/centos/dumpling/v4/export/prepare.go:30 +0x2f
github.com/pingcap/dumpling/v4/export.Dump(0xc0000b2a00, 0x0, 0x0)
        /home/centos/dumpling/v4/export/dump.go:35 +0x2cb
main.run()
        /home/centos/dumpling/cmd/dumpling/main.go:122 +0x5b9
main.glob..func1(0xd6e3a0, 0xda27b0, 0x0, 0x0)
        /home/centos/dumpling/cmd/dumpling/main.go:58 +0x20
github.com/spf13/cobra.(*Command).execute(0xd6e3a0, 0xc0000201d0, 0x0, 0x0, 0xd6e3a0, 0xc0000201d0)
        /home/centos/go/pkg/mod/github.com/spf13/[email protected]/command.go:844 +0x29d
github.com/spf13/cobra.(*Command).ExecuteC(0xd6e3a0, 0x0, 0x1, 0xc00006e058)
        /home/centos/go/pkg/mod/github.com/spf13/[email protected]/command.go:945 +0x317
github.com/spf13/cobra.(*Command).Execute(...)
        /home/centos/go/pkg/mod/github.com/spf13/[email protected]/command.go:885
main.main()
        /home/centos/dumpling/cmd/dumpling/main.go:131 +0x2d

goroutine 1 [running]:
runtime/debug.Stack(0xa14580, 0xc0000e86f0, 0xa14580)
        /home/centos/go/src/runtime/debug/stack.go:24 +0x9d
github.com/pingcap/dumpling/v4/export.withStack(0xa14580, 0xc0000e86f0, 0x10, 0xc0000c9998)
        /home/centos/dumpling/v4/export/error.go:40 +0x8d
github.com/pingcap/dumpling/v4/export.SelectVersion(0xc0000ec180, 0x2a, 0x4344c6, 0xc00002a180, 0xc00007ab60)
        /home/centos/dumpling/v4/export/sql.go:96 +0x9e
github.com/pingcap/dumpling/v4/export.detectServerInfo(0xc0000ec180, 0x5, 0xc00002a180, 0x2a, 0xc0000ec180)
        /home/centos/dumpling/v4/export/prepare.go:30 +0x2f
github.com/pingcap/dumpling/v4/export.Dump(0xc0000b2a00, 0x0, 0x0)
        /home/centos/dumpling/v4/export/dump.go:35 +0x2cb
main.run()
        /home/centos/dumpling/cmd/dumpling/main.go:122 +0x5b9
main.glob..func1(0xd6e3a0, 0xda27b0, 0x0, 0x0)
        /home/centos/dumpling/cmd/dumpling/main.go:58 +0x20
github.com/spf13/cobra.(*Command).execute(0xd6e3a0, 0xc0000201d0, 0x0, 0x0, 0xd6e3a0, 0xc0000201d0)
        /home/centos/go/pkg/mod/github.com/spf13/[email protected]/command.go:844 +0x29d
github.com/spf13/cobra.(*Command).ExecuteC(0xd6e3a0, 0x0, 0x1, 0xc00006e058)
        /home/centos/go/pkg/mod/github.com/spf13/[email protected]/command.go:945 +0x317
github.com/spf13/cobra.(*Command).Execute(...)
        /home/centos/go/pkg/mod/github.com/spf13/[email protected]/command.go:885
main.main()
        /home/centos/dumpling/cmd/dumpling/main.go:131 +0x2d

This also occurs if an explicit and unreachable host is given. dumpling should exit with an error message (rather than stacktrace) if a host cannot be reached. If no arguments are given when it's invoked, help output should be shown.

Support customized csv separator/delimiter

Feature Request

Is your feature request related to a problem? Please describe:

Some users need to specify special separator to avoid ambiguity.

Describe the feature you'd like:

Support --csv-seperator to specify special separator/delimiter for dumpling. The default value is ",".

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Optimization:

tidb oom while dumpling massive data

[root@localhost data2]# ./dumpling -h 172.16.4.36 -P 4000 -u root -o dumpling-exported --filetype csv -t 40
Release version:
Git commit hash: adf4a53efc9fec87ea96f92f5829019679f8580b
Git branch:      master
Build timestamp: 2020-06-19 01:00:52Z
Go version:      go version go1.14.1 linux/amd64

[2020/06/19 09:06:49.259 +08:00] [INFO] [config.go:118] ["detect server type"] [type=TiDB]
[2020/06/19 09:06:49.259 +08:00] [INFO] [config.go:136] ["detect server version"] [version=4.0.0-beta.2-616-gf139821c8]
[2020/06/19 09:06:49.316 +08:00] [WARN] [block_allow_list.go:15] ["unsupported dump schema in TiDB now"] [schema=mysql]
[2020/06/19 09:06:49.316 +08:00] [WARN] [block_allow_list.go:15] ["unsupported dump schema in TiDB now"] [schema=INFORMATION_SCHEMA]
[2020/06/19 09:06:49.316 +08:00] [WARN] [block_allow_list.go:15] ["unsupported dump schema in TiDB now"] [schema=METRICS_SCHEMA]
[2020/06/19 09:06:49.316 +08:00] [WARN] [block_allow_list.go:15] ["unsupported dump schema in TiDB now"] [schema=PERFORMANCE_SCHEMA]
[2020/06/19 09:12:23.539 +08:00] [ERROR] [main.go:175] ["dump failed error stack info"] [error="Error 1105: Out Of Memory Quota![conn_id=50]"] [stack="github.com/pingcap/log.Error\n\t/home/tidb/go/pkg/mod/github.com/pingcap/[email protected]/global.go:42\nmain.main\n\t/home/tidb/cy/src/dumpling/cmd/dumpling/main.go:175\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"]

dump failed: Error 1105: Out Of Memory Quota![conn_id=50]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.