pingcap / dumpling Goto Github PK

Dumpling is a fast, easy-to-use tool written by Go for dumping data from the database(MySQL, TiDB...) to local/cloud(S3, GCP...) in multifarious formats(SQL, CSV...).

License: Apache License 2.0

Makefile 1.40% Go 88.82% Shell 9.78%

sql sql-dump mysql mysqldump tidb go dump csv backup library

dumpling's Introduction

This repository has been moved to https://github.com/pingcap/tidb/tree/master/dumpling.

dumpling's People

Contributors

Stargazers

Watchers

Forkers

tiancaiamao tangenta suzaku siddontang 3pointer lichunzhu andrewdi glkappe yujuncen win-man jangocheng zhangjinpeng87 lee1057 csuzhangxc lance6716 gmhdbjd hadesfeng happyhakka6 lingling1420q fossabot zier-one stutiredboy tirsen dbakit mianhk sylzd swucom ilovesport haohaobaozhi hlapidez se-bin imiskolee cxt90730 astroprofundis dystudio yanglbsjz techdesignduo jeremyli90 haplone feloxx stoneshih99 recall704 hi-rustin ti-chi-bot rocketmai glorv dveeden yufan022 paulmarkyes kakeimei isgasho database-collections jmpotato ehco1996 harumikiyama mysky528 chengc-sa luxinle little-wallace tisonkun cloudxo haiyi2019 mfrank2016 gorge-baby okjiang lemontechsa docsir erwadba laashub-soa zwpaper arstercz alainlompo leavrth data-database qinluyan89 wizardxiao isabella232 jangocity iefnaf bedrock2022 caozuoking pkafma-aon raehowe corollad

dumpling's Issues

Support `--where` clause

Support the --where clause from mysqldump. The condition is applied on all tables.

The file size of the SQL dump should still be close to the configuration. If we dumped 1000 files of 256 MB each originally, and we use --where to filter out half of the rows, the result should be 500 files of 256 MB each, not 1000 files of 128 MB each.
If the expression cannot be evaluated, abort and report the error.
The condition used should be written out for SQL dumps (can be skipped for CSV and binary dumps) like:
```
/* WHERE (id > 500) */
INSERT INTO ...
```
Even if all rows are filtered out, it should still generate an SQL file with content:
```
/* All data are filtered out by this condition: */
/* WHERE (id > 500) */
```

Should work just fine by adding the condition to SELECT *.

Score

Support output compression

Support writing the data file as *.gz.

Well, just use https://godoc.org/compress/gzip and it should be done.

(Note: we have a configuration about the size of each data file. You could decide whether this means the size before or after compression, whichever is simpler to implement.)

Score

Support --rows parameter

support parameter --rows or -r compared with mydumper. which means

Split table into chunks of this many rows, default unlimited

replace go-sql-connector/mysql with siddontang/go-mysql/client to improve dumpling's performance

Description

Background

Currently dumpling's performance is only 1/2 to 2/3 of mydumper. There are two parts that cost a lot of time for dumpling.

After analyzing the torch graph and doing some simple tests we find that dumpling costs a lot of time in fetch one row.

That's because driver.Value in database/sql package is an interface{} type variable.

When we convert []byte type varible to interface{} type, it will use runtime.mallocgc in runtime.convTslice to do that which will cost a lot of time, but we can't change driver.Value.

One solution is to abandon the usage of database/sql and directly use the []byte value readed from mysql server. But this is a huge change for dumpling.

Now dumpling will do these things in serial:

read a row -> escape string -> write to buffer -> read next row ...

Actually, when we escape the value, we can start to read another row to improve the performance. But it seems hard for database/sql package to implement this function, which means we may have to implement the MySQL client by ourselves.

Reference

convertion code in go-mysql: https://github.com/go-sql-driver/mysql/blob/73dc904a9ece5c074295a77abb7a135797a351bf/packets.go#L770
dumpling torch graph:
mydumper torch graph:
code to test the effiency when assign []byte to interface{}
assign []byte to interface{} will cost much more time than assign to []byte in this test.
https://gist.github.com/lichunzhu/2433d332b4bfc57fb7c1aa3f404b4c58
Test: If we use dumpling with only scan (disable escape and write), it will cost the same time as mydumper both write and read.
Revelant torch:

Tasks

improve dumpling's performance, make it better than mydumper (for both single-threaded and multi-threaded running)
- one possible approach is to replace go-sql-connector/mysql with siddontang/go-mysql/client to improve dumpling's performance. What's more, we need to refactor this package to parallel reading from database, escaping chapters and writing to disks.

Score

6600

Mentor

@kennytm

Recommended Skills

performance improvement for golang

Support --statement-size parameter

-s or --statement-size
The maximum size for an insert statement before breaking into a new statement.
default 1,000,000 bytes

Dumpling will crash when running without any command line argument

When running dumpling without any command line argument, it will crash with following errors:

Release version:
Git commit hash: 2655b67e4a5e01367a0f2cd9355b5d8e98f1583e
Git branch:      master
Build timestamp: 2020-05-11 08:17:14Z
Go version:      go version go1.14.1 linux/amd64

dump failed: err = err = dial tcp 127.0.0.1:4000: connect: connection refused
goroutine 1 [running]:
runtime/debug.Stack(0xa13a60, 0xc000080190, 0xa14400)
	/usr/local/go/src/runtime/debug/stack.go:24 +0x9d
github.com/pingcap/dumpling/v4/export.withStack(0xa13a60, 0xc000080190, 0xc0000aa040, 0x95bdcf)
	/home/tidb/luancheng/dumpling/v4/export/error.go:40 +0x8d
github.com/pingcap/dumpling/v4/export.simpleQuery(0xc000144180, 0x95bdcf, 0x10, 0xc000073998, 0xc000010088, 0x0)
	/home/tidb/luancheng/dumpling/v4/export/sql.go:367 +0x152
github.com/pingcap/dumpling/v4/export.SelectVersion(0xc000144180, 0x2a, 0x434466, 0xc00002a0f0, 0xc00007a1a0)
	/home/tidb/luancheng/dumpling/v4/export/sql.go:94 +0x81
github.com/pingcap/dumpling/v4/export.detectServerInfo(0xc000144180, 0x5, 0xc00002a0f0, 0x2a, 0xc000144180)
	/home/tidb/luancheng/dumpling/v4/export/prepare.go:30 +0x2f
github.com/pingcap/dumpling/v4/export.Dump(0xc00010e780, 0x0, 0x0)
	/home/tidb/luancheng/dumpling/v4/export/dump.go:35 +0x2cb
main.run()
	/home/tidb/luancheng/dumpling/cmd/dumpling/main.go:122 +0x5b9
main.glob..func1(0xd6e380, 0xda2790, 0x0, 0x0)
	/home/tidb/luancheng/dumpling/cmd/dumpling/main.go:58 +0x20
github.com/spf13/cobra.(*Command).execute(0xd6e380, 0xc0000921b0, 0x0, 0x0, 0xd6e380, 0xc0000921b0)
	/home/tidb/go/pkg/mod/github.com/spf13/[email protected]/command.go:844 +0x29d
github.com/spf13/cobra.(*Command).ExecuteC(0xd6e380, 0x0, 0x1, 0xc000082058)
	/home/tidb/go/pkg/mod/github.com/spf13/[email protected]/command.go:945 +0x317
github.com/spf13/cobra.(*Command).Execute(...)
	/home/tidb/go/pkg/mod/github.com/spf13/[email protected]/command.go:885
main.main()
	/home/tidb/luancheng/dumpling/cmd/dumpling/main.go:131 +0x2d

goroutine 1 [running]:
runtime/debug.Stack(0xa14400, 0xc000095260, 0xa14400)
	/usr/local/go/src/runtime/debug/stack.go:24 +0x9d
github.com/pingcap/dumpling/v4/export.withStack(0xa14400, 0xc000095260, 0x10, 0xc000073998)
	/home/tidb/luancheng/dumpling/v4/export/error.go:40 +0x8d
github.com/pingcap/dumpling/v4/export.SelectVersion(0xc000144180, 0x2a, 0x434466, 0xc00002a0f0, 0xc00007a1a0)
	/home/tidb/luancheng/dumpling/v4/export/sql.go:96 +0x9e
github.com/pingcap/dumpling/v4/export.detectServerInfo(0xc000144180, 0x5, 0xc00002a0f0, 0x2a, 0xc000144180)
	/home/tidb/luancheng/dumpling/v4/export/prepare.go:30 +0x2f
github.com/pingcap/dumpling/v4/export.Dump(0xc00010e780, 0x0, 0x0)
	/home/tidb/luancheng/dumpling/v4/export/dump.go:35 +0x2cb
main.run()
	/home/tidb/luancheng/dumpling/cmd/dumpling/main.go:122 +0x5b9
main.glob..func1(0xd6e380, 0xda2790, 0x0, 0x0)
	/home/tidb/luancheng/dumpling/cmd/dumpling/main.go:58 +0x20
github.com/spf13/cobra.(*Command).execute(0xd6e380, 0xc0000921b0, 0x0, 0x0, 0xd6e380, 0xc0000921b0)
	/home/tidb/go/pkg/mod/github.com/spf13/[email protected]/command.go:844 +0x29d
github.com/spf13/cobra.(*Command).ExecuteC(0xd6e380, 0x0, 0x1, 0xc000082058)
	/home/tidb/go/pkg/mod/github.com/spf13/[email protected]/command.go:945 +0x317
github.com/spf13/cobra.(*Command).Execute(...)
	/home/tidb/go/pkg/mod/github.com/spf13/[email protected]/command.go:885
main.main()
	/home/tidb/luancheng/dumpling/cmd/dumpling/main.go:131 +0x2d

It's better to print help info when running without argument.

Benchmark against mydumper

Compare the speed of Dumpling on a database against Mydumper, on a large (> 10 GB) database.

Score

Implement black-white-list filtering

Make use of https://github.com/pingcap/tidb-tools/tree/master/pkg/filter to filter out tables before dumping.

Please consider watching pingcap/dm#404 if you pick up this issue.

Score

Support writing to cloud storage (S3, GCS, ...)

We should be able to reuse the fruit from pingcap/br#89.

Score

Support checkpoints

We should be able recover the progress even if we kill -9ed Dumpling.

Add context in `export.Dump`

As a lib, dumpling should add context.Context in export.Dump to shutdown the dump process by using context.CancelFunc.

dumpling use `-H` to set host, but mydumper use `-h`

User may use mydumper before, and then use dumpling to replace mydumper, but dumpling has different flag with mydumper, for host, dumpling use -H, but mydumper use -h.

TLS support

Need to support TLS.

Support black-white list or something like this to dump specified tables

Support parallel dumping of a single table

Extracted from #1.

We may reuse the algorithm from https://github.com/pingcap/tidb-tools/blob/master/pkg/diff/chunk.go.

Score

1200

E2E test

Create integration tests on CI showing MySQL/TiDB → Dumpling → Loader/Lightning → MySQL/TiDB works with small-scale data.
Run the same test with large-scale data (of size > 100 GB).

You can use sync_diff_inspector to compare databases for the integration tests.

Score

Add --version/-V flags

dumpling should support a --version/-V flag to match behavior of other PingCAP tools.

`-F` flag seems not work with `--filetype csv` and `--sql`

When I want to dump data selected by SQL to CSV(using current master):

./dumpling \
-F $(( 1024 * 3 ))  \
--filetype csv \
--sql "select * from test.sbtest1;" \
-o csv-and-size-2

what I want to see

All selected lines are dumped into 3KB chunks of file:

hillium@continuation ~/D/d/b/csv-and-size-5> ls -lh ../csv-and-size-5/
total 480
-rwxr-xr-x  1 hillium  staff   3.2K  5 22 13:52 1590126769.1443635317331776148.csv
-rwxr-xr-x  1 hillium  staff   3.2K  5 22 13:52 1590126769.1460320609597786623.csv
-rwxr-xr-x  1 hillium  staff   3.2K  5 22 13:52 1590126769.1598098976185383115.csv
-rwxr-xr-x  1 hillium  staff   3.2K  5 22 13:52 1590126769.1874068156324778273.csv
-rwxr-xr-x  1 hillium  staff   3.2K  5 22 13:52 1590126769.1905388747193831650.csv
-rwxr-xr-x  1 hillium  staff   3.2K  5 22 13:52 1590126769.1976235410884491574.csv
-rwxr-xr-x  1 hillium  staff   3.2K  5 22 13:52 1590126769.2015796113853353331.csv
-rwxr-xr-x  1 hillium  staff   3.2K  5 22 13:52 1590126769.2050257992909156333.csv
-rwxr-xr-x  1 hillium  staff   3.2K  5 22 13:52 1590126769.2338498362660772719.csv

what I got instead

I got just one file with no-complete records:

hillium@continuation ~/D/d/b/csv-and-size-5> ls -lh ../csv-and-size-2/
total 24
-rwxr-xr-x  1 hillium  staff   3.2K  5 22 13:46 1590126407.csv
-rwxr-xr-x  1 hillium  staff   140B  5 22 13:46 metadata

notes

According to debug log, seems we name the csv file bad, so our output files will become hidden files:

[2020/05/22 13:46:47.163 +08:00] [DEBUG] [writer_util.go:256] ["dumping table"] [table=] ["record counts"=17]
[2020/05/22 13:46:47.163 +08:00] [DEBUG] [writer_util.go:331] ["opened file"] [path=csv-and-size-2/..45.csv]
[2020/05/22 13:46:47.163 +08:00] [DEBUG] [writer_util.go:341] ["tear down lazy file writer..."]
[2020/05/22 13:46:47.163 +08:00] [DEBUG] [writer_util.go:256] ["dumping table"] [table=] ["record counts"=17]
[2020/05/22 13:46:47.163 +08:00] [DEBUG] [writer_util.go:331] ["opened file"] [path=csv-and-size-2/..46.csv]

If a table IR give us empty database or table name, we use the unix timestamp to name the file:

dumpling/v4/export/writer.go

Lines 126 to 130 in 0eb24b8

    
           if ir.DatabaseName() != "" && ir.TableName() != "" { 
        
           	fileName = fmt.Sprintf("%s.%s.%d.csv", ir.DatabaseName(), ir.TableName(), ir.ChunkIndex()) 
        
           } else { 
        
           	fileName = fmt.Sprintf("%d.csv", time.Now().Unix()) 
        
           }

But we name the output file directly by (DB name, Table name, count) 3-tuple during looping:

dumpling/v4/export/writer.go

Line 151 in 0eb24b8

    
           fileName = fmt.Sprintf("%s.%s.%d.csv", ir.DatabaseName(), ir.TableName(), chunkIndex)

One more thing, name it directly by unix time won't work, because we may dump multi files at same second. Maybe we can solve this by get nano second, or use ${UNIX time}-${chunk index}.csv format.

dumpling should handle cases that estimated count are not accurate

Now dumpling computes estimatedChunks through tidb estRows.

	// every chunk would have eventual adjustments
	estimatedChunks := count / conf.Rows
	estimatedStep := (max-min)/estimatedChunks + 1

When table is not created but not used in tidb, estRows will be 10000. Then estimatedChunks will be very large if conf.Rows is not big. Dumpling may dump many sql files which is not expected.

Support dump databases and tables with special name

dumpling need to support dumping databases and tables with special name.
When I try to run ./dumpling -u root -P 3306 -H 127.0.0.1 -B "live-test" to dump database live-test, dumpling returns error:
We can escape database and table name to solve this problem.

Release version:
Git commit hash: a36198fef0eff1752f6bd645bb5f884e3c30ed98
Git branch:      master
Build timestamp: 2020-04-21 02:46:31Z
Go version:      go version go1.13.4 linux/amd64

[2020/04/21 22:46:41.256 +08:00] [INFO] [config.go:111] ["detect server type"] [type=MySQL]
[2020/04/21 22:46:41.256 +08:00] [INFO] [config.go:129] ["detect server version"] [version=5.7.26-log]
dump failed: SHOW CREATE DATABASE live-test: err = Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '-test' at line 1
goroutine 1 [running]:
runtime/debug.Stack(0xa2c9c0, 0xc000120280, 0xa2d680)
        /usr/local/go/src/runtime/debug/stack.go:24 +0x9d
github.com/pingcap/dumpling/v4/export.withStack(0xa2c9c0, 0xc000120280, 0xc0000c8078, 0xc000128080)
        /home/pingcap/goPath/src/github.com/pingcap/dumpling/v4/export/error.go:40 +0x8d
github.com/pingcap/dumpling/v4/export.simpleQuery(0xc0001540c0, 0xc000128080, 0x1e, 0xc0001198c0, 0x1, 0xc000128080)
        /home/pingcap/goPath/src/github.com/pingcap/dumpling/v4/export/sql.go:345 +0x152
github.com/pingcap/dumpling/v4/export.ShowCreateDatabase(0xc0001540c0, 0x7ffce0e6c37e, 0x9, 0x1ed, 0x0, 0x1ed, 0x0)
        /home/pingcap/goPath/src/github.com/pingcap/dumpling/v4/export/sql.go:38 +0x10c
github.com/pingcap/dumpling/v4/export.dumpDatabases(0xa371c0, 0xc0000c8078, 0xc0000fea00, 0xc0001540c0, 0xa35540, 0xc000124030, 0x0, 0xc0)
        /home/pingcap/goPath/src/github.com/pingcap/dumpling/v4/export/dump.go:102 +0x2e3
github.com/pingcap/dumpling/v4/export.Dump(0xc0000fea00, 0x0, 0x0)
        /home/pingcap/goPath/src/github.com/pingcap/dumpling/v4/export/dump.go:90 +0x83e
main.run()
        /home/pingcap/goPath/src/github.com/pingcap/dumpling/cmd/dumpling/main.go:119 +0x64b
main.glob..func1(0xd80aa0, 0xc000152aa0, 0x0, 0xa)
        /home/pingcap/goPath/src/github.com/pingcap/dumpling/cmd/dumpling/main.go:57 +0x20
github.com/spf13/cobra.(*Command).execute(0xd80aa0, 0xc0000d0010, 0xa, 0xa, 0xd80aa0, 0xc0000d0010)
        /home/pingcap/goPath/pkg/mod/github.com/spf13/[email protected]/command.go:844 +0x2aa
github.com/spf13/cobra.(*Command).ExecuteC(0xd80aa0, 0x0, 0x1, 0xc0000a6058)
        /home/pingcap/goPath/pkg/mod/github.com/spf13/[email protected]/command.go:945 +0x317
github.com/spf13/cobra.(*Command).Execute(...)
        /home/pingcap/goPath/pkg/mod/github.com/spf13/[email protected]/command.go:885
main.main()
        /home/pingcap/goPath/src/github.com/pingcap/dumpling/cmd/dumpling/main.go:128 +0x2d

dumpling costs too much time on getting schemas' tables

Bug Report

Please answer these questions before submitting your issue. Thanks!

What did you do? If possible, provide a recipe for reproducing the error.

Run dumpling with -T args to dump only one table. The TiDB cluster has more than 1000 schemas and 100000 tables. Each select query will cost about 3s.

What did you expect to see?

Dumpling data quickly and successfully.

What did you see instead?

Dumpling is blocked at listAllTables.

How to dump views?

I don't see how to dump views with option like this one -W, --no-views Do not dump views (default true)

`--escape-backslash=false` leads to data mismatched

mysql [email protected]:test> create table t (a varchar(100))
Query OK, 0 rows affected
Time: 0.005s
mysql [email protected]:test> insert into t values("'hello")
Query OK, 1 row affected
Time: 0.003s
mysql [email protected]:test> select * from t
+--------+
| a      |
+--------+
| 'hello |
+--------+
1 row in set
Time: 0.014s

bash> ./bin/dumpling --filetype csv --escape-backslash=false
bash> cat export-2020-06-09T16:31:45+08:00/test.t.0.csv
"a"
"''hello"

The 'hello became ''hello

Support `--tidb-force-priority` in dumpling like mydumper

Feature Request

Is your feature request related to a problem? Please describe:

Some users don't want dumpling to affect their normal TiDB work.

Describe the feature you'd like:

Use session variable tidb-force-priority to set dumpling sqls priority to reduce the resources usage of tidb cluster.

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Optimization:

Support dumping SQL files to a local filesystem

Create an MVP for this project, showing performance and correctness. See pingcap/community#123 for the full list of expectations.

Checklist:

Score

~~6200~~ 5000

dump TiDB INFORMATION_SCHEMA failed

[2020/03/01 21:07:14.710 +08:00] [WARN] [session.go:1123] ["compile SQL failed"] [conn=51] [error="[planner:1054]Unknown column '_tidb_rowid' in 'field list'"] [SQL="SELECT _tidb_rowid from INFORMATION_SCHEMA.INSPECTION_RESULT LIMIT 0"]

Seem table INSPECTION_RESULT has no _tidb_rowid

Support output metadata after the dump operation

Support output a metadata into output dir after dump operation. DM need the position in metadata to get the start checkpoint for syncer in all-mode.
An example of Mydumper's metadata:

Started dump at: 2020-03-10 15:53:25
SHOW MASTER STATUS:
        Log: ON.000001
        Pos: 7502
        GTID:6ce40be3-e359-11e9-87e0-36933cb0ca5a:1-29

Finished dump at: 2020-03-10 15:53:25

Consider making dumpling `-F` flag use `MB` unit

Dumpling and Mydumper have the same command line flag -F, but the unit is totally different, Mydumper -F unit is MB while Dumpling -F unit is bytes.

Also, there is no unit in the --help document and bytes is not user-friendly if users want to configure 64MB, they have to specify 67108864 which is cumbersome.

Users would easily make mistakes if they used Mydumper previously, especially we link Mydumper document in Dumpling document.

Let Dumpling set GC safepoint ttl with new pd interface

When Dumpling dumps large data it needs to set TiDB snapshot at a ts. When the dumping time is too long tikv may gc this TiDB snapshot ts and Dumpling will fail to dump, exits with an error.

To avoid this problem, users need to set tikv_gc_life_time longer, which is explained in the end of https://pingcap.com/docs-cn/stable/export-or-backup-using-dumpling. However, it's not convenient. The users need to run

update mysql.tidb set VARIABLE_VALUE = '720h' where VARIABLE_NAME = 'tikv_gc_life_time';

before dumping and run

update mysql.tidb set VARIABLE_VALUE = '10m' where VARIABLE_NAME = 'tikv_gc_life_time';

after dumping.

But, Dumpling shouldn't change this variable because dumping process may fail in the half.

In v4.0.0, PD has a new interface UpdateServiceGCSafePoint which allows service to set TiKV GC safe point with a ttl, which is very suitable for dumpling's scene. What's more, TiCDC has already used this interface in their service.

So, in summary, I think dumpling can also use this feature to make sure dumped ts is not "GCed".

Omit generated columns when generating sqls

When table has GENERATED column current dumpling will also generate sqls with the generated values, which will cause an error in downstream dbs.
We need to omit GENERATED columns.

example

table-structure:

CREATE TABLE `t1` (
  `id` int(11) DEFAULT NULL,
  `name` varchar(20) DEFAULT NULL,
  `info` json DEFAULT NULL,
  `gen_id` int(11) GENERATED ALWAYS AS (json_extract(`info`,'$.id')) VIRTUAL,
  `info2` varchar(40) DEFAULT NULL,
  UNIQUE KEY `gen_idx` (`gen_id`),
  KEY `multi_col` (`id`,`gen_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

sqls generated by dumpling:

/*!40101 SET NAMES binary*/;
INSERT INTO `t1` VALUES
(1, 'Arya Stark', '{"id": 120}', 120, NULL),
(2, 'Catelyn Stark', '{"id": 121}', 121, NULL),
(3, 'Eddard Stark', '{"id": 122}', 122, NULL);

sqls generated by mydumper:

/*!40101 SET NAMES binary*/;
/*!40014 SET FOREIGN_KEY_CHECKS=0*/;
INSERT INTO `t1` (`id`,`name`,`info`,`info2`) VALUES
(1,'Arya Stark',CONVERT('{\"id\": 120}' USING UTF8MB4),NULL),
(2,'Catelyn Stark',CONVERT('{\"id\": 121}' USING UTF8MB4),NULL),
(3,'Eddard Stark',CONVERT('{\"id\": 122}' USING UTF8MB4),NULL);

Add an `--dump-empty-db` parameter to control whether to dump empty databases

mysql [email protected]:test> create database cy
Query OK, 0 rows affected
Time: 0.004s
mysql [email protected]:test> create table cy.t (a int)
Query OK, 0 rows affected
Time: 0.006s

bash> ./bin/dumpling -f "*.cy"
bash> ls export-2020-06-08T17:40:02+08:00
cy-schema-create.sql   metadata               test-schema-create.sql

It should not dump the schemas cy and test.

--snapshot argument doesn't work

Bug Report

Please answer these questions before submitting your issue. Thanks!

What did you do? If possible, provide a recipe for reproducing the error.
Run dumpling with --snapshot argument.
What did you expect to see?
Dump tables with given snapshot.
What did you see instead?
Dump tables with current snapshot.

Support -T/--table-list args to dump single tables

support -T/--table-list args to dump single tables

Support dumping CSV

Support dumping data as CSV files instead of SQL files. (Schema files should still be SQL)

Score

Reduce the time dumpling holds the FTWRL lock

Feature Request

Is your feature request related to a problem? Please describe:

When dumpling MySQL, dumpling will use FTWRL to lock all the tables and release the lock after all tables have been dumped which will strongly affect the MySQL server.

Describe the feature you'd like:

In mydumper, they use a special way to confirm consistency and "lock time". The solution is: FTWRL -> start THREADS*transactions(WITH CONSISTENT SNAPSHOT) -> UNLOCK TABLES -> dump data
I think we can also do this in dumpling.

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Optimization:

Usability improvement with cmd line parameter `-t` and `-F`

Background

-t controls how many tables can be exported concurrently. If user provides -r in parameter, each table could be separated into some concurrent export jobs. So the -t doesn't mean exactly the upper limit of export threads.
-F only accepts an integer, which represents the size of the output file in bytes. User may want to provide a more human-readable parameter such as 64MB, 1GB.

Requirement

-t controls the global upper limit of export threads, or provides another parameter to limit the concurrency of single table export, then the global concurrency can be controlled.
-F accepts human readable file size

Maybe we need quote table names when querying DB info.

Currently, we inject table name directly into sql query.

dumpling/v4/export/sql.go

Line 37 in 62b8414

query := fmt.Sprintf("SHOW CREATE DATABASE %s", database)

dumpling/v4/export/sql.go

Line 50 in 62b8414

query := fmt.Sprintf("SHOW CREATE TABLE %s.%s", database, table)

Even those table names are get by connection from database directly, so we don't need to fear SQL injection attack. But when there are tables using key words of SQL, dumpling will fail, a example is run integration_test on MariaDB:

dump failed: SHOW CREATE DATABASE rows: err = Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'rows' at line 1
goroutine 1 [running]:
runtime/debug.Stack(0x1615be0, 0xc00000e4e0, 0x1616980)
	/usr/local/Cellar/go/1.14/libexec/src/runtime/debug/stack.go:24 +0x9d
github.com/pingcap/dumpling/v4/export.withStack(0x1615be0, 0xc00000e4e0, 0xc0000260c0, 0xc0000243e0)
	/Users/hillium/Developer/dumpling/v4/export/error.go:40 +0x8d
github.com/pingcap/dumpling/v4/export.simpleQuery(0xc0001f4180, 0xc0000243e0, 0x19, 0xc0001858c0, 0x1, 0xc0000243e0)
	/Users/hillium/Developer/dumpling/v4/export/sql.go:367 +0x152
github.com/pingcap/dumpling/v4/export.ShowCreateDatabase(0xc0001f4180, 0x7ffeefbff7b5, 0x4, 0xc0001859e0, 0x14200df, 0x7ffeefbff7c1, 0x26)
	/Users/hillium/Developer/dumpling/v4/export/sql.go:38 +0x10c
github.com/pingcap/dumpling/v4/export.dumpDatabases(0x1620180, 0xc0000260c0, 0xc0001b4780, 0xc0001f4180, 0x161e3c0, 0xc000010078, 0x0, 0x100)
	/Users/hillium/Developer/dumpling/v4/export/dump.go:109 +0x124
github.com/pingcap/dumpling/v4/export.Dump(0xc0001b4780, 0x0, 0x0)
	/Users/hillium/Developer/dumpling/v4/export/dump.go:92 +0x84f
main.run()
	/Users/hillium/Developer/dumpling/cmd/dumpling/main.go:122 +0x5b9
main.glob..func1(0x19774e0, 0xc0001ec600, 0x0, 0x10)
	/Users/hillium/Developer/dumpling/cmd/dumpling/main.go:58 +0x20
github.com/spf13/cobra.(*Command).execute(0x19774e0, 0xc000020130, 0x10, 0x11, 0x19774e0, 0xc000020130)
	/Users/hillium/go/pkg/mod/github.com/spf13/[email protected]/command.go:844 +0x29d
github.com/spf13/cobra.(*Command).ExecuteC(0x19774e0, 0x0, 0x1, 0xc00007a058)
	/Users/hillium/go/pkg/mod/github.com/spf13/[email protected]/command.go:945 +0x317
github.com/spf13/cobra.(*Command).Execute(...)
	/Users/hillium/go/pkg/mod/github.com/spf13/[email protected]/command.go:885
main.main()
	/Users/hillium/Developer/dumpling/cmd/dumpling/main.go:131 +0x2d

Support a custom binary output format

SQL and CSV are not easy to parse. The purpose of Dumpling is to support a binary format which allows Lightning to read data more quickly.

Decide the format specification
Implement encoder
Implement decoder (read file and output a row of values)

Score

Error occurs with argument '-r'

./bin/dumpling -B test -B d1 -r 100
zsh: correct 'test' to 'tests' [nyae]? n
Release version:
Git commit hash: e1ffd1a5e64ea4f7e2a3be4c234d11eea0f7ec29
Git branch:      master
Build timestamp: 2020-06-05 08:00:49Z
Go version:      go version go1.14.2 darwin/amd64

[2020/06/05 17:13:02.370 +08:00] [INFO] [config.go:118] ["detect server type"] [type=TiDB]
[2020/06/05 17:13:02.370 +08:00] [INFO] [config.go:136] ["detect server version"] [version=4.0.0-beta.2-562-ge85b2c516]
[2020/06/05 17:13:02.415 +08:00] [WARN] [sql.go:485] ["can't get estimate count from db"] [query="EXPLAIN SELECT `a` FROM `test`.`t`"] [error="sql: Scan error on column index 2, name \"task\": destination not a pointer"]

Support specifying output file format

Feature Request

Is your feature request related to a problem? Please describe:

Some users need to specify filename format to avoid duplication.

Describe the feature you'd like:

Support specifying output file name format instead of formatted file name.

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Optimization:

Cannot dump heavily skewed data in parallel

Example:

create table x(id bigint primary key);
insert into x (a) values (1);
insert into x (a) values (2);
...
insert into x (a) values (9999);
insert into x (a) values (10000);

insert into x (a) values (9223372036854775807);

Currently Dumpling only splits the range using min and max ignoring the actual data distribution, so this will cause one thread to dump all small ID and while the rest only dump zero or one rows.

We could either introduce mydumper's bisection algorithm, or use a work-stealing algorithm for the completed threads to "steal" the unprocessed range from the working threads.

Make `-t` control the global upper limit of export threads

Separated from #76.

Currently, -t controls how many tables can be exported concurrently. If user provides -r in parameter, each table could be separated into some concurrent export jobs. So the -t doesn't mean exactly the upper limit of export threads.

We want to change -t to control the global upper limit of export threads, or provides another parameter to limit the concurrency of single table export, then the global concurrency can be controlled.

Unify `--filesize` and `--statement-size` definition with mydumper's

Feature Request

Is your feature request related to a problem? Please describe:

Currently dumpling's --filesize and --statement-size only counts the column values' size. However, mydumper counts the text file's size, which may confuse users.

Describe the feature you'd like:

Unify --filesize and --statement-size definition with mydumper's which means sentences like 'INSERT' are also counted in filesize.

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Optimization:

Support dumping from Oracle DB, DB2

Support dumping directly from Oracle DB, so we don't need to go through the fragile CSV format.

Add ci pipelines, official documents and more tests in dumpling

invalid metadata for multiple GTID sets

dumpling data from a MySQL with multple GTID sets

> SHOW MASTER STATUS;
+------------------+----------+--------------+------------------+--------------------------------------------+
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                          |
+------------------+----------+--------------+------------------+--------------------------------------------+
| mysql-bin.000003 | 1274     |              |                  | 5b5a8e4e-9b43-11ea-900d-0242ac170002:1-10, |
|                  |          |              |                  | 5b642cb6-9b43-11ea-8914-0242ac170003:1-7   |
+------------------+----------+--------------+------------------+--------------------------------------------+

extra newline after the first GTID set

# cat metadata
Started dump at: 2020-05-21 10:09:16
SHOW MASTER STATUS:
		Log: mysql-bin.000003
		Pos: 1274
		GTID:5b5a8e4e-9b43-11ea-900d-0242ac170002:1-10,
5b642cb6-9b43-11ea-8914-0242ac170003:1-7
Finished dump at: 2020-05-21 10:09:16

Argument `threads` should be validated

The users are able to use dumpling -t 0 (stuck) and dumpling -t -1 (panic). As a product, it should validate the effective ranges for numeric arguments.

Unclean exit if connection fails

Trying to execute dumpling without any command-line arguments assumes a connection to 127.0.0.1:4000, and if that fails, the command exits uncleanly:

$ dumpling/bin/dumpling
Release version:
Git commit hash: ba51a591f93ca26325f01ef484074697e7d9c451
Git branch:      master
Build timestamp: 2020-05-11 05:56:17Z
Go version:      go version go1.14.2 linux/amd64

dump failed: err = err = dial tcp 127.0.0.1:4000: connect: connection refused
goroutine 1 [running]:
runtime/debug.Stack(0xa13be0, 0xc00006c410, 0xa14580)
        /home/centos/go/src/runtime/debug/stack.go:24 +0x9d
github.com/pingcap/dumpling/v4/export.withStack(0xa13be0, 0xc00006c410, 0xc000024098, 0x95bdcf)
        /home/centos/dumpling/v4/export/error.go:40 +0x8d
github.com/pingcap/dumpling/v4/export.simpleQuery(0xc0000ec180, 0x95bdcf, 0x10, 0xc0000c9998, 0xc0000100c0, 0x0)
        /home/centos/dumpling/v4/export/sql.go:367 +0x152
github.com/pingcap/dumpling/v4/export.SelectVersion(0xc0000ec180, 0x2a, 0x4344c6, 0xc00002a180, 0xc00007ab60)
        /home/centos/dumpling/v4/export/sql.go:94 +0x81
github.com/pingcap/dumpling/v4/export.detectServerInfo(0xc0000ec180, 0x5, 0xc00002a180, 0x2a, 0xc0000ec180)
        /home/centos/dumpling/v4/export/prepare.go:30 +0x2f
github.com/pingcap/dumpling/v4/export.Dump(0xc0000b2a00, 0x0, 0x0)
        /home/centos/dumpling/v4/export/dump.go:35 +0x2cb
main.run()
        /home/centos/dumpling/cmd/dumpling/main.go:122 +0x5b9
main.glob..func1(0xd6e3a0, 0xda27b0, 0x0, 0x0)
        /home/centos/dumpling/cmd/dumpling/main.go:58 +0x20
github.com/spf13/cobra.(*Command).execute(0xd6e3a0, 0xc0000201d0, 0x0, 0x0, 0xd6e3a0, 0xc0000201d0)
        /home/centos/go/pkg/mod/github.com/spf13/[email protected]/command.go:844 +0x29d
github.com/spf13/cobra.(*Command).ExecuteC(0xd6e3a0, 0x0, 0x1, 0xc00006e058)
        /home/centos/go/pkg/mod/github.com/spf13/[email protected]/command.go:945 +0x317
github.com/spf13/cobra.(*Command).Execute(...)
        /home/centos/go/pkg/mod/github.com/spf13/[email protected]/command.go:885
main.main()
        /home/centos/dumpling/cmd/dumpling/main.go:131 +0x2d

goroutine 1 [running]:
runtime/debug.Stack(0xa14580, 0xc0000e86f0, 0xa14580)
        /home/centos/go/src/runtime/debug/stack.go:24 +0x9d
github.com/pingcap/dumpling/v4/export.withStack(0xa14580, 0xc0000e86f0, 0x10, 0xc0000c9998)
        /home/centos/dumpling/v4/export/error.go:40 +0x8d
github.com/pingcap/dumpling/v4/export.SelectVersion(0xc0000ec180, 0x2a, 0x4344c6, 0xc00002a180, 0xc00007ab60)
        /home/centos/dumpling/v4/export/sql.go:96 +0x9e
github.com/pingcap/dumpling/v4/export.detectServerInfo(0xc0000ec180, 0x5, 0xc00002a180, 0x2a, 0xc0000ec180)
        /home/centos/dumpling/v4/export/prepare.go:30 +0x2f
github.com/pingcap/dumpling/v4/export.Dump(0xc0000b2a00, 0x0, 0x0)
        /home/centos/dumpling/v4/export/dump.go:35 +0x2cb
main.run()
        /home/centos/dumpling/cmd/dumpling/main.go:122 +0x5b9
main.glob..func1(0xd6e3a0, 0xda27b0, 0x0, 0x0)
        /home/centos/dumpling/cmd/dumpling/main.go:58 +0x20
github.com/spf13/cobra.(*Command).execute(0xd6e3a0, 0xc0000201d0, 0x0, 0x0, 0xd6e3a0, 0xc0000201d0)
        /home/centos/go/pkg/mod/github.com/spf13/[email protected]/command.go:844 +0x29d
github.com/spf13/cobra.(*Command).ExecuteC(0xd6e3a0, 0x0, 0x1, 0xc00006e058)
        /home/centos/go/pkg/mod/github.com/spf13/[email protected]/command.go:945 +0x317
github.com/spf13/cobra.(*Command).Execute(...)
        /home/centos/go/pkg/mod/github.com/spf13/[email protected]/command.go:885
main.main()
        /home/centos/dumpling/cmd/dumpling/main.go:131 +0x2d

This also occurs if an explicit and unreachable host is given. dumpling should exit with an error message (rather than stacktrace) if a host cannot be reached. If no arguments are given when it's invoked, help output should be shown.

Support customized csv separator/delimiter

Feature Request

Is your feature request related to a problem? Please describe:

Some users need to specify special separator to avoid ambiguity.

Describe the feature you'd like:

Support --csv-seperator to specify special separator/delimiter for dumpling. The default value is ",".

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Optimization:

tidb oom while dumpling massive data

[root@localhost data2]# ./dumpling -h 172.16.4.36 -P 4000 -u root -o dumpling-exported --filetype csv -t 40
Release version:
Git commit hash: adf4a53efc9fec87ea96f92f5829019679f8580b
Git branch:      master
Build timestamp: 2020-06-19 01:00:52Z
Go version:      go version go1.14.1 linux/amd64

[2020/06/19 09:06:49.259 +08:00] [INFO] [config.go:118] ["detect server type"] [type=TiDB]
[2020/06/19 09:06:49.259 +08:00] [INFO] [config.go:136] ["detect server version"] [version=4.0.0-beta.2-616-gf139821c8]
[2020/06/19 09:06:49.316 +08:00] [WARN] [block_allow_list.go:15] ["unsupported dump schema in TiDB now"] [schema=mysql]
[2020/06/19 09:06:49.316 +08:00] [WARN] [block_allow_list.go:15] ["unsupported dump schema in TiDB now"] [schema=INFORMATION_SCHEMA]
[2020/06/19 09:06:49.316 +08:00] [WARN] [block_allow_list.go:15] ["unsupported dump schema in TiDB now"] [schema=METRICS_SCHEMA]
[2020/06/19 09:06:49.316 +08:00] [WARN] [block_allow_list.go:15] ["unsupported dump schema in TiDB now"] [schema=PERFORMANCE_SCHEMA]
[2020/06/19 09:12:23.539 +08:00] [ERROR] [main.go:175] ["dump failed error stack info"] [error="Error 1105: Out Of Memory Quota![conn_id=50]"] [stack="github.com/pingcap/log.Error\n\t/home/tidb/go/pkg/mod/github.com/pingcap/[email protected]/global.go:42\nmain.main\n\t/home/tidb/cy/src/dumpling/cmd/dumpling/main.go:175\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"]

dump failed: Error 1105: Out Of Memory Quota![conn_id=50]

	if ir.DatabaseName() != "" && ir.TableName() != "" {
	fileName = fmt.Sprintf("%s.%s.%d.csv", ir.DatabaseName(), ir.TableName(), ir.ChunkIndex())
	} else {
	fileName = fmt.Sprintf("%d.csv", time.Now().Unix())
	}

pingcap / dumpling Goto Github PK

dumpling's Introduction

dumpling's People

Contributors

Stargazers

Watchers

Forkers

dumpling's Issues

Score

Score

Description

Background

Reference

Tasks

Score

Mentor

Recommended Skills

Score

Score

Score

Score

Score

what I want to see

what I got instead

notes

Bug Report

Feature Request

Score

Bug Report

Score

Feature Request

Background

Requirement

Score

Feature Request

Feature Request

Feature Request

Recommend Projects

Recommend Topics

Recommend Org