Code Monkey home page Code Monkey logo

spanner-migration-tool's Introduction

Spanner migration tool: Spanner Evaluation and Migration

integration-tests-against-emulator code-coverage-check codecov

Important

We have changed architecture of the minimal downtime migration and added Pub/Sub notifications component. There are some changes in required permissions because of the new component. Please go through Permissions page and design page of the documentation.

Overview

Spanner migration tool is a stand-alone open source tool for Cloud Spanner evaluation and migration, using data from an existing PostgreSQL, MySQL, SQL Server, Oracle or DynamoDB database. The tool ingests schema and data from either a pg_dump/mysqldump file or directly from the source database, and supports both schema and data migration. For schema migration, Spanner migration tool automatically builds a Spanner schema from the schema of the source database. This schema can be customized using the Spanner migration tool schema assistant and a new Spanner database is created using the Spanner schema built.

Installation

Installation instructions for Spanner migration tool are available here.

Documentation

Detailed documentation on how to use spanner-migration-tool is available at: https://googlecloudplatform.github.io/spanner-migration-tool

spanner-migration-tool's People

Contributors

aasthabharill avatar agasheesh avatar aksharauke avatar anupam-searce avatar asthamohta avatar bharadwaj-aditya avatar darshan-sj avatar deep1998 avatar dependabot[bot] avatar gandhidharmil avatar hengfengli avatar isabella-pham avatar its7arc avatar lgruen avatar lyazii-searce avatar manitgupta avatar mayank2913 avatar nareshz avatar nevinheintze avatar pradip97 avatar pratikdhanave-searce avatar priyaannu avatar renovate-bot avatar renovate[bot] avatar ronak-searce avatar sagarsearce avatar shellymathew98 avatar shreyakhajanchi avatar thiagotnunes avatar vardhanthigle avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spanner-migration-tool's Issues

Improve error reporting in harbourbridge

Currently, harbourbridge just panics for errors of type: cant create database, cant get instance etc. We just print the rpc error message which is not enough to know the cause of the error itself. We should add better error messages including what was the input for the failed function such as the request body etc.

Source Refactoring

During the process of adding support for various database sources in Harbourbridge, there has been a large amount of duplicated code that has build up for each databases. There are two aspects that need to be handled as a part of the code cleanup

  1. Structure of the codebase - to move the code of each source DB into its relevant directory, and group it under a common parent directory. This will ensure that the structure of the codebase is maintained in an easy to understand manner
  2. Extracting common code across databases - A large amount of code across the implementation of each database is common, as it handles things like reading from dump files, information schema etc which are relatively similar across most databases. We need to extract out common functionality across the various sources, to maintain a common code flow, reduce bloat, make adding new databases easier and ensure that bug fixes go to all the source databases.
  3. Move implementation details of specific databases from the common top level code (probably a factory or registration type structure). This will ensure that when a source database is added, there is no leak of source db logic into the common processing code across harbour bridge

Each of the above can be handled in an independent change request, but the exercise should be taken up before adding support for more databases.

Preserve Primary Key name of original schema in spanner

Expected Behavior

Save the primary key name from the original schema while migrating to spanner.

Actual Behavior

The name of the primary key is lost during the migration process

Steps to Reproduce the Problem

  1. Trigger a migration in any way, the primary key names are lot

Specifications

  • Version: All
  • Platform: All

Consider support for indexes

HarbourBridge maps PostgreSQL primary keys to Spanner primary keys. But it ignores other indexing. These could be mapped to Spanner secondary indexes.

Validate data written to Spanner

After data is written to Spanner, read it back and check it matches the data from the pg_dump.

At a minimum, we could check row counts for each table.

We could also:
a) check actual data, but just do this for a sampled set of data, or
b) check all data (probably requires another pass over the pg_dump for large datasets).

Use UNIQUE + NOT NULL when there is no primary key

Currently, if a table does not have a primary key, we create a synthetic primary key.

However, some tables have columns that are UNIQUE and NOT NULL but are not primary keys. Such columns would be good candidates to use as the primary key (and better than adding synthetic keys).

Fix: mysql maps tinyint(1) to INT64 instead of BOOL

Example issue: active column of customer table in sakila example.

What we get in the spanner schema:

active INT64 NOT NULL

but we should get:

active BOOL NOT NULL

Note that mysqldump gives the right mapping.

The issues seems to be in infoschema.go, where we process data from the information_schema.COLUMNS table.
What we get back from the information_schema query for table customer and column active is:
column_name: active
data_type: tinyint
column_type: tinyint(1)
is_nullable: NO
column_default: 1
character_maximum_length: NULL
numeric_precision: 3
numeric_scale extra: 0

We pass this into toType, and get back the type {Name:tinyint Mods:[] ArrayBounds:[]} i.e. it is missing the mods. Note that toType only looks at data_type and doesn't look at column_type.

Data dependency in foreign key

Expected Behavior

The tool should be able to migrate data with depended values in foreign key.

Actual Behavior

The spanner throws reference error on migrating data which is refering some other table without having refered row.

Error while writing 833 rows to Spanner: spanner: code = "FailedPrecondition", desc = "Foreign key constraint my_fk_83 is violated on table workorderrouting. Cannot find referenced values in location(LocationID)."

Here location is being filled after workorderrouting table, so its not able to find the value, as insert statements are executed in order defined in the dump file.

Solution: first transfer schema without foreign key, then migrate data, then update schema with ALTER TABLE statements.

Refactor: cleanup handling of UNIQUE

Our handling of UNIQUE is confusing and our generation of schema (schema.go) is broken. Fortunately this doesn't cause us to generate broken Spanner schema.

Here are some more details:

a) In schema.go: we represent UNIQUE constraints within Column and also within Index. This means that UNIQUE constraints are redundantly represented which is pretty confusing. We should have one method for representing all UNIQUE constraints. See Discussion section for details of choices we have.

b) UpdateCols is broken: it should not push multi-column UNIQUE constraints down to the column level --- see the two example 'test' tables in the Discussion section.

c) We should avoid doing Spanner-specific transformations when we map into schema (schema.go). Where possible, Spanner-specific transformations should be in toddl.go. At present we have comments in e.g. mysqldump.go indicating Spanner-specific transformations.

DISCUSSION
How should we represent UNIQUE constraints in schema.go?
First, note that the UNIQUE constraint at the Column level is not sufficient for representing multi-column UNIQUE constraints. In particular:
CREATE TABLE test (id bigint, x bigint, UNIQUE(id, x));
CREATE TABLE test (id bigint UNIQUE, x bigint UNIQUE);
are not the same -- the first says that, when combined, id and x are unique, but the latter says they are individually unique (which is a stronger constraint).

Now, one option would be to add a multi-column constraint to schema.go and representing all UNIQUE constraints as multi-column constraints (dropping Unique from Column and from Index). However, this is not clear-cut since source databases have multiple ways to define UNIQUE constraints: you can either do it declaratively (as a property of a column or table), or you can do it using an index (essentially focusing on how the constraint is implemented). For example in MySQL DDL UNIQUE constraints can be either be:
i) A column level constraint (as part of a CREATE TABLE statement)
ii) A table level constraint (as part of a CREATE TABLE statement)
iii) An index (as part of a CREATE TABLE statement)
iv) Added via an ALTER TABLE constraint (changing column constraints, table constraints or index definitions)
iv) Added via a CREATE UNIQUE INDEX statement (which internally maps to an alter table statement).

In MySQL infoschema tables, UNIQUE is represented both as part of the TABLE_CONSTRAINTS table as well as part of the STATISTICS table (which describes indexes).

Under the hood, all databases implement UNIQUE as an index.
Maybe we should just handle them as indexes after all?

Spanner throws error creating Foreign Keys referring the same column concurrently

When multiple create FK requests are sent to spanner where the referenced column is same and require internal index creation, spanner throws an error of the form
The schema change adding index IDX_contract_ContractID_U_09F242C8CD7BD34 for foreign key FK_contract_sku_3 must finish before adding foreign key FK_fingerprint_contract_2.

Handle case sensitivity for used index, table, FK names

The map usedNames ensures we generate unique names when we map from MySQL to Spanner since Spanner requires all table names, foreign key names and index names to be distinct. Currently, names differing only in case are allowed in HarbourBridge but not in spanner. Need to desensitise the name handling.

Password prompt can get lost when using mysqldump or pg_dump

When piping mysqldump or pg_dump to HarbourBridge, if a password is required, we get the following output:

Enter password: Using driver (source DB): mysqldump
Using Google Cloud project: test-project
Using only available Spanner instance: demo
Using Cloud Spanner instance: demo
...

and so the password prompt gets lost in the preliminary output we generate from HarbourBridge.

In this case, we should wait for the password to be entered before printing the project, instance etc.

One option would be to refactor main.go so that we check if driver is mysqldump or pg_dump and get a seekable input file, before we proceed with printing out Cloud project and instance etc.

Improve performance of Foreign Key processing

The current behavior in HarbourBridge is to process schema change statements for creation of foreign keys in a linear order. Each schema update for foreign key creation can take a long time in Cloud Spanner since it needs to do data validation as well as create backing indexes for the foreign key, where applicable. For a source database with a large number of foreign keys, this can become problematic since each foreign key creation takes a long time and the overall time taken can be in hours. This can be solved using the following approaches:

Problem Statement 1: Improve the throughput for foreign key creation

Alternative 1.1: Batch creation for multiple foreign key in the same schema update.

There are some caveats to the same though:

Caveat 1.1.1: Cloud Spanner suggests at most 10 schema change statements in a single schema update that requires validation or backfill. See https://cloud.google.com/spanner/docs/schema-updates#large-updates for reference.

Caveat 1.1.2: Some of the foreign key creation may fail e.g., due to constraint violation. In such a case, the foreign key creations that were successful are not rolled back and the schema update API will return at the first error. See https://cloud.google.com/spanner/docs/schema-updates#order_of_execution_of_statements_in_batches for more details.

Alternative 1.2: Perform multiple schema changes in parallel

Caveat 1.2.1: If there are too many schema updates happening at the same time, Cloud Spanner may throttle some of them. See https://cloud.google.com/spanner/docs/schema-updates#frequency for more details.

We will need to do some investigation and come up with a heuristic to combine Alternative 1.1 and Alternative 1.2 for optimal performance.

Foreign keys in HarbourBridge are today created after all the data has already migrated. This is done to avoid the issue where referencing rows are written before the referenced rows when migrating the data. This can be solved by doing the data migration in the same order as the one implied by DAG created by following foreign key dependencies. However, this is not an option when using mysqldump or pgdump since those cannot be read out of order to migrate tables in a specific order.

Problem Statement 2: Create foreign keys during data migration for direct connect mode

There is some investigation/experimentation needed to understand if this would actually help with the overall data migration time.

For some of the customers, it may make sense to not have HarbourBridge perform the foreign key creations in the first place if they care more about reducing the down time and are willing to apply foreign keys themselves after the migration is over. To support such use cases, we should also provide a command line flag to skip foreign key creation during data migration. This is specially important if the data migration is followed by CDC migration since foreign keys may cause the migration of CDC change events to fail and get retried multiple times and thus increase the overall application downtime during migration.

Problem Statement 3: Add flag to skip foreign key creation on the HarbourBridge command line.

Note that we should also take into account that some of the foreign keys may be converted to interleave tables when design the solutions to Problem Statement 1 and 2.

Add sample pg_dump file

Allow users to try out HarbourBridge without being connected to an existing PostgreSQL db by providing some sample pg_dump files.

Replace verbose.go with a logging library

Replace internal/verbose.go with one of the existing go logging libraries that provides support for different log-levels. Our main goal is to support '-v' (verbose mode) where we log lots of low-level details of conversion for debugging. Likely candidate: go.uber.org/zap.

Notes:

  1. the standard go logging isn't what we want since it doesn't provide logging levels which we really need here.
  2. there is a TODO for this issue in internal/verbose.go.

Document caveats with running HarbourBridge server and web UI on different machines

When we open a dump file from the web UI, we look for the file on the file system that the server is running on.

If the web UI is running on a browser on the same machine as the server, then all will be good.

But if the web UI is running on a browser on a different machine, this this might be confusing.

For example, an user might run HarbourBridge on their workstation, and then get the web UI up on a browser on their laptop.

When selecting a file for upload, the user might think it should be a file on their laptop.

Thus, we should document the caveats with running HarbourBridge server and web UI on different machines.

Support for larger databases

The current HarbourBridge implementation is only intended for databases up to a few GB.
We use a 4GB database pg_dump file regularly for testing.

While the tool can be used for larger DBs, it is not well tested or tuned for that usage. Potential issues are:
a) it could send data too fast to Spanner, especially for single node instances.
b) it could be too slow for large multi-node instances.
c) missing support for restarting import (see #20).

Handle data conversion of non-boolean values for MySQL TINYINT(1)/BOOLEAN

NOTE: TINYINT(1) is a synonym for BOOL and BOOLEAN.

MySQL does not require values for BOOLEAN to actually be boolean values.
Instead, it allows any legal value for tinyint to be stored in a BOOLEAN i.e. -128 to 127.

At the moment, HarbourBridge generates an error and rejects rows where boolean columns have non-boolean values.

Proposed resolution: instead of generating an error, convert integer values to booleans using the standard heuristic: 0 is false and non-zero is true.

Consider making data import restartable

To improve reliability of data import, we could keep some state about progress of data writes (say another Spanner table), and restart where we left off. We could take advantage of transactions to know exactly what data was written by an Apply e.g. as part of each Apply batch, we could include info about what rows are included in the batch.

Report version of HarbourBridge when run from command line

We have had few instances in the past where customer reported issues but it was suspected that they may have been using an older version of the tool. But since we don't report the version of HarbourBridge on the stdout, it becomes quite hard to reproduce a given issue or to be able to state whether upgrading to the latest version of the tool would have fixed the issue.

Until we get the full release versioning in place, suggest that we report the commit hash, and if possible date and time when that commit was check in to master.

Feature: Support fulltext indexes

HarbourBridge's current treatment of fulltext indexes (mysql) and equivalent functionality in postgres (GIN and GiST indexes) is partial and inconsistent.

With the mysql driver, fulltext indexes are mapped to regular Spanner indexes.

With the mysqldump driver, fulltext indexes are silently ignored/dropped.

We should:

  • figure out how we want to handle fulltext indexes (only realistic options at present are (a) drop them and document this in the report, or (b) map them to regular indexes, which probably isn't useful)
  • implement this consistently across dump and direct drivers (and for both mysql and postgres)
  • update this treatment as new features are available on Spanner

Harbourbridge error while installing

Expected Behavior

Installation ready to use after command:

GO111MODULE=on go get github.com/cloudspannerecosystem/harbourbridge

Actual Behavior

Error when running installation command:

# github.com/lfittl/pg_query_go/parser
In file included from C:\Users\<user>\go\pkg\mod\github.com\lfittl\p[email protected]\parser\include/c.h:1129,
                 from C:\Users\<user>\go\pkg\mod\github.com\lfittl\p[email protected]\parser\include/postgres.h:47,
                 from contrib_pgcrypto_sha1.c:50:
C:\Users\<user>\go\pkg\mod\github.com\lfittl\p[email protected]\parser\include/port.h:17:10: fatal error: netdb.h: No such file or directory
   17 | #include <netdb.h>
      |          ^~~~~~~~~
compilation terminated.

Steps to Reproduce the Problem

  1. Install golang on windows
  2. Install mingw
  3. Install harbourbrige

Specifications

  • Version:
    • go: go1.15.7 windows/amd64
    • mingw: mingw 10.2.0
    • harbourbridge: latest
  • Platform:
    • windows 10

Fix treatment of Postgres INHERITS clause when using pg_dump

Starting with version 9.5, PostgreSQL supports an INHERITS clause: see https://www.postgresql.org/docs/9.5/ddl-inherit.html.

HarbourBridge handles this correctly when using direct database access i.e. when using driver=postgres.

However, pg_dump support is broken. For example running:

pg_dump ... | $GOPATH/bin/harbourbridge -driver=pg_dump

on a database containing a table created with:

  create table test_inherits () inherits (test);

will ignore the inherits clause, and essentially generate the empty table with no columns (which will be augmented with a synthetic synth_id column because there is no primary key). Any data for this table will almost certainly be dropped.

Consider adding schema_only, data_only, and dry_run options

There is overlap in functionality here. Need to consider which options make sense vs adding too many flags and possible confusion.

Potential functionality we could add:

dry_run: generate and write out the schema and report, do data conversion, but don't make a spanner DB or load it will data.

schema_only: build the schema and create the DB, but skip data conversion and writing data to Spanner.

data_only: skip schema conversion and creation of DB, and instead do data conversion and write data to an existing DB. Optionally: check that the existing DB has a schema that is compatible with the one we would build (or we can just send data and see what breaks).

Fix : mysqldump parser not able to parse ordering of the index key

Expected Behavior

We should be able to get the ordering (i.e, ascending/descending) of the key columns involved in the index.

Actual Behavior

The mysqldump parser has no attribute in *ast.IndexPartSpecification which can be mapped to the ordering of the key column.

Find alternative way to get the ordering of the key column in the index. currently default ordering is kept to ascending.
The fix is needed in mysql/mysqldump.go in toSchemaKeys function.

func toSchemaKeys(columns []*ast.IndexPartSpecification) (keys []schema.Key) {
	for _, colname := range columns {
		keys = append(keys, schema.Key{Column: colname.Column.Name.String()})
	}
	return keys
}

Early exit if pg_dump not parsable

If pg_dump is not parsable, we still proceed and create a DB (with empty schema) and then go on to do data conversion (which will fail since there is no schema). This all takes time and the output from the tool is very clear about what happened i.e. it says 'no schema and no data', but doesn't pinpoint the likely cause.

A better user experience would be to recognize that the pg_dump wasn't parsable, print an appropriate error and exit early.

Mysql Dump inconsistent snapshots

Expected Behavior

While working with any database, the dump we use should be a consistent snapshot of the data we are restoring.

Actual Behavior

While generating a dump using the mysqldump, writes on the database continue. Even while using the --single -transaction flag, the dump is not guaranteed to be consistent as it is taken with the isolation of Repeatable Read. This only ensures a consistent snapshot of the rows being read, in this case the lock would be at a table level, not the whole database. So there exists a possibility of getting snapshots of different times of different tables.

Steps to Reproduce the Problem

  1. Populate a few GB of data in each table of the database (can start with 2 tables)
  2. Keep writing data into both the tables. Keep timestamp as a column to verify the inconsistency
  3. Start Mysqldump
  4. Check the dump to see different start times on both the tables.

Specifications

  • Version: All versions
  • Platform: All platforms

Using go get harbourbridge fails post go version 1.17

Using GO111MODULE=on go get github.com/cloudspannerecosystem/harbourbridge to download harbourbridge using go version 1.17+ throws
go get is deprecated , try go install, which further throws

The go.mod file for the module providing named packages contains one or more replace directives. It must not contain directives that would cause it to be interpreted differently than if it were the main module.

Consider switching report.txt to report.md

We currently use text for the report file.

This means:

  • formatting options are limited
  • we have to do our own formatting (clunky, more code)
  • we can't show links to documentation
  • we can't link from one part of the report to another

Markdown is now basically universal. Switching to it would give a nicer output and allow us to delete some code.

Support CRLF (aka DOS file format)

If pg_dump output is in DOS file format ("\r\n" instead of just "\n"), HarbourBridge basically fails: you'll get some kind of schema (but possibly missing the last statement), but data conversion will mostly break (you'll get some data, but we won't recognize the end of the COPY-BLOCK).

Note that you can run into this issue with docker. For example, if you get pg_dump data by running a docker command like "docker run -it ..." (see https://stackoverflow.com/questions/55034214/why-do-docker-run-t-outputs-include-r-in-the-command-output).
The '-t' gives you a new interactive pseudo-TTY, and this maps new line into CRLF ("\r\n").

Support for PSQL attached partition by hash ?

Found Behavior

I've been running harbourbridge on some of my PSQL schema to test how hard a migration to Google Spanner would be.

One of the main key of our project is that some of our table has the PostgreSQL 12 partition strategy.

Running a first pass on a config schema went perfectly. Doing a test on a more complex schema end-up doing a failed to parse the data file.

I'm doing a simple pgdump in schema only mode:
pg_dump -h myhost -d mydb -U myusername -s -n myschemaname| $GOPATH/bin/harbourbridge -driver=pg_dump -dbname=my-destination-db
which result in:

Using driver (source DB): pg_dump target-db: spanner
Generating schema: 100%
Failed to parse the data file: Error parsing last 6404 line(s) of inputpanic: failed to parse the data file

goroutine 1 [running]:
main.main()
        /app/go/pkg/mod/github.com/cloudspannerecosystem/[email protected]/main.go:157 +0xecc
root@2462e825148a:/app#

So I export my pg_dump to a file to see what line 6404 was all about. Which is the exact spot where the first partitioned table get created in the script order.

sed -n '6403,6405'p pgdump.txt ALTER INDEX affiliate.affiliate_referral_code_pk ATTACH PARTITION affiliate.referral_code_0_pkey;

Expected Behavior

If a statement isn't supported I would expect to get it back on the report and have harbourbridge go ahead with the dump migration completion.

Steps to Reproduce the Problem

  1. Get a simple table that has a partition by hash strategy on a PG12 instance.
  2. Pass a pg_dump of that table to harbourbridge.

Specifications

  • Version:
  • Platform:

Runtime error: invalid memory address or nil pointer dereference

Expected Behavior

Running Harbourbridge on full Postgres database dump including objects such as uniqe indexes and user defined functions.
No expected Errors when executing since it is a plain text sql dump.

Actual Behavior

Running into an error, i think when encountering "))".

The following Error is raised:

Using driver (source DB): pg_dump
Generating schema: 61%panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xf8e55c]

goroutine 1 [running]:
github.com/cloudspannerecosystem/harbourbridge/postgres.toIndexKeys(0xc00007e740, 0x3, 0x4, 0x1d, 0xc000436ef8, 0x1)
        /home/<user>/gopath/pkg/mod/github.com/cloudspannerecosystem/[email protected]/postgres/pgdump.go:670 +0x13c
github.com/cloudspannerecosystem/harbourbridge/postgres.processIndexStmt(0xc000501e40, 0xc0003e8cb0, 0xc000548ed0, 0xc0003e8d50, 0x0, 0xc00007e740, 0x3, 0x4, 0x0, 0x0, ...)
        /home/<user>/gopath/pkg/mod/github.com/cloudspannerecosystem/[email protected]/postgres/pgdump.go:213 +0x2e6
github.com/cloudspannerecosystem/harbourbridge/postgres.processStatements(0xc000501e40, 0xc0003e9030, 0x1, 0x1, 0xbf)
        /home/<user>/gopath/pkg/mod/github.com/cloudspannerecosystem/[email protected]/postgres/pgdump.go:190 +0x59b
github.com/cloudspannerecosystem/harbourbridge/postgres.ProcessPgDump(0xc000501e40, 0xc000643ad0, 0x1235b01, 0xc0005bf9e8)
        /home/<user>/gopath/pkg/mod/github.com/cloudspannerecosystem/[email protected]/postgres/pgdump.go:58 +0xb3
main.ProcessDump(0x1281dfb, 0x7, 0xc000501e40, 0xc0005bfad0, 0xc000092810, 0x2)
        /home/<user>/gopath/pkg/mod/github.com/cloudspannerecosystem/[email protected]/main.go:941 +0xe9
main.schemaFromDump(0x1281dfb, 0x7, 0xc000643f28, 0x7fef244896e0, 0x0, 0x414050)
        /home/<user>/gopath/pkg/mod/github.com/cloudspannerecosystem/[email protected]/main.go:432 +0x445
main.schemaConv(0x1281dfb, 0x7, 0xc0005bff28, 0x1d, 0x7fef24480108, 0xc000040780)
        /home/<user>/gopath/pkg/mod/github.com/cloudspannerecosystem/[email protected]/main.go:239 +0x156
main.toSpanner(0x1281dfb, 0x7, 0x0, 0x0, 0x0, 0x0, 0xc000040760, 0x1c, 0xc000643f28, 0xc000040780, ...)
        /home/<user>/gopath/pkg/mod/github.com/cloudspannerecosystem/[email protected]/main.go:184 +0xc05
main.main()
        /home/<user>/gopath/pkg/mod/github.com/cloudspannerecosystem/[email protected]/main.go:167 +0x3f4

Steps to Reproduce the Problem

1. Create a sql dump file including the following Line

CREATE UNIQUE INDEX some_index ON some_schema.some_table USING btree (some_column1, some_column2, some_schema.some_plsql_function(some_column3, some_column4));

2. Run Harbourbridge

harbourbridge -schema-only < some_dump.pg_dump

Specifications

  • Version:
    • go version go1.15.7 linux/amd64
    • harbourbridge version from master
  • Platform: Windows 10 / GCP Cloud Shell

Renaming a column then reverting to original name raises error in the UI

Expected Behavior

If a col is renamed and then reverted to original name, it should do so successfully.

Actual Behavior

Instead throws an error

Steps to Reproduce the Problem

This may not happen for all cases. Use the cart.mysqldump in the repo and use user_id column for the steps to reproduce.
Open web UI

  1. Change a column name (new_col_name) then save changes
  2. Rename back to original name (old_col_name)
  3. Clicking save may throw "old_col_name already exists"

Web UI can crash (nil-pointer dereference) if pages accessed in odd order

Steps to reproduce:
start harbourbridge with -web
open browser and go to http://localhost:8080/#/schema-report
Note: this is not the usual workflow, since you'd typically start a session at the home page

Sample crash trace
2021/05/24 12:05:21 http: panic serving [::1]:42012: runtime error: invalid memory address or nil pointer dereference
goroutine 11 [running]:
net/http.(*conn).serve.func1(0xc0001b50e0)
/usr/lib/google-golang/src/net/http/server.go:1823 +0x153
panic(0x12d9ba0, 0x1db39c0)
/usr/lib/google-golang/src/runtime/panic.go:971 +0x499
github.com/cloudspannerecosystem/harbourbridge/web.getDDL(0x165cd70, 0xc0000e4000, 0xc0000d8200)
/usr/local/google/home/nch/cloud-spanner/HarbourBridge/user-guided-schema-migration-tmp/harbourbridge/web/web.go:191 +0x39
net/http.HandlerFunc.ServeHTTP(0x14bb420, 0x165cd70, 0xc0000e4000, 0xc0000d8200)
/usr/lib/google-golang/src/net/http/server.go:2068 +0x44
github.com/gorilla/mux.(*Router).ServeHTTP(0xc000168600, 0x165cd70, 0xc0000e4000, 0xc00058e000)
/usr/local/google/home/nch/go/pkg/mod/github.com/gorilla/[email protected]/mux.go:212 +0xd3
github.com/gorilla/handlers.(*cors).ServeHTTP(0xc0000ba120, 0x165cd70, 0xc0000e4000, 0xc00058e000)
/usr/local/google/home/nch/go/pkg/mod/github.com/gorilla/[email protected]/cors.go:54 +0x103e
net/http.serverHandler.ServeHTTP(0xc0001e01c0, 0x165cd70, 0xc0000e4000, 0xc00058e000)
/usr/lib/google-golang/src/net/http/server.go:2886 +0xa3
net/http.(*conn).serve(0xc0001b50e0, 0x1660600, 0xc00061c100)
/usr/lib/google-golang/src/net/http/server.go:1951 +0x8cd
created by net/http.(*Server).Serve
/usr/lib/google-golang/src/net/http/server.go:3014 +0x3e5

Cause: calling web.getDDL with sessionState.conv set to nil (it’s default value)

Options to fix

  1. Ensure that sessionState.conv is initialized properly e.g. add
    sessionState.conv = internal.MakeConv()
    to web.go’s init().
  2. Have safety checks whenever we access sessionState.conv e.g. like we already do in getTypeMap.

Plan to go with option 1.

Support sharded PostgreSQL databases

Support PostgreSQL database configurations where a database is partitioned into different datasets, each hosted on a separate PostgreSQL server.

A first step in this direction is probably a data_only option (see #17). But we might need to do more. For example, we might need to check to see if a table exists, and if it doesn't, generate a schema for it and send it to Spanner. This would cover the sharding case where each PostgreSQL server shard doesn't necessarily have an identical set of tables.

No longer need to drop comments from DDL

Spanner now accepts DDL comments so we no longer need to drop them when sending DDL to Spanner.

We should:

  • stop dropping comments when we call GetDDL to build DDL string to send to Spanner
  • delete the stale comment about DDL comments

Support user-guided schema conversion

Currently we fully automate schema conversion.

However, many aspects of schema conversion represent choices. We should provide a way for users to provide input to control these choices. Some options to consider:
a) allow a user to edit the schema we build and feed it back as input to HarbourBridge
b) provide a preference configuration for common translations e.g. numeric
c) provide a UI that presents choices and advice for each choice, prioritized by an estimate of the importance of each choice

Consider character sets other than UTF-8

HarbourBridge currently supports only one character encoding: UTF-8.

Specifically, we map PostgreSQL character types (including TEXT and CHAR types) to Spanner's STRING type, which is UTF-8 encoded. Note that HarbourBridge itself does not do any encoding/decoding or UTF-8 checks: it passes through data from pg_dump to Spanner. Internally, we use go's string type, which supports UTF-8.

We could go beyond UTF-8 by building character encoding conversions into HarbourBridge. For example, we could wrap io.Reader with https://godoc.org/golang.org/x/net/html/charset.

Multi-row INSERT statements broken for PostgreSQL

We only handle single row INSERT statements. Insert statements with multiple rows are broken: we construct a row with all of the values in the multiple rows concatenated together.

Some history ...

In pgdump, the default way to write out data to the dump file is to use COPY-FROM. However, you can also get pgdump to generate INSERT statements. The very first version of harbourbridge used the insert version (it was kind of easier to start with). When pgdump generates INSERTS, it only generates them with one row at a time i.e.

INSERT INTO public.products (product_id, description, price, date_added) VALUES ('abc-123', 'Blue suede shoes', '141.99', '2020-06-06');

And so to get harbourbridge bootstrapped, that's all that was implemented. Then we switched to using COPY-FROM, and (mostly) stopped using INSERT.

However, PostgreSQL does support multi-row inserts i.e.:

INSERT INTO public.products (product_id, description, price, date_added) VALUES
('abc-123', 'Blue suede shoes', '141.99', '2020-06-06'),
('axd-673', 'Antique typewriter', '99.99', '2020-06-07'),
('zxi-631', 'Glass vase', '55.50', '2020-06-10');

These get completely mangled at the moment! The main issue is in getVals, which builds the following row for this multi-row INSERT:

vals = ["abc-123" "Blue suede shoes" "141.99" "2020-06-06" "axd-673" "Antique typewriter" "99.99" "2020-06-07" "zxi-631" "Glass vase" "55.50" "2020-06-10"]

i.e. all of the values from all of the rows concatenated together!

To fix this, we should:
a) return [][]string from getVals
b) change copyOrInsert to handle [][]string for vals (maybe we should call it rows?)
c) change ProcessPgDump to handle multi-row inserts i.e. iterate over the [][]string and call ProcessDataRow on each row.
d) add a test for multi-row INSERT statements (in fact, this should probably be the first step!)

Here's a complete dump file to test this out:

CREATE TABLE public.products (
product_id character varying(20) NOT NULL,
description character varying(1000),
price numeric(10,2),
date_added date
);

INSERT INTO public.products (product_id, description, price, date_added) VALUES
('abc-123', 'Blue suede shoes', '141.99', '2020-06-06'),
('axd-673', 'Antique typewriter', '99.99', '2020-06-07'),
('zxi-631', 'Glass vase', '55.50', '2020-06-10');

Improve error messages during data conversion.

In general, HarbourBridge provides a summary of errors and unusual conditions as part of the 'Unexpected Conditions" section of report.txt. However, this section does not including errors encountered while writing rows to Spanner.

Note that such errors are accumulated in BatchWriter and are available via the Errors() method, but HarbourBridge does not call Error and so these errors are essentially dropped.

There are several options to improve this:

a) Improve error handling in data.go so that more errors are caught before we call Spanner (e.g. we could catch common errors like non-matching column/value counts, and field size mismatches). This has other benefits: we can provide more succinct error messages and avoid batch retries/splits.

b) Change report.go to include errors from BatchWriter (requires some minor refactoring of main.go).

Foreign key reference is case sensitive in spanner

Expected Behavior

The tool should be able to migrate foreign keys if referenced columns are not case sensitive, as mysql and postgres allows case insensitive refered column names.

Actual Behavior

Spanner throws error on migrating foreign key with different case in refered column.

Fix needed in toddl to check once if the refered column actually exists, if not we can compare all columns with refered column with lower case and identify which column it is actually refering.

Add support for reading source database dump file from GCS buckets

Today users can pipe dump file to HarbourBridge (HB) from stdin or pass a local path to the dump file using dump-file flag. But many a times, it may not be possible to copy the dump file in the same machine as the one where HB is running.

Thus, we should allow users to specify path to a dump file in GCS bucket as a valid argument to dump-file flag.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.