Code Monkey home page Code Monkey logo

tap-mysql's Introduction

tap-mysql

PyPI version CircleCI Build Status

Singer tap that extracts data from a MySQL database and produces JSON-formatted data following the Singer spec.

$ mkvirtualenv -p python3 tap-mysql
$ pip install tap-mysql
$ tap-mysql --config config.json --discover
$ tap-mysql --config config.json --properties properties.json --state state.json

Usage

This section dives into basic usage of tap-mysql by walking through extracting data from a table. It assumes that you can connect to and read from a MySQL database.

Install

$ mkvirtualenv -p python3 tap-mysql
$ pip install tap-mysql

or

$ git clone [email protected]:singer-io/tap-mysql.git
$ cd tap-mysql
$ mkvirtualenv -p python3 tap-mysql
$ python install .

Have a source database

There's some important business data siloed in this MySQL database -- we need to extract it. Here's the table we'd like to sync:

mysql> select * from example_db.animals;
+----|----------|----------------------+
| id | name     | likes_getting_petted |
+----|----------|----------------------+
|  1 | aardvark |                    0 |
|  2 | bear     |                    0 |
|  3 | cow      |                    1 |
+----|----------|----------------------+
3 rows in set (0.00 sec)

Create the configuration file

Create a config file containing the database connection credentials, e.g.:

{
  "host": "localhost",
  "port": "3306",
  "user": "root",
  "password": "password"
}

These are the same basic configuration properties used by the MySQL command-line client (mysql).

Discovery mode

The tap can be invoked in discovery mode to find the available tables and columns in the database:

$ tap-mysql --config config.json --discover

A discovered catalog is output, with a JSON-schema description of each table. A source table directly corresponds to a Singer stream.

{
  "streams": [
    {
      "tap_stream_id": "example_db-animals",
      "table_name": "animals",
      "schema": {
        "type": "object",
        "properties": {
          "name": {
            "inclusion": "available",
            "type": [
              "null",
              "string"
            ],
            "maxLength": 255
          },
          "id": {
            "inclusion": "automatic",
            "minimum": -2147483648,
            "maximum": 2147483647,
            "type": [
              "null",
              "integer"
            ]
          },
          "likes_getting_petted": {
            "inclusion": "available",
            "type": [
              "null",
              "boolean"
            ]
          }
        }
      },
      "metadata": [
        {
          "breadcrumb": [],
          "metadata": {
            "row-count": 3,
            "table-key-properties": [
              "id"
            ],
            "database-name": "example_db",
            "selected-by-default": false,
            "is-view": false,
          }
        },
        {
          "breadcrumb": [
            "properties",
            "id"
          ],
          "metadata": {
            "sql-datatype": "int(11)",
            "selected-by-default": true
          }
        },
        {
          "breadcrumb": [
            "properties",
            "name"
          ],
          "metadata": {
            "sql-datatype": "varchar(255)",
            "selected-by-default": true
          }
        },
        {
          "breadcrumb": [
            "properties",
            "likes_getting_petted"
          ],
          "metadata": {
            "sql-datatype": "tinyint(1)",
            "selected-by-default": true
          }
        }
      ],
      "stream": "animals"
    }
  ]
}

Field selection

In sync mode, tap-mysql consumes the catalog and looks for tables and fields have been marked as selected in their associated metadata entries.

Redirect output from the tap's discovery mode to a file so that it can be modified:

$ tap-mysql -c config.json --discover > properties.json

Then edit properties.json to make selections. In this example we want the animals table. The stream's metadata entry (associated with "breadcrumb": []) gets a top-level selected flag, as does its columns' metadata entries. Additionally, we will mark the animals table to replicate using a FULL_TABLE strategy. For more, information, see Replication methods and state file.

[
  {
    "breadcrumb": [],
    "metadata": {
      "row-count": 3,
      "table-key-properties": [
        "id"
      ],
      "database-name": "example_db",
      "selected-by-default": false,
      "is-view": false,
      "selected": true,
      "replication-method": "FULL_TABLE"
    }
  },
  {
    "breadcrumb": [
      "properties",
      "id"
    ],
    "metadata": {
      "sql-datatype": "int(11)",
      "selected-by-default": true,
      "selected": true
    }
  },
  {
    "breadcrumb": [
      "properties",
      "name"
    ],
    "metadata": {
      "sql-datatype": "varchar(255)",
      "selected-by-default": true,
      "selected": true
    }
  },
  {
    "breadcrumb": [
      "properties",
      "likes_getting_petted"
    ],
    "metadata": {
      "sql-datatype": "tinyint(1)",
      "selected-by-default": true,
      "selected": true
    }
  }
]

Sync mode

With a properties catalog that describes field and table selections, the tap can be invoked in sync mode:

$ tap-mysql -c config.json --properties properties.json

Messages are written to standard output following the Singer specification. The resultant stream of JSON data can be consumed by a Singer target.

{"value": {"currently_syncing": "example_db-animals"}, "type": "STATE"}

{"key_properties": ["id"], "stream": "animals", "schema": {"properties": {"name": {"inclusion": "available", "maxLength": 255, "type": ["null", "string"]}, "likes_getting_petted": {"inclusion": "available", "type": ["null", "boolean"]}, "id": {"inclusion": "automatic", "minimum": -2147483648, "type": ["null", "integer"], "maximum": 2147483647}}, "type": "object"}, "type": "SCHEMA"}

{"stream": "animals", "version": 1509133344771, "type": "ACTIVATE_VERSION"}

{"record": {"name": "aardvark", "likes_getting_petted": false, "id": 1}, "stream": "animals", "version": 1509133344771, "type": "RECORD"}

{"record": {"name": "bear", "likes_getting_petted": false, "id": 2}, "stream": "animals", "version": 1509133344771, "type": "RECORD"}

{"record": {"name": "cow", "likes_getting_petted": true, "id": 3}, "stream": "animals", "version": 1509133344771, "type": "RECORD"}

{"stream": "animals", "version": 1509133344771, "type": "ACTIVATE_VERSION"}

{"value": {"currently_syncing": "example_db-animals", "bookmarks": {"example_db-animals": {"initial_full_table_complete": true}}}, "type": "STATE"}

{"value": {"currently_syncing": null, "bookmarks": {"example_db-animals": {"initial_full_table_complete": true}}}, "type": "STATE"}

Replication methods and state file

In the above example, we invoked tap-mysql without providing a state file and without specifying a replication method. The two ways to replicate a given table are FULL_TABLE and INCREMENTAL.

Full Table

Full-table replication extracts all data from the source table each time the tap is invoked.

Incremental

Incremental replication works in conjunction with a state file to only extract new records each time the tap is invoked. This requires a replication key to be specified in the table's metadata as well.

Example

Let's sync the animals table again, but this time using incremental replication. The replication method and replication key are set in the table's metadata entry in properties file:

{
  "streams": [
    {
      "tap_stream_id": "example_db-animals",
      "table_name": "animals",
      "schema": { ... },
      "metadata": [
        {
          "breadcrumb": [],
          "metadata": {
            "row-count": 3,
            "table-key-properties": [
              "id"
            ],
            "database-name": "example_db",
            "selected-by-default": false,
            "is-view": false,
            "replication-method": "INCREMENTAL",
            "replication-key": "id"
          }
        },
        ...
      ],
      "stream": "animals"
    }
  ]
}

We have no meaningful state so far, so just invoke the tap in sync mode again without a state file:

$ tap-mysql -c config.json --properties properties.json

The output messages look very similar to when the table was replicated using the default FULL_TABLE replication method. One important difference is that the STATE messages now contain a replication_key_value -- a bookmark or high-water mark -- for data that was extracted:

{"type": "STATE", "value": {"currently_syncing": "example_db-animals"}}

{"stream": "animals", "type": "SCHEMA", "schema": {"type": "object", "properties": {"id": {"type": ["null", "integer"], "minimum": -2147483648, "maximum": 2147483647, "inclusion": "automatic"}, "name": {"type": ["null", "string"], "inclusion": "available", "maxLength": 255}, "likes_getting_petted": {"type": ["null", "boolean"], "inclusion": "available"}}}, "key_properties": ["id"]}

{"stream": "animals", "type": "ACTIVATE_VERSION", "version": 1509135204169}

{"stream": "animals", "type": "RECORD", "version": 1509135204169, "record": {"id": 1, "name": "aardvark", "likes_getting_petted": false}}

{"stream": "animals", "type": "RECORD", "version": 1509135204169, "record": {"id": 2, "name": "bear", "likes_getting_petted": false}}

{"stream": "animals", "type": "RECORD", "version": 1509135204169, "record": {"id": 3, "name": "cow", "likes_getting_petted": true}}

{"type": "STATE", "value": {"bookmarks": {"example_db-animals": {"version": 1509135204169, "replication_key_value": 3, "replication_key": "id"}}, "currently_syncing": "example_db-animals"}}

{"type": "STATE", "value": {"bookmarks": {"example_db-animals": {"version": 1509135204169, "replication_key_value": 3, "replication_key": "id"}}, "currently_syncing": null}}

Note that the final STATE message has a replication_key_value of 3, reflecting that the extraction ended on a record that had an id of 3. Subsequent invocations of the tap will pick up from this bookmark.

Normally, the target will echo the last STATE after it's finished processing data. For this example, let's manually write a state.json file using the STATE message:

{
  "bookmarks": {
    "example_db-animals": {
      "version": 1509135204169,
      "replication_key_value": 3,
      "replication_key": "id"
    }
  },
  "currently_syncing": null
}

Let's add some more animals to our farm:

mysql> insert into animals (name, likes_getting_petted) values ('dog', true), ('elephant', true), ('frog', false);
$ tap-mysql -c config.json --properties properties.json --state state.json

This invocation extracts any data since (and including) the replication_key_value:

{"type": "STATE", "value": {"bookmarks": {"example_db-animals": {"replication_key": "id", "version": 1509135204169, "replication_key_value": 3}}, "currently_syncing": "example_db-animals"}}

{"key_properties": ["id"], "schema": {"properties": {"name": {"maxLength": 255, "inclusion": "available", "type": ["null", "string"]}, "id": {"maximum": 2147483647, "minimum": -2147483648, "inclusion": "automatic", "type": ["null", "integer"]}, "likes_getting_petted": {"inclusion": "available", "type": ["null", "boolean"]}}, "type": "object"}, "type": "SCHEMA", "stream": "animals"}

{"type": "ACTIVATE_VERSION", "version": 1509135204169, "stream": "animals"}

{"record": {"name": "cow", "id": 3, "likes_getting_petted": true}, "type": "RECORD", "version": 1509135204169, "stream": "animals"}
{"record": {"name": "dog", "id": 4, "likes_getting_petted": true}, "type": "RECORD", "version": 1509135204169, "stream": "animals"}
{"record": {"name": "elephant", "id": 5, "likes_getting_petted": true}, "type": "RECORD", "version": 1509135204169, "stream": "animals"}
{"record": {"name": "frog", "id": 6, "likes_getting_petted": false}, "type": "RECORD", "version": 1509135204169, "stream": "animals"}

{"type": "STATE", "value": {"bookmarks": {"example_db-animals": {"replication_key": "id", "version": 1509135204169, "replication_key_value": 6}}, "currently_syncing": "example_db-animals"}}

{"type": "STATE", "value": {"bookmarks": {"example_db-animals": {"replication_key": "id", "version": 1509135204169, "replication_key_value": 6}}, "currently_syncing": null}}

Copyright © 2017 Stitch

tap-mysql's People

Contributors

ajmers avatar bi1yeu avatar ccapurso avatar cosimon avatar dmosorast avatar gelks avatar iterati avatar kallan357 avatar kspeer825 avatar luandy64 avatar lumberj avatar mdelaurentis avatar nick-mccoy avatar psantacl avatar zachharris1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tap-mysql's Issues

showed {"type": "STATE", "value": {"currently_syncing": null}} even though table rows not empty

Hi, I would like to check if I miss anything in my properties.json file as when I test it keeps on shown like above instead of the value in JSON format. Below are the content inside

{ "streams": [ { "replication_method": "FULL_TABLE", "replication_key": "id", "key_properties": [ "id" ], "tap_stream_id": "quiz-animal", "table_name": "animal", "schema": { "properties": { "id": { "inclusion": "automatic", "minimum": -2147483648, "maximum": 2147483647, "type": [ "null", "integer" ] }, "name": { "inclusion": "available", "maxLength": 10, "type": [ "null", "string" ] }, "likes": { "inclusion": "available", "minimum": -2147483648, "maximum": 2147483647, "type": [ "null", "integer" ] } }, "type": "object" }, "stream": "animal", "metadata": [ { "breadcrumb": [], "metadata": { "selected-by-default": false, "database-name": "quiz", "row-count": 6, "is-view": false, "table-key-properties": [ "id" ] } }, { "breadcrumb": [ "properties", "id" ], "metadata": { "selected-by-default": true, "sql-datatype": "int(11)" } }, { "breadcrumb": [ "properties", "name" ], "metadata": { "selected-by-default": true, "sql-datatype": "varchar(10)" } }, { "breadcrumb": [ "properties", "likes" ], "metadata": { "selected-by-default": true, "sql-datatype": "int(11)" } } ] } ] }
Kindly anyone who've face this let me know if im missing something.

Appreciate on any suggestion.

How to pull out data from multiple table join

Hi,

I would like to check if the tap have possibility to pull out the data from multiple table join with several where clause ?

Could anyone share some tweak that I can make to make it possible ?

Thanks and appreciate for any advise

Cannot use SSL key and cert without an SSL CA

From what I can tell, no amount of config fiddling of tap-mysql will correctly pass through the right connection params to pymysql with MySQL 5.6
What I need is the following, but tap-mysql assumes you must need a custom CA if you are turning off check_hostname

        host=...,
        user=...,
        password=...,
        ssl={
          'cert': '/path/to/cert',
          'key': '/path/to/key',
          'check_hostname': False
        })

Only figured this out after reading this github issue and hacking tap-mysql to pass the exact ssl config I thought i would need. PyMySQL/PyMySQL#430 (comment)

Set minimum correctly for unsigned decimal types

Looks like the UNSIGNED qualifier on DECIMAL types does not allow for larger positive numbers, it just prevents negative numbers:

mysql> create table decimal_test (a_signed decimal(5, 2), a_unsigned decimal(5, 2) unsigned);
Query OK, 0 rows affected (0.01 sec)

mysql> insert into decimal_test (a_signed, a_unsigned) values (100000, 100000);
Query OK, 1 row affected, 2 warnings (0.00 sec)

mysql> insert into decimal_test (a_signed, a_unsigned) values (-100000, -100000);
Query OK, 1 row affected, 2 warnings (0.01 sec)

mysql> select * from decimal_test;
+----------+------------+
| a_signed | a_unsigned |
+----------+------------+
|   999.99 |     999.99 |
|  -999.99 |       0.00 |
+----------+------------+
2 rows in set (0.00 sec)

Incremental update selects most-recently-changed record

I'm testing this tap with Meltano, and I think I've found a bug -- I'm seeing that incremental updates don't select zero rows when the data hasn't changed; instead I get a single row, the last one to change. I think this is down to the following code:

select_sql += ' WHERE `{}` >= %(replication_key_value)s ORDER BY `{}` ASC'.format(

From the specs (https://github.com/singer-io/getting-started/blob/master/docs/CONFIG_AND_STATE.md#example-1), it seems to suggest that the replication_key_value is expected to be the last-seen value, and therefore we should look for records after this one?

I.e. I'd expect the generated SQL to be:

select_sql += ' WHERE {}> %(replication_key_value)s ORDER BY{}` ASC'

(> instead of >=).

Or is it deliberately doing this to guard against a possible pathological case where another record was added later which has the same timestamp?

"type": "STATE", "value": {"currently_syncing": null}}

I'm trying to understand how singer code work. I have created the same table as in example but as i ran the following command i get the following message. i don't understand where the output file is generated. I have read documentation couple of times. Can anyone help me to sort it out
tap-mysql -c config.json --properties properties.json /usr/local/lib/python3.4/dist-packages/pymysql/cursors.py:161: Warning: (1158, 'Got an error reading communication packets') while self.nextset(): INFO Server Parameters: version: 10.2.11-MariaDB-10.2.11+maria~trusty-log, wait_timeout: 2700, innodb_lock_wait_timeout: 2700, max_allowed_packet: 16777216, interactive_timeout: 28800 {"type": "STATE", "value": {"currently_syncing": null}}

Allow setting MySQL session variables

This would allow specifying behaviors like set session transaction isolation level read committed; which is better for long-running queries against read replicas.

Documentation example not working with RDS and typos

  1. I spent the better part of a day trying to get tap-mysql to pull from a AWS-RDS MySQL instance. I was able to discover the DB just fine but was never getting any records when trying to sync. The current examples in the documentation do not match what I had to do to finally get it working.

The example shows that the following is how one would select a table for replication:
{ "selected": "true", "properties": { "likes_getting_petted": { "selected": "true", "inclusion": "available", "type": [ "null", "boolean" ] }, "name": { "selected": "true", "maxLength": 255, "inclusion": "available", "type": [ "null", "string" ] },

My own investigation and instrumentation of the source code revealed that the source code specifically looks for "metadata": followed by an empty breadcrumb list "breadcrumb": []. It is in that portion of the properties/catalog file that it ultimately searches for "selected": true. Here is a working example from my own code:
"metadata": [ { "breadcrumb": [], "metadata": { "selected": true, "replication-method": "FULL_TABLE", "selected-by-default": false, "database-name": "pimcore", "row-count": 76, "is-view": false, "table-key-properties": [ "id" ] } },

  1. The error messaging for a failed replication type has a typo. The current message reads as, "Exception: only INCREMENTAL, LOG_BASED, and FULL TABLE replication methods are supported"

There is no "FULL TABLE", the correct value is "FULL_TABLE".

Unsupported data type: JSON

As referenced in the issue below, the JSON data type is not supported. This data type has been available since MySQL 5.7.8 in mid-2015 this feature is likely to be in common use.

It looks like the json datatype is also not supported.

Originally posted by @louisrankin in #70 (comment)

Log SSL Cipher Being Used for Connection

It is beneficial to know which SSL cipher is being used when connecting to the mysql instance, to ensure that communication is being secured using up to date encryption methods.

The proposed method would be to add logging to this function to include the values of the Ssl_version, Ssl_cipher, and tls_version parameters described here.

What updates the state.json file?

Should running the tap in sync mode update the state.json file? Or do I need to manually pipe the last line to the state file as a bookmark?

Time to update attrs?

I have a conflict in requirements, wouldn't it be a good idea to update attrs to a more recent version? (Haven't tested though)

Here's my conflict, I imagine more and more packages will cause conflicts in the future.

attrs==16.3.0
  - aiohttp==3.5.4 [requires: attrs>=17.3.0]
    - slackclient==2.0.1 [requires: aiohttp>3.5.2]

add option to select all streams/columns by default

This would be helpful as it won't require post-processing of the JSON to set all fields as included.

Useful for when there are database views of normalized data, since the tables and columns are typically defined beforehand and should all be included in a replication job.

cryptography is required for sha256_password or caching_sha2_password

Hi,

I installed tap-mysql and trying to run meltano --log-level=debug invoke tap-mysql. I throws an exception because it doesn't find the cryptography installed.

CRITICAL cryptography is required for sha256_password or caching_sha2_password
Traceback (most recent call last):
  File "/srv/bps/meltano/releases/20220518_215252_b2234918/.meltano/extractors/tap-mysql/venv/bin/tap-mysql", line 8, in <module>
    sys.exit(main())
  File "/srv/bps/meltano/releases/20220518_215252_b2234918/.meltano/extractors/tap-mysql/venv/lib/python3.9/site-packages/singer/utils.py", line 229, in wrapped
    return fnc(*args, **kwargs)
  File "/srv/bps/meltano/releases/20220518_215252_b2234918/.meltano/extractors/tap-mysql/venv/lib/python3.9/site-packages/tap_mysql/__init__.py", line 722, in main
    log_server_params(mysql_conn)
  File "/srv/bps/meltano/releases/20220518_215252_b2234918/.meltano/extractors/tap-mysql/venv/lib/python3.9/site-packages/tap_mysql/__init__.py", line 683, in log_server_params
    with connect_with_backoff(mysql_conn) as open_conn:
  File "/srv/bps/meltano/releases/20220518_215252_b2234918/.meltano/extractors/tap-mysql/venv/lib/python3.9/site-packages/backoff/_sync.py", line 94, in retry
    ret = target(*args, **kwargs)
  File "/srv/bps/meltano/releases/20220518_215252_b2234918/.meltano/extractors/tap-mysql/venv/lib/python3.9/site-packages/tap_mysql/connection.py", line 24, in connect_with_backoff
    connection.connect()
  File "/srv/bps/meltano/releases/20220518_215252_b2234918/.meltano/extractors/tap-mysql/venv/lib/python3.9/site-packages/pymysql/connections.py", line 599, in connect
    self._request_authentication()
  File "/srv/bps/meltano/releases/20220518_215252_b2234918/.meltano/extractors/tap-mysql/venv/lib/python3.9/site-packages/pymysql/connections.py", line 882, in _request_authentication
    auth_packet = _auth.caching_sha2_password_auth(self, auth_packet)
  File "/srv/bps/meltano/releases/20220518_215252_b2234918/.meltano/extractors/tap-mysql/venv/lib/python3.9/site-packages/pymysql/_auth.py", line 264, in caching_sha2_password_auth
    data = sha2_rsa_encrypt(conn.password, conn.salt, conn.server_public_key)
  File "/srv/bps/meltano/releases/20220518_215252_b2234918/.meltano/extractors/tap-mysql/venv/lib/python3.9/site-packages/pymysql/_auth.py", line 142, in sha2_rsa_encrypt
    raise RuntimeError("cryptography is required for sha256_password or caching_sha2_password")

I wonder what do I wrong.

It is not installed in the virtualenv of tap-mysql indeed:

% .meltano/extractors/tap-mysql/venv/bin/pip freeze

attrs==16.3.0
backoff==1.8.0
cffi==1.15.0
ciso8601==2.2.0
jsonschema==2.6.0
mysql-replication==0.22
pendulum==1.2.0
pycparser==2.21
PyMySQL==0.9.3
python-dateutil==2.8.2
pytz==2018.4
pytz-deprecation-shim==0.1.0.post0
pytzdata==2020.1
simplejson==3.11.1
singer-python==5.9.0
six==1.16.0
tap-mysql @ git+https://github.com/singer-io/tap-mysql@c09e9931d19a792ace0aa1a2434eff0fdb97dfa7
tzdata==2022.1
tzlocal==4.2

But I don't see it among the dependencies of the tap-mysql package either.

However I can see that it depends on PyMySQL. The website of PyMySQL says:

To use “sha256_password” or “caching_sha2_password” for authenticate, you need to install additional dependency:

$ python3 -m pip install PyMySQL[rsa]

Installing that package will install cryptography too, and then tap-mysql works again. Is it possible that tap-mysql just depends on the wrong package?

# .meltano/extractors/tap-mysql/venv/bin/pip install "PyMySQL[rsa]"
Requirement already satisfied: PyMySQL[rsa] in ./.meltano/extractors/tap-mysql/venv/lib/python3.9/site-packages (0.9.3)
Collecting cryptography
  Using cached cryptography-37.0.2-cp36-abi3-manylinux_2_24_x86_64.whl (4.0 MB)
Requirement already satisfied: cffi>=1.12 in ./.meltano/extractors/tap-mysql/venv/lib/python3.9/site-packages (from cryptography->PyMySQL[rsa]) (1.15.0)
Requirement already satisfied: pycparser in ./.meltano/extractors/tap-mysql/venv/lib/python3.9/site-packages (from cffi>=1.12->cryptography->PyMySQL[rsa]) (2.21)
Installing collected packages: cryptography
Successfully installed cryptography-37.0.2

Case-insensitive column selection

Is there an easy way to make the column selection case-insensitive?

We're using tap-mysql to import data from an external database into a Postgres database, but the admins of the source database seem to keep changing the configuration of some MySQL views on a regular basis thus changing the column names to be sometimes all lower case, another time all upper case.

Unable to get mariadb table state using tap-mysql

Hi There,

I am running sample tap-mysql with mariadb on ec2 instance and primary key is incremental value and numeric. I am not getting any error in getting properties which shows the stream and row count but doesn't populate any state and it always null(even tried creating a state file which also didn't help me) :
tap-mysql -c config.json --properties properties.json --state state.json
INFO Server Parameters: version: 5.5.64-MariaDB, wait_timeout: 2700, innodb_lock_wait_timeout: 2700, max_allowed_packet: 1048576, interactive_timeout: 28800
INFO Server SSL Parameters (blank means SSL is not active): [ssl_version: ], [ssl_cipher: ]
{"type": "STATE", "value": {"bookmarks": {"esinger-tutorials_tbl": {"version": 1509135204169, "replication_key_value": 2, "replication_key": "tutorial_id"}}, "currently_syncing": null}}

Meltano without a catalog file

Simply following the steps specified here:

https://hub.meltano.com/extractors/tap-mysql/
leads to an exception being raised in the meltano config tap-mysql test step
only INCREMENTAL, LOG_BASED, and FULL TABLE replication methods are supported

I don't see a way of specifying the replication method without passing in a catalog file, so completely by-passing the "simplified" Meltano YAML file. I also get a similar error by adding a couple of simple selects to the YAML file.

Is that the expected behaviour? i.e. are the Installation and configuration steps are not quite right and tap-mysql always requires the catalog to be passed in using a file?

Many thanks

Ali

Use Transform before singer.RecordMessage()

As I had run into issues with decimal values before (TypeError: Object of type Decimal is not JSON serializable), I was trying to use the format: "singer.decimal" property in the catalog file. I realized that this does not work because the tap I was working with previously (tap-shopify) passes the row and metadata to Transformer.transform before recording the message.

LOG_BASED replication produces duplicate rows

I'm finding that if the tap runs after multiple updates to a single row while using LOG_BASED replication, duplicate records are being generated. What follows is an slightly modified example that I ran into in production.

I have the following lines in my binlog:

/*!100001 SET @@session.gtid_seq_no=45555058*//*!*/;
BEGIN
/*!*/;
# at 30263912
# at 30264365
#201223  9:16:23 server id 1  end_log_pos 30264365 CRC32 0x99c7ce0c     Annotate_rows:
#Q> insert into `contracts`.`contracts` (`autorenewal`, `autorenewal_cancelled_at`, `autorenewal_end_date`, `contract_signed_at`, `exclusive`, `guid`, `id`, `iso`, `other_charges`, `other_fee`, `payment_terms_id`, `price`, `production_fee`, `production_number`, `sales_period`, `service_fee`, `status`) values ('yes', NULL, NULL, '2020-12-23 17:13:56', false, 43124, NULL, 'TSD', 0, 0, 4, 2689, 0, 'f1kOuqCqjH', '202103', 0, 'open')
#201223  9:16:23 server id 1  end_log_pos 30264465 CRC32 0x72aa7fe7     Table_map: `contracts`.`contracts` mapped to number 7644
# at 30264465
#201223  9:16:23 server id 1  end_log_pos 30264573 CRC32 0xa4a70f8f     Write_rows: table id 7644 flags: STMT_END_F

BINLOG '
...
'/*!*/;
# Number of rows: 1
# at 30264573
# at 30264688
#201223  9:16:23 server id 1  end_log_pos 30264688 CRC32 0x878b81b4     Annotate_rows:
#Q> update `contracts`.`contracts` set `production_number` = '[ACTUAL VALUE]' where `id` = 1768
#201223  9:16:23 server id 1  end_log_pos 30264788 CRC32 0xe0cf2065     Table_map: `contracts`.`contracts` mapped to number 7644
# at 30264788
#201223  9:16:23 server id 1  end_log_pos 30264976 CRC32 0x21fc0f4c     Update_rows: table id 7644 flags: STMT_END_F

BINLOG '
...
'/*!*/;

In the first query, a row is inserted into the table containing a placeholder production_number. In the second query, the production_number is updated to the actual value.

When I look at the tap-mysql output (and what is uploaded via target-bigquery), I see that there are two records distinguishable only in sequence:

{"type": "RECORD", "stream": "contracts", "record": {"id": 1768, "guid": 43124, "status": "open", "iso": "MNM", "sales_period": "202103", "production_number": "f1kOuqCqjH", "exclusive": false, "autorenewal": "yes", "autorenewal_end_date": null, "price": 2689.00, "production_fee": 0.00, "other_fee": 0.00, "service_fee": 0.00, "other_charges": 0.00, "payment_terms_id": 4, "contract_signed_at": "2020-12-23T08:00:00.000000Z", "autorenewal_cancelled_at": null, "entered_at": "2020-12-23T17:16:23.000000Z", "_sdc_deleted_at": null}, "version": 1607545454071, "time_extracted": "2020-12-23T17:20:20.966775Z"}
{"type": "RECORD", "stream": "contracts", "record": {"id": 1768, "guid": 43124, "status": "open", "iso": "MNM", "sales_period": "202103", "production_number": "[ACTUAL VALUE]", "exclusive": false, "autorenewal": "yes", "autorenewal_end_date": null, "price": 2689.00, "production_fee": 0.00, "other_fee": 0.00, "service_fee": 0.00, "other_charges": 0.00, "payment_terms_id": 4, "contract_signed_at": "2020-12-23T08:00:00.000000Z", "autorenewal_cancelled_at": null, "entered_at": "2020-12-23T17:16:23.000000Z", "_sdc_deleted_at": null}, "version": 1607545454071, "time_extracted": "2020-12-23T17:20:20.966775Z"}

As you can see from the binlog, the two records are distinct and should be output. However, there is no way to dedupe the rows once they have been uploaded to another SQL service like Google BigQuery - all metadata (version and time_extracted) is identical in both rows.

Error setting up integration with Google Cloud SQL MySQL

I'm getting the error below when attempting to integrate with Google Cloud SQL. Any thoughts on how to resolve this?

2021-05-11 23:51:37,933Z main - INFO Running tap-mysql version 1.17.5 and target-stitch version 3.2.0 2021-05-11 23:51:38,099Z main - INFO [smart-services] building smart service producer: topic( com.stitchdata.extractionJobStarted ), use_ssl( true ) 2021-05-11 23:51:38,102Z main - INFO [smart-services] building smart service producer: topic( com.stitchdata.extractionJobFinished ), use_ssl( true ) 2021-05-11 23:51:38,103Z main - INFO [smart-services] building smart service producer: topic( com.stitchdata.streamRecordCount ), use_ssl( true ) 2021-05-11 23:51:39,118Z main - INFO [smart-services] event successfully sent to kafka: com.stitchdata.extractionJobStarted [1] at offset None 2021-05-11 23:51:39,118Z main - INFO Starting tap to discover schemas: tap-env/bin/tap-mysql --config /tmp/tap_discover_config.json --discover 2021-05-11 23:51:39,475Z tap - CRITICAL cryptography is required for sha256_password or caching_sha2_password 2021-05-11 23:51:39,477Z tap - Traceback (most recent call last): 2021-05-11 23:51:39,477Z tap - File "tap-env/bin/tap-mysql", line 33, in <module> 2021-05-11 23:51:39,477Z tap - sys.exit(load_entry_point('tap-mysql==1.17.5', 'console_scripts', 'tap-mysql')()) 2021-05-11 23:51:39,477Z tap - File "/code/orchestrator/tap-env/lib/python3.5/site-packages/singer/utils.py", line 229, in wrapped 2021-05-11 23:51:39,477Z tap - return fnc(*args, **kwargs) 2021-05-11 23:51:39,477Z tap - File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_mysql/__init__.py", line 722, in main 2021-05-11 23:51:39,477Z tap - log_server_params(mysql_conn) 2021-05-11 23:51:39,477Z tap - File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_mysql/__init__.py", line 683, in log_server_params 2021-05-11 23:51:39,477Z tap - with connect_with_backoff(mysql_conn) as open_conn: 2021-05-11 23:51:39,477Z tap - File "/code/orchestrator/tap-env/lib/python3.5/site-packages/backoff/_sync.py", line 94, in retry 2021-05-11 23:51:39,477Z tap - ret = target(*args, **kwargs) 2021-05-11 23:51:39,478Z tap - File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_mysql/connection.py", line 24, in connect_with_backoff 2021-05-11 23:51:39,478Z tap - connection.connect() 2021-05-11 23:51:39,478Z tap - File "/code/orchestrator/tap-env/lib/python3.5/site-packages/pymysql/connections.py", line 599, in connect 2021-05-11 23:51:39,478Z tap - self._request_authentication() 2021-05-11 23:51:39,478Z tap - File "/code/orchestrator/tap-env/lib/python3.5/site-packages/pymysql/connections.py", line 871, in _request_authentication 2021-05-11 23:51:39,478Z tap - auth_packet = self._process_auth(plugin_name, auth_packet) 2021-05-11 23:51:39,478Z tap - File "/code/orchestrator/tap-env/lib/python3.5/site-packages/pymysql/connections.py", line 902, in _process_auth 2021-05-11 23:51:39,478Z tap - return _auth.sha256_password_auth(self, auth_packet) 2021-05-11 23:51:39,478Z tap - File "/code/orchestrator/tap-env/lib/python3.5/site-packages/pymysql/_auth.py", line 179, in sha256_password_auth 2021-05-11 23:51:39,478Z tap - data = sha2_rsa_encrypt(conn.password, conn.salt, conn.server_public_key) 2021-05-11 23:51:39,478Z tap - File "/code/orchestrator/tap-env/lib/python3.5/site-packages/pymysql/_auth.py", line 142, in sha2_rsa_encrypt 2021-05-11 23:51:39,478Z tap - raise RuntimeError("cryptography is required for sha256_password or caching_sha2_password") 2021-05-11 23:51:39,478Z tap - RuntimeError: cryptography is required for sha256_password or caching_sha2_password 2021-05-11 23:51:39,511Z main - INFO Tap exited abnormally with status 1 2021-05-11 23:51:40,111Z main - INFO [smart-services] event successfully sent to kafka: com.stitchdata.extractionJobFinished [31] at offset None 2021-05-11 23:51:40,112Z main - INFO No tunnel subprocess to tear down 2021-05-11 23:51:40,112Z main - INFO Exit status is: Discovery failed with code 1 and error message: "cryptography is required for sha256_password or caching_sha2_password".

Unsupported data types: point, binary, tinytext

In setting up the binlog integration in stitch, the UI indicates that point, binary(1), and tinytext data types are unsupported. I assume that is a limitation of the tap, so creating an issue here to document that those data types aren't supported, and to hopefully encourage handling of those data types by the tap.

CRITICAL only INCREMENTAL, LOG_BASED, and FULL TABLE replication methods are supported

Following a simple example this is the only error I am coming across when I run the tap in sync mode using tap-mysql -c config.json --properties properties.json.

Console Output:

{"type": "STATE", "value": {"currently_syncing": "testdata-sim_crm"}}
INFO Beginning sync for InnoDB table testdata.sim_crm
INFO METRIC: {"metric": "job_duration", "type": "timer", "tags": {"table": "sim_crm", "job_type": "sync_table", "status": "failed", "database": "testdata"}, "value": 0.8691658973693848}
CRITICAL only INCREMENTAL, LOG_BASED, and FULL TABLE replication methods are supported

My properties.json is:

{
  "streams": [
    {
      "table_name": "sim_crm",
      "schema": {
	"selected": "true",
        "properties": {
          "sim_crm_lead_name": {
	    "selected": "true",
            "inclusion": "available",
            "type": [
              "null",
              "string"
            ],
            "maxLength": 45
          },
          "sim_crm_last_update_time": {
	    "selected": "true",
            "inclusion": "available",
            "type": [
              "null",
              "string"
            ],
            "maxLength": 45
          },
          "idsim_crm": {
	    "selected": "true",
            "inclusion": "automatic",
            "type": [
              "null",
              "integer"
            ],
            "minimum": -2147483648,
            "maximum": 2147483647
          },
          "sim_crm_lead_status": {
	    "selected": "true",
            "inclusion": "available",
            "type": [
              "null",
              "string"
            ],
            "maxLength": 45
          },
          "sim_crm_lead_company": {
	    "selected": "true",
            "inclusion": "available",
            "type": [
              "null",
              "string"
            ],
            "maxLength": 45
          }
        },
        "type": "object"
      },
      "tap_stream_id": "testdata-sim_crm",
      "stream": "sim_crm",
      "metadata": [
        {
          "breadcrumb": [
            "properties",
            "sim_crm_lead_company"
          ],
          "metadata": {
            "selected-by-default": true,
            "sql-datatype": "varchar(45)"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "sim_crm_lead_name"
          ],
          "metadata": {
            "selected-by-default": true,
            "sql-datatype": "varchar(45)"
          }
        },
        {
          "breadcrumb": [],
          "metadata": {
            "selected-by-default": false,
            "database-name": "testdata",
            "table-key-properties": [
              "idsim_crm"
            ],
            "row-count": 4,
            "is-view": false
          }
        },
        {
          "breadcrumb": [
            "properties",
            "idsim_crm"
          ],
          "metadata": {
            "selected-by-default": true,
            "sql-datatype": "int(11)"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "sim_crm_last_update_time"
          ],
          "metadata": {
            "selected-by-default": true,
            "sql-datatype": "varchar(45)"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "sim_crm_lead_status"
          ],
          "metadata": {
            "selected-by-default": true,
            "sql-datatype": "varchar(45)"
          }
        }
      ]
    },
    {
      "table_name": "sim_salesforce",
      "schema": {
	"selected": "true",
        "properties": {
          "sim_salesvalue": {
	    "selected": "true",
            "inclusion": "available",
            "type": [
              "null",
              "integer"
            ],
            "minimum": -2147483648,
            "maximum": 2147483647
          },
          "sim_salesnum": {
	    "selected": "true",
            "inclusion": "available",
            "type": [
              "null",
              "integer"
            ],
            "minimum": -2147483648,
            "maximum": 2147483647
          },
          "idsim_salesforce": {
	    "selected": "true",
            "inclusion": "automatic",
            "type": [
              "null",
              "integer"
            ],
            "minimum": -2147483648,
            "maximum": 2147483647
          },
          "sim_salesdate": {
	    "selected": "true",
            "inclusion": "available",
            "format": "date-time",
            "type": [
              "null",
              "string"
            ]
          }
        },
        "type": "object"
      },
      "tap_stream_id": "testdata-sim_salesforce",
      "stream": "sim_salesforce",
      "metadata": [
        {
          "breadcrumb": [
            "properties",
            "sim_salesvalue"
          ],
          "metadata": {
            "selected-by-default": true,
            "sql-datatype": "int(11)"
          }
        },
        {
          "breadcrumb": [],
          "metadata": {
            "selected-by-default": false,
            "database-name": "testdata",
            "table-key-properties": [
              "idsim_salesforce"
            ],
            "row-count": 3,
            "is-view": false
          }
        },
        {
          "breadcrumb": [
            "properties",
            "sim_salesnum"
          ],
          "metadata": {
            "selected-by-default": true,
            "sql-datatype": "int(11)"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "idsim_salesforce"
          ],
          "metadata": {
            "selected-by-default": true,
            "sql-datatype": "int(11)"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "sim_salesdate"
          ],
          "metadata": {
            "selected-by-default": true,
            "sql-datatype": "datetime"
          }
        }
      ]
    }
  ]
}

How to get table relationships for all tables in a database ?

I want to extract relationships between tables ( one to many / many to many / etc ).
Currently using tap-mysql with --discover option I am able to get list of all tables. It contains information about schema name, table name, table schema ( columns and column types) and metadata.
Metadata only contains table-key-properties :
"metadata": { "selected-by-default": false, "database-name": "employee", "row-count": 0, "is-view": false, "table-key-properties": [ "id" ] }
Is there a way extract foreign and primary key information ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.