Code Monkey home page Code Monkey logo

gcsb's Introduction

GCSB

It's like YCSB but with more Google. A simple tool meant to generate load against Google Spanner databases. The primary goals of the project are

  • Write randomized data to tables in order to
  • Generate read/write load against user provided schemas

Quickstart

To initiate a simple load test against your spanner instance using one of our test schemas

Create a test table

You can use your own schema if you'd prefer, but we provide a few test schemas to help you get started. To get started, create a table named SingleSingers

gcloud spanner databases ddl update YOUR_DATABASE_ID --instance=YOUR_INSTANCE_ID --ddl-file=schemas/single_table.sql

Load data into table

Load some data into the table to seed the upcoming laod test. In the example below, we are loading 10,000 rows of random data into the table SingleSingers

gcsb load -p YOUR_GCP_PROJECT_ID -i YOUR_INSTANCE_ID -d YOUR_DATABASE_ID -t SingleSingers -o 10000

Run a load test

Now you can perform a load test operation by using the run sub command. The below command will generate 10,000 operations. Of these operations, 75% will be READ operations and 25% will be writes. These operations will be performed over 50 threads.

gcsb run -p YOUR_GCP_PROJECT_ID -i YOUR_INSTANCE_ID -d YOUR_DATABASE_ID -t SingleSingers -o 10000 --reads 75 --writes 25 --threads 50

Operations

The tool usage is generally broken down into two categories, load and run operations.

Load

GCSB proides batched loading functionality to facilitate load testing as well as assist with performing database splits on your tables.

At runtime, GCSB will detect the schema of the table you're loading data for an create data generators that are appropriate for the column types you have in your database. Each type of generator has some configurable funcationality that allows you to refine the type, length, or range of data the tool generates. For in depth information on the various configuration values, please read the comments in example_gcsb.yaml.

Single table load

By default, GCSB will detect the table schema and create default random data generators based on the columns it finds. In order to tune the values the generator creates, you must create override configurations in the gcsb.yaml file. Please see that file's documentation for more information.

gcsb load -t TABLE_NAME -o NUM_ROWS

Additionally, please see gcsb load --help for additional configuration options.

Multiple table load

Similar to the above Single table load, you may specify multiple tables by repeating the -t TABLE_NAME argument. By default, the number of operations is applied to each table. For example, specifying 2 tables with 1000 operations, will yield 2000 total operations. 1000 per table.

gcsb load -t TABLE1 -t TABLE2 -o NUM_ROWS

Operations per table can be configured in the yaml configuration. For example

tables:
  - name: TABLE1
    operations:
      total: 500
  - name: TABLE2
    operations:
      total: 500

Loading into interleaved tables

Loading data into interleaved tables is not supported yet. If you want to create splits in the database, you can load data into parent tables.

Run

Single table run

By default, GCSB will detect the table schema and create default random data generators based on the columns it finds. In order to tune the values the generator creates, you must create override configurations in the gcsb.yaml file. Please see that file's documentation for more information.

gcsb run -p YOUR_GCP_PROJECT_ID -i YOUR_INSTANCE_ID -d YOUR_DATABASE_ID -t SingleSingers -o 10000 --reads 75 --writes 25 --threads 50

Additionally, please see gcsb run --help for additional configuration options.

Multiple table run

Similar to the above Single table run, you may specify multiple tables by repeating the -t TABLE_NAME argument. By default, the number of operations is applied to each table. For example, specifying 2 tables with 1000 operations, will yield 2000 total operations. 1000 per table.

gcsb run -t TABLE1 -t TABLE2 -o NUM_ROWS

Operations per table can be configured in the yaml configuration. For example

tables:
  - name: TABLE1
    operations:
      total: 500
  - name: TABLE2
    operations:
      total: 500

Running against interleaved tables

Run operations against interleaved tables are only supported on the APEX table.

Using our test INTERLEAVE schema, we see an INTERLEAVE relationship between the Singers, Albums, and Songs tables.

You will notice that if we try to run against any child table an error will occur.

gcsb run -t Songs -o 10

unable to execute run operation: can only execute run against apex table (try 'Singers')

Distributed testing

GCSB is intended to run in a stateless mannger. This design choice was to allow massive horizontal scaling of gcsb to stress your database to it's absolute limits. During development we've identified kubernetes as the prefered tool for the job. We've provided two separate tutorials for running gcsb inside of kubernetes

  • GKE - For running GCSB inside GKE using a service account key. This can be used for non-GKE clusters as well as it contains instructions for mounting a service account key into the container.
  • GKE with Workload Identity - For running GCSB inside a GKE cluster that has workload identity turned on. This is most useful in organizations that have security policies preventing you from generating or downloading a service account key.

Configuration

The tool can receive configuration input in several different ways. The tool will load the file gcsb.yaml if it detects it in the current working directory. Alternatively you can use the global flag -c to specify a path to the configuration file. Each sub-command has a number of configuration flags that are relevant to that operation. These values are bound to their counterparts in the yaml configuration file and take precedent over the config file. Think of them as overrides. The same is true for environment variables.

Please note, at present, the yaml conifguration file is the only way to specify generator overrides for data loading and write operations. Without this file, the tool will use a random data generator that is appropriate for the table schema it detects at runtime.

For in depth information on the various configuration values, please read the comments in example_gcsb.yaml

Supported generator type

The tool supports the following generator type in the configuration.

type description
UUID_V4 Generates UUID v4 value. Supported column types are STRING and BYTES. Note that UUID is automatically inferred for STRING(36) column without a configuration.

Roadmap

Not Supported (yet)

  • Interleaved tables for Load and Run phases.
  • Generating read operations utilizing ReadByIndex
  • Generating NULL values for load operations. If a column is NULLable, gcsb will still generate a value for it.
  • JSON column types
  • STRUCT Objects.
  • VIEWS
  • Inserting data across multiple tables in the same transaction
  • No SCAN or DELETE operations are supported at this time
  • Tables with foreign key relationships
  • Testing multiple tables at once

Development

Build

make build

Test

make test

gcsb's People

Contributors

snehashah16 avatar tomo241 avatar yfuruyama avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

gcsb's Issues

UUID v4 Support

Basic Idea

Supports UUID v4 data generation.

Design

Infer UUID v4 by default

Google recommends using UUID v4 for primary keys in Cloud Spanner and it has been widely adopted as a best practice.

Hence, it might make sense to infer if a certain column is supposed to have UUID v4 and if looks so, automatically generates UUID v4 value for the column. This inference would make it easy to load data without forcing users to write the configurations.

How to infer UUID v4

There are several ways for users to store UUID v4 value. For instances,

  • In a STRING(36) column (e.g. "f13c1af5-07cd-4db6-8891-ffa4acbd4991").
  • In a STRING(32) column without - (e.g. "f13c1af507cd4db68891ffa4acbd4991")
  • In a BYTES(16) column (e.g. <BYTE_ARRAY>).

Among them I guess most users would choose STRING(36) for UUID v4, so we can only infer that if the column type is exactly STRING(36).

GCSB Configuration

We can provide an explicit way to use the UUID v4 generation through the config. This is an example of the config for UUID v4 support.

# gcsb.yaml
tables:
- name: User
  columns:
  - name: UserId
    generator:
      type: UUID_V4

Currently there is a field called type in generator field. For this purpose we can create a new value for the type field: UUID_V4.

(Question: Can we use that field to specify the actual data type? Or should we treat it as a Spanner data type like STRING and define a new field like sub_type for UUID_V4?)

This UUID_V4 is only valid if the column type is either of the following.

  • STRING(36)
  • STRING(32)
  • BYTES(16)

The actual data generated by this tool will vary depending on the column type. For example, if the column type is STRING(32), we generate f13c1af507cd4db68891ffa4acbd4991 for the value. If the column type is not in the above list, we should report an error explicitly.

Other Details

  • This design only focuses on UUID v4. In other words, we will treat each UUID version differently. If users would like to use UUID v1, we can think of adding UUID_V1 type afterward.

gcsb

HI,
How to download and use this tool

Ability to provide custom read queries during run phase

Customer feedback:

Looks like it doesn’t support providing your own queries to the load test. The solution seems to just read from a table. Would be neat if you could also do that as well to test out a difficult query involving joins of tables etc and how it performs - would this be possible in the future?

How to set the date / Timestamp type range in the yaml config file

How to set the date / Timestamp type range in the yaml config file, is there some sample?
The target table column type is DATE or TimeStamp, I would like to generate the random value but between "2022-01-01" and "2022-12-31" or "2022-01-01:00-00-00" and "2022-12-31:00-00-00".

Unexpected load happens which doesn't have a parent-child relationship with a target table

Issue

With the following schema, unexpected load happens to the tables which don't have a parent-child relationship with a target table.

CREATE TABLE l1_P (
  Id INT64,
) PRIMARY KEY(Id);

CREATE TABLE l1_P_l2_A (
  Id INT64,
) PRIMARY KEY(Id),
  INTERLEAVE IN PARENT l1_P ON DELETE NO ACTION;

CREATE TABLE l1_P_l2_B (
  Id INT64,
) PRIMARY KEY(Id),
  INTERLEAVE IN PARENT l1_P ON DELETE NO ACTION;

CREATE TABLE l1_P_l2_B_l3_X (
  Id INT64,
) PRIMARY KEY(Id),
  INTERLEAVE IN PARENT l1_P_l2_B ON DELETE NO ACTION;

CREATE TABLE l1_P_l2_B_l3_Y (
  Id INT64,
) PRIMARY KEY(Id),
  INTERLEAVE IN PARENT l1_P_l2_B ON DELETE NO ACTION;

CREATE TABLE l1_P_l2_C (
  Id INT64,
) PRIMARY KEY(Id),
  INTERLEAVE IN PARENT l1_P ON DELETE NO ACTION;

# Case 1)
Expected:

  • l1_P
  • l1_P_l2_A
$ ./gcsb load -p MY_PROJECT -i gcsbtest -d db1 -t l1_P_l2_A -o 1
  :
2023/01/10 21:29:07 Executing load phase
2023/01/10 21:29:07 +-----------+------------+------+-------+---------+
2023/01/10 21:29:07 |   TABLE   | OPERATIONS | READ | WRITE | CONTEXT |
2023/01/10 21:29:07 +-----------+------------+------+-------+---------+
2023/01/10 21:29:07 | l1_P_l2_A |          5 | N/A  | N/A   | LOAD    |
2023/01/10 21:29:07 | l1_P      |          1 | N/A  | N/A   | LOAD    |
2023/01/10 21:29:07 | l1_P_l2_C |          5 | N/A  | N/A   | LOAD    |
2023/01/10 21:29:07 +-----------+------------+------+-------+---------+
2023/01/10 21:29:07 +-----------------------+-------+--------------+--------------+--------------+-----------+--------------+--------------+--------------+
2023/01/10 21:29:07 |        METRIC         | COUNT |     MIN      |     MAX      |     MEAN     |  STDDEV   |    MEDIAN    |     95%      |     99%      |
2023/01/10 21:29:07 +-----------------------+-------+--------------+--------------+--------------+-----------+--------------+--------------+--------------+
2023/01/10 21:29:07 | schema.inference      |     1 | 6.955609345s | 6.955609345s | 6.955609345s | 0s        | 6.955609345s | 6.955609345s | 6.955609345s |
2023/01/10 21:29:07 | run                   |     1 | 43.32597ms   | 43.32597ms   | 43.32597ms   | 0s        | 43.32597ms   | 43.32597ms   | 43.32597ms   |
2023/01/10 21:29:07 | operations.read.data  |     0 | 0s           | 0s           | 0s           | 0s        | 0s           | 0s           | 0s           |
2023/01/10 21:29:07 | operations.read.time  |     0 | 0s           | 0s           | 0s           | 0s        | 0s           | 0s           | 0s           |
2023/01/10 21:29:07 | operations.write.data |    11 | 4.421µs      | 57.797µs     | 21.242µs     | 17.532µs  | 10.785µs     | 57.797µs     | 57.797µs     |
2023/01/10 21:29:07 | operations.write.time |    11 | 16.706932ms  | 36.485169ms  | 25.84112ms   | 5.87515ms | 25.223941ms  | 36.485169ms  | 36.485169ms  |
2023/01/10 21:29:07 +-----------------------+-------+--------------+--------------+--------------+-----------+--------------+--------------+--------------+

# Case 2)
Expected

  • l1_P
  • l1_P_l2_B
  • l1_P_l2_B_l3_X
$ ./gcsb load -p MY_PROJECT -i gcsbtest -d db1 -t l1_P_l2_B_l3_X -o 1
  :
2023/01/10 21:29:42 Executing load phase
2023/01/10 21:29:42 +----------------+------------+------+-------+---------+
2023/01/10 21:29:42 |     TABLE      | OPERATIONS | READ | WRITE | CONTEXT |
2023/01/10 21:29:42 +----------------+------------+------+-------+---------+
2023/01/10 21:29:42 | l1_P_l2_B_l3_X |          5 | N/A  | N/A   | LOAD    |
2023/01/10 21:29:42 | l1_P           |          1 | N/A  | N/A   | LOAD    |
2023/01/10 21:29:42 | l1_P_l2_C      |          5 | N/A  | N/A   | LOAD    |
2023/01/10 21:29:42 +----------------+------------+------+-------+---------+
2023/01/10 21:29:42 +-----------------------+-------+--------------+--------------+--------------+------------+--------------+--------------+--------------+
2023/01/10 21:29:42 |        METRIC         | COUNT |     MIN      |     MAX      |     MEAN     |   STDDEV   |    MEDIAN    |     95%      |     99%      |
2023/01/10 21:29:42 +-----------------------+-------+--------------+--------------+--------------+------------+--------------+--------------+--------------+
2023/01/10 21:29:42 | schema.inference      |     1 | 6.292371079s | 6.292371079s | 6.292371079s | 0s         | 6.292371079s | 6.292371079s | 6.292371079s |
2023/01/10 21:29:42 | run                   |     1 | 30.534362ms  | 30.534362ms  | 30.534362ms  | 0s         | 30.534362ms  | 30.534362ms  | 30.534362ms  |
2023/01/10 21:29:42 | operations.read.data  |     0 | 0s           | 0s           | 0s           | 0s         | 0s           | 0s           | 0s           |
2023/01/10 21:29:42 | operations.read.time  |     0 | 0s           | 0s           | 0s           | 0s         | 0s           | 0s           | 0s           |
2023/01/10 21:29:42 | operations.write.data |    11 | 4.999µs      | 44.35µs      | 22.819µs     | 15.622µs   | 29.294µs     | 44.35µs      | 44.35µs      |
2023/01/10 21:29:42 | operations.write.time |    11 | 15.929295ms  | 23.794854ms  | 20.231022ms  | 2.644308ms | 20.875363ms  | 23.794854ms  | 23.794854ms  |
2023/01/10 21:29:42 +-----------------------+-------+--------------+--------------+--------------+------------+--------------+--------------+--------------+

# Case 3)
Expected:

  • l1_P
  • l1_P_l2_B
$ ./gcsb load -p MY_PROJECT -i gcsbtest -d db1 -t l1_P_l2_B -o 1
  :
2023/01/10 21:33:11 Executing load phase
2023/01/10 21:33:11 +-----------+------------+------+-------+---------+
2023/01/10 21:33:11 |   TABLE   | OPERATIONS | READ | WRITE | CONTEXT |
2023/01/10 21:33:11 +-----------+------------+------+-------+---------+
2023/01/10 21:33:11 | l1_P_l2_B |          5 | N/A  | N/A   | LOAD    |
2023/01/10 21:33:11 | l1_P      |          1 | N/A  | N/A   | LOAD    |
2023/01/10 21:33:11 | l1_P_l2_C |          5 | N/A  | N/A   | LOAD    |
2023/01/10 21:33:11 +-----------+------------+------+-------+---------+
2023/01/10 21:33:11 +-----------------------+-------+--------------+--------------+--------------+------------+--------------+--------------+--------------+
2023/01/10 21:33:11 |        METRIC         | COUNT |     MIN      |     MAX      |     MEAN     |   STDDEV   |    MEDIAN    |     95%      |     99%      |
2023/01/10 21:33:11 +-----------------------+-------+--------------+--------------+--------------+------------+--------------+--------------+--------------+
2023/01/10 21:33:11 | schema.inference      |     1 | 6.686091806s | 6.686091806s | 6.686091806s | 0s         | 6.686091806s | 6.686091806s | 6.686091806s |
2023/01/10 21:33:11 | run                   |     1 | 34.79919ms   | 34.79919ms   | 34.79919ms   | 0s         | 34.79919ms   | 34.79919ms   | 34.79919ms   |
2023/01/10 21:33:11 | operations.read.data  |     0 | 0s           | 0s           | 0s           | 0s         | 0s           | 0s           | 0s           |
2023/01/10 21:33:11 | operations.read.time  |     0 | 0s           | 0s           | 0s           | 0s         | 0s           | 0s           | 0s           |
2023/01/10 21:33:11 | operations.write.data |    11 | 30.599µs     | 41.917µs     | 35.147µs     | 3.826µs    | 34.844µs     | 41.917µs     | 41.917µs     |
2023/01/10 21:33:11 | operations.write.time |    11 | 15.64244ms   | 25.150921ms  | 22.768607ms  | 2.641282ms | 23.397655ms  | 25.150921ms  | 25.150921ms  |
2023/01/10 21:33:11 +-----------------------+-------+--------------+--------------+--------------+------------+--------------+--------------+--------------+

Cause

(t *table) GetAllRelationNames() relies on child table information of an apex table.

gcsb/pkg/schema/table.go

Lines 359 to 369 in 0a71c86

func (t *table) GetAllRelationNames() []string {
apex := t.GetApex()
ret := []string{apex.Name()}
child := apex.Child()
for ok := true; ok; ok = (child != nil) {
ret = append(ret, child.Name())
child = child.Child()
}
return ret

The child table gets overwritten whenever a different interleaved table is found in creating parental relationships.

gcsb/pkg/schema/tables.go

Lines 87 to 104 in 0a71c86

func (t *tables) Traverse() error {
// Iterate over tables setting parental relationships
for _, child := range t.tables {
if child.ParentName() != "" {
// fetch the parent table
parent := t.GetTable(child.ParentName())
if parent == nil {
return fmt.Errorf("table '%s' references a parent table '%s' that is not in information schema", child.Name(), child.ParentName())
}
// Set parent as this tables parent
child.SetParent(parent)
// Set parents child
parent.SetChildName(child.Name())
parent.SetChild(child)
}
}

Therefore, l1_P_l2_C eventually becomes the child of the table l1_P. This info was always used on the cases above where loading l1_P_l2_C was not needed in the interleaving chain for a target table.

Moreover, if the unrelated l1_P_l2_C has some other child tables, load operations happen to the descendant tables as well.

The issue is troublesome to in terms of the workload and execution time. And this is risky when the unexpected load happen to an existing production DB.

Solution

It doesn't look appropriate to define and rely on a 1:1 parent-child relationship because the actual relationship is 1:many.
So the needed work is:

  • Use parent info instead of child info in GetAllRelationNames.
  • Change the table struct
    child Table
    so that it can contain multiple children or just count the number of child tables. It seems the child info is stored to detemine if a table is interleaved or not.

    gcsb/pkg/schema/table.go

    Lines 323 to 325 in 0a71c86

    // IsInterleaved will return true if the table has a parent or child
    func (t *table) IsInterleaved() bool {
    return t.HasChild() || t.HasParent()

    If this is the only purpose to store the child info, just store the number of child tables should be enough instead of having these c and child.

    gcsb/pkg/schema/table.go

    Lines 75 to 76 in 0a71c86

    c string // child name
    child Table

Remarks

  1. This issue can also be the cause of an implied bug and the workaround at https://github.com/cloudspannerecosystem/gcsb/blob/master/pkg/workload/core.go#L165-L167
					if n == t { // Avoid inserting t twice for some reason... i dont have time to figure out why this is happenign
						continue
					}
  1. Another weird thing is why Case 2) succeeds even though there was no load operation to l1_P_l2_B according to the stats.

Doesn't work multiple run

It cannot input multiple tables as in the readme, and actually accepts a single string instead of []string.
The subsequent processing also seems to be unsupported.

flags.StringVarP(&runTable, "table", "t", "", "Table name to load")

Doesn't work table insert sample with comments

When I tried to create a table as per the readme, I could not do it due to an encoding error in the comment section.

When I erased the comments and ran it again, it worked fine.
Is there something wrong with my settings or need to fix cooment or readme?

$ gcloud spanner databases ddl update {db-name} --instance={instance-name} --ddl-file=schemas/single_table.sql
ERROR: (gcloud.spanner.databases.ddl.update) INVALID_ARGUMENT: Error parsing Spanner DDL statement: /*\nCopyright 2022 Google LLC\n\nLicensed under the Apache License, Version 2.0 (the \"License\") : Syntax error on line 1, column 1: Encountered \'/\' while parsing: ddl_statement
- '@type': type.googleapis.com/google.rpc.LocalizedMessage
  locale: en-US
  message: |-
    Error parsing Spanner DDL statement: /*
    Copyright 2022 Google LLC

    Licensed under the Apache License, Version 2.0 (the "License") : Syntax error on line 1, column 1: Encountered '/' while parsing: ddl_statement

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.