Code Monkey home page Code Monkey logo

cq-provider-sdk's Introduction

cloudquery logo

License: MPL 2.0 Go Report Card CLI Workflow

CloudQuery is an open-source, high-performance data integration framework built for developers, with support for a wide range of plugins.

CloudQuery extracts, transforms, and loads configuration from cloud APIs, files or databases to variety of supported destinations such as databases, data lakes, or streaming platforms for further analysis.

Installation

See the Quickstart guide for instructions how to start syncing data with CloudQuery.

Why CloudQuery?

  • Blazing fast: CloudQuery is optimized for performance, utilizing the excellent Go concurrency model with light-weight goroutines.
  • Deploy anywhere: CloudQuery plugins are single-binary executables and can be deployed and run anywhere.
  • Open source: Language-agnostic, extensible plugin architecture using Apache Arrow: develop your own plugins in Go, Python, Java or JavaScript using the CloudQuery SDK.
  • Pre-built queries: CloudQuery maintains a number of out-of-the-box security and compliance policies for cloud infrastructure.
  • Unlimited scale: CloudQuery plugins are stateless and can be scaled horizontally on any platform, such as EC2, Kubernetes, batch jobs or any other compute.

Use Cases

  • Cloud Security Posture Management: Use as an open source CSPM solution to monitor and enforce security policies across your cloud infrastructure for AWS, GCP, Azure and many more.
  • Cloud Asset Inventory: First-class support for major cloud infrastructure providers such as AWS, GCP and Azure allow you to collect and unify configuration data.
  • Cloud FinOps: Collect and unify billing data from cloud providers to drive financial accountability.
  • ELT Platform: With hundreds of plugin combinations and extensible architecture, CloudQuery can be used for reliable, efficient export from any API to any database, or from one database to another.
  • Attack Surface Management: Open source solution for continuous discovery, analysis and monitoring of potential attack vectors that make up your organization's attack surface.
  • Eliminate data silos: Eliminate data silos across your organization, unifying data between security, infrastructure, marketing and finance teams.

Links

License

By contributing to CloudQuery you agree that your contributions will be licensed as defined on the LICENSE file.

Hiring

If you are into Go, Backend, Cloud, GCP, AWS - ping us at jobs [at] our domain

Contribution

Feel free to open a pull request for small fixes and changes. For bigger changes and new plugins, please open an issue first to prevent duplicated work and to have the relevant discussions first.

Open source and open core

The CloudQuery framework, SDK and CLI are open source while plugins available under plugins are open core, hence not all contributions to plugins directory will be accepted if they are part of the commercial plugin offering - please file an issue before opening a PR.

cq-provider-sdk's People

Contributors

amanenk avatar bbernays avatar candiduslynx avatar cq-bot avatar daniil-ushkov avatar disq avatar erezrokah avatar hermanschaaf avatar irmatov avatar michelvocks avatar roneli avatar shimonp21 avatar spangenberg avatar yevgenypats avatar zagronitay avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cq-provider-sdk's Issues

feat(resolvers): Remove column level `IgnoreError`

SDK consumers can defined IgnoreError handlers per tables and per columns.

The column level functionality seems to be used in a single resource in the GCP provider https://github.com/cloudquery/cq-provider-gcp/blob/8db6d1e043aec6b17f1f30afc3a0cb551595e34e/resources/services/kubernetes/clusters.go#L145

As this feature is not properly documented (only in our code) and barely used we should remove it from the SDK, and change the GCP provider to add a custom resolver for IP addresses that ignores the relevant errors.

Potential performance and developer experience improvements/investigation around database inserts and PKs

Describe the resource.

Currently the insert logic is as follows:

  • Resolve main table (for account and region)
  • CopyFrom array/map and insert into db
    • if CopyFrom fails insert one by one - interesting to have statistics how many times this fails and why?
    • If there is a conflict on primary key delete cascade the parent and insert the new resource
  • after fetch for resource and specific client (i.e for aws account_id and region) finish it deletes data with deletefilter from previous fetch cycle

Given that in big account most of the resources stays the same will CopyFrom always fail? Can we benchmark the onconflict replace?

Another thought is that there is two highly related fields - Primary keys and DeleteFilter.

Might be interesting to benchmark the following insert algorithm:

  • Resolve main table
  • CopyFrom to temp table
  • When account_id, region and resource finished run the following transaction:
    - Delete from table where account_id=something, region=something # (or any deletefilter) ;
    - insert into table from table_tmp where delete_filter;

Things to measure:
- database performance in both cases

Advantages:
- Developer experience: PK might not be needed in addition to deletefilter as well as bugs with wrong pks might occur less.
- the main table might be more consistent as it wont contain "delete resource"
- Given that most of the bugs where incorrect PKs and not duplicate data (apart from a few global resources) those bugs and maintains can be solved.

Disadvantages:
- ( Depending on the performance test ) It might take more time for the main table to be updated with new data
- It might be harder to detect duplicate data

Use Case

Performance and dev experience

Additional context

No response

automated resolver can't handle *[]string to inet array

Describe the bug

automated resolver can't handle *[]string to inet array

Expected Behavior

sdk sould be able to convert *[]string to inet array using path resolver

Steps to Reproduce

in azure provider resource azure_network_virtual_networks
replace:

 			{
				Name:        "dhcp_options_dns_servers",
				Description: "The list of DNS servers IP addresses.",
				Type:        schema.TypeInetArray,
				Resolver:    resolveNetworkVirtualNetworksDhcpOptionsDnsServers,
			},

with

 			{
				Name:        "dhcp_options_dns_servers",
				Description: "The list of DNS servers IP addresses.",
				Type:        schema.TypeInetArray,
				Resolver: schema.PathResolver("DhcpOptions.DNSServers"),
			},

and run azure_network_virtual_networks mock test

Possible Solution

No response

Provider and CloudQuery version

cloudquery 0.19.2, azure 0.3.11

Additional Context

No response

Support parallelization rate limit

Requested in cloudquery/cloudquery#159

Support resource fetch rate limit

  • limit amount of resources running in parallel
  • add delay between multiplex of clients (i.e client multiplexes accounts/regions)
  • add limit of concurrent client multiplexes that execute in parallel
  • create a global client parallel limit i.e X resources Y Clients per resource M <= X * Y

Add support for common resolvers

Add support for common resolvers from strings

  • IP Resolver
  • INET resolver
  • Mac resolver
  • UUID resolver
  • Datetime Resolver
  • Date Resolver
  • Add Type Transforms String -> Int, int -> String etc'

Question: What is the difference between SmallInt, Int, BigInt

How a user should be able to know which type of Int he should use? why not always use bigint ?

Also - https://github.com/cloudquery/cq-provider-sdk/blob/main/provider/schema/column.go#L191

uint16 is TypeInt while *uint16 looks like is not supported at all but int16 is TypeSmallInt. How this is decided ?

	switch val := v.(type) {
	case int8, *int8, uint8, *uint8, int16, *int16:
		return c.Type == TypeSmallInt
	case uint16, int32, *int32:
		return c.Type == TypeInt
	case int, *int, uint32, *uint32, int64, *int64:
		return c.Type == TypeBigInt
	case []byte:

RFC: Make IgnoreError propogate to resolvers and child tables

Is your feature request related to a problem? Please describe.

Create an RFC to discuss all edge cases that can happen if we want to propogate IgnoreError to all column resolvers as well as child tables

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Allow sdk to convert from int64 to int automatically

Describe the bug

Columns that described as int but have int64 type do not converted to int automatically.

Expected Behavior

columns described as int type should automatically converted from int64 to int

Steps to Reproduce

create resource from structure that has in64 type and describe column type as int.

Possible Solution

add automatic conversion for this pair of types.

Provider and CloudQuery version

cloudquery 0.22.10, aws 0.11.1

Additional Context

No response

Ensure CQ_ID is unique Across entire fetch

In order to decouple database write issues from API/resolving issues we need the SDK to keep track of all CQ_IDs that are inserted to ensure that every cq_id (at a top level table) is unique. When it is not unique then we can be 100% sure it is a fetching problem

Fix "error: no errors"

Describe the bug

An interesting bug that stems from the fact that diag.Diagnostics implements the error interface (i.e. func Error() string {:
both diag.BaseError and diag.Diagnostics (i.e. a slice of diagnostics []diag.Diagnostics ) implement the error interface:

type error interface {
	Error() string
}

This means that, if a function returns diag.Diagnostics , but thinks it returns an error, it will be implicitly converted. And thus, even though the function returned nil, the caller will actually see error: no errors.
To reproduce:

func Test_DiagAndErrors(t *testing.T) {
	var err error

	err = functionReturningDiags0()
	if err != nil {
		fmt.Printf("error: %vֿ\n", err)
	}

	err = functionReturningDiagsNil()
	if err != nil {
		fmt.Printf("error: %vֿ\n", err)
	}
}

func functionReturningDiagsNil() diag.Diagnostics {
	return nil
}

func functionReturningDiags0() diag.Diagnostics {
	return diag.Diagnostics{}
}

Output:

error: no errorsֿ
error: no errorsֿ

Insights:
A customer complained about error: no errors in discord a couple of weeks ago (in relation to policies).
implicit conversion is bad! The code for converting Diagnostics to errors makes me believe that this is a side effect that the writer did not intend.

func (diags Diagnostics) Error() string {

My proposal would be to rename Diagnostics.Error() to Diagnostics.ToError() to avoid implementing the error interface and allowing this implicit conversion.

Expected Behavior

Steps to Reproduce

Possible Solution

No response

Provider and CloudQuery version

0.24.7

Additional Context

No response

resolve columns error: invalid CIDR address: --- Test FAIL:

In tests and I guess in general it's very hard to debug when resolvers fail.

For example the following error doesn't indiciate what column and what it contains:

resolve columns error: invalid CIDR address: --- FAIL: TestEc2Eips (0.04s)

JSON column validation rejects []struct

When trying to store a JSON object in PG that has a []struct type in Go, we recieve the following error:

insert failed column fields expected TypeJSON got []struct

This seems very similar to #36

Adding a helper message for debug.

When running a provider it will usually print the following message:

./cq-provider-template 
This binary is a plugin. These are not meant to be executed directly.
Please execute the program that consumes these plugins, which will
load any plugins automatically

I suggest also adding the following.

To run the provider in debug mode please use CQ_PROVIDER_DEBUG=1

Generate random cq_id if some primary keys are null

Currently if any primary key is null we return an error when doing GenerateCqID, instead of returning an error log a warning and and generate a random ID instead, this should only be allowed for tables that have a parent, main tables must always return an error.

Support global table

Some tables are static / global meaning fetching them each time returns the same data.
Global tables should not fail if data already exists but rather replace, i.e latest fetch takes precedent.

postResolver error is not logged to console

Some errors during tests are not shown in the log. For example parsing time "jgZGgCfHJebDjhZNAYTUuLXoP" as "2006-01-02T15:04:05Z07:00": cannot parse "jgZGgCfHJebDjhZNAYTUuLXoP" as "2006" error happened but the console didn't show it.
image

JSON column validation rejects []interface{}

TypeJSON columns currently only accept map, []byte, string, and *string values: https://github.com/cloudquery/cq-provider-sdk/blob/main/provider/schema/column.go#L139

	// Maps are jsons
	if reflect2.TypeOf(v).Kind() == reflect.Map {
		return c.Type == TypeJSON
	}

	switch val := v.(type) {
	...
	case []byte:
		...
		return c.Type == TypeByteArray || c.Type == TypeJSON
	...
	case string:
		...
		if c.Type == TypeJSON {
			return true
		}
		...
	case *string:
		if c.Type == TypeJSON {
			return true
		}

... whereas postgres json & jsonb columns support JSON arrays as well as primitives and pgx accepts virtually anything it can marshal: https://github.com/jackc/pgtype/blob/master/json.go#L52-L57

at the very least it would be useful to be store JSON-serializable slices in TypeJSON columns

resource.go:157: row at table has an empty column

Currently when testing you can get the following error:

    resource.go:157: row at table aws_ec2_ebs_snapshots has an empty column

It's very hard to debug this without either knowing which column is empty or at least printing the violating row to debug this manually.

Duplicate Provider Name

Currently provider Name is required in main.go

	serve.Serve(&serve.Options{
		// CHANGEME: change to your provider name
		Name:                "YourProviderName",
		Provider:            resources.Provider(),
		Logger:              nil,
		NoLogOutputOverride: false,
	})

and in resources/provider.go

	return &provider.Provider{
		// CHANGEME: Change to yoru provider name
		Name:      "YourProviderName",
		Configure: client.Configure,
		ResourceMap: map[string]*schema.Table{
			"demo_resource": DemoResource(),
		},
		Migrations: providerMigrations,
		Config: func() provider.Config {
			return &client.Config{}
		},
	}

Looks like it should be needed only in provider.Provider

Remove Default in schema.Table

Describe the bug

The default option was added to allow if an error occurs / the getter gets a nil to set a default value

The logic in execution is a little weird as with the funk.Get if nil is returned it will set default value (if not nil), but if a resolver is set
a default will only be set if the resolver returned an error, and in some cases like PathResolver it will act the same as funk.Get and return nil

Expected Behavior

default value should always be set / removed

Steps to Reproduce

Possible Solution

remove default option from schema.Table and use schema.PathResolverWithDefault / let resolvers define defaults or accept defaults as arguments.

Before removing the default value, we can first add the schema.PathResolverWithDefault as a starter.

Another option is to define the expected logic of when a default should be set

Provider and CloudQuery version

all

Additional Context

No response

Support global table

Some tables are static / global meaning fetching them each time returns the same data.
Global tables should not fail if data already exists but rather replace, i.e latest fetch takes precedent.

partition-service-region file not respected when customer specifies custom "regions" in config

Describe the bug

partition-service-region file not respected when customer specifies custom "regions" in config.

Syncing CloudQuery providers [aws@latest]

Initializing CloudQuery Providers...

✓ [email protected] verified    1s  100 %

Finished provider initialization...

Checking available provider updates...


Finished syncing providers...

Starting provider fetch...

⚠️ cq-provider-aws@latest diagnostics: 2  3m1s   Finished Resources: 161/161

Provider fetch complete.

Fetch Diagnostics:

Resource: qldb.ledgers Type: Resolving Severity: Error
        Summary: failed to resolve table "aws_qldb_ledgers": error at github.com/cloudquery/cq-provider-aws/resources/services/qldb.fetchQldbLedgers[ledgers.go:264] operation error QLDB: ListLedgers, exceeded maximum number of attempts, 10, https response error StatusCode: 0, RequestID: , request send failed, Get "https://qldb.eu-west-3.amazonaws.com/ledgers": dial tcp: lookup qldb.eu-west-3.amazonaws.com on 10.100.102.1:53: no such host
Resource: workspaces.directories Type: Resolving Severity: Error
        Summary: failed to resolve table "aws_workspaces_directories": error at github.com/cloudquery/cq-provider-aws/resources/services/workspaces.fetchWorkspacesDirectories[directories.go:247] operation error WorkSpaces: DescribeWorkspaceDirectories, exceeded maximum number of attempts, 10, https response error StatusCode: 0, RequestID: , request send failed, Post "https://workspaces.eu-west-3.amazonaws.com/": dial tcp: lookup workspaces.eu-west-3.amazonaws.com on 10.100.102.1:53: no such host
Resource: workspaces.workspaces Type: Resolving Severity: Error
        Summary: failed to resolve table "aws_workspaces_workspaces": error at github.com/cloudquery/cq-provider-aws/resources/services/workspaces.fetchWorkspacesWorkspaces[workspaces.go:170] operation error WorkSpaces: DescribeWorkspaces, exceeded maximum number of attempts, 10, https response error StatusCode: 0, RequestID: , request send failed, Post "https://workspaces.eu-west-3.amazonaws.com/": dial tcp: lookup workspaces.eu-west-3.amazonaws.com on 10.100.102.1:53: no such host

Provider aws fetch summary: ✓ Total Resources fetched: 241       ⚠️ Warnings: 0   ❌ Errors: 3

Expected Behavior

The partition-service-region file should be respected

Steps to Reproduce

cloudquery {
  plugin_directory = "./cq/providers"
  policy_directory = "./cq/policies"

  provider "aws" {
    version = "latest"
  }

  connection {
    username = "postgres"
    password = "pass"
    host     = "localhost"
    port     = 5432
    database = "postgres"
    sslmode  = "disable"
  }
}

provider "aws" {
  configuration {
    // Optional, Repeated. Add an 'accounts' block for every account you want to assume-role into and fetch data from.
    // accounts "<UNIQUE ACCOUNT IDENTIFIER>" {
    // Optional. Role ARN we want to assume when accessing this account
    // role_arn = < YOUR_ROLE_ARN >
    // Optional. Named profile in config or credential file from where CQ should grab credentials
    // local_profile = < PROFILE_NAME >
    // }
    // Optional. by default assumes all regions
     regions = ["eu-west-3"]
    // Optional. Enable AWS SDK debug logging.
    aws_debug = false
    // The maximum number of times that a request will be retried for failures. Defaults to 10 retry attempts.
    // max_retries = 10
    // The maximum back off delay between attempts. The backoff delays exponentially with a jitter based on the number of attempts. Defaults to 30 seconds.
    // max_backoff = 30
  }

  // list of resources to fetch
  // resources = ["wafv2.managed_rule_groups"]
   resources = ["*"]
  // resources = ["lambda.functions"]
}

Possible Solution

No response

Provider and CloudQuery version

Version: 0.24.8, aws-provider 0.12.4

Additional Context

No response

ulimit buffer

Instead of setting the max_goroutines as high as possible. I think we should set it at something like ulimit * 0.8 So then we don't try and take all of the descriptors because we know there are other things that use file descriptors other than our go routines

Feature: add recover from panic inside fetch

fetching with the SDK can cause sometimes panics, since we use a lot of runtime and assertions in the code this can occur. We should recover from the panic, print the error to the log with full stacktrace, and return an error

Add Error Types

Add Error Type Interfaces i.e ThorttleError, AccessError, InsertionError etc'

Remove DefaultConfigGenerator

Generate config automatically from config object tags and from resources to prevent bugs and inconsistencies between the config and the real implementation

Allow custom field to be used for relation

Currently, only the id field is allowed to be used for the relation.
That would generate unnecessary fields where relation could be made by the already present field.
Example:
Type Volume and Volume Attachment. Volume attachment have field VolumeId, but it could not be used to for the relation mapping.

Feature\Allow partial fetch

In some uses cases for example fetching many pages of resources, some resources may fail either in a relation resolve, column resolve or a post resource resolve and even panic.

In this case we should allow the user to pass a partial fetch option allowing us to skip the broken resource and continue fetching.

The sdk execution should return an execution result with all errors that occurred in a table fetch and what resources failed with any information that will be useful for the user to fix.

The users should enable this feature specifically, acknowledging that they agree for partial fetches and some data might be missing.

Implement a unified type assertion errors handler

There are a lot of type assertions that require checks and every time developer needs to build an error by hand.
An integrated solution would ease the process of development.
First construction that comes to mind:

r, ok := resource.Item.(types.Type)
if !ok {
return fmt.Errorf("wrong type assertion: got %s instead of %s", 
  reflect.TypeOf(r).Name(),
  reflect.TypeOf(types.Type{}).Name())
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.