timeplus-io / gluon Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 4.0 2.76 MB

Python SDK for Timeplus REST API (not for Proton)

License: Apache License 2.0

Makefile 0.11% Python 99.89%

gluon's People

Contributors

Stargazers

Watchers

Forkers

tspannhw jingyaokang rohankumardubey yokofly

gluon's Issues

Enhance with session reuse

should consider a better way to reuse session to improve the performance

Http post fail when multiple process with Gluon

run 40 ingestions and 40 queries in parallel, following errors are reported,

2022-05-21T06:07:08.303417+0000 - ERROR - http post failed HTTPConnectionPool(host='172.31.50.206', port=8000): Max retries exceeded with url: /api/v1beta1/streams/github_6/ingest (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9e322cf430>: Failed to establish a new connection: [Errno 99] Cannot assign requested address'))
2022-05-21T06:07:08.306757+0000 - ERROR - http get failed HTTPConnectionPool(host='172.31.50.206', port=8000): Max retries exceeded with url: /api/v1beta1/queries/55ff1c73-47c9-4303-9346-c91f65054d38 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9e322d4610>: Failed to establish a new connection: [Errno 99] Cannot assign requested address'))
2022-05-21T06:07:08.309252+0000 - ERROR - http get failed HTTPConnectionPool(host='172.31.50.206', port=8000): Max retries exceeded with url: /api/v1beta1/queries/348b2a28-d182-4e6d-a9c1-118d3d59a7f2 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9e322d4340>: Failed to establish a new connection: [Errno 99] Cannot assign requested address'))

And @zliang-min already had some investigation as following:

https://stackoverflow.com/questions/30943866/requests-cannot-assign-requested-address-out-of-ports

Right, gluon is not using Session, e.g. https://github.com/timeplus-io/gluon/blob/develop/python/timeplus/env.py#L182

I think it's mostly the ingestion causing the problem in your case. I wonder if calling r.close() explicitly would help. Maybe you can try adding r.close() to the methods in the env.py file and see if that helps (but I am not sure how this would impact the performance, e.g. it might impact keep-alive).

"failed to recieve data from websocket" in 0.2.1

After I upgraded to 0.2.1 and ran the same python script and got many errors:

2022-09-22T01:33:20.515584+0000 - ERROR - failed to recieve data from websocket
error Expecting value: line 1 column 1 (char 0)
2022-09-22T01:33:20.633927+0000 - ERROR - failed to recieve data from websocket
error Expecting value: line 1 column 1 (char 0)
2022-09-22T01:33:20.755501+0000 - ERROR - failed to recieve data from websocket
error Expecting value: line 1 column 1 (char 0)

Using swagger codegen to build SDK

with more and more interface change, using generated code is more efficient to build SDK.

api_key override when there are multiple env

when there is multiple env, the later apikey will override previous env.

sqlalchemy : type support

support more types includes:

array
tuple
map
uuid

stream auto_create doesn't work when creating source

  source = (
      KafkaSource()
      .name(name)
      .properties(
          KafkaProperties()
          .topic("github_events")
          .brokers(BROKER)
          .sasl("plain")
          .username(KAFKA_API_KEY)
          .password(KAFKA_SECRET_KEY)
      )
  )

  source.connection(SourceConnection().stream("kafka").auto_create(True))
  source.create().start()

Source is created and started, but there is no stream. And there is no Neutron error

cannot insert data with _tp_time

I created a stream with a single cnt column (actually the stream is created automated with Timeplus Sink)

If I use python SDK to insert data, I cannot set the _tp_time column, otherwise it will fail

s = (
    Stream().name("a")
    .column(StreamColumn().name("_tp_time").type("datetime64(3, 'UTC')"))
    .column(StreamColumn().name("cnt").type("uint64"))
)
s.insert([["2022-09-09T09:00:00",9595]])

Error

timeplus.error.TimeplusAPIError: http method post, response code 500, failed to send http post due to {"code":500,"message":"failed to ingest data to stream a: code: 27, message: Code: 27. DB::ParsingException: Cannot parse input: expected '"' before: '-09-09T09:00:00",9595]]': (while reading the value of key cnt): \nRow 1:\nColumn 0, name: cnt, type: uint64, ERROR: text "[\u003cDOUBLE QUOTE\u003e2022-09-" is not like uint64\n\n: (at row 1)\n. (CANNOT_PARSE_INPUT_ASSERTION_FAILED) (version 1.0.82)"}

If I remove the _tp_time, then data can be inserted

s = (
    Stream().name("a")
    .column(StreamColumn().name("cnt").type("uint64"))
)
s.insert([[950]])

cannot insert data if the column name is not set, but still show `insert success`

In my case, the stream is created with SQL, not via the python code

s = (
    Stream()
    .name("github_events")
)

When I insert data, the console output is

post http://localhost:8000/api/v1beta1/streams/github_events/ingest
insert {'columns': [], 'data': [['id', '20835192348'], ['created_at', '2022-03-20T04:03:40'], ['type', 'CreateEvent'], ['repo', 'Bobsang132/GIT-BASH'], ['payload', {'ref': 'main', 'ref_type': 'branch', 'master_branch': 'main', 'description': None, 'pusher_type': 'user'}]]}
insert success

But actually no data is inserted. It seems that I need to declare the columns even the stream is created elsewhere. A bit anti-intuition, since the column names are already set in insert method

convert the qeury result to it real type

Now the query result is from WS where the result is all string type, it is better to convert the string type to the real time according to https://github.com/timeplus-io/docs/blob/main/docs/datatypes.md

cannot insert datetime directly

I got a python datetime object, and wanted to insert data via
s.insert([ ["id",e.id],["created_at",e.created_at],..)

Got error message

failed to insert Object of type datetime is not JSON serializable

Workaround is to call e.created_at.isoformat()

Ideally the python SDK can accept native datetime objects

build issue

(gluon) jameshao@JamesdeMacBook-Pro python % python3 -m build

Creating venv isolated environment...
Installing packages in isolated environment... (setuptools>=42)
Getting dependencies for sdist...
running egg_info
creating src/timeplus.egg-info
writing src/timeplus.egg-info/PKG-INFO
writing dependency_links to src/timeplus.egg-info/dependency_links.txt
writing requirements to src/timeplus.egg-info/requires.txt
writing top-level names to src/timeplus.egg-info/top_level.txt
writing manifest file 'src/timeplus.egg-info/SOURCES.txt'
error: package directory 'src/tests' does not exist

solution app resource loader

for a soution, the user might create:

sources
stream
view
query
sinks

the solutino app resource loader is an application that will create these resources using a configuration file. (json, yaml, conf)

ending `/` caused invalid url

also need document explain how to use environment

Add a new method to allow user get query result synchronously

Customer Feedbacks

The homepage link on https://pypi.org/project/timeplus/0.1.10/ points to https://github.com/timeplus-io/gluon/python which doesn't exist, should be https://github.com/timeplus-io/gluon/tree/develop/python
python 3.8 and lack of certain lib. I think we only test on python 3.9. We should change requirement from Python>3 to Python> =3.9
changing the sample code to use Type.STRING instead of "string" etc. Otherwise have to check the documentation for what data types are available

json stream ingest does not work

swagger_client.rest.ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Date': 'Thu, 22 Jun 2023 20:11:26 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Content-Length': '111', 'Connection': 'keep-alive', 'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Headers': 'Content-Type, Content-Length, Accept-Encoding, X-CSRF-Token, Authorization, accept, origin, Cache-Control, X-Requested-With', 'Access-Control-Allow-Methods': 'POST, HEAD,PATCH, DELETE, OPTIONS, GET, PUT', 'Access-Control-Allow-Origin': '*', 'X-Request-Id': 'H0eB3RGybTYSUHAuH1TP4'})
HTTP response body: b'{"code":400,"message":"bad event: json: cannot unmarshal string into Go value of type map[string]interface {}"}'

Support JSON data type

add json datatype support

Restart the query if the neutron(api server) is restarted

Today if the neutron gets restarted, the streaming sql will be cancelled.
For the Python SDK client, it'll be great if the query can be resubmitted, or let the client to decide how to resubmit.
Today when the client uses Python SDK to run a streaming query and neutron gets restarted, the query just hangs

able to control logging

today, the Python SDK just shows many log message with print method.
It's overwhelming when inserts a lot of data

Python SDK to support multi-tenant deployment

As title.

The Python SDK should be supporting both single-tenant and multi-tenant deployment

access token is no longer supported

switch to API key instead

Python SDK users can skip calling `address(..)` also need to show warning for workspace input

The README of the Python SDK shows Environment().address(api_address).workspace(workspace).apikey(api_key)
This is confusing for new users. They don't know what is api_address and what is the workspace

Most of our users are using https://us.timeplus.cloud, so this should be the default value

api_key = os.environ.get("TIMEPLUS_API_KEY") #60 characters 
workspace = os.environ.get("TIMEPLUS_WORKSPACE") #8 characters, such as weioctg9, instead of MyWorkspaceName 
env = Environment().workspace(workspace).apikey(api_key)
# if you are not using https://us.timeplus.cloud, call address(..), such as 
# Environment().address("https://my.timeplus.cloud")

Add check in workspace() to check whether the input is 8 char, if not, show an error e,g "Did you set the workspace name? Please set the workspace ID (usually 8 characters long)"
Add check in apikey() to check whether the input is 60 char, if not, show an error e,g "The apikey should be 60 characters"

cannot insert json string directly

somehow similar to #3

I got a dict from python API and insert the column via ["jsonColumn",jsonValue]
It generates JSON to REST API like ['payload', {'push_id': 93968

I got insert success but the data is actually not inserted.

I guess I need to turn the JSON object to a string, e.g.

import json

["payload",json.dumps(e.payload)]

add ibis support

https://ibis-project.org/posts/ibis-version-8.0.0-release/ new ibis release start to support streaming, this is a good candidate to integrate with python community which we should support

publish 1.3.0 official build

latest version today is 1.3.0b2, b means beta. We should release official version before Sep15. Soft GA was early Aug.

Generator source doesn't support `range`

E.g. I'd like to use

GeneratorField().name("device").type("string").range(["device_0", "device_1", "device_2"])

However, only limit is supposed

file loader as an example

a file loader as an example of gluon

query.delete / metadata

After creating a new query and do nothing, it will cause an error if I want to delete it without using query.cancel()
An KeyError: 'id' error in metadata if no sleep(3) after ingest data

metrics store solution

add interface to store metrics to timeplus

Switch to API Key based authentication/authorization

view interface

I'd like to use gluon to create, list, delete and update a view or materialized view

support UDF

https://docs.timeplus.com/rest.html#tag/UDFs-v1beta2

Adapt to new API

support TTL
support View
support New Source creation API

support topology

support topology
https://docs.timeplus.com/rest.html#tag/Topology-v1beta2

DB API and SQLAlchemy support

And when working with other community tools such as LangChain, Querybook, Superset, SQLAlchemy plays as the driver role.
It is better that Timeplus python sdk can support it.

when upgrading to 0.1.5 my previous script can't run and get exception of 'api_key_id'

s.create() exception, error = 'api_key_id'

my code thrown out exception:

def create_stream(stream_name, tp_schema, tp_host, tp_port, client_id,client_secret):
    env = (
        Env()
        .schema(tp_schema).host(tp_host).port(tp_port)
        #.login(client_id=client_id, client_secret=client_secret)
    )
    Env.setCurrent(env)

    try:
        s = (
            Stream()
            .name(stream_name)
            .column(StreamColumn().name("id").type("string"))
            .column(StreamColumn().name("created_at").type("datetime"))
            .column(StreamColumn().name("actor").type("string"))
            .column(StreamColumn().name("type").type("string"))
            .column(StreamColumn().name("repo").type("string"))
            .column(StreamColumn().name("payload").type("string"))
            .ttl_expression("created_at + INTERVAL 2 WEEK")
        )
        if(s.get() is None):
            s.create()
            print(f"Created a new stream {s.name()}")
        return stream_name
    except Exception as e:
        print(f"s.create() exception, error = {e}")
        sys.exit(f"Failed to list or create data streams from {tp_schema}://{tp_host}:{tp_port}. Please make sure you are connecting to the right server.")

.neutron.yaml:

enable-authentication: false

# uncomment/enable the following setting for playground
#frontend-applications: playground
#proton-username: demo
#proton-password: demo@t+
#proton-query-black-list:
#  - insert
#  - drop
#  - system.
#  - create
#proton-query-timeout: 300

# except playground, other deployments need thoses auth0
auth0-domain: timeplus.us.auth0.com
auth0-callback-url: https://FIXME/callback
auth0-client-id:
auth0-client-secret:
auth0-audience: https://FIXME.beta.timeplus.com/api/v1beta1
auth0-api-client-id:
auth0-api-client-secret:
~
~

checked neutron.log by tail-f, no error or new log emit when 'api_key_id' exception is caught.

cannot insert data

DDL

CREATE STREAM github_events(
    id string,
    created_at string,
    type string,
    repo string,
    _tp_time datetime64 default to_datetime(created_at)
)

Python

        s.insert([
            ["id",e.id],["created_at",e.created_at.isoformat()],["type",e.type],["repo",e.repo.name]#,["payload",json.dumps(e.payload)]
            ])

Error
post http://localhost:8000/api/v1beta1/streams/github_events/ingest
insert {'columns': ['id', 'created_at', 'type', 'repo'], 'data': [['id', '20839378587'], ['created_at', '2022-03-20T18:07:17'], ['type', 'PushEvent'], ['repo', 'b-tekinli/b-tekinli']]}
failed to insert 500 {"code":0,"message":"failed to ingest data to stream github_events: request failed with status code 400, response body {"code":27,"error_msg":"Code: 27. DB::ParsingException: Cannot parse input: expected ',' before: '],[\"created_at\",\"2022-03-20T18:07:17\"],[\"type\",\"PushEvent\"],[\"repo\",\"b-tekinli\/b-tekinli\"]]': \nRow 1:\nColumn 0, name: id, type: string, parsed text: \u003cEMPTY\u003eERROR\nCode: 26. DB::ParsingException: Cannot parse JSON string: expected opening quote: (while reading the value of key id). (CANNOT_PARSE_QUOTED_STRING) (version 1.0.20)\n\n: While executing ParallelParsingBlockInputFormat: (at row 1)\n. (CANNOT_PARSE_INPUT_ASSERTION_FAILED) (version 1.0.20)","request_id":""}\n"}

resource path for swagger generated code is wrong

the replacement does not work which will call '/v1beta2/streams/:name' to get stream
this issue impact both stream and view object

the current workaround is use list and then filter out which works.

Gluon smoke

improve the query performance

In my test, with car_live_data, the console query will reach eps about 700, while using python sdk, it can only achieve 7-10 eps. which is a 100 times slower.

Need to dig out the performance bottle neck here.

support sink/source

support sink and source resource management

don't refresh token if the initial token is invalid

in the 0.1.1 gluon, for any HTTP request, if the token is invalid, the sdk will try to refresh the token, with at most 24 times.
However if the first token is invalid, I guess we can skip the token refresh, and show the error to the end user that the token is not valid.

The main use case is the user gets an valid token and need to run the script for a long time. We should auto-refresh the token so that the user doesn't need to stop the script and rerun. But if at the beginning the token is invalid, this is considered as misconfig.