Code Monkey home page Code Monkey logo

gluon's People

Contributors

gangtao avatar jingyaokang avatar jovezhong avatar ye11ow avatar yokofly avatar zliang-min avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

gluon's Issues

Http post fail when multiple process with Gluon

run 40 ingestions and 40 queries in parallel, following errors are reported,

2022-05-21T06:07:08.303417+0000 - ERROR - http post failed HTTPConnectionPool(host='172.31.50.206', port=8000): Max retries exceeded with url: /api/v1beta1/streams/github_6/ingest (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9e322cf430>: Failed to establish a new connection: [Errno 99] Cannot assign requested address'))
2022-05-21T06:07:08.306757+0000 - ERROR - http get failed HTTPConnectionPool(host='172.31.50.206', port=8000): Max retries exceeded with url: /api/v1beta1/queries/55ff1c73-47c9-4303-9346-c91f65054d38 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9e322d4610>: Failed to establish a new connection: [Errno 99] Cannot assign requested address'))
2022-05-21T06:07:08.309252+0000 - ERROR - http get failed HTTPConnectionPool(host='172.31.50.206', port=8000): Max retries exceeded with url: /api/v1beta1/queries/348b2a28-d182-4e6d-a9c1-118d3d59a7f2 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9e322d4340>: Failed to establish a new connection: [Errno 99] Cannot assign requested address'))

And @zliang-min already had some investigation as following:

https://stackoverflow.com/questions/30943866/requests-cannot-assign-requested-address-out-of-ports

Right, gluon is not using Session, e.g. https://github.com/timeplus-io/gluon/blob/develop/python/timeplus/env.py#L182

I think it's mostly the ingestion causing the problem in your case. I wonder if calling r.close() explicitly would help. Maybe you can try adding r.close() to the methods in the env.py file and see if that helps (but I am not sure how this would impact the performance, e.g. it might impact keep-alive).

"failed to recieve data from websocket" in 0.2.1

After I upgraded to 0.2.1 and ran the same python script and got many errors:

2022-09-22T01:33:20.515584+0000 - ERROR - failed to recieve data from websocket
error Expecting value: line 1 column 1 (char 0)
2022-09-22T01:33:20.633927+0000 - ERROR - failed to recieve data from websocket
error Expecting value: line 1 column 1 (char 0)
2022-09-22T01:33:20.755501+0000 - ERROR - failed to recieve data from websocket
error Expecting value: line 1 column 1 (char 0)

stream auto_create doesn't work when creating source

  source = (
      KafkaSource()
      .name(name)
      .properties(
          KafkaProperties()
          .topic("github_events")
          .brokers(BROKER)
          .sasl("plain")
          .username(KAFKA_API_KEY)
          .password(KAFKA_SECRET_KEY)
      )
  )

  source.connection(SourceConnection().stream("kafka").auto_create(True))
  source.create().start()

Source is created and started, but there is no stream. And there is no Neutron error

cannot insert data with _tp_time

I created a stream with a single cnt column (actually the stream is created automated with Timeplus Sink)

If I use python SDK to insert data, I cannot set the _tp_time column, otherwise it will fail

s = (
    Stream().name("a")
    .column(StreamColumn().name("_tp_time").type("datetime64(3, 'UTC')"))
    .column(StreamColumn().name("cnt").type("uint64"))
)
s.insert([["2022-09-09T09:00:00",9595]])

Error

timeplus.error.TimeplusAPIError: http method post, response code 500, failed to send http post due to {"code":500,"message":"failed to ingest data to stream a: code: 27, message: Code: 27. DB::ParsingException: Cannot parse input: expected '"' before: '-09-09T09:00:00",9595]]': (while reading the value of key cnt): \nRow 1:\nColumn 0, name: cnt, type: uint64, ERROR: text "[\u003cDOUBLE QUOTE\u003e2022-09-" is not like uint64\n\n: (at row 1)\n. (CANNOT_PARSE_INPUT_ASSERTION_FAILED) (version 1.0.82)"}

If I remove the _tp_time, then data can be inserted

s = (
    Stream().name("a")
    .column(StreamColumn().name("cnt").type("uint64"))
)
s.insert([[950]])

cannot insert data if the column name is not set, but still show `insert success`

In my case, the stream is created with SQL, not via the python code

s = (
    Stream()
    .name("github_events")
)

When I insert data, the console output is

post http://localhost:8000/api/v1beta1/streams/github_events/ingest
insert {'columns': [], 'data': [['id', '20835192348'], ['created_at', '2022-03-20T04:03:40'], ['type', 'CreateEvent'], ['repo', 'Bobsang132/GIT-BASH'], ['payload', {'ref': 'main', 'ref_type': 'branch', 'master_branch': 'main', 'description': None, 'pusher_type': 'user'}]]}
insert success

But actually no data is inserted. It seems that I need to declare the columns even the stream is created elsewhere. A bit anti-intuition, since the column names are already set in insert method

cannot insert datetime directly

I got a python datetime object, and wanted to insert data via
s.insert([ ["id",e.id],["created_at",e.created_at],..)

Got error message

failed to insert Object of type datetime is not JSON serializable

Workaround is to call e.created_at.isoformat()

Ideally the python SDK can accept native datetime objects

build issue

(gluon) jameshao@JamesdeMacBook-Pro python % python3 -m build

  • Creating venv isolated environment...
  • Installing packages in isolated environment... (setuptools>=42)
  • Getting dependencies for sdist...
    running egg_info
    creating src/timeplus.egg-info
    writing src/timeplus.egg-info/PKG-INFO
    writing dependency_links to src/timeplus.egg-info/dependency_links.txt
    writing requirements to src/timeplus.egg-info/requires.txt
    writing top-level names to src/timeplus.egg-info/top_level.txt
    writing manifest file 'src/timeplus.egg-info/SOURCES.txt'
    error: package directory 'src/tests' does not exist

solution app resource loader

for a soution, the user might create:

  1. sources
  2. stream
  3. view
  4. query
  5. sinks

the solutino app resource loader is an application that will create these resources using a configuration file. (json, yaml, conf)

Customer Feedbacks

json stream ingest does not work

swagger_client.rest.ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Date': 'Thu, 22 Jun 2023 20:11:26 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Content-Length': '111', 'Connection': 'keep-alive', 'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Headers': 'Content-Type, Content-Length, Accept-Encoding, X-CSRF-Token, Authorization, accept, origin, Cache-Control, X-Requested-With', 'Access-Control-Allow-Methods': 'POST, HEAD,PATCH, DELETE, OPTIONS, GET, PUT', 'Access-Control-Allow-Origin': '*', 'X-Request-Id': 'H0eB3RGybTYSUHAuH1TP4'})
HTTP response body: b'{"code":400,"message":"bad event: json: cannot unmarshal string into Go value of type map[string]interface {}"}'

Restart the query if the neutron(api server) is restarted

Today if the neutron gets restarted, the streaming sql will be cancelled.
For the Python SDK client, it'll be great if the query can be resubmitted, or let the client to decide how to resubmit.
Today when the client uses Python SDK to run a streaming query and neutron gets restarted, the query just hangs

able to control logging

today, the Python SDK just shows many log message with print method.
It's overwhelming when inserts a lot of data

Python SDK users can skip calling `address(..)` also need to show warning for workspace input

The README of the Python SDK shows Environment().address(api_address).workspace(workspace).apikey(api_key)
This is confusing for new users. They don't know what is api_address and what is the workspace

Most of our users are using https://us.timeplus.cloud, so this should be the default value

api_key = os.environ.get("TIMEPLUS_API_KEY") #60 characters 
workspace = os.environ.get("TIMEPLUS_WORKSPACE") #8 characters, such as weioctg9, instead of MyWorkspaceName 
env = Environment().workspace(workspace).apikey(api_key)
# if you are not using https://us.timeplus.cloud, call address(..), such as 
# Environment().address("https://my.timeplus.cloud")

Add check in workspace() to check whether the input is 8 char, if not, show an error e,g "Did you set the workspace name? Please set the workspace ID (usually 8 characters long)"
Add check in apikey() to check whether the input is 60 char, if not, show an error e,g "The apikey should be 60 characters"

cannot insert json string directly

somehow similar to #3

I got a dict from python API and insert the column via ["jsonColumn",jsonValue]
It generates JSON to REST API like ['payload', {'push_id': 93968

I got insert success but the data is actually not inserted.

I guess I need to turn the JSON object to a string, e.g.

import json

["payload",json.dumps(e.payload)]

publish 1.3.0 official build

latest version today is 1.3.0b2, b means beta. We should release official version before Sep15. Soft GA was early Aug.

query.delete / metadata

  1. After creating a new query and do nothing, it will cause an error if I want to delete it without using query.cancel()
  2. An KeyError: 'id' error in metadata if no sleep(3) after ingest data

view interface

I'd like to use gluon to create, list, delete and update a view or materialized view

when upgrading to 0.1.5 my previous script can't run and get exception of 'api_key_id'

s.create() exception, error = 'api_key_id'

my code thrown out exception:

def create_stream(stream_name, tp_schema, tp_host, tp_port, client_id,client_secret):
    env = (
        Env()
        .schema(tp_schema).host(tp_host).port(tp_port)
        #.login(client_id=client_id, client_secret=client_secret)
    )
    Env.setCurrent(env)

    try:
        s = (
            Stream()
            .name(stream_name)
            .column(StreamColumn().name("id").type("string"))
            .column(StreamColumn().name("created_at").type("datetime"))
            .column(StreamColumn().name("actor").type("string"))
            .column(StreamColumn().name("type").type("string"))
            .column(StreamColumn().name("repo").type("string"))
            .column(StreamColumn().name("payload").type("string"))
            .ttl_expression("created_at + INTERVAL 2 WEEK")
        )
        if(s.get() is None):
            s.create()
            print(f"Created a new stream {s.name()}")
        return stream_name
    except Exception as e:
        print(f"s.create() exception, error = {e}")
        sys.exit(f"Failed to list or create data streams from {tp_schema}://{tp_host}:{tp_port}. Please make sure you are connecting to the right server.")    

.neutron.yaml:

enable-authentication: false

# uncomment/enable the following setting for playground
#frontend-applications: playground
#proton-username: demo
#proton-password: demo@t+
#proton-query-black-list:
#  - insert
#  - drop
#  - system.
#  - create
#proton-query-timeout: 300

# except playground, other deployments need thoses auth0
auth0-domain: timeplus.us.auth0.com
auth0-callback-url: https://FIXME/callback
auth0-client-id:
auth0-client-secret:
auth0-audience: https://FIXME.beta.timeplus.com/api/v1beta1
auth0-api-client-id:
auth0-api-client-secret:
~
~

checked neutron.log by tail-f, no error or new log emit when 'api_key_id' exception is caught.

cannot insert data

DDL

CREATE STREAM github_events(
    id string,
    created_at string,
    type string,
    repo string,
    _tp_time datetime64 default to_datetime(created_at)
)

Python

        s.insert([
            ["id",e.id],["created_at",e.created_at.isoformat()],["type",e.type],["repo",e.repo.name]#,["payload",json.dumps(e.payload)]
            ])

Error
post http://localhost:8000/api/v1beta1/streams/github_events/ingest
insert {'columns': ['id', 'created_at', 'type', 'repo'], 'data': [['id', '20839378587'], ['created_at', '2022-03-20T18:07:17'], ['type', 'PushEvent'], ['repo', 'b-tekinli/b-tekinli']]}
failed to insert 500 {"code":0,"message":"failed to ingest data to stream github_events: request failed with status code 400, response body {"code":27,"error_msg":"Code: 27. DB::ParsingException: Cannot parse input: expected ',' before: '],[\"created_at\",\"2022-03-20T18:07:17\"],[\"type\",\"PushEvent\"],[\"repo\",\"b-tekinli\/b-tekinli\"]]': \nRow 1:\nColumn 0, name: id, type: string, parsed text: \u003cEMPTY\u003eERROR\nCode: 26. DB::ParsingException: Cannot parse JSON string: expected opening quote: (while reading the value of key id). (CANNOT_PARSE_QUOTED_STRING) (version 1.0.20)\n\n: While executing ParallelParsingBlockInputFormat: (at row 1)\n. (CANNOT_PARSE_INPUT_ASSERTION_FAILED) (version 1.0.20)","request_id":""}\n"}

resource path for swagger generated code is wrong

the replacement does not work which will call '/v1beta2/streams/:name' to get stream
this issue impact both stream and view object

the current workaround is use list and then filter out which works.

improve the query performance

In my test, with car_live_data, the console query will reach eps about 700, while using python sdk, it can only achieve 7-10 eps. which is a 100 times slower.

Need to dig out the performance bottle neck here.

don't refresh token if the initial token is invalid

in the 0.1.1 gluon, for any HTTP request, if the token is invalid, the sdk will try to refresh the token, with at most 24 times.
However if the first token is invalid, I guess we can skip the token refresh, and show the error to the end user that the token is not valid.

The main use case is the user gets an valid token and need to run the script for a long time. We should auto-refresh the token so that the user doesn't need to stop the script and rerun. But if at the beginning the token is invalid, this is considered as misconfig.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.