crate / cratedb-examples Goto Github PK

A collection of clear and concise examples how to work with CrateDB.

License: Apache License 2.0

Makefile 0.01% C# 0.24% Java 2.46% Shell 0.23% Ruby 0.05% PHP 0.19% Python 2.06% Jupyter Notebook 94.74% Dockerfile 0.02%

cratedb cratedb-driver cratedb-examples database database-example database-testing databases educational rdbms sql-database

cratedb-examples's Introduction

CrateDB Examples

✨ A collection of clear and concise examples how to work with CrateDB. ✨

🔗 Quick links: Application • Dataframe • Language • Testing • Topic

📖 More information: Drivers and Integrations • Integration Tutorials • Reference Documentation

👨‍💻 Usage

You can explore the content by browsing folders within the repository. Main sections can be explored by using the quick links in the header area.
If you are looking for something specific, please use GitHub search, for example, searching for "jdbc".
You can use the code snippets for educational and knowledge base purposes, or as blueprints within your own projects.
The repository is also used to support QA processes. Each example is designed to be invoked as an integration test case, accompanied by a corresponding CI validation job.

🧐 What's inside

This section gives you an overview about what's inside the relevant folders.

by-dataframe contains example code snippets how to work with dataframe libraries like pandas, Polars, Dask, Spark, and friends.
by-language contains demo programs / technical investigations outlining how to get started quickly with CrateDB using different programming languages and frameworks.
application contains integration scenarios with full-fledged applications and software frameworks.
testing contains reference implementations about how to use different kinds of test layers for testing your applications with CrateDB.
topic mostly contains Jupyter Notebooks outlining different use cases around working with time-series data, and demonstrating machine learning technologies together with CrateDB.

✅ CI Status

Please visit the Build Status page to inspect the build status of relevant drivers, applications, and integrations for CrateDB, on one page.

🏕️ Testing

In the same way as on CI, you can invoke the example programs easily on your workstation, in order to quickly get started on behalf of working example code, or to verify connectivity within your computing environment.

Prerequisites

For invoking the software integration tests, you will need installations of Docker, Python, and Git on your workstation.

Before running the tests, make sure to supply an instance of CrateDB. In order to use and verify the most recent available code, let's select the OCI image crate/crate:nightly.

docker run --rm -it --pull=always \
    --name=cratedb --publish=4200:4200 --publish=5432:5432 \
    --env=CRATE_HEAP_SIZE=4g \
    crate/crate:nightly -Cdiscovery.type=single-node

Test Runner `ngr`

The repository uses a universal test runner to invoke test suites of different languages and environments, called ngr.

In order to run specific sets of test cases, you do not need to leave the top-level directory, or run any kind of environment setup procedures. If all goes well, just select one of the folders of interest, and invoke ngr test on it, like that:

ngr test by-language/java-jdbc
ngr test by-language/python-sqlalchemy
ngr test by-language/php-amphp
ngr test by-dataframe/dask
ngr test application/apache-superset
ngr test testing/testcontainers/java
ngr test topic/machine-learning/llm-langchain

Note

It is recommended to invoke ngr from within a Python virtualenv, in order to isolate its installation from the system Python. Installing ngr works like this:

git clone https://github.com/crate/cratedb-examples
cd cratedb-examples
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Test Matrix Support

Some examples optionally obtain parameters on invocation time.

One example is the test suite for Npgsql, which accepts the version number of the Npgsql driver release to be obtained from the environment at runtime, overriding any internally specified versions. Example:

ngr test by-language/csharp-npgsql --npgsql-version=6.0.9

Tip

This feature is handy if you are running a test matrix, which is responsible for driving the version numbers, instead of using the version numbers nailed within local specification files of any sort.

💁 Contributing

Interested in contributing to this project? Thanks so much for your interest!

As an open-source project, we are always looking for improvements in form of contributions, whether it be in the form of a new feature, improved infrastructure, or better documentation.

Your bug reports, feature requests, and patches are greatly appreciated.

🌟 Contributors

cratedb-examples's People

Contributors

Stargazers

Watchers

Forkers

proddata liyang-love mackerl surister harshil1712 potgie bvy007

cratedb-examples's Issues

[Testing] Demonstrate CrateDB test layers with parameterization

About

GH-280 only demonstrates basic usage of CrateDB test layer variants. On a subsequent iteration, we may want to demonstrate how to parameterize them.

Requirements

The minimum set of parameters needed to cover common use cases in software testing.

Software version of CrateDB (GA release, testing, nightly).
TCP ports (HTTP and PG) CrateDB should be listening on.
Heap size used by CrateDB, CRATE_HEAP_SIZE.
Path to its data directory when aiming to expose it.

As an outlook...

cr8's test layer is also capable of running clusters of multiple nodes, right? That is probably happening somewhere in crate-qa already? It should also be demonstrated in this context here.

References

Those are pointers to where parameterization is used, and not demonstrated here, yet.

native/pytest: CrateLayer accepts settings, see pytest_crate/plugin.py.
native/unittest: Dito, see crash::tests/test_integration.py.
testcontainers/pytest: @pilosus added parameterization capabilities to cratedb-toolkit's CrateDBTestAdapter: crate/crash#408. Thanks.
testcontainers/unittest: Dito.

Problem invoking Docker Compose on tutorial about Apache Kafka, Apache Flink and CrateDB

Hi there,

at [1], @jainhemant163 shared with us that he isn't able to invoke the docker-compose.yml file, neither on his workstation nor on AWS EC2 instances. The invocation croaks like:

ERROR: The Compose file './docker-compose.yml' is invalid because:
Unsupported config option for services: 'kafka-zookeeper'
Unsupported config option for networks: 'scada-demo'

With kind regards,
Andreas.

[1] https://dev.to/jainhemant163/comment/1eckd

Issue with time series forecasting with pycaret notebook

When running automl_timeseries_forecasting_with_pycaret.ipynb notebook the mlflow-cratedb module gets installed but it is not found when importing:

ModuleNotFoundError                       Traceback (most recent call last)
[<ipython-input-3-c9198aa96905>](https://localhost:8080/#) in <cell line: 6>()
      4 import plotly
      5 import plotly.graph_objects as go
----> 6 import mlflow_cratedb  # Required to enable the CrateDB MLflow adapter.
      7 from dotenv import load_dotenv
      8 

ModuleNotFoundError: No module named 'mlflow_cratedb'

Can you check/test the code? The issue persists both locally and in Colab.

RAG: Problems resolving dependencies on Google Colab

Problem

@hammerhead reported a flaw with the cratedb_rag_customer_support_langchain.ipynb Notebook when invoked on Google Colab.

Dependency resolution around Dask fails, bzw. takes ages to complete, if at all.

Collecting dask (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading dask-2022.5.2-py3-none-any.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 65.4 MB/s eta 0:00:00
Collecting distributed (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading distributed-2022.5.1-py3-none-any.whl (871 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 871.6/871.6 kB 54.8 MB/s eta 0:00:00
Collecting dask (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading dask-2022.5.1-py3-none-any.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 57.8 MB/s eta 0:00:00
Collecting distributed (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading distributed-2022.5.0-py3-none-any.whl (856 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 856.7/856.7 kB 60.6 MB/s eta 0:00:00
Collecting dask (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading dask-2022.5.0-py3-none-any.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 59.9 MB/s eta 0:00:00
Collecting distributed (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading distributed-2022.4.2-py3-none-any.whl (856 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 856.7/856.7 kB 61.6 MB/s eta 0:00:00
Collecting dask (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading dask-2022.4.2-py3-none-any.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 61.3 MB/s eta 0:00:00
Collecting distributed (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading distributed-2022.4.1-py3-none-any.whl (855 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 855.5/855.5 kB 59.3 MB/s eta 0:00:00
Collecting dask (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading dask-2022.4.1-py3-none-any.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 60.5 MB/s eta 0:00:00
Collecting distributed (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading distributed-2022.4.0-py3-none-any.whl (853 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 853.8/853.8 kB 52.3 MB/s eta 0:00:00
Collecting dask (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading dask-2022.4.0-py3-none-any.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 56.7 MB/s eta 0:00:00
Collecting distributed (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading distributed-2022.3.0-py3-none-any.whl (851 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 851.2/851.2 kB 54.7 MB/s eta 0:00:00
Collecting dask (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading dask-2022.3.0-py3-none-any.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 60.3 MB/s eta 0:00:00
Requirement already satisfied: httplib2>=0.9.1 in /usr/local/lib/python3.10/dist-packages (from oauth2client>=1.5.2->gcsfs->fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9)) (0.22.0)
  Downloading dask-2023.8.1-py3-none-any.whl (1.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 7.8 MB/s eta 0:00:00
INFO: pip is looking at multiple versions of distributed to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of distributed to determine which version is compatible with other requirements. This could take a while.
INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime. See https://pip.pypa.io/warnings/backtracking for guidance. If you want to abort this run, press Ctrl + C.
ERROR: Operation cancelled by user

Thoughts

It looks like it is clearly related to the Python 3.11.9 vs. Dask hiccup from last week.

References

Maybe related; I will execute this first; maybe, it will yield some insights.

GH-422

@hammerhead also provided a fix already.

GH-423

Dependency installation fails on Google Colab for `cratedb-vectorstore-rag-openai-sql.ipynb`

Steps to reproduce:

Go to the README file of the folder for the RAG pipeline notebook: https://github.com/crate/cratedb-examples/blob/main/topic/machine-learning/llm-langchain/README.md
Next to cratedb-vectorstore-rag-openai-sql.ipynb, click the Open in Colab button
Uncomment the remote requirements.txt installation and run it: !pip install -r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt

After some time, it fails with:

INFO: pip is looking at multiple versions of langchain[cratedb,openai] to determine which version is compatible with other requirements. This could take a while.
ERROR: Cannot install -r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 4), crate[sqlalchemy] and langchain[cratedb,openai]==0.1.4 because these package versions have conflicting dependencies.

The conflict is caused by:
    The user requested crate[sqlalchemy]
    cratedb-toolkit 0.0.3 depends on crate[sqlalchemy]>=0.34
    langchain[cratedb,openai] 0.1.4 depends on crate[sqlalchemy]<0.35.0 and >=0.34.0; extra == "cratedb"

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts

I could so far not reproduce the issue locally with a Python 3.10 environment.

LangChain: `cratedb_rag_customer_support.ipynb` trips with `CellTimeoutError`

Problem

cratedb_rag_customer_support.ipynb has been introduced just recently.

It looks like the call to embeddings.embed_documents(pages_text) might take longer than expected / uses more compute resources / stalls for any other reasons?

E           nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 120 seconds.
E           The message was: Cell execution timed out.
E           Here is a preview of the cell contents:
E           -------------------
E           embeddings = OpenAIEmbeddings(deployment='my-embedding-model', chunk_size=1)
E           pages_embeddings = embeddings.embed_documents(pages_text)
E           -------------------

/opt/hostedtoolcache/Python/3.11.7/x64/lib/python3.11/site-packages/nbclient/client.py:801: CellTimeoutError
------------------------------ Captured log call -------------------------------
ERROR    pytest_notebook.execution:client.py:795 Timeout waiting for execute reply (120s).

-- https://github.com/crate/cratedb-examples/actions/runs/7881120885/job/21504241805#step:6:848

Q & A

Can you dig a bit into this, @marijaselakovic? Do you have any idea where this may be coming from, or how it can be improved?

NB: It's not a unique thing. We are also taking care about the same details at GH-170 and GH-299.

Apache Kafka Streaming with PyFlink

About

apache-kafka-flink-streaming has a nice new example contributed by @surister. Thanks!

Backlog

Add a GHA workflow recipe, and add the outcome to the build status page.
Add an item to the integration tutorials section about Apache Flink.

/cc @matkuliak

AutoML: CI trips with `ValueError: Input contains NaN.`

Originally coming from an issue that mixed things up, GH-170, let's get things straight here.

Problem

CI on the AutoML job occasionally trips like this, failing the CI run.

FAILED test.py::test_file[automl_timeseries_forecasting_with_pycaret.py] - ValueError: Input contains NaN.

self = <joblib.parallel.BatchCompletionCallBack object at 0x7f4f737cb910>

    def _return_or_raise(self):
        try:
            if self.status == TASK_ERROR:
>               raise self._result
E               ValueError: Input contains NaN.

-- https://github.com/crate/cratedb-examples/actions/runs/7884792002/job/21514554253#step:6:1146

Outlook

@andnig shared his suggestions at #170 (comment) already. Maybe you can add them here instead?

CI job trips on Apache Superset 2.x: `AttributeError: module 'flask.json' has no attribute 'JSONEncoder'`

Problem

Invoking superset db upgrade fails.

-- https://github.com/crate/cratedb-examples/actions/runs/8005637022/job/21865556761#step:7:290

History

It failed for the first time on Tue, 20 Feb 2024 03:06:45 GMT. Before, it worked well, and nothing was changed on our ends.

-- https://github.com/crate/cratedb-examples/actions/workflows/application-apache-superset.yml

`Received unexpected backend message of type ParseComplete`: `by-language/csharp-npgsql` starts failing with CrateDB 4.8.4

About

At GH-14, we discovered that the csharp-npgsql program would fail its test suite. While it still worked with CrateDB 4.8.3, it starts croaking with CrateDB 4.8.4.

Exception

The exception can be reproduced by running those commands:

docker run -it --rm --publish=4200:4200 --publish=5432:5432 crate:4.8.4
dotnet test --framework=net5.0

[xUnit.net 00:00:00.93]     demo.tests.DemoProgramTest.TestSystemQueryExample [FAIL]
[xUnit.net 00:00:00.95]     demo.tests.DemoProgramTest.TestBasicConversationExample [FAIL]
[xUnit.net 00:00:00.96]     demo.tests.DemoProgramTest.TestAsyncUnnestExample [FAIL]
  Failed demo.tests.DemoProgramTest.TestSystemQueryExample [1 ms]
  Error Message:
   System.AggregateException : One or more errors occurred. (Received unexpected backend message of type ParseComplete) (The following constructor parameters did not have matching fixture data: DatabaseFixture fixture)
---- System.Exception : Received unexpected backend message of type ParseComplete
---- The following constructor parameters did not have matching fixture data: DatabaseFixture fixture
  Stack Trace:

----- Inner Stack Trace #1 (System.Exception) -----
   at Npgsql.NpgsqlDataReader.ProcessMessage(IBackendMessage msg)
   at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken)
   at Npgsql.NpgsqlDataReader.NextResult()
   at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken)
   at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken)
   at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior)
   at Npgsql.PostgresDatabaseInfo.LoadBackendTypes(NpgsqlConnection conn, NpgsqlTimeout timeout, Boolean async)
   at Npgsql.PostgresDatabaseInfo.LoadPostgresInfo(NpgsqlConnection conn, NpgsqlTimeout timeout, Boolean async)
   at Npgsql.PostgresDatabaseInfoFactory.Load(NpgsqlConnection conn, NpgsqlTimeout timeout, Boolean async)
   at Npgsql.NpgsqlDatabaseInfo.Load(NpgsqlConnection conn, NpgsqlTimeout timeout, Boolean async)
   at Npgsql.NpgsqlConnector.LoadDatabaseInfo(Boolean forceReload, NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)
   at Npgsql.NpgsqlConnector.Open(NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)
   at Npgsql.ConnectorPool.OpenNewConnector(NpgsqlConnection conn, NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)
   at Npgsql.ConnectorPool.<>c__DisplayClass38_0.<<Rent>g__RentAsync|0>d.MoveNext()
--- End of stack trace from previous location ---
   at Npgsql.NpgsqlConnection.<>c__DisplayClass41_0.<<Open>g__OpenAsync|0>d.MoveNext()
--- End of stack trace from previous location ---
   at Npgsql.NpgsqlConnection.Open()
   at demo.tests.DatabaseFixture..ctor() in /Users/amo/dev/crate/docs/cratedb-examples/by-language/csharp-npgsql/tests/DemoProgramTest.cs:line 22
----- Inner Stack Trace #2 (Xunit.Sdk.TestClassException) -----

Screenshot

-- https://github.com/crate/cratedb-examples/actions/runs/4009337440

Improve tutorials about Apache Superset

Carried over from #217.

Backlog

The tutorial Set up an Apache Superset development sandbox with CrateDB should be updated on a few spots: CrateDB version used should be latest, and CSRF_TOKEN as well as HTTP session is no longer needed.
There should also be a separate user-oriented tutorial on the community forum, where all developer-like steps like git clone are omitted. Most probably, a user-focused tutorial should be based on Docker Compose for running both Superset and CrateDB, but also not like the upstream documentation Installing Superset Locally Using Docker Compose is doing it, because it also uses a git clone inside. This new tutorial should be the primary resource to advertise when educating users about this integration, and it should also outline how to connect to CrateDB Cloud.
Monitor and add support for Python 3.12, when available.

LangChain: Resources need upgrades re. `openai>=1`

Problem

# APIRemovedInV1: You tried to access openai.ChatCompletion, but this is no longer supported in openai>=1.0.0
openai==0.28

E           APIRemovedInV1: 
E           
E           You tried to access openai.ChatCompletion, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API.
E           
E           You can run `openai migrate` to automatically upgrade your codebase to use the 1.0.0 interface. 
E           
E           Alternatively, you can pin your installation to the old version, e.g. `pip install openai==0.28`
E           
E           A detailed migration guide is available here: https://github.com/openai/openai-python/discussions/742

References

https://github.com/crate/cratedb-examples/actions/runs/7072399658/job/19251176606#step:6:916

Workaround

I've chosen to downgrade for now. Can you have a look, @andnig?
-- 04d46e3

CI: Collection of flukes

About

This ticket collects all sorts of flukes and anomalies observed when running validation jobs on CI.

[Java] crate-java-testing

Problem

Both patches reflecting upon software packages which can be used for testing CrateDB did not include presenting the crate-java-testing package for Java. It is well alive.

#54
#280

Solution

Add a relevant snippet to cratedb-examples, and a corresponding documentation section to cratedb-guide, at https://cratedb.com/docs/guide/integrate/testing.html.

Use Python 3.10 on CI, at least for Jupyter Notebooks

About

Because we are aiming to run a signifcant portion of the assets here on Google Colab, most notably the Jupyter Notebooks, we may want to follow their cadence of Python updates.

Currently, Google Colab still seems to be on Python 3.10 ¹, so we may want to adjust the corresponding CI jobs to validate just that, in order to avoid any surprises.

/cc @marijaselakovic, @surister, @ckurze

https://colab.google/articles/py3.10 ↩

Testcontainers for Python

About

Similar to GH-54, we would like to demonstrate CrateDB with testcontainers-python, a »Testcontainers« implementation for Python.
I've started a corresponding implementation on behalf of the LorryStream project the other day, and already reused it at the CrateDB Retention project. It needs to be reviewed and submitted to the upstream repository as a contribution before further elaborating on it.

timeseries-queries-and-visualization.ipynb - Loading initial data doesnt return output

The loading of data doesnt return the output that is described.

`The result contains information about the successfully written rows and potential errors that might have occurred. The output is expected to look roughly like this:

[({'id': '<SOME_ID>', 'name': '<SOME_NAME>'},
'https://github.com/crate/cratedb-datasets/raw/main/cloud-tutorials/data_weather.csv.gz', 70000, 0, {} )]
The response indicates that 70,000 records have been successfully loaded and that no errors happened.'

My suggestion would be to fully get rid of this piece of text as this doesn't match the reality.

TSML: Error in `timeseries-anomaly-detection.ipynb`

Problem

The timeseries-anomaly-detection.ipynb notebook errors out, both on Python 3.10 and 3.11 ¹².

ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required by SimpleImputer.

Observations

Because it happens on both versions of Python, it is most probably unrelated to the change per se where it started tripping.

GH-425

Thoughts

Most probably another dependency flaw?

Testcontainers for Java: Backlog

Hi.

At GH-54, we are bringing in some ready-to-run code examples for basic use of Testcontainers for Java with CrateDB. There are some items which should be addressed within subsequent iterations.

With kind regards,
Andreas.

Startup

Propagating command-line options to CrateDB

Propagating command-line options to CrateDB will become important, for example when aiming to use this in a multi-node scenario like what crate-java-testing provides per CrateTestCluster ¹, right?

[Test worker] WARN tc.crate:5.2 - Reuse was requested but the environment does not support the reuse of containers
To enable reuse of containers, you must set 'testcontainers.reuse.enable=true' in a file located at /Users/amo/.testcontainers.properties

Database provisioning

Other test frameworks

Testcontainers for Java also provides integrations for other test frameworks. Currently, all test cases are based on JUnit 4.

See also https://github.com/crate/crate-jdbc/issues/377. ↩

Time Series: Some notebooks are not compatible with pandas 2.x

Observations

When running a few notebooks on pandas 2.x, errors like those can be observed:

TypeError: Could not convert string 'BerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlinBerlin' to numeric

-- https://github.com/crate/cratedb-examples/actions/runs/8975962618/job/24651748395?pr=430#step:6:1839
-- https://github.com/crate/cratedb-examples/actions/runs/8975962618/job/24651748121?pr=430#step:6:825

References

Evaluations

It looks like it is a data shape error.
Apparently, Google Colab now strictly uses and requires pandas 2.x since 2024-05-13?

Thoughts

It looks like it is a data shape error. Maybe the way the notebooks are working with pandas needs an update when using more recent pandas 2.x? The string repetition flaw reminds me of the famous »Wat« talk by Gary Bernhardt. ;]

Workaround

As a temporary measure, tests stopped including the corresponding notebook. It will get skipped per cfd1a6c, on behalf of the relevant modernization patch.

GH-430

Time Series: Skip testing notebooks not compatible with pandas 2.x

exploratory_data_analysis.ipynb

time-series-decomposition.ipynb

They are not ready for pandas 2.x yet, and block others from being
upgraded.

Originally posted by @amotl in #430 (comment)

ODBC on Windows, pyodbc, and unixODBC

About

Material discovered about ODBC.

Code examples: https://github.com/mikethebeer/crate-pgodbc
Tutorial: https://community.cratedb.com/t/connecting-to-cratedb-with-excel-odbc-driver-on-macos/1373

References

crate/crate-clients-tools#62

Use Markdown or Python for writing Jupyter Notebooks

About

Markdown as the lingua franca for many technical writing tasks should be used more, as it is roughly interoperable with, for example, GitHub and Discourse. ¹

Conversion from Jupyter Notebooks

Just use nbconvert.

pip install nbconvert
jupyter nbconvert --to markdown automl_classification_with_pycaret.ipynb

Authoring Jupyter Notebooks

Instead of converting from, Jupyter Notebooks can be written in Markdown itself, see Notebooks written entirely in Markdown.

The easiest way to create a MyST notebook is to use Jupytext, a tool that allows for two-way conversion between .ipynb and a variety of text files. See also Notebooks as Markdown.

Also with HubSpot, when throwing https://github.com/crate-workbench/hubspot-tech-writing into the mix. ↩

Time Series: Modernize notebooks to use recent versions of pandas and SQLAlchemy

Problem

Even if only just conceived, the new notebooks are immediately outdated already, using legacy technologies like pandas 1.x and SQLAlchemy 1.x.

#379
#383

Solution

Provide updates to use more recent / modern versions of both pandas and SQLAlchemy.

[Tech Writing] CrateDB for Pythonistas with SQLAlchemy

About

The narrative of the article CrateDB for Pythonistas with SQLAlchemy is nice, but the implementation can be improved.

Backlog for Python

About

Coming from a few recent patches, this ticket collects and/or summarizes a few backlog items.

by-language/python-sqlalchemy

Some adjustments may be added to bring all insert_*.py programs into the same shape.
Other than demonstrating only write operations, also demonstrate read operations?
Demonstrate applicability on behalf of relevant Jupyter Notebook(s).

References

GH-64

LLM example `document_loader.ipynb` fails

There is a failure in document_loader.ipynb which fails this PR.

Yeah, I also discovered that at GH-257 yesterday. Something needs a fix, but I have not been able to discover the root cause, yet.

Originally posted by @amotl in #215 (comment)

/cc @marijaselakovic, @ckurze

Issue with requirements.txt

When installing requirements.txt in cratedb-vectorstore-rag-openai-sql.ipynb notebook in Google Colab I got the following ResolutionImpossible Error:

ERROR: Cannot install -r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 4), crate[sqlalchemy]>=0.34 and langchain[cratedb,openai]==0.1.4 because these package versions have conflicting dependencies.

The conflict is caused by:
    The user requested crate[sqlalchemy]>=0.34
    cratedb-toolkit 0.0.5 depends on crate[sqlalchemy]>=0.34
    langchain[cratedb,openai] 0.1.4 depends on crate[sqlalchemy]<0.35.0 and >=0.34.0; extra == "cratedb"

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

@amotl is it possible for you to check this and loose the range of packages?

Time Series: Incompatibility between Google Colab and pandas 2

When running timeseries-queries-and-visualization.ipynb in Colab you get the following error when installing the needed pip packages.

!pip install --upgrade kaleido 'pandas>=2' plotly sqlalchemy-cratedb
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pandas<2.2.2dev0,>=2.0, but you have pandas 2.2.2 which is incompatible.
google-colab 1.0.0 requires pandas==2.0.3, but you have pandas 2.2.2 which is incompatible.

This can be fixed by running:

!pip install --upgrade kaleido 'pandas==2.0.3' plotly sqlalchemy-cratedb

Evaluate mainlining of `cratedb_toolkit.sqlalchemy.patch.patch_inspector`

About

@ckurze and @amotl discovered a case where the flaw is reproducible, that SQLAlchemy's introspection/reflection machinery is not able to pick up the schema name correctly.

Details

832c3bc fixes it. Indeed, we apparently need a runtime fix here, when using a non-standard schema (doc vs. testdrive).

# TODO: Bring this into the `crate-python` driver.
from cratedb_toolkit.sqlalchemy.patch import patch_inspector
patch_inspector()

Originally posted by @amotl in #136 (comment)

ML/AutoML: `ModuleNotFoundError: No module named 'crate.client.sqlalchemy'`

2024/06/18 12:28:28 INFO mlflow: Amalgamating MLflow for CrateDB
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-2-c9198aa96905> in <cell line: 6>()
      4 import plotly
      5 import plotly.graph_objects as go
----> 6 import mlflow_cratedb  # Required to enable the CrateDB MLflow adapter.
      7 from dotenv import load_dotenv
      8 
3 frames
/usr/local/lib/python3.10/dist-packages/cratedb_toolkit/sqlalchemy/patch.py in patch_inspector()
     23         return schema_name
     24 
---> 25     from crate.client.sqlalchemy.dialect import CrateDialect
     26 
     27     get_table_names_dist = CrateDialect.get_table_names

ModuleNotFoundError: No module named 'crate.client.sqlalchemy'
---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.

AutoML: Test harness trips when installing `catboost` on macOS with Python 3.11, works with Python 3.10

About

When invoking the test cases on the automl folder,

git clone https://github.com/crate/cratedb-examples
cd cratedb-examples
pip install -r requirements.txt
ngr test topic/machine-learning/automl

the process fails at installation time already.

ERROR: No matching distribution found for catboost<1.2,>=0.23.2; platform_system == "Darwin" and extra == "models"

References

May be relevant.

Add examples on how to use CrateDB with Polars in Write and Read operations.

Main track issue: https://github.com/crate/roadmap/issues/56

AutoML: CI trips with `CellTimeoutError` / `ValueError: Input contains NaN.`

Dear @andnig,

the CI caught an error from automl_timeseries_forecasting_with_pycaret.py ¹.

FAILED test.py::test_file[automl_timeseries_forecasting_with_pycaret.py] - ValueError: Input contains NaN.

Apparently, it started tripping like this only yesterday ², so it is likely the error is related to changed input data.

However, the result of debugging this error may well converge into a corresponding issue at PyCaret, because its promises are so high. On the other hand, the code may just need a particular data cleansing step, to accomodate the situation. May I ask you to have a look?

With kind regards,
Andreas.

AutoML: CI trips with `CellTimeoutError`

Originally coming from an issue that mixed things up, GH-170, let's get things straight here.

Problem

CI on the AutoML job occasionally trips like this, failing the CI run.

E           nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 300 seconds.
E           The message was: Cell execution timed out.
E           Here is a preview of the cell contents:
E           -------------------
E           s = setup(data, fh=15, target="total_sales", index="month", log_experiment=True)
E           -------------------

/opt/hostedtoolcache/Python/3.11.6/x64/lib/python3.11/site-packages/nbclient/client.py:801: CellTimeoutError

-- #170 (comment)

Outlook

@andnig suggested at #170 (comment), that maybe the PYTEST_CURRENT_TEST environment variable, and what it is guarding, is not being evaluated correctly.

However, at #170 (comment), we have been able to confirm it works well.

ML/LLM: `cratedb_rag_customer_support_langchain.ipynb` croaks with `ImportError: cannot import name 'patch_inspector' from 'cratedb_toolkit.sqlalchemy.patch'`

@wierdvanderhaar reported another problem. Thank you.

WARNING: langchain 0.2.3 does not provide the extra 'cratedb'
WARNING: langchain 0.2.3 does not provide the extra 'openai'

ImportError                               Traceback (most recent call last)
<ipython-input-2-451439e6f556> in <cell line: 11>()
      9 from langchain_openai import OpenAIEmbeddings
     10 from langchain_community.document_loaders import CSVLoader
---> 11 from langchain_community.vectorstores import CrateDBVectorSearch
     12 
     13 warnings.filterwarnings('ignore')

3 frames
/usr/local/lib/python3.10/dist-packages/langchain_community/vectorstores/cratedb/base.py in <module>
     15 
     16 import sqlalchemy
---> 17 from cratedb_toolkit.sqlalchemy.patch import patch_inspector
     18 from cratedb_toolkit.sqlalchemy.polyfill import (
     19     refresh_table,

ImportError: cannot import name 'patch_inspector' from 'cratedb_toolkit.sqlalchemy.patch' (/usr/local/lib/python3.10/dist-packages/cratedb_toolkit/sqlalchemy/patch.py)

LangChain: FileNotFoundError: [Errno 2] No such file or directory: 'mlb_teams_2012.sql'

About

Nightly scheduled tests tripped here.

FAILED test.py::test_file[document_loader.py] - FileNotFoundError: [Errno 2] No such file or directory: 'mlb_teams_2012.sql'

-- https://github.com/crate/cratedb-examples/actions/runs/6938705593/job/18874863128#step:6:697

And there.

FAILED test.py::test_notebook[document_loader.ipynb] - Failed: Direct construction of pytest_notebook.plugin.JupyterNbCollector has been deprecated, please use pytest_notebook.plugin.JupyterNbCollector.from_parent.

-- https://github.com/crate/cratedb-examples/actions/runs/6938705593/job/18874863128#step:6:689

[ML]: `automl_timeseries_forecasting_with_pycaret.{ipynb,py}` are defunct

Problem

Requests to those resources yield HTTP 500 Internal Server Error responses, observed through the Build Status page.

Failing test runs

References

crate/mlflow-cratedb#149

/cc @hammerhead, @surister, @ckurze

CI: Npgsql test matrix is incorrect

Problem

It is unexpected that GH-169 is green, because Npgsql 8.0 does not support .NET 6 any longer. So, why doesn't it fail?

Observation

This test matrix slot, which should invoke .NET 5.0.x, apparently also uses .NET 8, which is wrong.

dotnet test --framework=net8.0 --collect:"XPlat Code Coverage"

-- https://github.com/crate/cratedb-examples/actions/runs/7033351740/job/19138983476?pr=169#step:7:136

Conclusion

Test matrix slot value propagation is flawed somewhere and needs to be fixed.

.NET: Test methods should not use blocking task operations

There are a few admonitions on PRs related to .NET/C#.

Test methods should not use blocking task operations, as they can cause deadlocks. Use an async test method and await instead. (https://xunit.net/xunit.analyzers/rules/xUnit1031)

-- https://github.com/crate/cratedb-examples/pull/111/files, see section "Unchanged files with check annotations".

Check tutorial about Kafka, Flink and CrateDB with the vanilla PostgreSQL JDBC Driver

Hi there,

because @proddata just asked about the state of the CrateDB JDBC Driver, I would like to put down that note here.

When refreshing the resources [1,2] the other day, building upon [3] by @kovrus, @carlotas19 converged the resources into [4] (cheers!). While supporting that, I also created some other accompanying resources at [5], which add some infrastructure and documentation to run the example in a reproducible manner out of the box.

My memories about the details are a bit faded, but I remember that the example did not work with the vanilla PostgreSQL JDBC Driver. Apparently, I already took a little note about it at the place where you would be able to switch the driver ¹, alongside ²:

Currently, org.postgresql:postgresql croaks with
org.postgresql.util.PSQLException: No hstore extension installed.

We should get back to this and use latest software versions of the corresponding components when testing again, this time specifically focused on shedding some more light onto the problem discovered here.

With kind regards,
Andreas.

/cc @hammerhead

[1] https://www.ververica.com/blog/smart-systems-iot-use-case-open-source-kafka-flink-cratedb
[2] https://crate.io/resources/white-papers/lp-wp-flink-kafka-cratedb
[3] https://github.com/crate/cratedb-flink-jobs
[4] https://dev.to/crate/build-a-data-ingestion-pipeline-using-kafka-flink-and-cratedb-1h5o
[5] https://github.com/crate/cratedb-examples/tree/main/spikes/kafka-flink

crate / cratedb-examples Goto Github PK

cratedb-examples's Introduction

CrateDB Examples

👨‍💻 Usage

🧐 What's inside

✅ CI Status

🏕️ Testing

Prerequisites

Test Runner ngr

Test Matrix Support

💁 Contributing

🌟 Contributors

cratedb-examples's People

Contributors

Stargazers

Watchers

Forkers

cratedb-examples's Issues

About

Requirements

References

Problem

Thoughts

References

Problem

Q & A

About

Backlog

Problem

Outlook

Problem

History

About

Exception

Screenshot

Backlog

Problem

References

Workaround

About

Problem

Solution

About

Footnotes

About

Problem

Observations

Thoughts

Footnotes

Startup

Database provisioning

Other test frameworks

Footnotes

Observations

References

Evaluations

Thoughts

Workaround

About

References

About

Conversion from Jupyter Notebooks

Authoring Jupyter Notebooks

Footnotes

Problem

Solution

About

About

by-language/python-sqlalchemy

References

About

Details

About

References

Footnotes

Problem

Outlook

About

Problem

Failing test runs

Test Runner `ngr`