pacman82 / arrow-odbc Goto Github PK

View Code? Open in Web Editor NEW

50.0 50.0 9.0 406 KB

Fill Apache Arrow record batches from an ODBC data source in Rust.

License: MIT License

Rust 100.00%

arrow-odbc's People

Contributors

Stargazers

Watchers

Forkers

icodein bcmcmill jonashaag keraion matthewgapp nobie-org mach-kernel

arrow-odbc's Issues

mysql json type

Display size of column 7: -4 Is this normal? If this is -4, the program will be painc.

[2023-11-19T04:49:41Z DEBUG arrow_odbc::schema] ODBC driver reported for column 1. Relational type: Varchar { length: 255 }; Nullability: Nullable; Name: 'username';
[2023-11-19T04:49:41Z DEBUG arrow_odbc::schema] ODBC driver reported for column 2. Relational type: Varchar { length: 255 }; Nullability: Nullable; Name: 'fullname';
[2023-11-19T04:49:41Z DEBUG arrow_odbc::schema] ODBC driver reported for column 3. Relational type: Varchar { length: 255 }; Nullability: Nullable; Name: 'password';
[2023-11-19T04:49:41Z DEBUG arrow_odbc::schema] ODBC driver reported for column 4. Relational type: Varchar { length: 255 }; Nullability: Nullable; Name: 'email';
[2023-11-19T04:49:41Z DEBUG arrow_odbc::schema] ODBC driver reported for column 5. Relational type: Varchar { length: 2083 }; Nullability: Nullable; Name: 'profile_img';
[2023-11-19T04:49:41Z DEBUG arrow_odbc::schema] ODBC driver reported for column 6. Relational type: TinyInt; Nullability: Nullable; Name: 'deleted';
[2023-11-19T04:49:41Z DEBUG arrow_odbc::schema] ODBC driver reported for column 7. Relational type: Unknown; Nullability: Nullable; Name: 'properties';
[2023-11-19T04:49:41Z DEBUG arrow_odbc::schema] ODBC driver reported for column 8. Relational type: TinyInt; Nullability: Nullable; Name: 'is_group';
[2023-11-19T04:49:41Z DEBUG arrow_odbc::reader] Relational type of column 1: Varchar { length: 255 }
[2023-11-19T04:49:41Z DEBUG arrow_odbc::reader] Relational type of column 2: Varchar { length: 255 }
[2023-11-19T04:49:41Z DEBUG arrow_odbc::reader] Relational type of column 3: Varchar { length: 255 }
[2023-11-19T04:49:41Z DEBUG arrow_odbc::reader] Relational type of column 4: Varchar { length: 255 }
[2023-11-19T04:49:41Z DEBUG arrow_odbc::reader] Relational type of column 5: Varchar { length: 2083 }
[2023-11-19T04:49:41Z DEBUG arrow_odbc::reader] Relational type of column 7: Unknown
[2023-11-19T04:49:41Z DEBUG arrow_odbc::reader] Display size of column 7: -4
thread 'odbc::odbc_drive::test::test_insert_into' panicked at /Users/simonyi/mosdb/arrow-odbc/src/reader/text.rs:52:14:
called `Result::unwrap()` on an `Err` value: TryFromIntError(())

TooLargeValueForBuffer error despite increasing max_bytes_per_batch and setting with_max_text_size to absurd numbers

Hey, @pacman82! Again, I can't thank you enough for this lib!

Running into an issue where we continue getting the following error despite increasing MAX_TEXT_SIZE to a value greater than 1384. Any help would be much appreciated!

Arrow { source: ExternalError(TooLargeValueForBuffer { indicator: Some(1384), buffer_index: 1 }) }

As additional context, I've tried increasing the MAX_BYTES_PER_BATCH to some absurd numbers but no dice. I've also tried removing with_fallable_allocations and also removing with_max_text_size

Update: the SQLType for the problematic field was coming in as VARCHAR(255). When converted to octets (utf8), it produces an expected length of 1020 (255 * 4). As you can see see in the error message, the actual length of the column was 1384. The source is databricks. So perhaps databricks isn't respecting its own schema? As a workaround, I modified this line to be the max of max_str_len and the len as reported by the type. But this causes issues with my other sources because now the row size is determined as the roughly max_str_len * number of columns which turns into a huge number.

So, to accommodate the messiness of the world I'm working with, I think I'd prefer as a user of this API to be able to provide an expected text size and a max text size. The expected text size would be used to allocate the vets, and the max text size would be used to return an error or panic.

8.3.0: the trait `arrow_odbc::odbc_api::Cursor` is not implemented

The 8.3.0 update is breaking my code with:
the trait arrow_odbc::odbc_api::Cursor is not implemented for odbc_api::CursorImpl<odbc_api::StatementConnection<'static>>

Panic with divide by zero on empty schema

Since release of arrow_odbc 3.0.0 (arrow_odbc_py 2.0.0), when running a multi-statement query that doesn't generate a result set in the second or later statement now fails due to a division by zero error when calculating the buffer_size_in_rows. The first statement returns a result set to allow the Cursor to be created, and the statement handle still has more results; however, the calculation of the buffer size now panics.

import arrow_odbc as ao

reader = ao.read_arrow_batches_from_odbc(
    connection_string="Driver={DB2};"
    "Hostname=host;"
    "Port=50000;"
    "Protocol=TCPIP;"
    "Database=db2db;"
    "UID=user;"
    "PWD=password;",
    query="""
SELECT 1 
FROM SYSIBM.SYSDUMMY1
;

DECLARE GLOBAL TEMPORARY TABLE MY_DGTT AS ( 
    SELECT *, 5 AS FIVE
    FROM SYSIBM.SYSDUMMY1 
) 
WITH DATA 
ON COMMIT PRESERVE ROWS 
NOT LOGGED 
WITH REPLACE
;

SELECT * 
FROM SESSION.MY_DGTT
;
""",
)

while True:
    print("-------------------------")
    print(reader.schema or "No Schema")
    if reader.schema:
        print("-------------------------")
        for batch in reader:
            print(batch)
    if not reader.more_results():
        break

Error Message:

thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-odbc-3.0.0/src/reader/odbc_reader.rs:467:30:
attempt to divide by zero
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Expected Results:

-------------------------
1: int32 not null
-------------------------
pyarrow.RecordBatch
1: int32 not null
----
1: [1]
-------------------------
No Schema
-------------------------
IBMREQD: string not null
FIVE: int32 not null
-------------------------
pyarrow.RecordBatch
IBMREQD: string not null
FIVE: int32 not null
----
IBMREQD: ["Y"]
FIVE: [5]

[Feature Request] Convert Schema string from ODBC to ArrowSchema from pub API

Any plan on api to get tables and schemas from odbc connection?

about ConcurrentOdbcReader send

problem

This is the problem I'm having

error[E0277]: `(dyn arrow_odbc::reader::ReadStrategy + 'static)` cannot be sent between threads safely
   --> src/odbc/odbc_drive.rs:105:25
    |
105 |                 Ok(Some(Box::pin(s)))
    |                         ^^^^^^^^^^^ `(dyn arrow_odbc::reader::ReadStrategy + 'static)` cannot be sent between threads safely
    |
    = help: the trait `std::marker::Send` is not implemented for `(dyn arrow_odbc::reader::ReadStrategy + 'static)`
    = note: required for `std::ptr::Unique<(dyn arrow_odbc::reader::ReadStrategy + 'static)>` to implement `std::marker::Send`
note: required because it appears within the type `Box<dyn ReadStrategy>`
   --> /Users/simonyi/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/alloc/src/boxed.rs:195:12

where

Can it be changed?
impl<C: Cursor + Send + 'static> ConcurrentOdbcReader {

I think

I think it is better to use SendableRecordBatchStream type here

pub trait RecordBatchStream: Stream<Item = Result<RecordBatch>> {
    fn schema(&self) -> SchemaRef;
}

pub type SendableRecordBatchStream = Pin<Box<dyn RecordBatchStream + Send>>;

application run crash

        let cursor = conn
            .execute("SHOW COLUMNS FROM  hello", ())
            .unwrap()
            .expect("");
        // Each batch shall only consist of maximum 10.000 rows.
        let max_batch_size = 10_000;
        // Read result set as arrow batches. Infer Arrow types automatically using the meta
        // information of `cursor`.
        let mut arrow_record_batches = OdbcReader::new(cursor, max_batch_size).unwrap();

When "SHOW COLUMNS FROM hello" is executed, if batch_size is set to 10000, the program will crash. If it is set to 10, it will not crash.

ReadStrategy is not Send

Is the ConcurrentOdbcReader thread safe? I'm getting an error saying that ReadStrategy is not Send:
error[E0277]: (dyn arrow_odbc::reader::ReadStrategy + 'static) cannot be sent between threads safely

Is there a way to use this concurrently? I'm trying to use this with the new WASI component model, saving the reader across multiple calls from the component.

Performance extremely sensitive to `max_text_size`

It seems like arrow-odbc's performance can be extremely sensitive to the value of max_text_size:

In [16]: reader = arrow_odbc.read_arrow_batches_from_odbc(..., max_text_size=30)

In [17]: %timeit -n1 -r1 list(reader)
81.1 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

In [18]: reader = arrow_odbc.read_arrow_batches_from_odbc(..., max_text_size=300)

In [19]: %timeit -n1 -r1 list(reader)
117 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

In [20]: reader = arrow_odbc.read_arrow_batches_from_odbc(..., max_text_size=3000)

In [21]: %timeit -n1 -r1 list(reader)
6.42 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

When writing library code I have to pick a value large enough to accommodate the longest string the library will ever deal will. In practice this can easily be the 3000 bytes used in the example above, for a 80x slowdown in this case :-(

Version: arrow-odbc-py 2.0.5

Optimize parsing of Decimal(x, 0)

I think we can make a (small?) optimization for parsing decimals with scale=0. Snowflake will send any integers as Decimal(38, 0) by default. At the very least I guess we can skip removing the . char but maybe we can also just use one of the integer parsers.

how to execute multi-statements where intermediate results don't return a cursor?

Hey @pacman82, how would I execute a multi-statement query like below? Based on the documentation and integration tests, you expect a cursor to be returned (only when each statement returns at least one row).

How would you advise me on how to get something like this to execute (where intermediate results don't return anything and the last result does)?

USE DATABASE PAYMENTDATA;

SELECT * FROM "PAYMENTSCHEMA"."CUSTOMERS" LIMIT 10

mysql support

Is mysql currently incompatible?

  const CONNECTION_STRING: &str = r#"
    Driver={MySQL ODBC 8.2 Unicode Driver};
    Server=127.0.0.1;
    DB=morse;
    UID=root;
    PWD=123456;
    Port=3306;
    "#;

odbcinst.ini

[MySQL ODBC 8.2 Unicode Driver]
Description=MySQL ODBC 8.2 Unicode Driver
Driver = /usr/local/lib/libmyodbc8w.so


running 1 test
Error: ODBC emitted an error calling 'SQLDriverConnect':
State: H000, Native error: 0, Message: [

Stack backtrace:
   0: std::backtrace_rs::backtrace::libunwind::trace
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/../../backtrace/src/backtrace/libunwind.rs:104:5
   1: std::backtrace_rs::backtrace::trace_unsynchronized
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2: std::backtrace::Backtrace::create
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/backtrace.rs:331:13
   3: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
             at /Users/simonyi/.cargo/registry/src/mirrors.tuna.tsinghua.edu.cn-df7c3c540f42cdbd/anyhow-1.0.75/src/error.rs:551:25
   4: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/core/src/result.rs:1963:27
   5: morse_engine::odbc::test::test
             at ./src/odbc/mod.rs:27:26
   6: morse_engine::odbc::test::test::{{closure}}
             at ./src/odbc/mod.rs:15:18
   7: core::ops::function::FnOnce::call_once
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/core/src/ops/function.rs:250:5
   8: core::ops::function::FnOnce::call_once
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/core/src/ops/function.rs:250:5
   9: test::__rust_begin_short_backtrace
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/test/src/lib.rs:628:18
  10: test::run_test_in_process::{{closure}}
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/test/src/lib.rs:651:60
  11: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/core/src/panic/unwind_safe.rs:272:9
  12: std::panicking::try::do_call
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/panicking.rs:552:40
  13: std::panicking::try
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/panicking.rs:516:19
  14: std::panic::catch_unwind
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/panic.rs:142:14
  15: test::run_test_in_process
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/test/src/lib.rs:651:27
  16: test::run_test::{{closure}}
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/test/src/lib.rs:574:43
  17: test::run_test::{{closure}}
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/test/src/lib.rs:602:41
  18: std::sys_common::backtrace::__rust_begin_short_backtrace
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/sys_common/backtrace.rs:154:18
  19: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/thread/mod.rs:529:17
  20: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/core/src/panic/unwind_safe.rs:272:9
  21: std::panicking::try::do_call
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/panicking.rs:552:40
  22: std::panicking::try
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/panicking.rs:516:19
  23: std::panic::catch_unwind
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/panic.rs:142:14
  24: std::thread::Builder::spawn_unchecked_::{{closure}}
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/thread/mod.rs:528:30
  25: core::ops::function::FnOnce::call_once{{vtable.shim}}
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/core/src/ops/function.rs:250:5
  26: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/alloc/src/boxed.rs:2007:9
  27: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/alloc/src/boxed.rs:2007:9
  28: std::sys::unix::thread::Thread::new::thread_start
             at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/sys/unix/thread.rs:108:17
  29: __pthread_joiner_wake
test odbc::test::test ... FAILED

failures:

failures:
    odbc::test::test

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 13 filtered out; finished in 0.37s

error: test failed, to rerun pass `-p morse-engine --lib`

Polars and arrow-odbc

Hey @pacman82 thanks for your job you have done a incredible work here, please did you have some articles or examples of how to integrate this crate with polars? I had find some posts talking about this crate with polars but no single one example how to do it.

See ya!

arrow version conflict

cargo.lock

[[package]]
name = "arrow-odbc"
version = "3.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d2278bb408bd1cbf82f725517ff182d06b024657040f9bafac23b9bd0ca61883"
dependencies = [
 "arrow 49.0.0",
 "atoi",
 "chrono",
 "log",
 "odbc-api",
 "thiserror",
]

my cargo.toml

[dependencies]
arrow = { version = "46.0.0", features = ["prettyprint","ffi"] }
arrow-array = { version = "46.0.0", default-features = false, features = ["chrono-tz"] }
arrow-buffer = { version = "46.0.0", default-features = false }
arrow-flight = { version = "46.0.0", features = ["flight-sql-experimental"] }
arrow-schema = { version = "46.0.0", default-features = false }
parquet = { version = "46.0.0", features = ["arrow", "async", "object_store"] }

What should I do to make arrow-odbc use arrow:46.0.0 version? Although arrow-odbc supports version = ">= 29, < 50", the actual compiled result is as above. Why is this?

ADBC support?

Just an idea: would be nice to use this as a bridge from odbc to adbc

https://arrow.apache.org/adbc/0.5.1/format/specification.html

Feel free to close this if too complex / time consuming or otherwise not wanted

Would probably need to use this lib: https://github.com/apache/arrow-adbc/blob/main/rust/Cargo.toml

Semicolon after INSERT fails on DB2 for IBM i.

I was originally trying to do bulk inserts with arrow-odbc-py, but the stack traces didn't show me the original database error code and message. So, I switched to Rust, for a simpler reproducible example.

My goal was to get insert_into_table working with DB2 for IBM i. I believe the root cause is that function insert_statement_text hard-codes a semicolon, and the ODBC drivers for DB2 for IBM i does not allow statements to end in a semicolon, nor does it allow a script of statements separated by a semicolon.

Code

For a table named TSTPANDAS (originally testing inserts from Pandas) with columns ROW_NUM and RAND_FLOAT.

Warning: I really don't know Rust, though it's reminding me of C++, Java, and Kotlin. I hacked this together from examples in your docs.

use std::env;
use std::sync::Arc;

use arrow::{
    array::{Float32Array, Int32Array},
    record_batch::RecordBatch,
};
use arrow_array::RecordBatchIterator;
use arrow_odbc::{
    insert_into_table,
    odbc_api::{ConnectionOptions, Environment, Error},
};

fn main() -> Result<(), Error> {
    // If you do not do anything fancy it is recommended to have only one Environment in the
    // entire process.
    let environment = Environment::new()?;

    let user = env::var("db2_user").unwrap_or_default();
    let password = env::var("db2_password").unwrap_or_default();

    // Connect using a DSN. Alternatively we could have used a connection string
    let connection = environment.connect(
        "scupddatawhse_dev",
        &user,
        &password,
        ConnectionOptions::default(),
    )?;

    // https://docs.rs/arrow/latest/arrow/#tabular-representation
    let col_1 = Arc::new(Int32Array::from_iter([1, 2, 3])) as _;
    let col_2 = Arc::new(Float32Array::from_iter([1., 6.3, 4.])) as _;

    let record_batch =
        RecordBatch::try_from_iter([("row_num", col_1), ("rand_float", col_2)]).unwrap();

    // https://docs.rs/arrow-array/44.0.0/arrow_array/trait.RecordBatchReader.html
    let batches: Vec<RecordBatch> = vec![record_batch.clone(), record_batch.clone()];

    let mut reader = RecordBatchIterator::new(batches.into_iter().map(Ok), record_batch.schema());
    let batch_size = 1000;
    insert_into_table(&connection, &mut reader, "TSTPANDAS", batch_size).expect("Inserted into TSTPANDAS");

    Ok(())
}

Output:

rcoleman12@ross-T480-linux:~/git/test_arrow_odbc$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.05s
     Running `target/debug/test_arrow_odbc`
thread 'main' panicked at 'Inserted into TSTPANDAS: PreparingInsertStatement { source: Diagnostics { record: State: 42000, Native error: -104, Message: [IBM][System i Access ODBC Driver][DB2 for i5/OS]SQL0104 - Token ; was not valid. Valid tokens: <END-OF-STATEMENT>., function: "SQLPrepare" }, sql: "INSERT INTO TSTPANDAS (row_num, rand_float) VALUES (?, ?);" }', src/main.rs:42:74
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

compile error

os:macos m1

-nodefaultlibs"
  = note: ld: warning: ignoring duplicate libraries: '-lc++'
          ld: library 'odbc' not found
          clang: error: linker command failed with exit code 1 (use -v to see invocation)