pacman82 / arrow-odbc Goto Github PK
View Code? Open in Web Editor NEWFill Apache Arrow record batches from an ODBC data source in Rust.
License: MIT License
Fill Apache Arrow record batches from an ODBC data source in Rust.
License: MIT License
Display size of column 7: -4 Is this normal? If this is -4, the program will be painc.
[2023-11-19T04:49:41Z DEBUG arrow_odbc::schema] ODBC driver reported for column 1. Relational type: Varchar { length: 255 }; Nullability: Nullable; Name: 'username';
[2023-11-19T04:49:41Z DEBUG arrow_odbc::schema] ODBC driver reported for column 2. Relational type: Varchar { length: 255 }; Nullability: Nullable; Name: 'fullname';
[2023-11-19T04:49:41Z DEBUG arrow_odbc::schema] ODBC driver reported for column 3. Relational type: Varchar { length: 255 }; Nullability: Nullable; Name: 'password';
[2023-11-19T04:49:41Z DEBUG arrow_odbc::schema] ODBC driver reported for column 4. Relational type: Varchar { length: 255 }; Nullability: Nullable; Name: 'email';
[2023-11-19T04:49:41Z DEBUG arrow_odbc::schema] ODBC driver reported for column 5. Relational type: Varchar { length: 2083 }; Nullability: Nullable; Name: 'profile_img';
[2023-11-19T04:49:41Z DEBUG arrow_odbc::schema] ODBC driver reported for column 6. Relational type: TinyInt; Nullability: Nullable; Name: 'deleted';
[2023-11-19T04:49:41Z DEBUG arrow_odbc::schema] ODBC driver reported for column 7. Relational type: Unknown; Nullability: Nullable; Name: 'properties';
[2023-11-19T04:49:41Z DEBUG arrow_odbc::schema] ODBC driver reported for column 8. Relational type: TinyInt; Nullability: Nullable; Name: 'is_group';
[2023-11-19T04:49:41Z DEBUG arrow_odbc::reader] Relational type of column 1: Varchar { length: 255 }
[2023-11-19T04:49:41Z DEBUG arrow_odbc::reader] Relational type of column 2: Varchar { length: 255 }
[2023-11-19T04:49:41Z DEBUG arrow_odbc::reader] Relational type of column 3: Varchar { length: 255 }
[2023-11-19T04:49:41Z DEBUG arrow_odbc::reader] Relational type of column 4: Varchar { length: 255 }
[2023-11-19T04:49:41Z DEBUG arrow_odbc::reader] Relational type of column 5: Varchar { length: 2083 }
[2023-11-19T04:49:41Z DEBUG arrow_odbc::reader] Relational type of column 7: Unknown
[2023-11-19T04:49:41Z DEBUG arrow_odbc::reader] Display size of column 7: -4
thread 'odbc::odbc_drive::test::test_insert_into' panicked at /Users/simonyi/mosdb/arrow-odbc/src/reader/text.rs:52:14:
called `Result::unwrap()` on an `Err` value: TryFromIntError(())
Hey, @pacman82! Again, I can't thank you enough for this lib!
Running into an issue where we continue getting the following error despite increasing MAX_TEXT_SIZE to a value greater than 1384. Any help would be much appreciated!
Arrow { source: ExternalError(TooLargeValueForBuffer { indicator: Some(1384), buffer_index: 1 }) }
As additional context, I've tried increasing the MAX_BYTES_PER_BATCH to some absurd numbers but no dice. I've also tried removing with_fallable_allocations
and also removing with_max_text_size
Update: the SQLType for the problematic field was coming in as VARCHAR(255). When converted to octets (utf8), it produces an expected length of 1020 (255 * 4). As you can see see in the error message, the actual length of the column was 1384. The source is databricks. So perhaps databricks isn't respecting its own schema? As a workaround, I modified this line to be the max of max_str_len
and the len
as reported by the type. But this causes issues with my other sources because now the row size is determined as the roughly max_str_len * number of columns which turns into a huge number.
So, to accommodate the messiness of the world I'm working with, I think I'd prefer as a user of this API to be able to provide an expected text size and a max text size. The expected text size would be used to allocate the vets, and the max text size would be used to return an error or panic.
The 8.3.0 update is breaking my code with:
the trait arrow_odbc::odbc_api::Cursor
is not implemented for odbc_api::CursorImpl<odbc_api::StatementConnection<'static>>
Since release of arrow_odbc 3.0.0 (arrow_odbc_py 2.0.0), when running a multi-statement query that doesn't generate a result set in the second or later statement now fails due to a division by zero error when calculating the buffer_size_in_rows
. The first statement returns a result set to allow the Cursor
to be created, and the statement handle still has more results; however, the calculation of the buffer size now panics.
import arrow_odbc as ao
reader = ao.read_arrow_batches_from_odbc(
connection_string="Driver={DB2};"
"Hostname=host;"
"Port=50000;"
"Protocol=TCPIP;"
"Database=db2db;"
"UID=user;"
"PWD=password;",
query="""
SELECT 1
FROM SYSIBM.SYSDUMMY1
;
DECLARE GLOBAL TEMPORARY TABLE MY_DGTT AS (
SELECT *, 5 AS FIVE
FROM SYSIBM.SYSDUMMY1
)
WITH DATA
ON COMMIT PRESERVE ROWS
NOT LOGGED
WITH REPLACE
;
SELECT *
FROM SESSION.MY_DGTT
;
""",
)
while True:
print("-------------------------")
print(reader.schema or "No Schema")
if reader.schema:
print("-------------------------")
for batch in reader:
print(batch)
if not reader.more_results():
break
Error Message:
thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-odbc-3.0.0/src/reader/odbc_reader.rs:467:30:
attempt to divide by zero
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Expected Results:
-------------------------
1: int32 not null
-------------------------
pyarrow.RecordBatch
1: int32 not null
----
1: [1]
-------------------------
No Schema
-------------------------
IBMREQD: string not null
FIVE: int32 not null
-------------------------
pyarrow.RecordBatch
IBMREQD: string not null
FIVE: int32 not null
----
IBMREQD: ["Y"]
FIVE: [5]
Any plan on api to get tables and schemas from odbc connection?
This is the problem I'm having
error[E0277]: `(dyn arrow_odbc::reader::ReadStrategy + 'static)` cannot be sent between threads safely
--> src/odbc/odbc_drive.rs:105:25
|
105 | Ok(Some(Box::pin(s)))
| ^^^^^^^^^^^ `(dyn arrow_odbc::reader::ReadStrategy + 'static)` cannot be sent between threads safely
|
= help: the trait `std::marker::Send` is not implemented for `(dyn arrow_odbc::reader::ReadStrategy + 'static)`
= note: required for `std::ptr::Unique<(dyn arrow_odbc::reader::ReadStrategy + 'static)>` to implement `std::marker::Send`
note: required because it appears within the type `Box<dyn ReadStrategy>`
--> /Users/simonyi/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/alloc/src/boxed.rs:195:12
Can it be changed?
impl<C: Cursor + Send + 'static> ConcurrentOdbcReader {
I think it is better to use SendableRecordBatchStream type here
pub trait RecordBatchStream: Stream<Item = Result<RecordBatch>> {
fn schema(&self) -> SchemaRef;
}
pub type SendableRecordBatchStream = Pin<Box<dyn RecordBatchStream + Send>>;
let cursor = conn
.execute("SHOW COLUMNS FROM hello", ())
.unwrap()
.expect("");
// Each batch shall only consist of maximum 10.000 rows.
let max_batch_size = 10_000;
// Read result set as arrow batches. Infer Arrow types automatically using the meta
// information of `cursor`.
let mut arrow_record_batches = OdbcReader::new(cursor, max_batch_size).unwrap();
When "SHOW COLUMNS FROM hello" is executed, if batch_size is set to 10000, the program will crash. If it is set to 10, it will not crash.
Is the ConcurrentOdbcReader thread safe? I'm getting an error saying that ReadStrategy is not Send:
error[E0277]:
(dyn arrow_odbc::reader::ReadStrategy + 'static) cannot be sent between threads safely
Is there a way to use this concurrently? I'm trying to use this with the new WASI component model, saving the reader across multiple calls from the component.
It seems like arrow-odbc's performance can be extremely sensitive to the value of max_text_size
:
In [16]: reader = arrow_odbc.read_arrow_batches_from_odbc(..., max_text_size=30)
In [17]: %timeit -n1 -r1 list(reader)
81.1 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [18]: reader = arrow_odbc.read_arrow_batches_from_odbc(..., max_text_size=300)
In [19]: %timeit -n1 -r1 list(reader)
117 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [20]: reader = arrow_odbc.read_arrow_batches_from_odbc(..., max_text_size=3000)
In [21]: %timeit -n1 -r1 list(reader)
6.42 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
When writing library code I have to pick a value large enough to accommodate the longest string the library will ever deal will. In practice this can easily be the 3000 bytes used in the example above, for a 80x slowdown in this case :-(
Version: arrow-odbc-py 2.0.5
I think we can make a (small?) optimization for parsing decimals with scale=0. Snowflake will send any integers as Decimal(38, 0) by default. At the very least I guess we can skip removing the .
char but maybe we can also just use one of the integer parsers.
Hey @pacman82, how would I execute a multi-statement query like below? Based on the documentation and integration tests, you expect a cursor to be returned (only when each statement returns at least one row).
How would you advise me on how to get something like this to execute (where intermediate results don't return anything and the last result does)?
USE DATABASE PAYMENTDATA;
SELECT * FROM "PAYMENTSCHEMA"."CUSTOMERS" LIMIT 10
Is mysql currently incompatible?
const CONNECTION_STRING: &str = r#"
Driver={MySQL ODBC 8.2 Unicode Driver};
Server=127.0.0.1;
DB=morse;
UID=root;
PWD=123456;
Port=3306;
"#;
odbcinst.ini
[MySQL ODBC 8.2 Unicode Driver]
Description=MySQL ODBC 8.2 Unicode Driver
Driver = /usr/local/lib/libmyodbc8w.so
running 1 test
Error: ODBC emitted an error calling 'SQLDriverConnect':
State: H000, Native error: 0, Message: [
Stack backtrace:
0: std::backtrace_rs::backtrace::libunwind::trace
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/../../backtrace/src/backtrace/libunwind.rs:104:5
1: std::backtrace_rs::backtrace::trace_unsynchronized
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: std::backtrace::Backtrace::create
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/backtrace.rs:331:13
3: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
at /Users/simonyi/.cargo/registry/src/mirrors.tuna.tsinghua.edu.cn-df7c3c540f42cdbd/anyhow-1.0.75/src/error.rs:551:25
4: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/core/src/result.rs:1963:27
5: morse_engine::odbc::test::test
at ./src/odbc/mod.rs:27:26
6: morse_engine::odbc::test::test::{{closure}}
at ./src/odbc/mod.rs:15:18
7: core::ops::function::FnOnce::call_once
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/core/src/ops/function.rs:250:5
8: core::ops::function::FnOnce::call_once
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/core/src/ops/function.rs:250:5
9: test::__rust_begin_short_backtrace
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/test/src/lib.rs:628:18
10: test::run_test_in_process::{{closure}}
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/test/src/lib.rs:651:60
11: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/core/src/panic/unwind_safe.rs:272:9
12: std::panicking::try::do_call
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/panicking.rs:552:40
13: std::panicking::try
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/panicking.rs:516:19
14: std::panic::catch_unwind
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/panic.rs:142:14
15: test::run_test_in_process
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/test/src/lib.rs:651:27
16: test::run_test::{{closure}}
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/test/src/lib.rs:574:43
17: test::run_test::{{closure}}
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/test/src/lib.rs:602:41
18: std::sys_common::backtrace::__rust_begin_short_backtrace
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/sys_common/backtrace.rs:154:18
19: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/thread/mod.rs:529:17
20: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/core/src/panic/unwind_safe.rs:272:9
21: std::panicking::try::do_call
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/panicking.rs:552:40
22: std::panicking::try
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/panicking.rs:516:19
23: std::panic::catch_unwind
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/panic.rs:142:14
24: std::thread::Builder::spawn_unchecked_::{{closure}}
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/thread/mod.rs:528:30
25: core::ops::function::FnOnce::call_once{{vtable.shim}}
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/core/src/ops/function.rs:250:5
26: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/alloc/src/boxed.rs:2007:9
27: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/alloc/src/boxed.rs:2007:9
28: std::sys::unix::thread::Thread::new::thread_start
at /rustc/4b85902b438f791c5bfcb6b1c5b476d5b88e2bef/library/std/src/sys/unix/thread.rs:108:17
29: __pthread_joiner_wake
test odbc::test::test ... FAILED
failures:
failures:
odbc::test::test
test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 13 filtered out; finished in 0.37s
error: test failed, to rerun pass `-p morse-engine --lib`
Hey @pacman82 thanks for your job you have done a incredible work here, please did you have some articles or examples of how to integrate this crate with polars? I had find some posts talking about this crate with polars but no single one example how to do it.
See ya!
cargo.lock
[[package]]
name = "arrow-odbc"
version = "3.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d2278bb408bd1cbf82f725517ff182d06b024657040f9bafac23b9bd0ca61883"
dependencies = [
"arrow 49.0.0",
"atoi",
"chrono",
"log",
"odbc-api",
"thiserror",
]
my cargo.toml
[dependencies]
arrow = { version = "46.0.0", features = ["prettyprint","ffi"] }
arrow-array = { version = "46.0.0", default-features = false, features = ["chrono-tz"] }
arrow-buffer = { version = "46.0.0", default-features = false }
arrow-flight = { version = "46.0.0", features = ["flight-sql-experimental"] }
arrow-schema = { version = "46.0.0", default-features = false }
parquet = { version = "46.0.0", features = ["arrow", "async", "object_store"] }
What should I do to make arrow-odbc use arrow:46.0.0 version? Although arrow-odbc supports version = ">= 29, < 50", the actual compiled result is as above. Why is this?
Just an idea: would be nice to use this as a bridge from odbc to adbc
https://arrow.apache.org/adbc/0.5.1/format/specification.html
Feel free to close this if too complex / time consuming or otherwise not wanted
Would probably need to use this lib: https://github.com/apache/arrow-adbc/blob/main/rust/Cargo.toml
I was originally trying to do bulk inserts with arrow-odbc-py, but the stack traces didn't show me the original database error code and message. So, I switched to Rust, for a simpler reproducible example.
My goal was to get insert_into_table
working with DB2 for IBM i. I believe the root cause is that function insert_statement_text
hard-codes a semicolon, and the ODBC drivers for DB2 for IBM i does not allow statements to end in a semicolon, nor does it allow a script of statements separated by a semicolon.
For a table named TSTPANDAS (originally testing inserts from Pandas) with columns ROW_NUM and RAND_FLOAT.
Warning: I really don't know Rust, though it's reminding me of C++, Java, and Kotlin. I hacked this together from examples in your docs.
use std::env;
use std::sync::Arc;
use arrow::{
array::{Float32Array, Int32Array},
record_batch::RecordBatch,
};
use arrow_array::RecordBatchIterator;
use arrow_odbc::{
insert_into_table,
odbc_api::{ConnectionOptions, Environment, Error},
};
fn main() -> Result<(), Error> {
// If you do not do anything fancy it is recommended to have only one Environment in the
// entire process.
let environment = Environment::new()?;
let user = env::var("db2_user").unwrap_or_default();
let password = env::var("db2_password").unwrap_or_default();
// Connect using a DSN. Alternatively we could have used a connection string
let connection = environment.connect(
"scupddatawhse_dev",
&user,
&password,
ConnectionOptions::default(),
)?;
// https://docs.rs/arrow/latest/arrow/#tabular-representation
let col_1 = Arc::new(Int32Array::from_iter([1, 2, 3])) as _;
let col_2 = Arc::new(Float32Array::from_iter([1., 6.3, 4.])) as _;
let record_batch =
RecordBatch::try_from_iter([("row_num", col_1), ("rand_float", col_2)]).unwrap();
// https://docs.rs/arrow-array/44.0.0/arrow_array/trait.RecordBatchReader.html
let batches: Vec<RecordBatch> = vec![record_batch.clone(), record_batch.clone()];
let mut reader = RecordBatchIterator::new(batches.into_iter().map(Ok), record_batch.schema());
let batch_size = 1000;
insert_into_table(&connection, &mut reader, "TSTPANDAS", batch_size).expect("Inserted into TSTPANDAS");
Ok(())
}
rcoleman12@ross-T480-linux:~/git/test_arrow_odbc$ cargo run
Finished dev [unoptimized + debuginfo] target(s) in 0.05s
Running `target/debug/test_arrow_odbc`
thread 'main' panicked at 'Inserted into TSTPANDAS: PreparingInsertStatement { source: Diagnostics { record: State: 42000, Native error: -104, Message: [IBM][System i Access ODBC Driver][DB2 for i5/OS]SQL0104 - Token ; was not valid. Valid tokens: <END-OF-STATEMENT>., function: "SQLPrepare" }, sql: "INSERT INTO TSTPANDAS (row_num, rand_float) VALUES (?, ?);" }', src/main.rs:42:74
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
os:macos m1
-nodefaultlibs"
= note: ld: warning: ignoring duplicate libraries: '-lc++'
ld: library 'odbc' not found
clang: error: linker command failed with exit code 1 (use -v to see invocation)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.