duckdb / duckdb_azure Goto Github PK

Azure extension for DuckDB

License: MIT License

CMake 2.45% Makefile 0.31% Shell 7.32% C++ 89.92%

duckdb_azure's Introduction

DuckDB

DuckDB is a high-performance analytical database system. It is designed to be fast, reliable, portable, and easy to use. DuckDB provides a rich SQL dialect, with support far beyond basic SQL. DuckDB supports arbitrary and nested correlated subqueries, window functions, collations, complex types (arrays, structs), and more. For more information on using DuckDB, please refer to the DuckDB documentation.

Installation

If you want to install and use DuckDB, please see our website for installation and usage instructions.

Data Import

For CSV files and Parquet files, data import is as simple as referencing the file in the FROM clause:

SELECT * FROM 'myfile.csv';
SELECT * FROM 'myfile.parquet';

Refer to our Data Import section for more information.

SQL Reference

The website contains a reference of functions and SQL constructs available in DuckDB.

Development

For development, DuckDB requires CMake, Python3 and a C++11 compliant compiler. Run make in the root directory to compile the sources. For development, use make debug to build a non-optimized debug version. You should run make unit and make allunit to verify that your version works properly after making changes. To test performance, you can run BUILD_BENCHMARK=1 BUILD_TPCH=1 make and then perform several standard benchmarks from the root directory by executing ./build/release/benchmark/benchmark_runner. The details of benchmarks are in our Benchmark Guide.

Please also refer to our Build Guide and Contribution Guide.

Support

See the Support Options page.

duckdb_azure's People

Stargazers

Watchers

Forkers

samansmink szarnyasg stephaniewang526 motherduckdb chrisiou jkuhn-cuda kmatt quentingodeau tishj y-- carlopi

duckdb_azure's Issues

Segfault with extension built on Ubuntu

The process below successfully builds the extension which INSTALLs from a local path, but on LOAD azure the duckdb cli crashes.

Oddly this does not happen with the GitHub release binary, but testing a build on Ubuntu to address SSL CA errors: #8 (comment)

$ duckdb -unsigned
v0.9.2 3c695d7ba9
Enter ".help" for usage hints.
D force install '/home/kmatt/source/duckdb_azure/build/release/extension/azure/azure.duckdb_extension';
D load azure;
Floating point exception (core dumped)

$ dmesg -T
[Fri Dec  8 18:31:53 2023] traps: duckdb[4008970] trap divide error ip:7f7cd3a3f275 sp:7ffc0a07f7e0 error:0 in azure.duckdb_extension[7f7cd350a000+1768000]

I'm not much on debugging C++, but building duckdb with debug to see if gdb shows anything helpful.

Build steps:

# Checkout source and submodule
git clone --recurse-submodules https://github.com/duckdb/duckdb_azure.git && cd duckdb_azure

# Install vcpkg in root of azure_duckdb source path
git clone https://github.com/Microsoft/vcpkg.git

# Run the bootstrap script to build vcpkg
vcpkg/bootstrap-vcpkg.sh

# Install dependencies
vcpkg/vcpkg install

# Make
VCPKG_TOOLCHAIN_PATH=$PWD/vcpkg/scripts/buildsystems/vcpkg.cmake make

Problem with the SSL CA cert

it works fine in windows, but when running from a notebook using linux, I get this erros

duckdb.sql(''' create or replace view lineitem as select * from 'azure://tpch/lineitem/*.parquet';''')

Error: Invalid Error: Fail to get a new connection for: https://dddddddd.blob.core.windows.net./ Problem with the SSL CA cert (path? access rights?)

Authorization error for query containing wildcard

Performing a query over multiple blobs gives authentication error. Using the same query and connection string with adlfs appears to work.

import duckdb
import adlfs

sas = <sas>
blob_endpoint=<endpoint>
table_path=<path>
connection_string=f"""BlobEndpoint={blob_endpoint};SharedAccessSignature={sas}"""

duckdb.sql('load azure')
duckdb.sql(f"set azure_storage_connection_string = '{connection_string}'")

the following fails with "This request is not authorized to perform this operation"

duckdb.sql(f"select` * from 'azure://{table_path}/*.parquet'")

This works using the same connection string.

fs = adlfs.AzureBlobFileSystem(connection_string=connection_string)
con = duckdb.connect()
con.register_filesystem(fs)

con.sql(f"select` * from 'abfs://{table_path}/*.parquet'")

Ideally I'd like to perform hive partitioned reads over azure blob storage. Is this something that is supported?

Segmentation fault when copying to Azure storage

This works:

INSTALL azure;
LOAD azure;

CREATE SECRET test (
      TYPE AZURE,
      PROVIDER CREDENTIAL_CHAIN,
      ACCOUNT_NAME 'test'
);

select * from 'az://test/test.csv';

This triggers a segmentation fault (in same session as above):

copy (select 1 as A, 2 as B) to 'az://test/test2.csv';

System Info

OS: Windows 11 and Linux (WSL)
Environment: Both PowerShell and WSL

❯ duckdb --version
v0.10.0 20b1486d11

Unable to query entire parquet directory using anonymous authenticatoin

Hello,

I'm not able to query Parquet data partitioned into several folders using anonymous authentication. I'm able to query individual files though. I can confirm that I am able to download the entire folder anonymously using other tools, so I think access permissions are setup properly.

Here is the code.

SET azure_storage_connection_string = 'DefaultEndpointsProtocol=https;AccountName=overturemapswestus2;AccountKey=;EndpointSuffix=core.windows.net';
SELECT
           subType,
           localityType,
           adminLevel,
           isoCountryCodeAlpha2,
           JSON(names) AS names,
           JSON(sources) AS sources,
           ST_GeomFromWkb(geometry) AS geometry
      FROM read_parquet('azure://release/2023-07-26-alpha.0/theme=admins/type=*/*')
     WHERE adminLevel = 2
       AND ST_GeometryType(ST_GeomFromWkb(geometry)) IN ('POLYGON','MULTIPOLYGON')
	   LIMIT 10;

Error

Error: Invalid Error: 401 Server failed to authenticate the request. Please refer to the information in the www-authenticate header.
Server failed to authenticate the request. Please refer to the information in the www-authenticate header.
RequestId:d101e329-701e-0015-5fa8-f8f163000000
Time:2023-10-06T22:55:53.1707452Z
Request ID: d101e329-701e-0015-5fa8-f8f163000000

Working query

SELECT
           subType,
           localityType,
           adminLevel,
           isoCountryCodeAlpha2,
           JSON(names) AS names,
           JSON(sources) AS sources,
           ST_GeomFromWkb(geometry) AS geometry
      FROM read_parquet('azure://release/2023-07-26-alpha.0/theme=admins/type=locality/20230725_211237_00132_5p54t_0fa79fec-7a39-4f51-90a6-aa94b553befd')
     WHERE adminLevel = 2
       AND ST_GeometryType(ST_GeomFromWkb(geometry)) IN ('POLYGON','MULTIPOLYGON')
	   LIMIT 10;

Connection timeout behind proxy network

Using Duckdb 0.9.2 while connecting to adls gen2 to read parquet, its unable to connect while stmt.executy(query)
Gives 12002 connection timeout on network where proxy needed, if connected to others it works?
Any workaround.
Set proxy in system propety done, not working.
Any way to pass proxy to extension. Thanks.

AzureStorageFileSystem Directory Exists not implemented

What happens?

duckdb.duckdb.NotImplementedException: Not implemented Error: AzureStorageFileSystem: DirectoryExists is not implemented!

Facing while copying the duckdb table to azure

To Reproduce

Just while copying the table it will produce

OS:

Ubuntu

DuckDB Version:

0.10.0

DuckDB Client:

Python

Full Name:

Tejinderpal Singh

Affiliation:

Atlan

Have you tried this on the latest nightly build?

I have not tested with any build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Yes, I have

add support for AD Service Principal client/secret

can you please add authentication support using AD Service Principal client/secret.

basically it involves 3 items:

client_id
client_secret
tenant_id

adlfs has good implementation https://raw.githubusercontent.com/fsspec/adlfs/main/adlfs/spec.py

I will buy you a coffee if this comes soon :)

Thank you and all the best
ehab

Test issue

Test & Expose retry config

Azure SDK seems to have a retry mechanism that is on by default:
https://github.com/Azure/azure-sdk-for-cpp/blob/b8d2301931091bbc2374706da605e954ee4d9863/sdk/core/azure-core/inc/azure/core/http/policies/policy.hpp#L79.

We would like to:

confirm/test that it works as expected
expose the configuration to the user

The main question is how to unify / distinguish these for the similar settings in the httpfs extension

Support for Device Code Flow Authentication

As suggested at this link, I'm creating an issue for the possible implementation of the authentication feature via Device Code Flow. I noticed that this flow is not exclusive to Azure, but it comes from OAuth 2.0. In it, we have two tokens as a response of the process: a Bearer token and a refresh token (to reauthenticate when the Bearer token expires). However, I think it doesn't make sense to implement reauthentication within DuckDB, I believe it makes more sense being a user's responsability. For more information on how this flow works in Azure, please refer to this link:

https://learn.microsoft.com/en-us/entra/identity-platform/v2-oauth2-device-code

MainDistributionPipeline build failed in main

Hello,

I saw that the MainDistributionPipeline didn't succeeded on windows with the main branch after my last PRs :(

azure-identity.lib(azure_cli_credential.cpp.obj) : error LNK2019: unresolved external symbol _Thrd_sleep_for referenced in function "void __cdecl std::this_thread::sleep_for<__int64,struct std::ratio<1,1000> >(class std::chrono::duration<__int64,struct std::ratio<1,1000> > const &)" (??$sleep_for@_JU?$ratio@$00$0DOI@@std@@@this_thread@std@@YAXAEBV?$duration@_JU?$ratio@$00$0DOI@@std@@@chrono@1@@Z) [D:\a\duckdb_azure\duckdb_azure\build\release\test\extension\loadable_extension_demo_loadable_extension.vcxproj]

Build link

The issue itself is strange because the mentioned function is part of the STL...
I try to reproduce it on my computer but the build succeeded without any issue...

Do you have any idea or some input on the build (that is not show in the logs (i already check the version of the compiler, pythons...))

Also something is strange the build trigger on (PR#33)[51f680e ] & (PR#34)[74b2330 ] but not on (PR#35)[ec2cd30] do you known why ?

Regards,
Quentin

Duck DB 0.10 seems to be broken

Hi there

I had expected this to work according to docs in 0.10:

import duckdb

with duckdb.connect() as con:
    con.execute(
        """INSTALL azure;
LOAD azure;
CREATE SECRET secret2 (
    TYPE AZURE,
    ACCOUNT_NAME 'account_name'
);"""

However, I get this: duckdb.duckdb.InvalidInputException: Invalid Input Error: Secret type 'azure' not found

Can't connect, authentication errors with seemingly valid connection string

I believe I have set a proper connection string with azure and loaded the extension in Duckdb 0.90. The same connection string works with the AzureBlobFilesystem in python and authenticates and lists blobs, etc.

But with duckdb on the command line the same connection string produces results like this:

Error: IO Error: AzureStorageFileSystem open file  <file path stuff> failed with Reason Phrase: Server failed to authenticate the request. Please refer to the information in the www-authenticate header.

Is there a way to see this www-authenticate header?

Unable to install the extension

Hello,

I see below error when I try to install the extension in Windows. Please take a look.

D INSTALL azure;
Error: HTTP Error: Failed to download extension "azure" at URL "http://extensions.duckdb.org/v0.8.1/windows_amd64/azure.duckdb_extension.gz"

Candidate extensions: "icu"

Performance compared to querying virtual filesystem through `rclone mount`

I have a 1.4 GB parquet file in Azure storage.

I can mount the storage location using rclone as follows:

rclone mount account_name:container_name ~/local_mount_point --vfs-cache-mode full --max-read-ahead 1024Ki --read-only

In this scenario, the following query runs in 103 seconds:

select * from '~/local_mount_point/directory/file.parquet' limit 10;

I can also use the duckdb_azure extension to connect to the file directly:

force install azure from 'http://nightly-extensions.duckdb.org';
LOAD azure;
set azure_account_name='account_name';
set azure_credential_chain='cli';
select * from 'azure://container/directory/file.parquet' limit 10;

In this case, the query completes in 240 seconds.

I'm guessing there may be performance enhancements possible to make the direct query as fast (or faster?) than the query through rclone mount.

(I'm aware this is a very early preview. Thanks for your amazing work!)

Note that rclone provides some local caching, so subsequent runs of the same query run much faster.
For example, re-running the same query from above completes in only 7 seconds on the second run through rclone.
Re-running the direct query took 208 seconds.

I'm not sure how much of the first load performance difference is due to rclone caching.

Connecting without connection string.

Hi DuckDB azure team,

thanks for an amazing extension.

I am running into an issue connecting, the problem is that my company does not allow connection strings to be revealed due to security reasons and their connector reads the relevant properties directly from $ az login system state in CLI.

Can DuckDB azure do something similar ?

Support write operation

Hi,

It's not really an issue but more an insight on what I plan to work on.
For the moment I didn't start but when I do I will post here a message to notify. If someone start before me please let me know ;)

Does this extension support other authetication ways such as service principal, or managed identity?

Azure SDK provides several ways for client authentication. https://github.com/Azure/azure-sdk-for-cpp/blob/main/sdk/identity/azure-identity/README.md
Could this extension support to authenticate client with methods other than connection string, such as service principal or managed identity, with settings from environment variables?

Querying data from public $web container without authentication

Thanks for the great extension! I am using Azure Storage to host a static website. I have data stored within the public $web container (where the html is hosted) that I can successfully query in DuckDB using a connection_string containing an AccountKey or SharedAccessSignature (as described in the extensions docs):

CREATE SECRET secret1 (
    TYPE AZURE,
    CONNECTION_STRING '⟨value⟩'
);

I would like to query the data without authentication though. I've tried removing the AccountKey or SharedAccessSignature portion of the secret config above, and tried the config suggested in the docs for authentication-less access:

CREATE SECRET secret2 (
    TYPE AZURE,
    PROVIDER CONFIG,
    ACCOUNT_NAME '⟨storage account name⟩'
);

Unfortunately, both attempts result in an error: NoAuthenticationInformationReason Phrase: Server failed to authenticate the request. Please refer to the information in the www-authenticate header..

I've also tried using https:// instead of az://. Without establishing any secret, from https://{storage_account}.web.core.windows.net/my_table.parquet/partition_1=1/partition_2=hello/part-0.parquet' works. But from 'https://{storage_account}.web.core.windows.net/my_table.parquet/**/*.parquet' gives a 404 error (the requested content does not exist). In other words, I'm unable to figure out how to read the full Parquet with Hive-style partitions when using https://.

Am I doing something wrong, or is this not supported?

Support for specifying token directly

Hi there

Thanks for this cool extension, that will enable lot's of use cases for us

If you acquire the token outside duckdb, would be nice to be able to do something like this:

SET azure_storage_bearer_token = '<your_token>';

This is espescially useful if you use Managed Identity / Interactive Browser Credentials or the like

Sample query to read Parquet/CSV from Azure Blob Storage

https://github.com/duckdblabs/duckdb_azure/blob/77f343c3997b3bcca8f389f0fd43b9acf4ee9622/README.md?plain=1#L13

Hey folks 👋

I think the query above, for Parquet files, should be

SELECT count(*)
FROM read_parquet('azure://<my_container>/<my_file>.parquet');

while for CSV files, it should be

SELECT count(*)
FROM read_csv('azure://<my_container>/<my_file>.csv');

Support of hierarchical namespace (dfs)

Hi @samansmink ,

This not really an issue but one of the feature I planning to work on.

I want to add the support of dfs endpoints to firstly improve the way that files are discovere when performing glob.

My idea is to add a new filesystem with a dedicated scheme (that have to be define).
But it can also be managed by simply choose over the endpoints, but I think it may be harder if we use this way because ms might have it own endpoints...

Do you have an opinion on this ?
Regards,
Quentin

duckdb / duckdb_azure Goto Github PK

duckdb_azure's Introduction

DuckDB

Installation

Data Import

SQL Reference

Development

Support

duckdb_azure's People

Stargazers

Watchers

Forkers

duckdb_azure's Issues

System Info

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

Have you tried this on the latest nightly build?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Recommend Projects

Recommend Topics

Recommend Org