Comments (8)
Hi, thanks for sharing your idea.
I agree the start point to analysis is the function.
In my understanding, the root cause of the error is that pg_bulkload forgot to be aware of pivot tuples (ex. b-tree root and intermediate nodes) and its ip_posid
may be 0, even though they were introduced in the commit (postgres/postgres@8224de4) that supports the INCLUDE clause in the Index from PG v11.
When the b-tree is in a multi-level hierarchical state, the error occurs because there are pivot tuples. That's why the error occurred only when many data is loaded.
As you said, I assumed that we need to fix BTReaderInit()
to handle the case that the tuple is the pivot tuple. We will make a patch.
from pg_bulkload.
Thanks. I could reproduced.
- 500lines: ERROR
# current master
$ pg_bulkload -V
pg_bulkload 3.1.20
# generate data
$ for i in `seq 1 500`; do echo "$i,$((i+1)),$((i+2))"; done > /tmp/in.csv
# create table
$ psql -c "SELECT version();"
version
--------------------------------------------------------------------------------------------------------
PostgreSQL 13.11 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit
(1 row)
$ psql -c "CREATE EXTENSION pg_bulkload;"
CREATE EXTENSION
$ psql -c "CREATE TABLE test (id1 int, id2 int, id3 int, UNIQUE(id1, id2, id3));"
CREATE TABLE
# pg_bulkload control file
$ cat /tmp/test.ctl
# output table
OUTPUT = test
# input data
WRITER = DIRECT
INPUT = /tmp/in.csv
TYPE = CSV
QUOTE = "\""
ESCAPE = \
DELIMITER = ","
# error handling
CHECK_CONSTRAINTS = YES
PARSE_ERRORS = INFINITE
DUPLICATE_ERRORS = INFINITE
# logging
LOGFILE = /tmp/pg_bulkload.log
PARSE_BADFILE = /tmp/parse_badfile.csv
DUPLICATE_BADFILE = /tmp/duplicate_badfile.csv
# test
$ pg_bulkload /tmp/test.ctl
NOTICE: BULK LOAD START
NOTICE: BULK LOAD END
0 Rows skipped.
500 Rows successfully loaded.
0 Rows not loaded due to parse errors.
0 Rows not loaded due to duplicate errors.
0 Rows replaced with new rows.
$ pg_bulkload /tmp/test.ctl
NOTICE: BULK LOAD START
ERROR: query failed: ERROR: could not create unique index "test_id1_id2_id3_key"
DETAIL: Key (id1, id2, id3)=(126, 127, 128) is duplicated.
DETAIL: query was: SELECT * FROM pgbulkload.pg_bulkload($1)
- 200lines: NO ERROR
$ for i in `seq 1 200`; do echo "$i,$((i+1)),$((i+2))"; done > /tmp/in.csv
$ psql -c "DROP TABLE test; CREATE TABLE test (id1 int, id2 int, id3 int, UNIQUE(id1, id2, id3));"
$ pg_bulkload /tmp/test.ctl
NOTICE: BULK LOAD START
NOTICE: BULK LOAD END
0 Rows skipped.
200 Rows successfully loaded.
0 Rows not loaded due to parse errors.
0 Rows not loaded due to duplicate errors.
0 Rows replaced with new rows.
$ pg_bulkload /tmp/test.ctl
NOTICE: BULK LOAD START
NOTICE: BULK LOAD END
0 Rows skipped.
200 Rows successfully loaded.
0 Rows not loaded due to parse errors.
0 Rows not loaded due to duplicate errors.
200 Rows replaced with new rows.
$ wc -l /tmp/duplicate_badfile.csv
200 /tmp/duplicate_badfile.csv
$ head -n 10 /tmp/duplicate_badfile.csv
1,2,3
2,3,4
3,4,5
4,5,6
5,6,7
6,7,8
7,8,9
8,9,10
9,10,11
10,11,12
from pg_bulkload.
Hey, I was also looking at the code and this is what I found.
The duplicate removal code never triggers (this one
Line 587 in 8caced4
because merge variable is set to -1 (https://github.com/ossc-db/pg_bulkload/blob/8caced46019119e2adf6f3dbdf96b72de9df08e9/lib/pg_btree.c#L489C6-L489C12) here
and that, in turn, happens because of this piece of code
firstid = PageGetItemId(reader->page, P_FIRSTDATAKEY(opaque));
itup = (IndexTuple) PageGetItem(reader->page, firstid);
if ((itup->t_tid).ip_posid == 0)
{
elog(DEBUG1, "pg_bulkload: failded in BTReaderInit for \"%s\"",
RelationGetRelationName(rel));
return -1;
}
``` inside BTReaderInit.
I did a quick test with that **condition removed**, which sounds like a very naive solution of course... (I believe it wasn't there in the older version of pg_bulkload as it was added in 2020), then the program performs as expected. It properly identifies duplicates, removes them etc and index gets created properly.
That check there is there because of some other issue(s). It is supposed to address crashes. I did a test with memory lowered to 1MB in postgres to try to make it crash with that change, but that didn't happen.
That PR with that check, had also few other changes, so perhaps those other changes address it? or perhaps my dataset was too small to make it crash with
from pg_bulkload.
@mikecaat thank you for figuring out!
Do you have initial ETA on this issue perhaps?
Multi column indices is something used very often for us, so there is impact. Any help is appreciated!
from pg_bulkload.
Do you have initial ETA on this issue perhaps?
In addition to v16 support, we plan to release a fix for the issue by January.
IIUC, the workarounds are
- to delete the index before loading the data and create the index after loaded.
- to truncate table before loading the data.
Multi column indices is something used very often for us, so there is impact. Any help is appreciated!
Yes. In addition, the problem also occurs with single column indexes.
from pg_bulkload.
In our testing we noticed this doesn't occur if we are loading the data straight to empty table. (and this allows to eliminate duplicates in the data fed to bulkload which is also part of our use case)
Is there anything that could go wrong when we are loading into empty table with indices on it? We haven't observed it but perhaps once again its matter of how much data we throw on it.
from pg_bulkload.
In our testing we noticed this doesn't occur if we are loading the data straight to empty table. (and this allows to eliminate duplicates in the data fed to bulkload which is also part of our use case) Is there anything that could go wrong when we are loading into empty table with indices on it? We haven't observed it but perhaps once again its matter of how much data we throw on it.
IIUC, if the table is empty, which means that the b-tree index of the table doesn't have data, the issue doesn't occur. That's one of the workaround.
from pg_bulkload.
Hello @mikecaat,
Just wanted to check, is fix for this issue on schedule for January? Did anything change?
from pg_bulkload.
Related Issues (20)
- delimiter - extended string syntax
- failed to install HOT 7
- unable to load to citus distributed table
- PostgreSQL 15 support HOT 4
- Cannot Make HOT 4
- update deprecated actoins for github actions
- Make impossible on Ubuntu HOT 6
- Error on Ubuntu
- Extension version pg_bulkload mistake HOT 6
- rpmbuild issue and question
- what is recovery command's major purpose ?
- missing img in the doc
- PostgreSQL 16 support HOT 5
- how to used in windows10/11?
- Put the source code into the contrib directory and use "make". Compile failed HOT 1
- How to filter rows (WHERE CLAUSE)
- Bad table attributes count when parsing
- Bulkload with Input type function ERROR :trying to store minimal tuple into wrong type of slot
- PostgreSQL 17 support HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pg_bulkload.