Comments (7)
I tried with NOT MATERIALIZED
and it didn't change what the query planner was doing. So it looks like at least in this negative form it is advisory only.
from osm2pgsql.
Do you have an EXPLAIN ANALYZE of the good and bad plan?
from osm2pgsql.
An EXPLAIN ANALYSE is naturally not possible short of waiting for a couple of hours.
Bad plan:
nominatim=# EXPLAIN
WITH changed_buckets AS (
SELECT array_agg(id) AS node_ids, id >> 5 AS bucket
FROM (SELECT osm_id as id FROM place LIMIT 1000) as x GROUP BY id >> 5)
SELECT DISTINCT w.id
FROM "public"."planet_osm_ways" w, changed_buckets b
WHERE w.nodes && b.node_ids
AND planet_osm_index_bucket(w.nodes) && (ARRAY[b.bucket])::bigint[];
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Unique (cost=46.34..55985610375.59 rows=20752577 width=8)
-> Nested Loop (cost=46.34..55985558494.15 rows=20752577 width=8)
Join Filter: ((w.nodes && (array_agg(x.id))) AND (planet_osm_index_bucket(w.nodes) && ARRAY[((x.id >> 5))]))
-> Index Scan using planet_osm_ways_pkey on planet_osm_ways w (cost=0.57..472413790.47 rows=965236160 width=108)
-> Materialize (cost=45.77..52.22 rows=215 width=40)
-> HashAggregate (cost=45.77..48.99 rows=215 width=40)
Group Key: (x.id >> 5)
-> Subquery Scan on x (cost=0.57..40.77 rows=1000 width=16)
-> Limit (cost=0.57..28.27 rows=1000 width=8)
-> Index Only Scan using place_id_idx on place (cost=0.57..9085193.42 rows=328039168 width=8)
(10 rows)
Good plan:
nominatim=# EXPLAIN
WITH changed_buckets AS (
SELECT array_agg(id) AS node_ids, id >> 5 AS bucket
FROM (SELECT osm_id as id FROM place LIMIT 1000) as x GROUP BY id >> 5)
SELECT DISTINCT w.id
FROM "public"."planet_osm_ways" w, changed_buckets b
WHERE w.nodes && b.node_ids
AND planet_osm_index_bucket(w.nodes) && (ARRAY[b.bucket])::bigint[];
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Unique (cost=59521649340.32..59521753103.21 rows=20752577 width=8)
-> Sort (cost=59521649340.32..59521701221.77 rows=20752577 width=8)
Sort Key: w.id
-> Nested Loop (cost=72489.62..59518154417.96 rows=20752577 width=8)
-> HashAggregate (cost=45.77..48.99 rows=215 width=40)
Group Key: (x.id >> 5)
-> Subquery Scan on x (cost=0.57..40.77 rows=1000 width=16)
-> Limit (cost=0.57..28.27 rows=1000 width=8)
-> Index Only Scan using place_id_idx on place (cost=0.57..9085193.42 rows=328039168 width=8)
-> Bitmap Heap Scan on planet_osm_ways w (cost=72443.86..276827659.72 rows=96524 width=108)
Recheck Cond: (planet_osm_index_bucket(nodes) && ARRAY[((x.id >> 5))])
Filter: (nodes && (array_agg(x.id)))
-> Bitmap Index Scan on planet_osm_ways_nodes_bucket_idx (cost=0.00..72419.73 rows=9652362 width=0)
Index Cond: (planet_osm_index_bucket(nodes) && ARRAY[((x.id >> 5))])
(14 rows)
from osm2pgsql.
Bad plan: https://explain.depesz.com/s/S6gG
Good plan: https://explain.depesz.com/s/jm3H
On the server getting the bad plan, can you force it to the good plan by overwriting SET enable_*
variables?
Can you get an EXPLAIN ANALYZE for just the good plan?
The bad plan is materializing the CTE. The documentation for 14 and 15 states
However, if a WITH query is non-recursive and side-effect-free (that is, it is a SELECT containing no volatile functions) then it can be folded into the parent query, allowing joint optimization of the two query levels. By default, this happens if the parent query references the WITH query just once, but not if it references the WITH query more than once.
I guess it can still materialize a CTE just like it can materialize a query plan node if it's required for the plan.
Both plans are estimating 20 million rows. This sounds very off, as we're doing batches of 1000, and indicates the problem is likely statistics-related.
The HashAggregate node statistics look reasonable as it is <1000. Reality is probably higher than 215 if places are randomly selected.
The estimate of 9.6 million rows for the good plan's bitmap index scan seems high.
I would expect PostgreSQL is already tracking the univariate statistics for planet_osm_index_bucket(nodes)
. In case it's not, can you try CREATE STATISTICS planet_osm_ways_bucket_stats ( planet_osm_index_bucket(nodes) ) FROM planet_osm_ways
and see if that changes the behavior?
from osm2pgsql.
Sorry if that wasn't clear but the underlying issue is indeed the bad statistics for GIN indexes on array columns. That's a well known and long-standing issue. Last time it did bite us was when JIT and parallel indexing was introduced (#1045).
Problem is that our tables have now reached a size where Postgres resorts to sequential scans instead of using the index. This is bad because forcing Postgres to use an index is much more difficult than preventing it from using an index or JIT or parallel processing. The ideas above are about how to best do that without breaking things in a slightly different setup.
Tried CREATE STATISTICS
but it made no difference. Maybe I did something wrong because it only took a few seconds to create it. I would have expected to take longer.
enable_material = off
helps and is the workaround I'm using currently to get updates through on that machine. It looks like it slows down other queries of the update process. Updates are still very slow but at least not at a stand-still. So not a permanent solution.
from osm2pgsql.
Can the query be updated to WITH changed_buckets AS NOT MATERIALIZED (...)
? Might help encourage the good plan without fiddling with enable_material
when that might benefit other queries.
from osm2pgsql.
Workaround is merged. Closing here.
from osm2pgsql.
Related Issues (20)
- osm2pgsql-replication init fails on get_dsn_parameters() HOT 1
- Test failure in bdd-flex with 1.9.0 HOT 2
- Confusing middle pgsql processing. HOT 7
- 1.9.2 failed to build on several architectures (error: static assertion failed) HOT 2
- Segmentation fault (core dumped) HOT 2
- Allow more than 32 generalization jobs HOT 2
- Chunky rivers when generalizing water areas HOT 2
- Question: could not extend file "base/361191441/368615606.137": No space left on device
- osm2pgsql should not perform analyze by itself HOT 15
- `object:as_multipolygon()` does not take `object.members` into account. HOT 2
- Some research on middle performance HOT 4
- highway=rest_area treatet as line HOT 2
- free(): invalid size Aborted Core dumped HOT 4
- North America import fails HOT 1
- nlohmann-json is missing from the Alpine build dependencies command
- Error in reprocessing of ways in relation, if osmc_symbols-tag of the relation contains the word 'backslash' HOT 6
- Deprecating -i,--tablespace-index? HOT 5
- Out of bounds tile coordinates produced HOT 3
- as_linestring() and likely other geometry transform functions can only be called from the process_way() function, not custom functions. HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from osm2pgsql.