pg_plan_inspector

pg_plan_inspector is being developed as a framework to monitor and improve the performance of PostgreSQL using Machine Learning methods.

One movie is worth a thousand words. I demonstrate that a tool provided by this framework shows the progress of a running query. Watch this: query-progress-example01.mp4

query-progress-example01.mp4

This is a POC model, and the primary purposes of this framework are:

To implement an external module that monitors the state of the running queries. (Refer to [3].)
To show that the PostgreSQL optimizer can be improved from providing feedback on the analysis of the executed plans.

To achieve these purposes, I am currently developing two modules: pg_query_plan and plan_analyzer.

The development of this framework has just begun. The movie shown above is the result of one of the by-products of developing these modules.

Terminology

Executed plan
An executed plan is a plan that contains both estimated and actual values, i.e. it is the same as the result of the EXPLAIN command with ANALYZE option.
Query plan
A query plan is a snapshot of the executed plan, which is taken by the pg_query_plan module. It also contains both estimated and actual values, and the actual values are the snapshot values when it is taken.
Machine Learning
Many methods I learned in statistics as a student are now classified as Machine Learning methods. Thanks to the new AI era, I can use the word Machine Learning without hesitation.

Supported Versions
Installation and Usage
Tentative Conclusion
Future work
Related works
Limitations and Warning
Version
References
Change Log

1. Supported Versions

This framework supports PostgreSQL versions 13 and 14 beta2.

2. Installation and Usage

This framework is composed of two modules: pg_query_plan and plan_analyzer.

The installations and their usages are described in README-pg_query_plan.md and README-plan_analyzer.md, respectively.

3. Tentative Conclusion

pg_query_plan

As shown in README-pg_query_plan.md, the pg_query_plan module can be monitored through the state of running queries. In other words, the first main purpose has been achieved by this module.

plan_analyzer

In README-tools.md, I introduced an example that a tool, called analyze.py, detects the functional dependency between the attributes of the table by applying a simple heuristics rule to the executed plans. This is an ad hoc; however, at least it shows that the second main purpose is not wrong and is possible.

As shown in query-progress-example01.mp4 and README-plan_analyzer.md, I have implemented the tool that shows the progress of a running query with some accuracy. More precisely, the tool gives relatively accurate results if the Plan Rows estimated by the optimizer can be corrected by a linear regression model, otherwise the results will be inaccurate, for example, if the distribution of the table data has changed considerably. In addition, the linear regression for correction of the estimated rows is generally often inaccurate when hash or merge joins are included. These causes are due to the validity limit of the linear regression model for the correction of the estimated rows.

Considering these, it is obvious that the essential task is to improve the cardinality estimation of the PostgreSQL optimizer.

4. Future Work

Step 1

In the first step, I will implement a bridge software between the plan_analyzer module and PostgreSQL server to feedback the analysis results of the executed plan to the optimizer.

A naive approach is that the plan_analyzer module sends the regression parameters stored in the repository to the PostgreSQL optimizer, and the optimizer corrects the estimated Plan Rows using the regression parameters. PostgreSQL fortunately already has a mechanism to intervene in the optimizer's processing and at least two modules have been made to improve the optimizer's processing for better results: pg_dbms_stats and pg_plan_advsr.

In addition, I will improve the algorithm of finding functional dependency in the first step.

Info:

pg_plan_advsr has already achieved one of the goals of this step using another approach.

Step 2

The approach mentioned above works to a certain extent if the linear regression model for the correction of the estimated rows is valid, otherwise it will not work.

Recently, lot of research to improve DBMS functions using AI technology[6] is being conducted, and hundreds of papers have been published on cardinality estimation. These state-of-the-art methods[7,9,8,1,2,5] are attempting to go beyond the traditional methods, such as using histograms. As PostgreSQL uses a traditional method, it would be worthwhile to add ML methods to the optimizer.

This framework will be able to provide feedback on the differences between the estimated cardinality (Plan Rows) and the actual cardinality in Step 1; therefore, the feedback can be used to improve learning.

5. Related Works

Two projects have similar features to the pg_query_plan module.

pg_query_state also shows the query plan on the working backend. However, this module needs to modify the PostgreSQL core by patching. On the other hand, pg_query_plan is a pure module and does not need to modify the core.
pg_show_plans shows the execution plans of all current running SQL statements. Unlike pg_query_plan and pg_query_state, pg_show_plans shows the execution plan that only contains the estimated values and not the actual values.

Query progress indicators were previously studied [4]. However, it seems that it is not currently being extensively researched.

Two interesting demonstrations will be shown at VLDB 2021: Demonstrations.

PostCENN: PostgreSQL with Machine Learning Models for Cardinality Estimation
DBMind: A Self-Driving Platform in openGauss

6. Limitations and Warning

6.1. Limitations

These modules cannot work if the query handles the partitioned tables.
These modules cannot work if the query has custom scans or foreign scans.
These modules do not consider the effects of triggers.
The pg_query_plan module is currently not available on standby.

6.2. Warning

Use this framework at your own risk.

With the pg_query_plan module, the performance of query processing is reduced by a few percent due to the overhead of data collection.
The values provided by the plan_analyzer module are estimates, not guaranteed to be 100% accurate.

Version

Version 0.1 (POC model)

References

[1] Andreas Kipf, et al. "Learned Cardinalities:Estimating Correlated Joins with Deep Learning". In CIDR, 2019.
[2] Benjamin Hilprecht, et al. "DeepDB: Learn from Data, not from Queries!". Proceedings of the VLDB, Vol. 13, Issue 7, March 2020, pages 992–1005.
[3] Lukas Fittl. "What's Missing for Postgres Monitoring". PGCon 2020.
[4] Patil L.V., and Mane Urmila P. "Survey on SQL Query Progress Indicator". International Journal of Engineering Research & Technology (IJERT) Vol. 2, Issue 3, March 2013.
[5] Rong Zhu, Ziniu Wu, et al. "FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation". arXiv preprint arXiv:2011.09022(2020).
[6] X. Zhou, C. Chai, G. Li, and J. Sun. "Database Meets Artificial Intelligence: A Survey". TKDE, 2020.
[7] Xiaoying Wang, et al. "Are We Ready For Learned Cardinality Estimation?". arXiv:2012.14743, December 2020.
[8] Ziniu Wu, et al. "BayesCard: Revitalizing Bayesian Networks for Cardinality Estimation". arXiv:2012.14743, December 2020.
[9] Zongheng Yang, et al. "NeuroCard: One Cardinality Estimator for All Tables". Proceedings of the VLDB, Vol. 14, Issue 1, September 2020.

Change Log

28th July 2021: Version 0.1 Released.

cuulee / pg_plan_inspector Goto Github PK