Comments (7)
I have investigated this and it sounds to me that it works as expected. But, you as a user must know the caveat of it. In short, this is not a memory leaks!
RepoDB requires these caching for it to perform a more performant insertion of the batch operations.
Explanation
Below is the screenshot and a very small project that we used for simulation and replication. The project below requires a SQL Server as we did the simulation there. Unfortunately for SQL Server and MDS, it only allow a maximum number of 2100 parameters. Therefore, you will see that I can only hit the max of 20 batchSize
for 20 rows
with 50 columns data entities
.
Project:
This project is good for small simulation on this kind of issue.
InsertAllMemoryLeaks-InsertAll.zip
Screenshot:
What the program does?
- It first create a table named
[dbo].[BigTable]
with 120 columns. - The program will iterate infinitely to do an
InsertAll
operation. - In every iteration, the program will create a list of
BigTable
entity with max 50 columns. The creation of the list will vary on the maximum batchSize, from 1. This is to ensure that RepoDB will create and cache each buffer's command text in the memory. - Insert the created list of
BigTable
entity towards the table.
Observation
Fluctuations:
In the first few seconds, the memory has fluctuated a lot, it is because when the INSERT
statement is being created for the number of rows given, the library will put that in the cache.
Truth: If you insert 1 row, it will create 1 INSERT
statement and cache it. If you insert 2 rows, it will create 2 INSERT
statements and cache it, and so forth.
The batchSize
is the maximum number of INSERT
statement it will create and cache into the memory. So in short, you will have 20 times INSERT
statement being cached into the memory (in which each of them will have different number of parameters based on the columns provided on the data entities).
Flat-Lines:
You will notice the flat-lines after those 20 INSERT
statement has been cached to the memory. This is because the library is not creating an INSERT
statement anymore, instead, simply reusing the one in the cache based on the number of rows you are inserting.
Behavior Extent
This kind of behavior is expected and is also present to both MergeAll
and UpdateAll
.
Conclusion
The number of cached statements will vary on the number of batchSize
you passed on the batch operations (i.e.: InsertAll
, MergeAll
and UpdateAll
). The size of the statement that is being cached will vary on the size of the entity schema (i.e.: Number of Columns).
Optimizations
Currently, RepoDB is creating multiple INSERT statement per row-numbers's batch insertion. See below.
- InsertAll with 1 row:
Statement cached:INSERT INTO [Table] VALUES (ColumnFirst, ..., ColumnLast);
- InsertAll with 5 rows:
Statements cached:INSERT INTO [Table] VALUES (ColumnFirst, ..., ColumnLast); INSERT INTO [Table] VALUES (ColumnFirst2, ..., ColumnLast2); INSERT INTO [Table] VALUES (ColumnFirst3, ..., ColumnLast3); INSERT INTO [Table] VALUES (ColumnFirst4, ..., ColumnLast4); INSERT INTO [Table] VALUES (ColumnFirst5, ..., ColumnLast5);
- And so forth...
The statement above are verbose and is also not using the more optimal bulk insert. This can be optimized by below.
- InsertAll with 3 rows:
Statements cached:INSERT INTO [Table] VALUES (ColumnFirst, ..., ColumnLast), (ColumnFirst2, ..., ColumnLast2), (ColumnFirst3, ..., ColumnLast3);
With that approach, it will eliminate so may characters from the memory.
from repodb.
For more information, I am referencing these discussions from Twitter.
Link: https://twitter.com/mike_pendon/status/1636489405098491910
from repodb.
Referencing: #380
from repodb.
We will create the simulation in the PGSQ database, we are hopeful that this 2100 limit is not present there so we can simulate your use-case. We will post the result here once done.
from repodb.
If you use BulkInsert
(equivalent to BinaryBulkImport
), you are levaraging the real bulk operations and these cached are skipped. In the screenshot below, you will notice that we adjusted the batchSize
to 1000
with 100
columns and yet the memory usage is not increasing.
You will also notice that the higher the batchSize
the faster it inserts the data. It is because the real bulk insert loves bring all the data at once from the client into the database server - there it performs much better.
Note: If you bulk insert 1000 rows and you set the batchSize to 1, it will iterate 1000 times (this is the behavior of SqlBulkCopy itself). Therefore, if you bulk insert 1000 rows and you set the batchSize to 1000 as well, it will bring all data at once.
EDIT: Attaching. InsertAllMemoryLeaks-BulkInsert.zip
from repodb.
Hmmm - interestingly, seems I can replicate your issue in PGSQL. I modified the program and enable the 100 max batch size with 120 columns in a table.
Below is a screenshot in the first few seconds of run. It first burst up to 900 MB, went down suddenly and slowly climbing up, exactly as what you explained in Twitter.
And few minutes after, the memory is not going down. It even reach to 2 GB, and is still climbing.
But, when the library had cached all the row batch statements (100 total statements), the execution suddenly becomes faster and it gives a flat line on the memory allocations.
After 5 minutes - the line is now flat and is not climbing anymore.
Here is the project that replicates your case.
InsertAllMemoryLeaksPostgreSql.zip
In conclusion, this is not a memory leak. Since you're requiring a big caching on the statement based on the big data entity you have, the library requires such memory to execute your insertion fast.
from repodb.
The BinaryBulkInsert
only requires 40 MB as it does not cache anything.
Project: InsertAllMemoryLeaksPostgreSql-BinaryBulkInsert.zip
from repodb.
Related Issues (20)
- DirectionalQueryField with string output in stored procedure HOT 3
- Enhancement: Lazy Property Handler HOT 1
- my cs50x project : table emails has no column named recipient >>> pointing to compose mail HOT 2
- Enhancement: Follow Npgsql connvention for DateTime and DateTimeOffset HOT 3
- Bug: using ExecuteQueryMultipleAsync with many resultsets in some cases results in a timeout HOT 2
- Bug: RepoDb.SqlServer 1.13.0 - Memory leak in Microsoft.Data.SqlClient 5.0.1 HOT 2
- Bug: InsertAll on MSSQL (returns wrong Identifier) HOT 1
- InsertAsync on Postgres returns the wrong identifier HOT 11
- Bug: PostgreSQL BinaryBulkMerge not working when table has name "schema.table" HOT 3
- Random errors occurring during concurrent database access. HOT 5
- Bug: Getting the error "The list of queryable fields must not be null for ..." when a list of fields has been specified HOT 1
- Question: How to perform Identity Inserts when inserting an entity? HOT 2
- Expression is currently not supported. HOT 5
- Enhancement: Refactor Batch Operations SQL Statements HOT 8
- BaseRepository does not work with generic TDbConnection. HOT 9
- Bug: DirectionalQueryField broken in 1.13.1 HOT 2
- Request: Add support for VistaDB ( https://vistadb.com )
- the incoming request has too many parameters HOT 3
- Question: New setup, cannot get past MissingFieldsException HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from repodb.