Code Monkey home page Code Monkey logo

fast-sqlite3-inserts's Introduction

Fast SQLite Inserts

To find out the fastest way to create an SQLite DB with one billion random rows.

Read this blog post for the more context - Towards Inserting One Billion Rows in SQLite Under A Minute

Leaderboard

(for 100M insertions)

Variant Time
Rust 23 seconds
PyPy 126 seconds
CPython 210 seconds

Current Benchmark

Python

These are the current fastest CPython and PyPy numbers.

$ ./bench.sh

Sat May  8 19:42:44 IST 2021 [PYTHON] running sqlite3_opt_batched.py (100_000_000) inserts
      517.53 real       508.24 user         7.35 sys

Sat May  8 20:03:04 IST 2021 [PYPY] running sqlite3_opt_batched.py (100_000_000) inserts
      159.70 real       153.46 user         5.81 sys

Rust

These are the current fastest Rust numbers

Mon Nov 22 18:47:26 IST 2021 [RUST] basic_batched.rs (100_000_000) inserts

real   0m23.826s
user   0m21.685s
sys    0m2.057s

Mon Nov 22 18:47:50 IST 2021 [RUST] threaded_batched.rs (100_000_000) inserts

real   0m23.070s
user   0m27.512s
sys    0m2.465s

In Memory

Instead of writing to disk, I used a :memory: DB, these are the numbers

Mon May 10 17:40:39 IST 2021 [RUST] basic_batched.rs (100_000_000) inserts
       31.38 real        30.55 user         0.56 sys

Mon May 10 17:39:39 IST 2021 [RUST] threaded_batched.rs (100_000_000) inserts
       28.94 real        45.02 user         2.03 sys

Busy loop time

The amount of time these scripts were taking in just to run the for loops (and no SQL insertion)

$ ./busy.sh

Sun May  9 13:16:01 IST 2021 [PYTHON] busy_loop.py (100_000_000) iterations
      351.14 real       347.53 user         3.39 sys

Sun May  9 13:21:52 IST 2021 [PYPY] busy_loop.py (100_000_000) iterations
       81.58 real        77.73 user         3.80 sys

Sun May  9 13:23:14 IST 2021 [RUST] busy.rs (100_000_000) iterations
       17.97 real        16.29 user         1.67 sys

Sun May  9 13:23:32 IST 2021 [RUST] threaded_busy.rs (100_000_000) iterations
        7.18 real        42.52 user         7.20 sys

Community Contributions

PR Author Result
#2 captn3m0 Reduced the CPython running time by half (from 7.5 minutes to 3.5 minute)
#12 red15 saved 2s from Rust's running time (bringing it to 30s)
#19 kerollmops saved 5s from Rust's running time (bringing it to 23s)

Contributing

All contributions are welcome. If you have any ideas on increasing the performance, feel free to submit a PR. You may also check the current open issues to work on.

License

Released under MIT License. Check LICENSE file more info.

fast-sqlite3-inserts's People

Contributors

avinassh avatar captn3m0 avatar kerollmops avatar red15 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fast-sqlite3-inserts's Issues

better presentation

README can be improved with graphs, how to run the bench code and more info.

Use SQL for generation

What about a using SQLite for test data generation?

Example:

CREATE TABLE test (
  id INTEGER PRIMARY KEY NOT NULL, 
  x REAL NOT NULL,
  y REAL NOT NULL,
  z REAL NOT NULL
);
INSERT INTO test
    WITH RECURSIVE
      cnt( id, x, y, z) AS (
      VALUES(1 , random(), random(), random()) UNION ALL 
      SELECT id+1,random(),random(), random() FROM cnt WHERE ID<1000)
    select * from cnt;

https://paulbradley.org/sqlite-test-data/
https://stackoverflow.com/questions/17931320/how-to-insert-random-data-into-a-sqlite-table-using-only-queries

Rust reuse memory

Rather than creating the Vec in the inner loop, you can create it in the outer loop using with_capacity with capacity of batch size so it will not allocate in inner loop, then just reuse the Vec.

Once the execute is used, can just Vec clear to clear the items without deallocating memory. I wonder how much faster will it make.

`basic_async` not actually asynchronous ?

I might be missing something here as I'm still new to async in rust, but as I understand it calling await on a future forces the runtime to serialize the execution of the program, this means that this code:

conn.execute("PRAGMA cache_size = 1000000;").await?;
conn.execute("PRAGMA locking_mode = EXCLUSIVE;").await?;
conn.execute("PRAGMA temp_store = MEMORY;").await?;

is completely synchronized and each call to the database needs to complete before the next call starts.
Now, this seems fine as this only happens once outside of the benchmark loop.

But all the code in the faker function also uses await on each async call.
For example this code :

let stmt_with_area = tx
    .prepare("INSERT INTO user VALUES (NULL, ?, ?, ?)")
    .await?;
let stmt = tx
    .prepare("INSERT INTO user VALUES (NULL, NULL, ?, ?)")
    .await?;

It seems to me that there is no reason the needs to be serialized and we should join these calls, in something like:

let (stmt_with_area, stmt ) = tokio::join!(
    tx.prepare("INSERT INTO user VALUES (NULL, ?, ?, ?)"),
    tx.prepare("INSERT INTO user VALUES (NULL, NULL, ?, ?)")
)

Finally, it also seems that we don't actually need to await the execution of the statements themselves inside the for
but that would require saving all the futures in a big vector and then joining them, which seems a bit odd and would consume a lot of memory when running the for loop for 10e6 iterations.

proper benchmarking script

Current benchmarking code is a bash script hacked together. It can improved to take parameters, reduce code duplication in calling the same code

Statically linked with Sqlite3 with LTO?

I wonder if there will be any improvement if the rust version is linked with sqlite3, with LTO enabled.

It may inline a few functions, drops many others and might improve the performance.

Code clean up / refactor

Rust versions are bad, could use some love in refactoring

secondly, a lot of code is repeated. applying DRY would make it better

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.