jamis / bulk_insert Goto Github PK
View Code? Open in Web Editor NEWEfficient bulk inserts with ActiveRecord
License: MIT License
Efficient bulk inserts with ActiveRecord
License: MIT License
rails: 6.0
ruby: 2.7.2
bulk_insert: 1.8.1
used
WrittenPadQuestion.bulk_insert(.....) do |worker|
.....
.....
.....
end
app/models/questions.rb
xxxxs.rb:
belongs_to :original_question, optional: true
has_many :question_tag_relations, dependent: :destroy
has_many :tags, through: :question_tag_relations
before_create :setup_tags
private
def setup_tags
self.tag_ids = original_question.tag_ids
end
before_create not working, Is there any solution?
The current implementation of update_duplicates
is such that it updates all columns if there is a conflict during the insert phase. This is not always desirable. A classic case is the created_at
.
If an insert fails, and we want the update to happen instead we would typically want created_at
to keep its existing value. An option to control this behaviour would be great.
Firstly, I really like your project ! Really helpful 👍
I would suggest you to move the code inside InsertBulk module ActiveRecord::Base by using inherited callback.
inherited
callback argument.I've coded a POC there: inherited_callback
I've decided to create a bulk_insert
per Model. But it's possible to create the method only once in ActiveRecord::Base
module ActiveRecord
class Base
def self.inherited(child_class)
super
return if methods.include?(:insert_bulk)
...
end
end
end
So, you have the same flexibility and behaviour with less constraints.
What do you think about this design ? Maybe I missed something ? :)
When trying to insert 60k records in a go , the total results in the database is 60k * 7
ActiveRecord::Base.establish_connection(adapter: 'sqlserver', host: "xxx", username: "xxx", password: "xxx", database: "xxx", azure: true, port: 1433, timeout: 5000)
class Report < ActiveRecord::Base
self.primary_key = 'id'
end
my_array = [] #count of 100000 records
Report.bulk_insert(:account_owner_id) do |worker|
my_array.each do |arr|
worker.add account_owner_id: arr[0]
end
end
This issue occurs when I add this to a cron/scheduler. This also occurs when the table to which insertion happens have a lot of data
I might be wrong but on bulk_insert/lib/bulk_insert/worker.rb:67 you seem to be requiring rails instead of activerecord and that, when saving a worker causes:
*** NameError Exception: uninitialized constant BulkInsert::Worker::Rails
I discovered it while using the gem in a project where I only use ActiveRecord.
That can be bypassed if a module Rails
that respond to version
is defined inside the project
Upgrading from v1.6.0 to v1.7.0, I'm seeing the following failure when trying to call save!
on a bulk worker (using mysql2 v0.4.5
database connector against mysql v5.7.23 server):
D, [2018-10-11T09:17:07.739444 #72035] DEBUG -- : 2018-10-11 09:17:07.734938 D [72035:70195788170560] (0.727ms) ActiveRecord::Base -- SQL -- {:sql=>"INSERT INTO `recent_files` (`repo_id`,`md5`,`size`,`created_at`,`updated_at`) VALUES (1,'ae3e83e2fab3a7d8683d8eefabd1e74d',3,'2018-10-11 16:17:07','2018-10-11 16:17:07')"}
E, [2018-10-11T09:17:07.740832 #72035] ERROR -- : 2018-10-11 09:17:07.735079 E [72035:70195788170560 bulk_insert_thread.rb:29] BulkInsertThread -- Exception: NoMethodError: undefined method `fields' for nil:NilClass
.../gems/activerecord-4.2.10/lib/active_record/connection_adapters/mysql2_adapter.rb:222:in `exec_query'
.../gems/bulk_insert-1.7.0/lib/bulk_insert/worker.rb:92:in `execute_query'
.../gems/bulk_insert-1.7.0/lib/bulk_insert/worker.rb:82:in `save!'
.../lib/modules/bulk_insert_thread.rb:27:in `block in initialize'
When not using *_id on a relation when building the attr hash, .bulk_insert
runs without errors, but no records are inserted.
Proposal:
Rails 5 is almost here !
http://weblog.rubyonrails.org/2016/2/27/Rails-5-0-beta3/
Plan to support it ?
In case of a JSON column, bulk_insert
will incorrectly insert the default value. It inserts "{}"
(ie. a string) instead of {}
(ie. an object).
Here's a failing test:
begin
require "bundler/inline"
rescue LoadError => e
$stderr.puts "Bundler version 1.10 or later is required. Please update your Bundler"
raise e
end
gemfile(true) do
source "https://rubygems.org"
gem "rails"
gem "sqlite3"
gem "bulk_insert"
end
require "active_record"
require "minitest/autorun"
require "logger"
# This connection will do for database-independent bug reports.
ActiveRecord::Base.establish_connection(adapter: "sqlite3", database: ":memory:")
ActiveRecord::Base.logger = Logger.new(STDOUT)
ActiveRecord::Schema.define do
create_table :profiles, force: true do |t|
t.string :name
t.json :json_data, default: {}, null: false
end
end
class Profile < ActiveRecord::Base
end
class BugTest < Minitest::Test
def test_bulk_insert_json_default
worker = Profile.bulk_insert
worker.add(name: "Foo", json_data: {})
worker.add(name: "Bar")
worker.save!
profile1 = Profile.find_by!(name: "Foo")
profile2 = Profile.find_by!(name: "Bar")
assert_equal profile1.json_data, {}
assert_equal profile2.json_data, {}
end
end
The reason is the following code:
bulk_insert/lib/bulk_insert/worker.rb
Lines 51 to 64 in ab5db08
In case a value does not exist, the column default is used. The column default in rails is expressed as a string (ie. in the example above: Profile.columns_hash["json_data"].default == {}
). This is problematic for JSON columns, as both "{}"
and {}
are valid JSON (one a string, the other an object).
Column default values like ''
or false
aren't used when a value is omitted for that column.
For example, if I have a table foo
with a column bar
that has a default value of ''
, and I don't provide a value for bar
in my call to bulk_insert
, the newly-inserted row will have a bar
value of NULL
, and not ''
like I expect.
The issue seems to be the use of #present?
to determine a column has a default value:
bulk_insert/lib/bulk_insert/worker.rb
Line 50 in 3ded6b0
Rails 6.x brings a native support for bulk inserts and upserts.
This lib makes sense for <6.x only now.
I saw a case that when i try to add a rows with a given ID not using the ID and the code review shows we are removing it
Why ?
It seems like counter_cache
columns aren't updated after a bulk insert. What's the best way to remedy this?
It would be bad to have to run reset_counter
on every record that's affected.
Some DBs support RETURNING
in INSERT
.
Postgres example: http://www.postgresql.org/docs/9.1/static/sql-insert.html
Currently when trying to use ignore: true, it uses INSERT IGNORE INTO which is incorrect PG syntax. Since PG 9.5, there is working syntax:
As I just learned, create callbacks does not work when using the gem.
Is there any solution for this issue?
I am using bulk insert in Rails Table.
My id
column is named as PRODUCT_ID
and created_at
and updated_at
are respectively RECORD_CREATE_DATE
and RECORD_UPDATE_DATE
.
The bulk insert takes these values as NULL and I get an error because these fields have not null constraint.
How do I solve these?
I can see it's supported for MySQL but there's DO NOTHING for Postgres. Any reason for that?
It's supported since 9.5: https://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.5#INSERT_..._ON_CONFLICT_DO_NOTHING.2FUPDATE_.28.22UPSERT.22.29
Make sure that ruby 2.7 builds can run.
Currently there is a minor issue with bundler installation in travis:
The command "eval bundle install --jobs=3 --retry=3 --deployment " failed. Retrying, 2 of 3.
Traceback (most recent call last):
4: from /home/travis/.rvm/gems/ruby-2.7.1/bin/ruby_executable_hooks:24:in `<main>'
3: from /home/travis/.rvm/gems/ruby-2.7.1/bin/ruby_executable_hooks:24:in `eval'
2: from /home/travis/.rvm/gems/ruby-2.7.1/bin/bundle:23:in `<main>'
1: from /home/travis/.rvm/rubies/ruby-2.7.1/lib/ruby/2.7.0/rubygems.rb:294:in `activate_bin_path'
/home/travis/.rvm/rubies/ruby-2.7.1/lib/ruby/2.7.0/rubygems.rb:275:in `find_spec_for_exe': Could not find 'bundler' (1.17.2) required by your /home/travis/build/jamis/bulk_insert/Gemfile.lock. (Gem::GemNotFoundException)
To update to the latest version installed on your system, run `bundle update --bundler`.
To install the missing version, run `gem install bundler:1.17.2`
https://travis-ci.com/github/jamis/bulk_insert/jobs/337238175
Hello there!
I just updated to latest version v1.7.0 because I'm trying to use Return Primary Keys feature.
So, I just wrote down:
while my_condition
packs << [some_value, another_value, Time.now, Time.now]
# decrement my_condition
end
destination_columns = [:column1, :column2, :created_at, :updated_at]
worker = Pack.bulk_insert(*destination_columns, return_primary_keys: true) { |worker|
packs.each { |pack|
worker.add pack
}
# That's how I use bulk_insert everywhere
}
I'm doing that because I need to associate these packs with another model. Sometime I need to write pack.my_model_associated << [ids]
or just something like that. I do not know if it is possible not using Return Primary Keys feature. I was thinking about something like:
worker.result_sets.each { |pack|
pack.my_model_associated << [1, 2, 3, 4]
pack.save!
}
I know worker.result_sets
is an ActiveRecord::Result
and does not contain each
method. But like I said, it is just an idea. I was going to code something like that.
Anyway, I get the error:
NoMethodError: undefined method 'result_sets' for nil:NilClass
, because my worker
is nil
.
I cloned the repository and modified line 22 in lib/bulk_insert.rb. I returned worker
instead of nil
even on block_given?
is true
and the error stopped happening. But, I really don't know why I did that and I was not confident, so I did not submit a pull request or something else (I'm not an expert ruby developer). I just know worker.result_sets
was not nil
anymore. Although I couldn't do what I wanted (bulk_insert packs and relate all them with my_model_associated
quickly).
Could you help me?
Thanks.
Ruby 2.4 has recently reached its EOL.
All previous versions can be safely considered unsupported
I am using version 1.8.1, ruby 2.3.8, rails 4.2.5
I am trying to use non-block mode, passing an array of attribute hashes. My model is named Gap
:
worker = Gap.bulk_insert(return_primary_keys: true)
however, bulk_insert
returns nil
. (block mode also returns nil).
Looking at the code it looks like it only returns a worker if bulk_insert
is only called without values or a block:
def bulk_insert(*columns, values: nil, set_size:500, ignore: false, update_duplicates: false, return_primary_keys: false)
columns = default_bulk_columns if columns.empty?
worker = BulkInsert::Worker.new(connection, table_name, primary_key, columns, set_size, ignore, update_duplicates, return_primary_keys)
if values.present?
transaction do
worker.add_all(values)
worker.save!
end
nil
elsif block_given?
transaction do
yield worker
worker.save!
end
nil
else
worker
end
end
How would I access the worker object in order to look at result_sets
?
thanks, fw
Would it be possible to extend the ignore: true
option for PostgreSQL adapter as well?
When I added it to my Rails application, I got the following error:
ActiveRecord::StatementInvalid: PG::SyntaxError: ERROR: syntax error at or near "IGNORE"
LINE 1: INSERT IGNORE INTO "table_name" ("date","i...
As a suggestion/benchmark, you can find below the way it was implemented in activerecord-import gem
:
# postgresql_adapter.rb
# ...
def post_sql_statements( table_name, options ) # :nodoc:
sql = []
if supports_on_duplicate_key_update?
# Options :recursive and :on_duplicate_key_ignore are mutually exclusive
if (options[:ignore] || options[:on_duplicate_key_ignore]) && !options[:on_duplicate_key_update] && !options[:recursive]
sql << sql_for_on_duplicate_key_ignore( table_name, options[:on_duplicate_key_ignore] )
end
elsif options[:on_duplicate_key_ignore] && !options[:on_duplicate_key_update]
logger.warn "Ignoring on_duplicate_key_ignore because it is not supported by the database."
end
sql += super(table_name, options)
unless options[:no_returning] || options[:primary_key].blank?
primary_key = Array(options[:primary_key])
sql << " RETURNING \"#{primary_key.join('", "')}\""
end
sql
end
def sql_for_on_duplicate_key_ignore( table_name, *args ) # :nodoc:
arg = args.first
conflict_target = sql_for_conflict_target( arg ) if arg.is_a?( Hash )
" ON CONFLICT #{conflict_target}DO NOTHING"
end
I haven't found any CONTRIBUTING guideline in your project. Thank you
I have created a test project here: https://github.com/grdw/bulk_insert_bug
When cloning said project and running ruby main.rb
the following bug appears:
Traceback (most recent call last):
23: from main.rb:18:in `<main>'
22: from /Users/gerard/.ruby/gems/bulk_insert-1.8.1/lib/bulk_insert.rb:18:in `bulk_insert'
/Users/gerard/.ruby/gems/bulk_insert-1.8.1/lib/bulk_insert/worker.rb:112:in `block (2 levels) in compose_insert_query':
undefined method `type_cast_from_column' for #<ActiveRecord::ConnectionAdapters::Mysql2Adapter:0x00007ffc9488e788> (NoMethodError)
It seems something has changed from ActiveRecord 6.0.3.4 to 6.1.0.
I think that library structure would probably benefit from having database-specific behaviour abstracted to some sort of statement adapters. Currently it's heavily relying on ifs
which is not really easy to maintain or modify.
It might be also good to consider keyword arguments. It's always easy mix an order of long list of arguments.
Sample code ans structure below.
module BulkInsert
module StatementAdapters
class GenericAdapter
def insert_ignore_statement
end
def on_conflict_ignore_statement
end
def on_conflict_update_statement
end
def primary_key_return_statement
end
end
end
end
require 'bulk_insert/statement_adapters/generic_adapter'
require 'bulk_insert/statement_adapters/mysql_adapter'
require 'bulk_insert/statement_adapters/postgresql_adapter'
require 'bulk_insert/statement_adapters/sqlite_adapter'
module StatementAdapter
def self.for(connection)
case connection.adapter_name
when /^mysql/i
MySQLAdapter.new
when /\APost(?:greSQL|GIS)/i
PostgreSQLAdapter.new
when /\ASQLite/i
SQLiteAdapter.new
else
GenericAdapter.new
end
end
end
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.