jamis / bulk_insert Goto Github PK

View Code? Open in Web Editor NEW

818.0 818.0 83.0 135 KB

Efficient bulk inserts with ActiveRecord

License: MIT License

Ruby 89.11% JavaScript 1.00% CSS 1.16% HTML 8.22% Shell 0.51%

bulk_insert's People

Contributors

Stargazers

Watchers

Forkers

mehdi-farsi myteksi poporul karkiarjun jchilders villins michael0015 thoughtafter uudui wtian mkhuda j-y weyhmueller nelyj dave-slash-null sooooowhat mferracioli medeen elibinary sidnvy nhemsley lukeledet harssh bencolon jnstq thbar manme lumine jamedjo darinpp mberlanda tamoyal joshdirkx jfiander cbezares conorh ozkee tgeselle varyform mstruve chengyuanheng fishpercolator nynhex katiesvt arrbee gitlubtaotao mikeabner procore therusskiy crmaxx daemonsy dna noma4i arseniy96 paderinandrey pomier carlos-bbt olleolleolle ponnusamygit dapulse indiecampers sobstel steverovsky nickjacob aqeelvn nlpgo xiaoliukuibu huacnlee davidstosik bookbub zjohl kfprimm jajangvh axmitchell travismiller carlos-bqcl malclocke standardgalactic bearerpipelinetest iq-scm

bulk_insert's Issues

before_create not working, Is there any solution?

rails: 6.0
ruby: 2.7.2
bulk_insert: 1.8.1

used

WrittenPadQuestion.bulk_insert(.....) do |worker|
   .....
   .....
   .....
end

app/models/questions.rb

xxxxs.rb:

belongs_to :original_question, optional: true

has_many :question_tag_relations, dependent: :destroy
has_many :tags, through: :question_tag_relations

before_create :setup_tags

private

def setup_tags
  self.tag_ids = original_question.tag_ids
end

before_create not working, Is there any solution?

Update duplicates option currently updates all columns which may not be desirable

The current implementation of update_duplicates is such that it updates all columns if there is a conflict during the insert phase. This is not always desirable. A classic case is the created_at.

If an insert fails, and we want the update to happen instead we would typically want created_at to keep its existing value. An option to control this behaviour would be great.

POC: Use Ruby inherited callback instead InsertBulk module

Firstly, I really like your project ! Really helpful 👍

I would suggest you to move the code inside InsertBulk module ActiveRecord::Base by using inherited callback.

That permit to don't pollute the looong ancestor chain of ActiveRecord::Base.
ActiveRecord Models already inherit from ActiveRecord::Base.
You have access to the model's name via the inherited callback argument.

I've coded a POC there: inherited_callback

I've decided to create a bulk_insert per Model. But it's possible to create the method only once in ActiveRecord::Base

module ActiveRecord
  class Base
    def self.inherited(child_class)
      super
      return if methods.include?(:insert_bulk)
      ...
    end
  end
end

So, you have the same flexibility and behaviour with less constraints.
What do you think about this design ? Maybe I missed something ? :)

Bulk Insert Concurrency

When trying to insert 60k records in a go , the total results in the database is 60k * 7

ActiveRecord::Base.establish_connection(adapter: 'sqlserver', host: "xxx", username: "xxx", password: "xxx", database: "xxx", azure: true, port: 1433, timeout: 5000)

class Report < ActiveRecord::Base
  self.primary_key = 'id'
end

my_array = [] #count of 100000 records
Report.bulk_insert(:account_owner_id) do |worker|
  my_array.each do |arr|
    worker.add account_owner_id: arr[0]
  end
end

This issue occurs when I add this to a cron/scheduler. This also occurs when the table to which insertion happens have a lot of data

Rails dependency indirectly required

I might be wrong but on bulk_insert/lib/bulk_insert/worker.rb:67 you seem to be requiring rails instead of activerecord and that, when saving a worker causes:

*** NameError Exception: uninitialized constant BulkInsert::Worker::Rails

I discovered it while using the gem in a project where I only use ActiveRecord.
That can be bypassed if a module Rails that respond to version is defined inside the project

Result returned from INSERT is nil

Upgrading from v1.6.0 to v1.7.0, I'm seeing the following failure when trying to call save! on a bulk worker (using mysql2 v0.4.5 database connector against mysql v5.7.23 server):

D, [2018-10-11T09:17:07.739444 #72035] DEBUG -- : 2018-10-11 09:17:07.734938 D [72035:70195788170560] (0.727ms) ActiveRecord::Base -- SQL -- {:sql=>"INSERT  INTO `recent_files` (`repo_id`,`md5`,`size`,`created_at`,`updated_at`) VALUES (1,'ae3e83e2fab3a7d8683d8eefabd1e74d',3,'2018-10-11 16:17:07','2018-10-11 16:17:07')"}
E, [2018-10-11T09:17:07.740832 #72035] ERROR -- : 2018-10-11 09:17:07.735079 E [72035:70195788170560 bulk_insert_thread.rb:29] BulkInsertThread -- Exception: NoMethodError: undefined method `fields' for nil:NilClass
.../gems/activerecord-4.2.10/lib/active_record/connection_adapters/mysql2_adapter.rb:222:in `exec_query'
.../gems/bulk_insert-1.7.0/lib/bulk_insert/worker.rb:92:in `execute_query'
.../gems/bulk_insert-1.7.0/lib/bulk_insert/worker.rb:82:in `save!'
.../lib/modules/bulk_insert_thread.rb:27:in `block in initialize'

Make it work properly without *_id when it's a relation

When not using *_id on a relation when building the attr hash, .bulk_insert runs without errors, but no records are inserted.

Proposal:

Take advantage of AR to discover if it's a relation and properly format with *_id when building the SQL
Throw an error if something goes wrong

Support Rails 5

Rails 5 is almost here !
http://weblog.rubyonrails.org/2016/2/27/Rails-5-0-beta3/

Plan to support it ?

Incorrect default value inserted for JSON column types

In case of a JSON column, bulk_insert will incorrectly insert the default value. It inserts "{}" (ie. a string) instead of {} (ie. an object).

Here's a failing test:

begin
  require "bundler/inline"
rescue LoadError => e
  $stderr.puts "Bundler version 1.10 or later is required. Please update your Bundler"
  raise e
end

gemfile(true) do
  source "https://rubygems.org"
  gem "rails"
  gem "sqlite3"
  gem "bulk_insert"
end

require "active_record"
require "minitest/autorun"
require "logger"

# This connection will do for database-independent bug reports.
ActiveRecord::Base.establish_connection(adapter: "sqlite3", database: ":memory:")
ActiveRecord::Base.logger = Logger.new(STDOUT)

ActiveRecord::Schema.define do
  create_table :profiles, force: true do |t|
    t.string :name
    t.json :json_data, default: {}, null: false
  end
end

class Profile < ActiveRecord::Base
end

class BugTest < Minitest::Test
  def test_bulk_insert_json_default
    worker = Profile.bulk_insert
    worker.add(name: "Foo", json_data: {})
    worker.add(name: "Bar")
    worker.save!

    profile1 = Profile.find_by!(name: "Foo")
    profile2 = Profile.find_by!(name: "Bar")
    assert_equal profile1.json_data, {}
    assert_equal profile2.json_data, {}
  end
end

The reason is the following code:

bulk_insert/lib/bulk_insert/worker.rb

Lines 51 to 64 in ab5db08

    
           mapped = @columns.map.with_index do |column, index| 
        
               value_exists = values.is_a?(Hash) ? values.key?(column.name) : (index < values.length) 
        
               if !value_exists 
        
                 if column.default.present? 
        
                   column.default 
        
                 elsif column.name == "created_at" || column.name == "updated_at" 
        
                   :__timestamp_placeholder 
        
                 else 
        
                   nil 
        
                 end 
        
               else 
        
                 values.is_a?(Hash) ? values[column.name] : values[index] 
        
               end 
        
             end

In case a value does not exist, the column default is used. The column default in rails is expressed as a string (ie. in the example above: Profile.columns_hash["json_data"].default == {}). This is problematic for JSON columns, as both "{}" and {} are valid JSON (one a string, the other an object).

Falsey column defaults not used

Column default values like '' or false aren't used when a value is omitted for that column.

For example, if I have a table foo with a column bar that has a default value of '', and I don't provide a value for bar in my call to bulk_insert, the newly-inserted row will have a bar value of NULL, and not '' like I expect.

The issue seems to be the use of #present? to determine a column has a default value:

bulk_insert/lib/bulk_insert/worker.rb

Line 50 in 3ded6b0

if column.default.present?

Deprecate the lib since Rails 6.x

Rails 6.x brings a native support for bulk inserts and upserts.

This lib makes sense for <6.x only now.

Why the ID column is being removed from default columns

I saw a case that when i try to add a rows with a given ID not using the ID and the code review shows we are removing it

Why ?

counter_cache values are not updated

It seems like counter_cache columns aren't updated after a bulk insert. What's the best way to remedy this?

It would be bad to have to run reset_counter on every record that's affected.

Add support for RETURNING

Some DBs support RETURNING in INSERT.
Postgres example: http://www.postgresql.org/docs/9.1/static/sql-insert.html

Support PostgreSQL 9.5 ON CONFLICT when using ignore: true flag

Currently when trying to use ignore: true, it uses INSERT IGNORE INTO which is incorrect PG syntax. Since PG 9.5, there is working syntax:

https://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.5#INSERT_..._ON_CONFLICT_DO_NOTHING.2FUPDATE_.28.22UPSERT.22.29

Before/After create callbacks

As I just learned, create callbacks does not work when using the gem.
Is there any solution for this issue?

Error when default column names are different

I am using bulk insert in Rails Table.
My id column is named as PRODUCT_ID and created_at and updated_at are respectively RECORD_CREATE_DATE and RECORD_UPDATE_DATE.

The bulk insert takes these values as NULL and I get an error because these fields have not null constraint.

How do I solve these?

Support PostgreSQL 9.5 ON CONFLICT DO UPDATE

I can see it's supported for MySQL but there's DO NOTHING for Postgres. Any reason for that?

It's supported since 9.5: https://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.5#INSERT_..._ON_CONFLICT_DO_NOTHING.2FUPDATE_.28.22UPSERT.22.29

CI: fix ruby 2.7 build

Make sure that ruby 2.7 builds can run.
Currently there is a minor issue with bundler installation in travis:

The command "eval bundle install --jobs=3 --retry=3 --deployment " failed. Retrying, 2 of 3.

Traceback (most recent call last):

	4: from /home/travis/.rvm/gems/ruby-2.7.1/bin/ruby_executable_hooks:24:in `<main>'

	3: from /home/travis/.rvm/gems/ruby-2.7.1/bin/ruby_executable_hooks:24:in `eval'

	2: from /home/travis/.rvm/gems/ruby-2.7.1/bin/bundle:23:in `<main>'

	1: from /home/travis/.rvm/rubies/ruby-2.7.1/lib/ruby/2.7.0/rubygems.rb:294:in `activate_bin_path'

/home/travis/.rvm/rubies/ruby-2.7.1/lib/ruby/2.7.0/rubygems.rb:275:in `find_spec_for_exe': Could not find 'bundler' (1.17.2) required by your /home/travis/build/jamis/bulk_insert/Gemfile.lock. (Gem::GemNotFoundException)

To update to the latest version installed on your system, run `bundle update --bundler`.

To install the missing version, run `gem install bundler:1.17.2`

https://travis-ci.com/github/jamis/bulk_insert/jobs/337238175

NoMethodError: undefined method 'result_sets' for nil:NilClass

Hello there!

I just updated to latest version v1.7.0 because I'm trying to use Return Primary Keys feature.

So, I just wrote down:

while my_condition
  packs << [some_value, another_value, Time.now, Time.now]
  # decrement my_condition
end

destination_columns = [:column1, :column2, :created_at, :updated_at]
worker = Pack.bulk_insert(*destination_columns, return_primary_keys: true) { |worker|
  packs.each { |pack|
    worker.add pack
  }
  # That's how I use bulk_insert everywhere
}

I'm doing that because I need to associate these packs with another model. Sometime I need to write pack.my_model_associated << [ids] or just something like that. I do not know if it is possible not using Return Primary Keys feature. I was thinking about something like:

worker.result_sets.each { |pack|
  pack.my_model_associated << [1, 2, 3, 4]
  pack.save!
}

I know worker.result_sets is an ActiveRecord::Result and does not contain each method. But like I said, it is just an idea. I was going to code something like that.
Anyway, I get the error:
NoMethodError: undefined method 'result_sets' for nil:NilClass, because my worker is nil.

I cloned the repository and modified line 22 in lib/bulk_insert.rb. I returned worker instead of nil even on block_given? is true and the error stopped happening. But, I really don't know why I did that and I was not confident, so I did not submit a pull request or something else (I'm not an expert ruby developer). I just know worker.result_sets was not nil anymore. Although I couldn't do what I wanted (bulk_insert packs and relate all them with my_model_associated quickly).

Could you help me?

Thanks.

Drop support for Ruby < 2.4

Ruby 2.4 has recently reached its EOL.
All previous versions can be safely considered unsupported

Update CI
Update .gemspec

bulk_insert does not return a worker object

I am using version 1.8.1, ruby 2.3.8, rails 4.2.5
I am trying to use non-block mode, passing an array of attribute hashes. My model is named Gap:

worker = Gap.bulk_insert(return_primary_keys: true)

however, bulk_insert returns nil. (block mode also returns nil).
Looking at the code it looks like it only returns a worker if bulk_insert is only called without values or a block:

    def bulk_insert(*columns, values: nil, set_size:500, ignore: false, update_duplicates: false, return_primary_keys: false)
      columns = default_bulk_columns if columns.empty?
      worker = BulkInsert::Worker.new(connection, table_name, primary_key, columns, set_size, ignore, update_duplicates, return_primary_keys)

      if values.present?
        transaction do
          worker.add_all(values)
          worker.save!
        end
        nil
      elsif block_given?
        transaction do
          yield worker
          worker.save!
        end
        nil
      else
        worker
      end
    end

How would I access the worker object in order to look at result_sets?
thanks, fw

Support for psql ON CONFLICT

Would it be possible to extend the ignore: true option for PostgreSQL adapter as well?
When I added it to my Rails application, I got the following error:

ActiveRecord::StatementInvalid: PG::SyntaxError: ERROR:  syntax error at or near "IGNORE"
LINE 1: INSERT IGNORE INTO "table_name" ("date","i...

As a suggestion/benchmark, you can find below the way it was implemented in activerecord-import gem:

# postgresql_adapter.rb
# ...
  def post_sql_statements( table_name, options ) # :nodoc:
    sql = []

    if supports_on_duplicate_key_update?
      # Options :recursive and :on_duplicate_key_ignore are mutually exclusive
      if (options[:ignore] || options[:on_duplicate_key_ignore]) && !options[:on_duplicate_key_update] && !options[:recursive]
        sql << sql_for_on_duplicate_key_ignore( table_name, options[:on_duplicate_key_ignore] )
      end
    elsif options[:on_duplicate_key_ignore] && !options[:on_duplicate_key_update]
      logger.warn "Ignoring on_duplicate_key_ignore because it is not supported by the database."
    end

    sql += super(table_name, options)

    unless options[:no_returning] || options[:primary_key].blank?
      primary_key = Array(options[:primary_key])
      sql << " RETURNING \"#{primary_key.join('", "')}\""
    end

    sql
  end
  def sql_for_on_duplicate_key_ignore( table_name, *args ) # :nodoc:
    arg = args.first
    conflict_target = sql_for_conflict_target( arg ) if arg.is_a?( Hash )
    " ON CONFLICT #{conflict_target}DO NOTHING"
  end

I haven't found any CONTRIBUTING guideline in your project. Thank you

bulk_insert version 1.8.1 doesn't play nice with activerecord and mysql2

I have created a test project here: https://github.com/grdw/bulk_insert_bug

When cloning said project and running ruby main.rb the following bug appears:

Traceback (most recent call last):
	23: from main.rb:18:in `<main>'
	22: from /Users/gerard/.ruby/gems/bulk_insert-1.8.1/lib/bulk_insert.rb:18:in `bulk_insert'
/Users/gerard/.ruby/gems/bulk_insert-1.8.1/lib/bulk_insert/worker.rb:112:in `block (2 levels) in compose_insert_query': 
undefined method `type_cast_from_column' for #<ActiveRecord::ConnectionAdapters::Mysql2Adapter:0x00007ffc9488e788> (NoMethodError)

It seems something has changed from ActiveRecord 6.0.3.4 to 6.1.0.

Not null contraint but PARAM exists

Hello, I am trying to use bulk_insert on a model whose not null constraint is ID, the problema is I have an object with ID on it's value

But when I try to use bulk_insert on it:

My schema.rb of Tweet

Any ideas how to fix this?

Abstract database-specific statements

I think that library structure would probably benefit from having database-specific behaviour abstracted to some sort of statement adapters. Currently it's heavily relying on ifs which is not really easy to maintain or modify.

It might be also good to consider keyword arguments. It's always easy mix an order of long list of arguments.

Sample code ans structure below.

module BulkInsert
  module StatementAdapters
    class GenericAdapter
      def insert_ignore_statement
      end

      def on_conflict_ignore_statement
      end

      def on_conflict_update_statement
      end

      def primary_key_return_statement
      end
    end
  end
end

require 'bulk_insert/statement_adapters/generic_adapter'
require 'bulk_insert/statement_adapters/mysql_adapter'
require 'bulk_insert/statement_adapters/postgresql_adapter'
require 'bulk_insert/statement_adapters/sqlite_adapter'

module StatementAdapter
  def self.for(connection)
    case connection.adapter_name
    when /^mysql/i
      MySQLAdapter.new
    when /\APost(?:greSQL|GIS)/i
      PostgreSQLAdapter.new
    when /\ASQLite/i
      SQLiteAdapter.new
    else
      GenericAdapter.new
    end
  end
end

	mapped = @columns.map.with_index do \|column, index\|
	value_exists = values.is_a?(Hash) ? values.key?(column.name) : (index < values.length)
	if !value_exists
	if column.default.present?
	column.default
	elsif column.name == "created_at" \|\| column.name == "updated_at"
	:__timestamp_placeholder
	else
	nil
	end
	else
	values.is_a?(Hash) ? values[column.name] : values[index]
	end
	end