Code Monkey home page Code Monkey logo

Comments (14)

abcdefg-1234567 avatar abcdefg-1234567 commented on August 23, 2024 1

@kou
Yes, I would like to.
I used a CSV file made by a tool for spreadsheet and IDE to reproduce this problem.
So, I will make script that produces a CSV.

from csv.

abcdefg-1234567 avatar abcdefg-1234567 commented on August 23, 2024

@kou
I could reproduce this bug at hand.

from csv.

kou avatar kou commented on August 23, 2024

Thanks for your report.
It may be a bug of internal chunk based stream parser:

csv/lib/csv/parser.rb

Lines 299 to 308 in 22e62bc

chunk = input.gets(@row_separator, @chunk_size)
if chunk
raise InvalidEncoding unless chunk.valid_encoding?
# trace(__method__, :chunk, chunk)
@scanner = StringScanner.new(chunk)
if input.respond_to?(:eof?) and input.eof?
@inputs.shift
@last_scanner = @inputs.empty?
end
true

@abcdefg-1234567 Great! Do you want to work on fixing this problem?
Could you share a script that produces a CSV that reproduces this problem as the first step?

from csv.

abcdefg-1234567 avatar abcdefg-1234567 commented on August 23, 2024

@kou
I came up with the following code that creates a csv.

CSV.open('test.csv', 'w') do |csv|
  2500.times do
    csv << ['AAAA1234567890']
  end
end

However, I do not know how to make the following state by code.

  • file with CRLF endings
  • file with no EOL/trailling newline

I have confirmed that the bug is reproduced when the following conditions are set manually using the IDE.
Could you please give me any ideas?

from csv.

GabrielNagy avatar GabrielNagy commented on August 23, 2024

hey @abcdefg-1234567 you can use the following:

File.open('test.csv', 'w') do |f|
  2499.times do
    f.print("AAAA1234567890\r\n")
  end
  f.print("AAAA1234567890")
end

from csv.

abcdefg-1234567 avatar abcdefg-1234567 commented on August 23, 2024

@GabrielNagy
Thank you!

from csv.

kou avatar kou commented on August 23, 2024

OK. Let's reduce the reproducible CSV size as much as possible as the next step for easy to debug.
If it's difficult, we can start debugging with the reproducible CSV.
#279 (comment) may help you.

from csv.

abcdefg-1234567 avatar abcdefg-1234567 commented on August 23, 2024

I have confirmed that the bug will not reproduce if the csv is less than 2048 rows.

from csv.

abcdefg-1234567 avatar abcdefg-1234567 commented on August 23, 2024

I have confirmed the following.

The result of "value = parse_column_value (line 1030 of parser.rb)" when @ lineno=2048 is "AAAA1234567890AAAAA1234567890".

I am also wondering if changes are needed around the adjust_last_keep method.
@kou
Could you please explain the role of this method?

from csv.

kou avatar kou commented on August 23, 2024

Sure.

#adjust_last_keep was introduced for fixing https://bugs.ruby-lang.org/issues/18245 .

InputsScanner acts as logically one StringScanner with multiple inputs. (StringScanner can't work with multiple strings.)

CSV::Parser may want to push back read data. For example, if skip_lines is specified, CSV::Parser may push back read data. CSV::Parser reads a line from its scanner (CSV::Parser::Scanner or CSV::Parser::InputsScanner) to check whether the line should be skipped. If the read line isn't skip target, CSV::Parser pushes back the read line and parses the line as a CSV line. keep_start/keep_drop/keep_back/keep_end are for it.

adjust_last_keep is related to these keep_* methods.InputsScanner processes multiple inputs. So the target data (for example, one line for skip_lines) may exist in multiple inputs. For example, "# a", "bc" and "\n" are one line but they are 3 inputs. adjust_last_keep is for the situation. If we need to concatenate data from multiple inputs, adjust_last_keep does it.

I hope that this explanation helps you.

from csv.

abcdefg-1234567 avatar abcdefg-1234567 commented on August 23, 2024

Thank you for your detailed explanation!
I will refer to this and continue the investigation.

from csv.

kou avatar kou commented on August 23, 2024

Including line number in line contents will helpful:

File.open('/tmp/test.csv', 'w') do |f|
  lines = 2500.times.collect do |i|
    "A%013d" % i
  end
  f.print(lines.join("\r\n"))
end

Output with the test file:

...
A0000000002497
A0000000002498
A0000000002499A0000000002499

It seems that the last line was used twice.

from csv.

kou avatar kou commented on August 23, 2024

I cloud reproduce this with the script:

ENV["CSV_PARSER_SCANNER_TEST"] = "yes"

require "csv"

csv = CSV.new("a\r\nb", row_sep: "\r\n", strip: true, skip_lines: /\A *\z/)
csv.each do |row|
  pp row
end
["a"]
["bb"]

from csv.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.