Code Monkey home page Code Monkey logo

clojure-csv's Introduction

Clojure-CSV

Clojure-CSV is a small library for reading and writing CSV files. The main features:

  • Both common line terminators are accepted.
  • Quoting and escaping inside CSV fields are handled correctly (specifically commas and double-quote characters).
  • Unescaped newlines embedded in CSV fields are supported when parsing.
  • Reading is lazy.
  • More permissive than RFC 4180, although there are some optional strictness checks. (Send me any bugs you find, or any correctness checks you think should be performed.)

This library aims to be as permissive as possible with respect to deviation from the standard, as long as the intention is clear. The only correctness checks made are those on the actual (minimal) CSV structure. For example, some people think it should be an error when lines in the CSV have a different number of fields -- you should check this yourself. However, it is not possible, after parsing, to tell if the input ended before the closing quote of a field; if you care, it can be signaled to you.

The API has changed in the 2.0 series; see below for details.

Recent Updates

  • Updated library to 2.0.2, with a bug fix for malformed input by attil-io.

  • Updated library to 2.0.1, which adds the :force-quote option to write-csv. Big thanks to Barrie McGuire for the contribution.

  • Updated library to 2.0.0; essentially identical to 2.0.0-alpha2.

  • Updated library to 2.0.0-alpha2..

  • Rewritten parser for additional speed increases.

  • Benchmarks to help monitor and improve performance.

  • Updated the library to 2.0.0-alpha1.

  • Major update: Massive speed improvements, end-of-line string is configurable for parsing, improved handling of empty files, input to parse-csv is now a string or Reader, and a new API based on keyword args instead of rebinding vars.

###Previously...

  • Updated library to 1.3.2.
  • Added support for changing the character used to start and end quoted fields in reading and writing.
  • Updated library to 1.3.1.
  • Fixed the quoting behavior on write, to properly quote any field with a CR. Thanks to Matt Lehman for this fix.
  • Updated library to 1.3.0.
  • Now has support for Clojure 1.3.
  • Some speed improvements to take advantage of Clojure 1.3. Nearly twice as fast in my tests.
  • Updated library to 1.2.4.
  • Added the char-seq multimethod, which provides a variety of implementations for easily creating the char seqs that parse-csv uses on input from various similar objects. Big thanks to Slawek Gwizdowski for this contribution.
  • Includes a bug fix for a problem where a non-comma delimiter was causing incorrect quoting on write.
  • Included a bug fix to make the presence of a double-quote in an unquoted field parse better in non-strict mode. Specifically, if a CSV field is not quoted but has " characters, they are read as " with no further processing. Does not start quoting.
  • Reorganized namespaces to fit better with my perception of Clojure standards. Specifically, the main namespace is now clojure-csv.core.
  • Significantly faster on parsing. There should be additional speed improvements possible when Clojure 1.2 is released.
  • Support for more error checking with *strict* var.
  • Numerous bug fixes.

Obtaining

If you are using Leiningen, you can simply add

[clojure-csv/clojure-csv "2.0.1"]

to your project.clj and download it from Clojars with

lein deps

Use

The clojure-csv.core namespace exposes two functions to the user:

parse-csv

Takes a CSV as a char sequence or string, and returns a lazy sequence of vectors of strings; each vector corresponds to a row, and each string is one field from that row. Be careful to ensure that if you read lazily from a file or some other resource that it remains open when the sequence is consumed.

Takes the following keyword arguments to change parsing behavior:

:delimiter

A character that contains the cell separator for each column in a row.

Default value: \,

:end-of-line

A string containing the end-of-line character for reading CSV files. If this setting is nil then \n and \r\n are both accepted.

Default value: nil

:quote-char

A character that is used to begin and end a quoted cell.

Default value: "

:strict

If this variable is true, the parser will throw an exception on parse errors that are recoverable but not to spec or otherwise nonsensical.

Default value: false

write-csv

Takes a sequence of sequences of strings, basically a table of strings, and renders that table into a string in CSV format. You can easily call this function repeatedly row-by-row and concatenate the results yourself.

Takes the following keyword arguments to change the written file:

:delimiter

A character that contains the cell separator for each column in a row.

Default value: \,

:end-of-line

A string containing the end-of-line character for writing CSV files.

Default value: \n

:quote-char

A character that is used to begin and end a quoted cell.

Default value: "

:force-quote

If this variable is true, the output will have ever field quoted, whether this is needed or not. This can apparently be helpful for interoperating with Excel.

Default value: false

Changes from API 1.0

Clojure-CSV was originally written for Clojure 1.0, before many of the modern features we now enjoy in Clojure, like keyword args, an IO library and fast primitive math. The 2.0 series freshens up the API to more modern Clojure API style, language capabilities, and coding conventions. The JARs for the 1.0 series will remain available indefinitely (probably a long, long time), so if you can't handle an API change, you can continue to use it as you always have.

Here's a summary of the changes:

  • Options are now set through keyword args to parse-csv and write-csv. The dynamic vars are removed.
    • Rationale: Dynamic vars are a little annoying to rebind. This can tempt you to imprudently set them for too wide a swath of code. Reusing the same vars for both reading and writing meant that the vars had to have the same meaning in each context, or else two vars introduced to accommodate the differences. Keyword args are clear, fast, explicit, and local.
  • Parsing logic is now based on Java readers instead of Clojure char seqs.
    • Rationale: Largely performance. Clojure's char seqs are not particularly fast and throw off a lot of garbage. It's not clear that working entirely with pure Clojure data structures was providing much value to anyone. When you're doing IO, Readers are close at hand in Java, and now the basis for Clojure's IO libs.
  • An empty file now parses as a file with no rows.
    • Rationale: The CSV standard actually doesn't say anything about an input that is an empty file. Clojure-CSV 1.0 would return a single row with an empty string in it. The logic was that a CSV file row is everything between the start of a line and the end of the line, where an EOF is a line terminator. This would mean an empty file is a single row that has an empty field. An alternative, and equally valid view is that if a file has nothing in it, there is no row to be had. A file that is a single row with an empty field can still be expressed in this viewpoint as a file that contains only a line terminator. The same cannot be said of the 1.0 view of things: there was no way to represent a file with no rows. In any case, I went and looked at many other CSV parsing libraries for other languages, and they universally took the view that an empty CSV file has no rows, so now Clojure-CSV does as well.
  • The end-of-line option can now be set during parsing. If end-of-line is set to something other than nil, parse-csv will treat \n and \r\n as any other character and only use the string given in end-of-line as the newline.

Bugs

Please let me know of any problems you are having.

Contributors

License

Eclipse Public License

clojure-csv's People

Contributors

davidsantiago avatar i0cus avatar mlehman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clojure-csv's Issues

More relaxed `quote-char` parsing with `strict = false`

Hi David,

I've recently faced issues with Google Docs exported sheet, since (of course) it does not escape quotes. The issue happens if field starts with quoted value, but then continues unquoted. Minimal reproduction case would be:

Lewis;Huey;"Hip To Be Square" by Huey Lewis and the News

It chokes on the \space right after second quote. I'd expect it to thrown an exception with strict = true, but with strict = false I'd like it to recover into:

["Lewis", "Huey", "\"Hip To Be Square\" by Huey Lewis and the News"]

I did manage to prototype a fix within parse-csv-line to achieve this, it would however affect quoting related change done in 2.0.2 -- as it modifies default behavior (with strict = false). Strict mode still fails as expected. Related 2.0.2 quoting tests are currently a regression and would have to be moved to strictness tests section.

Would you be interested?

Parsing/casting function support?

It would be nice to be able to parse/cast data as it comes in from a csv file. For instance, convert numerical columns by specifying something like :cast-fns {2 (fn [s] (Integer/parseInt %))}. Would be even extra awesome to have "file sniffing" support, as R does, to look ahead and guess types in columns. As with #26, happy to contribute this or create my own project with these additional features.

optional deactivation of quoting recognition

Pseudo csv-files are not rare that do not quote strings. I have to work with them.

My experiments with disabling the quoting with setting quote-char to nil lead to an exception.

Would it be possible to enable such an empty quote-char or provide a boolean flage like "quoted-string?" ?

nil cells cause NPE

Given a sequence that contains nil entries, needs-quote? throws NullPointerException:

java.lang.NullPointerException: null
core.clj:228 clojure-csv.core/needs-quote?
core.clj:244 clojure-csv.core/quote-and-escape
core.clj:255 clojure-csv.core/quote-and-escape-row[fn]

Dependency change from clojure.contrib to clojure.core >=1.2

I see you're depending on clojure.contrib.str-utils2, but after a quick look it came to me that you are actually only making use of two functions: contains? and join. If you're willing to mark your library as "clojure 1.2 and above" you could easily drop that contrib dependency and use clojure.string/join and (.contains ^String haystack ^String needle).

This is actually more of a NIH nitpick than anything. ;-)

Version bump?

Hi David,

I've been using 2.0 alphas with Clojure 1.4.0 for few months now without issues. Are there any blockers that stop you from just pushing this to full version (instead of alpha)? Anything I could help with?

Cheers!

New Line Behavior

I found a possible issue with following parse-csv function, the line-seq behavior is the same behavior of java, is there a reason why the parse-csv does not behave the same way?

 (clojure-csv.core/parse-csv "test1,test2\rtest3,test4")
 ;>>(["test1" "test2\rtest3" "test4"])

Does not follow the same behavior than:

(line-seq (java.io.BufferedReader. (java.io.StringReader. "test1,test2\rtest3,test4")))
;>> ("test1,test2" "test3,test4")

The reason is that line-seq uses a BufferedReader to readLine where a line is considered to be terminated by any one of a line feed ('\n'), a carriage return ('\r'), or a carriage return followed immediately by a linefeed. And parse-csv only consider \n and \r as a new line.

http://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html#readLine()

Numbers aren't recognised as numbers

Given this input data row:

Aldershot;Leo Docherty;12;Hampshire;76205;26955;15477;3637;1796;1090;0;0;0

(from here)

I get this output:

["Aldershot" "Leo Docherty" "12" "Hampshire" "76205" "26955" "15477" "3637" "1796" "1090" "0" "0" "0"]

In other words, numbers are being recognised only as strings.

I wrote a wee function which sorts this:

(defn numbers-as-numbers
  "Return a list like the sequence `l`, but with all those elements
  which are string representations of numbers replaced with numbers."
  [l]
  (map #(if 
          (string? %) 
          (try 
            (let [n (read-string %)]
              (if (number? n) n %))
            (catch Exception e %))
          %) 
       l))

I'll try to do a pull request later today with this added in.

Commas inside quotes

Does this parser ignore commas inside quotes when splitting fields, as I hope it does?

Since you said "Quoting and escaping inside CSV fields are handled correctly (specifically commas and double-quote characters)," I'm assuming the answer is Yes.

Commas inside quotes sometimes not ignored

Issue #36 is closed, saying that commas inside quotes are ignored, but in the below example that is not always the case.

Only difference between the three lines below is no space, a space after the first, or a space after the second comma.

(->> "1367-1369,\"[Bailliages of Arras, Avesnes, Aubigny and Quiéry]\",[Artois],"
    (csv/parse-csv))

(->> "1367-1369, \"[Bailliages of Arras, Avesnes, Aubigny and Quiéry]\",[Artois],"
    (csv/parse-csv))

(->> "1367-1369,\"[Bailliages of Arras, Avesnes, Aubigny and Quiéry]\" ,[Artois],"
    (csv/parse-csv))

The first line produces the expected results.

(["1367-1369" "[Bailliages of Arras, Avesnes, Aubigny and Quiéry]" "[Artois]" ""])

But the other two lines seems to have issues. Either some overzealous splitting that didn't respect the double-quoted string, or just swallowing a whole entry.

(["1367-1369" " \"[Bailliages of Arras" " Avesnes" " Aubigny and Quiéry]\"" "[Artois]" ""])

(["1367-1369" " " "[Artois]" ""])

reflection warnings on clojars

Hi,
The current jar posted to clojars still has the warn-on-reflection flag set, which can be quite annoying because we are using your library in a large project. I noticed that the latest version in git has this commented out, so I just wanted to request that you please upload a new version to clojars.

Thanks for the library!

-Jeff

License info not in project.clj

The project.clj is missing the :license key which makes its way into the published pom on clojars and can be used to programmatically extract license info.

JVM crashes when using `:delimiter \;`

> (parse-csv "ola;hahdf" :delimeter \;)
Exception in thread "Swank REPL Thread" java.lang.IllegalMonitorStateException
        at
java.util.concurrent.locks.ReentrantLock$Sync.tryRelease(ReentrantLock.java:155)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1260)
        at java.util.concurrent.locks.ReentrantLock.unlock(ReentrantLock.java:460)
        at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:449)
        at swank.util.concurrent.mbox$receive.invoke(mbox.clj:28)
        at swank.core$eval_from_control.invoke(core.clj:108)
        at swank.core$eval_loop.invoke(core.clj:114)
        at swank.core$spawn_repl_thread$fn__803$fn__804.invoke(core.clj:343)
        at clojure.lang.AFn.applyToHelper(AFn.java:159)
        at clojure.lang.AFn.applyTo(AFn.java:151)
        at clojure.core$apply.invoke(core.clj:617)
        at swank.core$spawn_repl_thread$fn__803.doInvoke(core.clj:340)
        at clojure.lang.RestFn.invoke(RestFn.java:397)
        at clojure.lang.AFn.run(AFn.java:24)
        at java.lang.Thread.run(Thread.java:724)

Design discussion: input-streams

There's a limitation of clojure-csv that stems partly from the choice to use the Reader abstraction, but which may be surmountable without throwing the baby out with the bathwater.

The motivating problem is that sometimes it is necessary to deal with data at the byte level before it goes to the csv processor. Java provides the Reader abstraction for dealing with character data at a character level and InputStream for dealing with character data at the byte level. InputStreamReader bridges the two so that in theory you can do what you need to do with an InputStream and then act as if it is a Reader.

However, InputStreamReader does not implement mark() which clojure-csv does make use of. I would like to raise the possibility of a change to clojure-csv so that it does not need to use the mark() method of Readers. This would open clojure-csv to use with InputStreams as well other Java abstractions such as piped input that also do not implement mark().

There are a number of reasons why users of clojure-csv might need to deal with data at the byte level prior to it being processed by clojure-csv. My actual case is probably one of the more straightforward ones: I need to skip() the reader to a specific position. In many cases it is sufficient to use a Reader and skip characters instead of bytes, however many characters do not map 1-to-1 with bytes, and therefore it is important that I can skip based on actual bytes.

Parse errors in values with double quotes

Hi,

I noticed this bug in 1.2.0 and 1.2.1 versions. Below is the actual scenario:

Sample row (with semicolon delimiter):
120030;BLACK COD FILET MET VEL "MSC";KG;0;1;Nee;23-03-09;1070;BTWLAAG;STUK;1,06;33,1;VERS;Ja;Nee;1000

Expected:
["120030" "BLACK COD FILET MET VEL "MSC"" "KG" "0" "1" "Nee" "23-03-09" "1070" "BTWLAAG" "STUK" "1,06" "33,1" "VERS" "Ja" "Nee" "1000"]

Actual:
["120030" "BLACK COD FILET MET VEL "SC"KG" "0" "1" "Nee" "23-03-09" "1070" "BTWLAAG" "STUK" "1,06" "33,1" "VERS" "Ja" "Nee" "1000"]

Regards,
Shantanu

return-only files

I was trying to parse a CSV file exported by Mac Excel 2011; it seems to be terminated only by the Return character, with no Newline. Would there be any way of making clojure-csv configurable for different end-of-line sequences? I guess this is not standard CSV, but if Excel is making it it must be a pretty common use-case.

Clojure 1.3 compatibility

Since clojure-csv depends on contrib 1.2.0, it won't run on Clojure 1.3.0 (because contrib 1.2.0 won't run on Clojure 1.3.0).

I just went thru the process with CongoMongo, making that compatible with Clojure 1.3.0 so I'd be willing to help do the same for clojure-csv (which I'd need to do in order to use clojure-csv since we're wedded to Clojure 1.3.0 at work :)

Doublequote char is not configurable

My data source uses the pipe char | as delimiter and never has newlines in the middle of a value, so the newline is the unique identifier for EOL. Also the pipe never shows up inside values, which means that doublequotes could potentially be handled correctly. Here is an example file for testing:

abc|def|g"hi|jkl"|"mno|pqr|stu|vwx|

I save this into a file and test it:

(binding [*delimiter* \|]
  (parse-csv (slurp "c:/Users/ath/test.txt")))
==> (["abc" "def" "g\"hi" "jkl\"" "mno|pqr|stu|vwx|"])

When the leading doublequote in front of mno is removed the parsing is working as expected:

(["abc" "def" "g\"hi" "jkl\"" "mno" "pqr" "stu" "vwx" ""])

One way is required to tell which char is supposed to serve as the quotation char. This is currently fixed to doublequotes. Though in exactly my case this would be an awkward solution as I would have to bind some random char to be the quotechar which is not likely to occur in the values. Here I use the +:

(binding [*quotechar* \+] ...)

Another way would be to signal that no quotation char is used at all.

(binding [*quotation?* false] ...)

Would be great if you could add those both features to clojure-csv.

Header row consumption?

Some tools (such as python csv module) support consuming headers, so that rows can be returned as dictionaries. Is this something you'd consider supporting? I'd be happy to either add this, or create my own project if you want to keep core functionality simple.

Problem with quoted double-quotes in field

In file foo.csv, I have a single double-quoted field with quoted double-quotes:
"A"B"C"

Running
(csv/parse-csv (slurp "foo.csv"))
gives me:
(["B\"C""])

Is this right?

Thanks for any help.
Earl

write-csv quoting of string containing quotes

Is it expected that writing a value such as

(write-csv [[ "foo bar"]]) 

yields

"foo bar\n"

but writing a value containing a quoted portion such as

(write-csv [[ "\"foo\" bar"]])

yields

"\"\"\"foo\"\" bar\"\n"

? Having that whole string quoted with an extra set of double quotes is causing us to have an issue.

initial insignificant white space

If there is a line of data with whitespace preceding a quoted field, e.g.

  "2009-05-15 17:45:00","foo","bar"

This misparses the field and throws an exception:

java.lang.IllegalArgumentException: Invalid format: " "2009-05-15 17:45:00""

(strict parsing is disabled).

Quoting on comma instead of *delimiter*

Regarding line 148 of clojure-csv.core (inside needs-quote? function): need for quoting is determined using comma -- default CSV delimiter -- but at the same time you allow rebinding with delimiter... So in the end I am using different delimiter and my fields with "," are still getting quoted -- which is unexpected and breaks viewing resulting file in Excel, which in turn tripped me badly yesterday.

Would you consider changing this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.