kevin-montrose / cesil Goto Github PK

View Code? Open in Web Editor NEW

65.0 65.0 3.0 43.79 MB

Modern CSV (De)Serializer

License: MIT License

C# 95.79% HTML 4.19% PowerShell 0.02%

csv serializer

cesil's People

Contributors

Stargazers

Watchers

Forkers

attackgithub siliconrob goldli

cesil's Issues

Feature Suggestions

Hi Kevin, I loved your work on Jil so I am quite delighted to find that you are working on a high performance Csv Parser right when I was looking for one!

Testing it out, I ran into a few limitations for my ( albeit very edge case ) usage. ( I need to parse a PB of access logs 🚀 )

Provide an option for discarding un-mapped columns. The csv file I am parsing has 2 columns at the end that are just repeats of previous data, but I had to map them out anyway since the serializer would otherwise throw an exception.
This is very edge-case, but I would love to see some very low-level extensibility points, for example a way to get access to the raw ReadOnlySpan<byte> of a given field.
A way to just execute a function/delegate/callback for each field right after it has been processed into the correct type, without the expectation of an output or return type.
For example, if column 3 of my csv file is an int, I'd love to be able to just provide a custom action that aggregates the values, without Cesil expecting me to return a specific type that represents each row.
Again, very edge case, but you could consider adding an optional string interning/reusing feature that calculates the cardinality of string fields based on the first N records, and if the cardinality is below a certain threshold, re-use an already allocated string. I'm currently using an implementation with dictionary<int,string> that's keyed on the GetHashCode() of the string, this worked well and resulted in moving a lot of Gen 3 collections to Gen 1, as well as reducing the minimum required memory. On monday I will experiment with hashing off the raw bytes to avoid the string allocation overhead to begin with.

What should Cesil be named?

This was posed in Overthinking CSV With Cesil: Open Source Update

Cesil's current name can be confused with another .NET OSS project, Cecil.

This is less than ideal.

Since Cesil is a pretty young project, a rename probably makes sense. So, what should this library (currently known as Cesil) be named?

Ideally the name will be:

Short
Unique
A play on the function of the library

This is an Open Question, which means:

Rename `master` branch

It just needs doing, switch to main and probably also default PRs against vNext?

Making a note as a reminder.

Replace Utils.Sort with MemoryExtensions.Sort()

See here

Depends on .NET 5 release per dotnet/runtime#19969

Is there anything missing from IReader(Async) and IWriter(Async)?

This was posed in Overthinking CSV With Cesil: A “Modern” Interface and repeated in Overthinking CSV With Cesil: Reading Known Types.

Basically, is there anything missing from IReader, IWriter, or their async equivalents?

CesilUtils also exposes helpers for most of those methods, so failings in it may also imply failings in the interfaces.

This is an Open Question, which means:

Implement big endian branches in Utils.FindChar

There are two places in Utils.FindChar(ReadOnlySpan<char>, char) where a NotImplementedException is thrown if we're on a big endian system.

The major blocker here is some way to actually test the code.

Remove allocations when parsing enums

There are currently no available TryParse()'s for enums that take a ReadOnlySpan<char>. This is being worked on..

Once that's landed, revisit the cases in DefaultTypeParsers where we're either allocation or "doing things the hard way".

Consider defaulting RowEnding to Detect

I was experimenting with Cesil after reading the performance blog post you wrote recently. It took me a while (longer than I'd like to admit) to realize that the reason that EnumerateAll was not producing any results was because the file I was trying to read used '\n' for newlines, but Options defaults to CarriageReturnLineFeed. For the sanity of future users, you might consider changing the default to Detect instead.

The dataset I was using was the JHU covid-19 dataset available here
I was trying to bind it to the following class:

	public class CovidRecord
	{
		public int UID { get; set; }
		public string iso2 { get; set; }
		public string iso3 { get; set; }
		public int? code3 { get; set; }
		public float? FIPS { get; set; }
		public string Admin2 { get; set; }
		public string Province_State { get; set; }
		public string Country_Region { get; set; }
		public float? Lat { get; set; }
		public float? Long_ { get; set; }
		public string Combined_Key { get; set; }
	}

I have also authored a CSV library for .NET: Sylvan.Data.Csv. My focus was primarily performance, and I've put together some benchmarks for a handful of CSV libraries I could find. Sylvan exposes csv data a DbDataReader only, and doesn't provide any object binding capabilities, so comparing it to Cesil wouldn't exactly be apples to apples; as at least as far as I could tell Cesil only allows binding to objects.

Teach the ReadStateMachine how to skip leading whitespace in values

Cesil can optionally strip whitespace around values. Some of this requires processing outside of the ReaderStateMachine that is responsible to the core parsing pass.

There are two places where processing is not technically required, but has not yet been moved into the state machine.

This change will probably lead to a small performance improvement, as the branches involved should be entirely eliminated.

What additional testing does Cesil need?

This was posed in Overthinking CSV With Cesil: Testing And Code Coverage in 2020

Cesil has an extensive test suite and good code coverage metrics. What other testing, if any, should be implemented for Cesil?

This is an Open Question, which means:

Avoid copying by default when taking the range of a dynamic row

When you take the range of (ie. someRow[1..3]) of a dynamic row, the underlying data is copied into a new Memory<char>. This is only necessary if the original row is disposed after the subset, and even then it a copy could be optimized away in the common case where Options.DynamicRowDisposal is OnReaderDispose.

This will probably require introducing either a new kind of dynamic object (like RangedDynamicRow or something), or introducing a notion of a "mode" to DynamicRow.

Remove various Windows assumptions (.bat files, CMD, etc) in solution

I'm on Windows, but it's 2020 and .NET doesn't only run on Windows.

Change all the .bat files to be cross platform Powershell scripts:

As of 0.7.0 the benchmark project also makes some Windows-y assumptions (CMD being a thing, basically) that should also be excised.

Cesil.Benchmark runs on not-Windows

Are there any useful dynamic operations around reading that are missing from Cesil?

This was posed in Overthinking CSV With Cesil: Reading Dynamic Types.

Basically, are they any operations around reading dynamic rows that Cesil is missing? This covers missing methods from IReader<dynamic>/IAsyncReader<dynamic>, and missing *Dynamic* methods from CesilUtils.

This is an Open Question, which means:

Determine if "impossible" case in ReflectionExtensionMethods are testable

Cesil has has extension methods in ReflectionExtensionMethods for use in reflection-y code that DRY things up and makes sure nulls are caught early.

There are four places where a null check is necessary, according to nullable annotations in the BCL, that are currently untested. If these can be tested, they should be - if not, the todos should be replaced with comments explaining why they are impossible and null forgiveness operators (ie. !) used instead of the ifs.

Remove allocations from NameLookup creation

NameLookup is a struct used to speed up looking up dynamic members. In use it is allocation free and most construction code places data on the client provider MemoryPool<char> - but there are three places that directly allocate on the heap. These should be replaced either with no allocation code (which appears quite tricky) or with code that only uses memory obtained from a MemoryPool<char>.

This code is performance critical - there three benchmarks to verify that improvements on allocations do not cause regressions in runtime.

Do Cesil's Options provide everything needed in a CSV library?

This was posed in Overthinking CSV With Cesil: CSV Isn’t A Thing.

Essentially, are the configurations exposed via Options adequate?

This was restated as:

in Overthinking CSV With Cesil: “Maximum” Flexibility

This is an Open Question, which means:

Does Cesil give adequate control over allocations?

This was posed in Overthinking CSV With Cesil: “Maximum” Flexibility.

Essentially: Does combination of the MemoryPool<char> and various buffer size hints on Options and the configurable steps of (de)serialization wrapped in (De)SerializableMembers (and associated types like InstanceProviders, Setters, Formatters, etc.) provide sufficient control over when and how allocations happen?

Put another way, is there any place where using Cesil currently requires allocations that a consumer cannot influence or control either the timing or manner of those allocations?

This is an Open Question, which means:

Remove allocations in (Async)DynamicWriter.DiscoverColumns

Both methods end up creating lists, which isn't great. Ideally we shove these onto a MemoryPool somehow.

Do the conversions provided by the DefaultTypeDescriber for dynamic rows and cells cover all common use cases?

This was posed in Overthinking CSV With Cesil: Reading Dynamic Types.

Basically, are they any conversions from dynamic rows or cells to concrete types that should be supported that the DefaultTypeDescriber does not support?

Similarly, but slight differently, are there any kinds of conversions that would be useful that cannot be expressed using ITypeDesciber ?

This is an Open Question, which means:

Check conventions with Roslyn analyzers

There are a couple conventions in Cesil that should be checked by analyzers, rather than either a) not being checked or b) checked in tests with janky reflection.

They are (at time of writing):

AsyncTestHelper.IsCompletedSuccessfully() is used instead of (Value)Task(<T>).IsCompletedSuccessfully
AwaitHelper.ConfigureCancellableAwait() is called for all awaitables prior to them being await'ed
BindingFlagsConstants is used instead of BindingFlags
Every use of the null forgiving operators (ie. suffix-!) is annotated with an explanation
Members on Throw are called instead of a throw statement or throw expression
Members on Types are used instead of a typeof() expression
Non-public types don't have public members

There may be others to add in the future, but this is a good start.

It's possible to provide an analyzer as a project, so the path forward here is creating a Cesil.Analyzers project and referencing it from the main Cesil project.

How should Cesil treat nullable reference types in client code?

This was posed in Overthinking CSV With Cesil: Adopting Nullable Reference Types

Basically, when Cesil is setting a member on a type it currently ignores any inviolability annotations that member might have. This means that Cesil can violate nullability annotations (ie. it can assign null to string, despite string lacking a ?).

However, not all (in fact, likely very little) C# code has opted into nullable reference types - which means if Cesil were to respect nullable annotations there are cases where Cesil would fail where a client's own code would succeed.

Concretely, I see three options for what Cesil could do:

Ignore nullable annotations, clients should perform their own null checks
Enforce nullable annotations as part of the DefaultTypeDescriber, if a client needs to disable it they can provider their own ITypeDescriber
Provide an Option to enable/disable nullable reference type enforcement
a. This will move the logic out of ITypeDescribers, so client control via that method will no longer be available
If this is the route to take, what should the value for Options.(Dynamic)Default be?

This is an Open Question, which means:

Remove allocations when formatting enums

A number of allocation occur when formatting a [Flags] enum - these should be removed.

This may be a bit tricky, as there isn't a convenient TryFormat(Span<char>) method available.

Alternatives to IEnumerable<DynamicCellValue> for ITypeDescriber.GetCellsForDynamicRow

This was posed in Overthinking CSV With Cesil: Writing Dynamic Types.

ITypeDescriber.GetCellsForDynamicRow(in WriteContext, object) is invoked once per row when writing dynamic types. While a client could (in theory) reuse the allocated IEnumerable<T> across multiple calls, this is difficult (bordering on impossible) in the general case because when the returned value is no longer in use is impossible to know. As a consequence, the natural way to implement GetCellsForDynamicRow() requires at least one heap allocation per call (in fact, this is what DefaultTypeDescriber does).

Is there an alternative method (or methods) / return type / etc. that would allow for no allocation implementations of GetCellsForDynamicRow's functionality?

Are there any reasonable .NET type schemes that Cesil cannot read or write?

This was posed in Overthinking CSV With Cesil: “Maximum” Flexibility.

Cesil split reading and writing rows into a number of logical steps and let's clients provide their own implementations of those steps. They are:

For Reading
For Writing

Each step receives a context which provides the operation being performed, the location in the read file, and a per-reader/writer client provided object in addition to the raw data necessary to perform the step (object reference, string data, etc.).

The question is, are there any reasonable type or pattern that Cesil cannot support with this scheme?

"Reasonable" is a low bar here, if it's in actual in use code being read or written (even if manually) to a CSV I'd consider it a "reasonable" type no matter how otherwise weird.

This is an Open Question, which means:

kevin-montrose / cesil Goto Github PK

cesil's People

Contributors

Stargazers

Watchers

Forkers

cesil's Issues

Recommend Projects

Recommend Topics

Recommend Org