Code Monkey home page Code Monkey logo

cesil's People

Contributors

kevin-montrose avatar siliconrob avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

cesil's Issues

Feature Suggestions

Hi Kevin, I loved your work on Jil so I am quite delighted to find that you are working on a high performance Csv Parser right when I was looking for one!

Testing it out, I ran into a few limitations for my ( albeit very edge case ) usage. ( I need to parse a PB of access logs 🚀 )

  1. Provide an option for discarding un-mapped columns. The csv file I am parsing has 2 columns at the end that are just repeats of previous data, but I had to map them out anyway since the serializer would otherwise throw an exception.

  2. This is very edge-case, but I would love to see some very low-level extensibility points, for example a way to get access to the raw ReadOnlySpan<byte> of a given field.

  3. A way to just execute a function/delegate/callback for each field right after it has been processed into the correct type, without the expectation of an output or return type.
    For example, if column 3 of my csv file is an int, I'd love to be able to just provide a custom action that aggregates the values, without Cesil expecting me to return a specific type that represents each row.

  4. Again, very edge case, but you could consider adding an optional string interning/reusing feature that calculates the cardinality of string fields based on the first N records, and if the cardinality is below a certain threshold, re-use an already allocated string. I'm currently using an implementation with dictionary<int,string> that's keyed on the GetHashCode() of the string, this worked well and resulted in moving a lot of Gen 3 collections to Gen 1, as well as reducing the minimum required memory. On monday I will experiment with hashing off the raw bytes to avoid the string allocation overhead to begin with.

What should Cesil be named?

This was posed in Overthinking CSV With Cesil: Open Source Update

Cesil's current name can be confused with another .NET OSS project, Cecil.

This is less than ideal.

Since Cesil is a pretty young project, a rename probably makes sense. So, what should this library (currently known as Cesil) be named?

Ideally the name will be:

  • Short
  • Unique
  • A play on the function of the library

This is an Open Question, which means:

[A]s part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

Rename `master` branch

It just needs doing, switch to main and probably also default PRs against vNext?

Making a note as a reminder.

Is there anything missing from IReader(Async) and IWriter(Async)?

This was posed in Overthinking CSV With Cesil: A “Modern” Interface and repeated in Overthinking CSV With Cesil: Reading Known Types.

Basically, is there anything missing from IReader, IWriter, or their async equivalents?

CesilUtils also exposes helpers for most of those methods, so failings in it may also imply failings in the interfaces.

This is an Open Question, which means:

[A]s part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

Consider defaulting RowEnding to Detect

I was experimenting with Cesil after reading the performance blog post you wrote recently. It took me a while (longer than I'd like to admit) to realize that the reason that EnumerateAll was not producing any results was because the file I was trying to read used '\n' for newlines, but Options defaults to CarriageReturnLineFeed. For the sanity of future users, you might consider changing the default to Detect instead.

The dataset I was using was the JHU covid-19 dataset available here
I was trying to bind it to the following class:

	public class CovidRecord
	{
		public int UID { get; set; }
		public string iso2 { get; set; }
		public string iso3 { get; set; }
		public int? code3 { get; set; }
		public float? FIPS { get; set; }
		public string Admin2 { get; set; }
		public string Province_State { get; set; }
		public string Country_Region { get; set; }
		public float? Lat { get; set; }
		public float? Long_ { get; set; }
		public string Combined_Key { get; set; }
	}

I have also authored a CSV library for .NET: Sylvan.Data.Csv. My focus was primarily performance, and I've put together some benchmarks for a handful of CSV libraries I could find. Sylvan exposes csv data a DbDataReader only, and doesn't provide any object binding capabilities, so comparing it to Cesil wouldn't exactly be apples to apples; as at least as far as I could tell Cesil only allows binding to objects.

What additional testing does Cesil need?

This was posed in Overthinking CSV With Cesil: Testing And Code Coverage in 2020

Cesil has an extensive test suite and good code coverage metrics. What other testing, if any, should be implemented for Cesil?

This is an Open Question, which means:

[A]s part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

Avoid copying by default when taking the range of a dynamic row

When you take the range of (ie. someRow[1..3]) of a dynamic row, the underlying data is copied into a new Memory<char>. This is only necessary if the original row is disposed after the subset, and even then it a copy could be optimized away in the common case where Options.DynamicRowDisposal is OnReaderDispose.

This will probably require introducing either a new kind of dynamic object (like RangedDynamicRow or something), or introducing a notion of a "mode" to DynamicRow.

Are there any useful dynamic operations around reading that are missing from Cesil?

This was posed in Overthinking CSV With Cesil: Reading Dynamic Types.

Basically, are they any operations around reading dynamic rows that Cesil is missing? This covers missing methods from IReader<dynamic>/IAsyncReader<dynamic>, and missing *Dynamic* methods from CesilUtils.

This is an Open Question, which means:

[A]s part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

Determine if "impossible" case in ReflectionExtensionMethods are testable

Cesil has has extension methods in ReflectionExtensionMethods for use in reflection-y code that DRY things up and makes sure nulls are caught early.

There are four places where a null check is necessary, according to nullable annotations in the BCL, that are currently untested. If these can be tested, they should be - if not, the todos should be replaced with comments explaining why they are impossible and null forgiveness operators (ie. !) used instead of the ifs.

Remove allocations from NameLookup creation

NameLookup is a struct used to speed up looking up dynamic members. In use it is allocation free and most construction code places data on the client provider MemoryPool<char> - but there are three places that directly allocate on the heap. These should be replaced either with no allocation code (which appears quite tricky) or with code that only uses memory obtained from a MemoryPool<char>.

This code is performance critical - there three benchmarks to verify that improvements on allocations do not cause regressions in runtime.

Do Cesil's Options provide everything needed in a CSV library?

This was posed in Overthinking CSV With Cesil: CSV Isn’t A Thing.

Essentially, are the configurations exposed via Options adequate?

This was restated as:

Are there any missing Format-specific options Cesil should have?

in Overthinking CSV With Cesil: “Maximum” Flexibility

This is an Open Question, which means:

[A]s part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

Does Cesil give adequate control over allocations?

This was posed in Overthinking CSV With Cesil: “Maximum” Flexibility.

Essentially: Does combination of the MemoryPool<char> and various buffer size hints on Options and the configurable steps of (de)serialization wrapped in (De)SerializableMembers (and associated types like InstanceProviders, Setters, Formatters, etc.) provide sufficient control over when and how allocations happen?

Put another way, is there any place where using Cesil currently requires allocations that a consumer cannot influence or control either the timing or manner of those allocations?

This is an Open Question, which means:

[A]s part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

Do the conversions provided by the DefaultTypeDescriber for dynamic rows and cells cover all common use cases?

This was posed in Overthinking CSV With Cesil: Reading Dynamic Types.

Basically, are they any conversions from dynamic rows or cells to concrete types that should be supported that the DefaultTypeDescriber does not support?

Similarly, but slight differently, are there any kinds of conversions that would be useful that cannot be expressed using ITypeDesciber ?

This is an Open Question, which means:

[A]s part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

Check conventions with Roslyn analyzers

There are a couple conventions in Cesil that should be checked by analyzers, rather than either a) not being checked or b) checked in tests with janky reflection.

They are (at time of writing):

  • AsyncTestHelper.IsCompletedSuccessfully() is used instead of (Value)Task(<T>).IsCompletedSuccessfully
  • AwaitHelper.ConfigureCancellableAwait() is called for all awaitables prior to them being await'ed
  • BindingFlagsConstants is used instead of BindingFlags
  • Every use of the null forgiving operators (ie. suffix-!) is annotated with an explanation
  • Members on Throw are called instead of a throw statement or throw expression
  • Members on Types are used instead of a typeof() expression
  • Non-public types don't have public members

There may be others to add in the future, but this is a good start.

It's possible to provide an analyzer as a project, so the path forward here is creating a Cesil.Analyzers project and referencing it from the main Cesil project.

How should Cesil treat nullable reference types in client code?

This was posed in Overthinking CSV With Cesil: Adopting Nullable Reference Types

Basically, when Cesil is setting a member on a type it currently ignores any inviolability annotations that member might have. This means that Cesil can violate nullability annotations (ie. it can assign null to string, despite string lacking a ?).

However, not all (in fact, likely very little) C# code has opted into nullable reference types - which means if Cesil were to respect nullable annotations there are cases where Cesil would fail where a client's own code would succeed.

Concretely, I see three options for what Cesil could do:

  1. Ignore nullable annotations, clients should perform their own null checks
  2. Enforce nullable annotations as part of the DefaultTypeDescriber, if a client needs to disable it they can provider their own ITypeDescriber
  3. Provide an Option to enable/disable nullable reference type enforcement
    a. This will move the logic out of ITypeDescribers, so client control via that method will no longer be available
    If this is the route to take, what should the value for Options.(Dynamic)Default be?

This is an Open Question, which means:

[A]s part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

Alternatives to IEnumerable<DynamicCellValue> for ITypeDescriber.GetCellsForDynamicRow

This was posed in Overthinking CSV With Cesil: Writing Dynamic Types.

ITypeDescriber.GetCellsForDynamicRow(in WriteContext, object) is invoked once per row when writing dynamic types. While a client could (in theory) reuse the allocated IEnumerable<T> across multiple calls, this is difficult (bordering on impossible) in the general case because when the returned value is no longer in use is impossible to know. As a consequence, the natural way to implement GetCellsForDynamicRow() requires at least one heap allocation per call (in fact, this is what DefaultTypeDescriber does).

Is there an alternative method (or methods) / return type / etc. that would allow for no allocation implementations of GetCellsForDynamicRow's functionality?

Are there any reasonable .NET type schemes that Cesil cannot read or write?

This was posed in Overthinking CSV With Cesil: “Maximum” Flexibility.

Cesil split reading and writing rows into a number of logical steps and let's clients provide their own implementations of those steps. They are:

Each step receives a context which provides the operation being performed, the location in the read file, and a per-reader/writer client provided object in addition to the raw data necessary to perform the step (object reference, string data, etc.).

The question is, are there any reasonable type or pattern that Cesil cannot support with this scheme?

"Reasonable" is a low bar here, if it's in actual in use code being read or written (even if manually) to a CSV I'd consider it a "reasonable" type no matter how otherwise weird.

This is an Open Question, which means:

[A]s part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.