markpflug / sylvan Goto Github PK

View Code? Open in Web Editor NEW

342.0 342.0 35.0 1015 KB

A collection of .NET libraries, including the fastest general-purpose CSV parser for .NET.

License: MIT License

Batchfile 0.02% C# 99.71% TSQL 0.27%

csv data dotnet dotnet-core

sylvan's People

Contributors

Stargazers

Watchers

sylvan's Issues

Cannot parse a CR delimited file

When attempting to read a CSV file where lines are terminated only by CR characters, the parser reads all data into a single row and fails to parse properly.

This type of file can exist if generated on a classic MacOS system.

duplicate header handling

Duplicate headers cause an duplicate key exception during initialization. It should be possible to work-around this issue by providing a custom schema that remaps columns with new names. The header init code should use the mapped column name, not the original column name when constructing the header map to avoid this key collision.

Suggestion: Enable support for [NotMapped] attribute to tell the binder a class property has no mapping to the datasource

I think it's as simple as setting line 170 of the CompiledDataBinder to be

.Where(p => p.SetMethod != null && p.GetCustomAttribute<NotMappedAttribute>() == default)

does require a using for System.ComponentModel.DataAnnotations.Schema; though

CsvReader (using CSV Data Binder) does not appear to respect DateOnlyFormat or Culture options

No matter what I do, I can't get a CsvReader instance to parse a DateOnly in the format "dd/MM/yyyy".

Here's a minimal example to reproduce the issue which can be pasted into LINQPad. I'm using Sylvan.Data 0.2.8 and Sylvan.Data.Csv 1.2.4:

using Sylvan.Data;
using Sylvan.Data.Csv;

async Task Main()
{
	var data = """
"Settlement Date","Date","Symbol","Sedol","ISIN","Quantity","Price","Description","Reference","Debit","Credit","Running Balance"
"24/12/2022","24/12/2022","TEST","TESTV0","IE00TESTV02","100","£100.00","TEST","TEST04","£500.00","","£10.00"
""";

	var opts = new CsvDataReaderOptions
	{
		Culture = new CultureInfo("en-GB"),
		DateOnlyFormat = "dd/MM/yyyy",
	};

	using var csv = await CsvDataReader.CreateAsync(new StringReader(data), opts);
	var records = csv.GetRecords<Record>().ToList();
}

public class Record
{
	public DateOnly SettlementDate { get; set; }
	public DateOnly Date { get; set; }
	public string Symbol { get; set; }
	public string Sedol { get; set; }
	public string ISIN { get; set; }
	public int? Quantity { get; set; }
	public string Price { get; set; }
	public string Description { get; set; }
	public string Reference { get; set; }
	public string Debit { get; set; }
	public string Credit { get; set; }
	public string RunningBalance { get; set; }
}

GetDate (.net 6) doesn't respect DateFormat options

Let me first just say I've only been using this library for a little while but I love it. It's great to see a really well thought out CSV library.

Anyway, I have a CSV file with dates in Australian format (dd/MM/yyyy). I tried parsing the file with:

var options = new CsvDataReaderOptions { DateFormat = "dd/MM/yyyy" };

Followed by:

csv.GetDate(2)

But it throws the following exception:

String '30/06/2022' was not recognized as a valid DateOnly.

To workaround the issue I created a simple extension method:

    public static DateOnly GetDateOnly(this CsvDataReader reader, int ordinal)
    {
        var dateTime = reader.GetDateTime(ordinal);
        return DateOnly.FromDateTime(dateTime);
    }

This will do just fine for now but I just thought I'd report it in case you'd like to know.

dBASE memo formats

Apparently there are other memo formats that exist other than the one used by VFP9. I don't have the ability to create them though, so I have no way of validating a correct implementation. I've never seen a shapefile include a memo field, so this is very low priority.

strongly typed parsing is not working

i am trying to use the below code but it is not working

var dataTable = new DataTable();

var schema = new CsvSchema(Sylvan.Data.Schema.Parse("Id:int,Name,Quantity:int,SalePrice:decimal"));
using var csv = CsvDataReader.Create("SalesData.csv", new CsvDataReaderOptions { Schema = schema});
dataTable.Load(csv);

I reviewed the source code and found the Sylvan.Data.Schema.Parse method, but it is not accessible.

demo. There is no content in the CSV file

        [HttpGet]
        public IActionResult Export2()
        {
            #region Directly using CSV string to generate file successfully
            {
                using var tw = System.IO.File.CreateText("output.csv");
                var data1 = "A,B,C\n1,2,3\n4,5,6\n";
                {
                    var r = CsvDataReader.Create(new StringReader(data1));
                    var csvWriter0 = CsvDataWriter.Create(tw);
                    csvWriter0.Write(r);
                }

                {
                    var r = CsvDataReader.Create(new StringReader(data1));
                    var csvWriter0 = CsvDataWriter.Create(tw, new CsvDataWriterOptions { WriteHeaders = false });
                    csvWriter0.Write(r);
                }
            }
            #endregion
            
            var records = new[]
                {
                    new {
                        Id = 1,
                        Name = "Alpha",
                        Date = new DateTime(2021, 1, 1),
                    },
                    new {
                        Id = 2,
                        Name = "Beta",
                        Date = new DateTime(2022, 1, 1),
                    },
                    new {
                        Id = 3,
                        Name = "Gamma, Iota, Omega",
                        Date = new DateTime(2023, 1, 1),
                    }
                };

            List<A> list1 = new List<A>()
            {
                new A(){ MyProperty1 =1,MyProperty2=1},
                new A(){ MyProperty1 =2,MyProperty2=3},
                new A(){ MyProperty1 =3,MyProperty2=3},
            };

            var recordReader = list1.AsDataReader();

            var csvWriter = CsvDataWriter.Create("demo.csv");
            var h1 = csvWriter.Write(recordReader); 
            // There is a problem here. The generated file has no content.demo. There is no content in the CSV file

            return Ok(h1);
        }

quoted field corner case

The logic to handle quotes is slightly flawed for improperly quoted fields.

A quoted field with trailing data:
"this is quoted" improperly,

would be read as:
this is quoted" improperl

I think it would be slightly better (less surprising) if it was read as:
this is quoted improperly

Right now the code tracks quoting with a bool, change this to an enum: no-quotes, clean-quotes, bad-quotes. In the clean quotes case, we can do the easy/fast approach of trimming. For the bad quotes, a more complex approach would be needed.

Low priority, because let's face it, that ain't CSV.

add support for binary data in csv reader/writer

Allow reading/writing byte[] data as hex-encoded strings in the CSV reader/writer.

Last character of quoted cell is lost if CSV ends with this cell

The last character of a cell with a quoted string seems to be lost if the CSV content ends with the closing quote character of this cell.

Assuming this CSV content

id;value
1;"cell"

is stored in a string: "id;value\r\n1;\"cell\"". Using a Sylvan.Data.Csv.CsvDataReader will read only the value cel for the last cell.

If I append a newline to the string (no matter if it is \r, \n, or \r\n), the last cell is read correctly.

I have prepared some code on: dotnetfiddle and attached the file.

Write List<T> to CSV with CsvDataWriter

Hello,
thank you for this amazing lightning fast CSV library.
I use it to read million rows from CSV with no trouble.

The problem now is that I have to write CSV file from List (example: List(Person) peopleList).
What's the correct approach to write to a CSV file using your built in CsvDataWriter?

Thank you.

Possibility to cast as null, with error?

Does CSV library has same solution as Excel one, to parse error as null? GetErrorAsNull = true

I have a situation when csv is downloaded from application, this is caused that probably someone typed text with delimiter that made date value a decimal due to a shift in places, this means that when i do direct cast of schema with excample: Column Name:date? i get an error, cause it's not null but decimal, yet it cannot by parsed. This causes exception: {"Input string was not in a correct format."}

How to Append records to an existing CSV file

Hello,
here I am here again with another question hoping to help also other people with same requirements.

CsvDataWriter has only Write(or WriteAsync), it seems it creates a new file or overwrites the existing same named file.

SCENARIO:
Database table hosting 1.000.000 records.
Read a batch of 10.000 record from db, then write to file,
Read next batch of 10.000 records from Db then write to file....
....and so on.

Can you improve the library adding an Append(string file, CsvDataWriterOptions options) or AppendAsync method to add new records to an existing file?

Thank you

Exception messages could be made more helpful?

Had an issue recently parsing a CSV. Suppose the CSV looks like this:

1/1/1970,1,2

No header, just data rows

And I had a class like this:

class X{
  string NotUsed {get; set;}
  DateTime BlahDate {get;set;}
  int SomeInt {get;set;}
  int OtherInt {get;set;}
}

And I was using Sylvan like this:

            var opts = new CsvDataReaderOptions
            {
                Schema = new CsvSchema(Schema.Parse("BlahDate:DateTime{d/M/yyyy},SomeInt:int,OtherInt:int"))
            };
            using var csv = CsvDataReader.Create(new StringReader(s), opts);
            return csv.GetRecords<X>().ToList();

It blows up, but the exception helper looks like this:

To start with we really only see a cryptic "Sylvan.Data.UnboundMemberException: Exception_WasThrown" and we find some useful info deeper in the detail property grid after we expand stuff..

If the message could say something like "The provided schema doesn't cover all of the bindable members in the bound class. For more details on which properties/fields of the bound class are not referenced in the schema, see the UnboundColumns/UnboundProperties of the exception." it would be grand..

(ps.. I think there might be a way to turn off this check, I can't remember.. Maybe mentioning that in the exception message would be good too, like "Option 'CheckSchemaMatchesBoundClass' is true and ..

Exception of type 'Sylvan.Data.Csv.CsvFormatException'

Hi,
if i read a csv separated with comma and in a string he find a quote get the title exception

Nullable string from an empty string

Hoping I'm just doing something incorrectly and you can point out, "oh you need to do X", but here goes.

Following along the documentation creating a TypedCsvSchema, I have a small bit of code that does the following:

Reads the schema from my sql db table
I map that schema into my new TypedCsvSchema and all of the columns that are strings are nullable.
I read my CSV files which has empty strings "" for those columns (example "A","XYZ123",,"Mar-21","") the last field here should be a null value in the database.
I bulk copy the data into the table, and I'm still getting empty strings, not nulls.

In my reader I have done

var options =
new CsvDataReaderOptions
{
Schema = CsvSchema.Nullable,
BufferSize = 128000,
HasHeaders = true,
HeaderComparer = StringComparer.OrdinalIgnoreCase,
Culture = CultureInfo.InvariantCulture,
CsvStyle = CsvStyle.Standard
};

and set my own Schema, in which all fields are nullable.

both work the same way unfortunately.

Handling the csv with header at the non first line

I have a csv file of this format. THe header starts with the line CODE,MTH etc.
Is it possible to say, start read from line 3 or something else?

period,date,month_in__year_ends
40,122010,12
CODE,MTH,RFB,OUTGO,MOUTGO,IF,INC,SOUTGO,TOT_COMM,TOT_EXP
1,12,9062.86099263,9062.86099263,0,1.00000000,0,3,0,0
1,12,9062.86099263,9062.86099263,0,1.00000000,0,0,2,0
1,12,9062.86099263,9062.86099263,0,1.00000000,0,0,0,1

__OPRJKD
###  dfs.........

Processing files with large numbers of errors is rather slow

I've a file like this. It has 50 columns and is half hourly energy readings from a smart meter:

ID,Date,0: 30,01:00,01:30,02:00,02:30,03:00,03:30,04:00,04:30,05:00,05:30,06:00,06:30,07:00,07:30,08:00,08:30,09:00,09:30,10:00,10:30,11:00,11:30,12:00,12:30,13:00,13:30,14:00,14:30,15:00,15:30,16:00,16:30,17:00,17:30,18:00,18:30,19:00,19:30,20:00,20:30,21:00,21:30,22:00,22:30,23:00,23:30,00:00
Hello  ,2000-01-01,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48
Badline,2000-01-01, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

About 20% of the lines in the file are damaged, with no values in any cell on a row at all. I'm finding that processing this file is incredibly slow, I think because an exception is thrown for every value in error and there are 5000 errors in the file. It takes approximately 3 seconds per bad line to chew through the problem cells and move on, so reading it in is taking in excess of 5 minutes for a 600 line file.

I believe it was mentioned in the other thread where we discussed this experimental "give the reader a delegate to call upon an error" feature, that there was a risk of significant performance issues

Have I any options for speeding it up? Can I, upon the first error encountered per line, do something like tell the data reader to move onto the next line??

Doesn't distinguish between empty string and null

When reading a file written by CsvDataWriter.
If Schema = CsvSchema.Nullable all empty strings are null
if Schema is default, all empty strings are empty strings
There needs to be a way to diffrentiate between the two

The fix I did locally was in CsvWriter.cs => WriteValueOptimistic if the length of the value is 0, return NeedsQuoting
This will create a difference between empty string and null when reading with Schema = CsvSchema.Nullable

CsvRecordTooLargeException for just a commentary row

The files with just one commentary row end up with a strange CsvRecordTooLargeException
(thrown in the CsvDataReader+Async.cs, line 177)

Is it a bug perhaps?

As a starting point - I have tried to put the following test method to the CsvDataReaderTests.cs for little speeding it up and trying to debug it.
`
[Fact]

	public async Task JustCommentEmptyNoHeaderAsync()

	{

		var csv = await CsvDataReader.CreateAsync(new StringReader("#Hey just a comment like why there is no data (no header)"), new CsvDataReaderOptions { HasHeaders = false });

		Assert.Equal(0, csv.FieldCount);

		Assert.False(await csv.ReadAsync());

	}

(I also guess it does matter on the HasHeaders property)

CsvDataReader.Create from MemoryStream/Stream

Is there a possibility to implement the creation of a datareader via MemoryStream / Stream?

thank you

GetRecords throw an exception when using a record type

Hi.

This is the example:
csvReader.GetRecords<FileImportItem>()

public record class FileImportItem { public long Id { get; set; } ... }

Throws UnboundMemberException.

It doesn't when it's a normal class or a "record struct". I can work around it with structs, but I wanted to report it here so you can fix it if you want.

DataBinder: FormatException column details

Hi,

Thank you for a great library.

I am using the DataBinder to map a csv file to a parse result POCO.

Most fields are strings, but some DateTimes and others are decimals.

If a FormatException is thrown during the binding, there doesn't seem to be a way of determining which specific column the offending data is in. Is there way to determine this?

Thank you.

Type or namespace TypedCsvSchema() could not be found.

This is probably a noob mistake, but I just can't use 'TypedCsvSchema()'. I think I'm just missing a using statement, but maybe it's something else?

Code:

var schema = new TypedCsvSchema()
    .Add(typeof(int))
    .Add(typeof(string))
    .Add(typeof(double), true)
    .Add(typeof(DateTime))
    .Add(typeof(DateTime), true);

var options = new CsvDataReaderOptions
{
	Schema = schema
};

dataTable = new DataTable();

dataTable.Load(CsvDataReader.Create("data.csv", options));

Error CS0246: The type or namespace name 'TypedCsvSchema' could not be found (are you missing a using directive or an assembly reference?)

Using statements:

using Sylvan.Data.Csv; 
using Sylvan.Data; 
using System.Data.Common;

Debug symbols/source for NuGet package?

I'm running into a NullReferenceException calling CsvDataWriter.Write with a custom reader implementation. This is undoubtedly, ultimately an error in my implementation (it's very, very naive), but the NRE occurs inside Write, which has no symbols or source available when added as a NuGet package. This makes troubleshooting the problem, um, challenging. There is no depth to the call stack; it's just crashing in .Write().

Using a .NET 7 app running code in a .NET Standard 2.1 lib, if it makes any difference. I know NuGet packages can be made debuggable, but I have no practical experience doing so myself. The alternative would of course just be getting the repo and linking against source I build myself, which I'll do if this should be too difficult.

Using CsvDataReader in PowerShell

I think your CsvDataReader is going to save my life ... If I can get it to work.

My goal is to use PowerShell to bulkcopy huge csv's into SQL Server. I'm having trouble initializing the CsvDataReader object in PowerShell. Would you be able to provide the correct PowerShell syntax for this in the New-Object cmdlet ? Here's what I have -

$filePath = 'test.csv'
Add-Type -Path 'Sylvan.Data.Csv.dll'
$reader = New-Object Sylvan.Data.Csv.CsvDataReader($filePath)

It returns the following error -

$reader = New-Object Sylvan.Data.Csv.CsvDataReader($filePath)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A constructor was not found. Cannot find an appropriate constructor for type Sylvan.Data.Csv.CsvDataReader.

Thanks.

Truncated output of CsvDataWriter

When I was creating small app for converting large datasets from one CSV format to another, in sizes about 10 millions lines on the input and 500K rows on the output, I noticed that the CSV output created by CsvDataWriter was always, deterministically(!), truncated by about 100-1000 characters. Simply set each file was missing from half a line to about few lines.

I tried using proper usings and await syntax with the CsvDataWriter, tried to find any "Flush" API I might have missed, but the output was always the same, always trimmed on the same position.

Please check the CsvDataWriter implementation, it looks like it's not counting the required buffer properly and truncating the output sooner with the result of data corruption.

Quoting Strings in Sylvan.Data.Csv

Hello,

I was wondering if there was a method to quote all strings when using CsvDataWriter, regardless of whether it contains the delimiter or not.

Thank you for your hard work on this amazing library 👍

Fast XML parser

Hi,

Do you have any solutions for parsing XML fast? Like, really fast. 😄

Thank you!

Sylvan CSV - ResultSetMode on "non-standard" csv files

Hello,
This is not particularly an issue with your code but rather the particular file format that I am using which doesn't play nicely.

The file format is effectively multiple CSV tables in a single file and the number of headings will vary. A basic example is below:

"TABLE","TABLENAME1"
"HEADINGROW","HEADING_1","HEADING_2","HEADING_3","HEADING_4"
"DATAROW","DATA_1.1","DATA_1.2","DATA_1.3","DATA_1.4"
"DATAROW","DATA_2.1","DATA_2.2","DATA_2.3","DATA_2.4"

"TABLE","TABLENAME2"
"HEADINGROW","HEADING_X","HEADING_Y"
"DATAROW","DATA_X.1","DATA_Y.2"
"DATAROW","DATA_X.1","DATA_Y.2"

If I set ResultSetMode to SingleResult when I call GetValues and assign it to an array I only get the first two columns with values and the others column values are null. If i use a for loop on the RowFieldCount and calling GetString() I can retrieve all values. Using linqpad with the linked gist i get the below output:

====== First Row ======
-- Row Field Count For Loop Start
TABLE
TABLENAME1
-- Row Field Count For Loop End
-- Field Count For Loop Start
TABLE
TABLENAME1
-- Field Count For Loop End
====== Second Row ======
-- Row Field Count For Loop Start
HEADINGROW
HEADING_1
HEADING_2
HEADING_3
HEADING_4
-- Row Field Count For Loop End
-- Field Count For Loop Start
HEADINGROW
HEADING_1
-- Field Count For Loop End

Conversely if i set to MultiResult once it goes from row 1 with two columns to row 2 with more columns i can't retrieve any values in a for loop using RowFieldCount and only blank values using FieldCount as shown with the below linqpad output but set to multiresult.

====== First Row ======
-- Row Field Count For Loop Start
TABLE
TABLENAME1
-- Row Field Count For Loop End
-- Field Count For Loop Start
TABLE
TABLENAME1
-- Field Count For Loop End
====== Second Row ======
-- Row Field Count For Loop Start
-- Row Field Count For Loop End
-- Field Count For Loop Start


-- Field Count For Loop End

Hopefully the above makes sense. And finally getting to my point, am i doing something wrong in the above that i could improve? Or could some adjustment be made to accommodate for such a file format?

Thanks

DataTable.Load fails if file ends with multiple emtpy rows

When using a TypedCsvSchema to load a CSV file directly into a DataTable an exception is thrown when the CSV end with multiple empty lines.

TypedCsvColumn.cs

class TypedCsvColumn : DbColumn {
        public TypedCsvColumn(Type type, bool allowNull,string name) {
            DataType = type;
            AllowDBNull = allowNull;
            ColumnName = name;
        }
    }

TypedCsvSchema.cs

class TypedCsvSchema : ICsvSchemaProvider {
        List<TypedCsvColumn> columns;

        public TypedCsvSchema() {
            columns = new List<TypedCsvColumn>();
        }

        public TypedCsvSchema Add(Type type, bool isNullable, string name) {
            columns.Add(new TypedCsvColumn(type, isNullable, name));
            return this;
        }

        public TypedCsvSchema Add<T>(string name, bool isNullable = false) {
            var  type      = typeof(T);
            bool allowNull = isNullable;
            var  baseType  = Nullable.GetUnderlyingType(type);
            if (baseType != null) {
                type      = baseType;
                allowNull = true;
            }

            return Add(type, allowNull, name);
        }

        DbColumn? ICsvSchemaProvider.GetColumn(string? name, int ordinal) {
            return ordinal < columns.Count ? columns[ordinal] : null;
        }
    }

test method body

var data = new DataTable();

var schema = new TypedCsvSchema()
    .Add<int>("id")
    .Add<string>("firstname")
    .Add<string>("lastname") ;

var options = new CsvDataReaderOptions {
    Schema = schema
};

await using (var csv = await CsvDataReader.CreateAsync("/tmp/testfile.csv",options)) {
    data.Load(csv);
}

This generates an exception as the empty line cannot be parsed.

System.FormatException: Input string was not in a correct format.
   at System.Number.ThrowOverflowOrFormatException(ParsingStatus status, TypeCode type)
   at System.Number.ParseInt32(ReadOnlySpan`1 value, NumberStyles styles, NumberFormatInfo info)
   at System.Int32.Parse(ReadOnlySpan`1 s, NumberStyles style, IFormatProvider provider)
   at Sylvan.Data.Csv.CsvDataReader.GetInt32(Int32 ordinal)
   at Sylvan.Data.Csv.Int32Accessor.GetValue(CsvDataReader reader, Int32 ordinal)
   at Sylvan.Data.Csv.FieldAccessor`1.GetValueAsObject(CsvDataReader reader, Int32 ordinal)
   at Sylvan.Data.Csv.CsvDataReader.GetValue(Int32 ordinal)
   at Sylvan.Data.Csv.CsvDataReader.GetValues(Object[] values)
   at System.Data.ProviderBase.DataReaderContainer.CommonLanguageSubsetDataReader.GetValues(Object[] values)
   at System.Data.ProviderBase.SchemaMapping.LoadDataRow()
   at System.Data.Common.DataAdapter.FillLoadDataRow(SchemaMapping mapping)
   at System.Data.Common.DataAdapter.FillFromReader(DataSet dataset, DataTable datatable, String srcTable, DataReaderContainer dataReader, Int32 startRecord, Int32 maxRecords, DataColumn parentChapterColumn, Object parentChapterValue)
   at System.Data.Common.DataAdapter.Fill(DataTable[] dataTables, IDataReader dataReader, Int32 startRecord, Int32 maxRecords)
   at System.Data.DataTable.Load(IDataReader reader, LoadOption loadOption, FillErrorEventHandler errorHandler)
   at System.Data.DataTable.Load(IDataReader reader)

When reading in a while loop with ReadAsync this can be prevented by checking if RowFieldCount = FieldCount.
Is there a way to tell the reader to skip empty rows when used by DataTable.Load()?

SqlBulkCopy error when Sql table has ID column

Im having trouble using SqlBulkCopy, im getting an error saying
"The given value of type String from the data source cannot be converted to type datetime of the specified target column."

But if I remove the ID column from the SQL table it works...
I have tried to specify the columns in sql to get the schema to exclude ID, but that does not work either.

The csv file does not have an ID column.
Is there a setting Im missing somewhere?
@MarkPflug Any ideas?

Problems with UTF-8 delimiter splitting.

I was trying Sylvan csv and I was really struggling with using it until I found that there is a peculiar anomaly.

iIncome_Source_Sub_Source_Mapping_ID,iIncome_Source_ID,sIncome_Sub_Source,sIncome_Sub_Source_2,iIncome_Sub_Source_ID,blnActive
567774,11014,(주)다올소프트,기타녹음,NULL,1
394394,11014,(주)달콤소프트,유선전송,NULL,1
601057,11014,(주)대교,기타녹음,NULL,1

The first row should have NULL at GetValue(5) … it seems to be splitting in the Unicode field, returning (주)다 as a column instead of (주)다올소프트 and 소프트 for the next column … and splitting on 올.

What is interesting, is if I remove the last two rows in the CSV, it works and does not exhibit this behavior …

using var s = File.OpenRead("Income_Source_Sub_Source_Mapping.csv");
using var reader = new StreamReader(s, Encoding.UTF8);
var options = new CsvDataReaderOptions()
{
HasHeaders = true,
Delimiter = ','
};

using var csv = CsvDataReader.GetReader(s);

while (csv.Read())
{
if ((string)csv.GetValue(0) == "567774")
{
var x = csv.GetValue(5);

          Debug.Assert(x == "1");
   }

}

What I am curious about is, why should following data affect how a previous row is decoded?

Does this reader process row by row, or the whole stream first?

Income_Source_Sub_Source_Mapping.zip

CsvReader Unable to Import Scientific Notation Value to SQL Decimal

I have a value in the csv file that's in scientific notation (1.8492262445E9). It gets mapped to a SQL Server field of datatype: decimal(18,0). I'm getting this error -

$SqlBulkCopy.WriteToServer($CSVReader)
Error
-----------------------------------------------------------------------------------------------------------
Exception calling "WriteToServer" with "1" argument(s): "Input string was not in a correct format."

If I change the SQL datatype to: float it loads successfully but the value is: 1849226244.5 I know this is the correct value, but I really need it rounded up to: 1849226245 which is what decimal(18,0) will do. I've verified this behavior with the SQL statement -

Select Convert(float, '1.8492262445E9') As [Float], Convert(decimal(18,0), Convert(float, '1.8492262445E9')) As [Decimal]

Float                  Decimal
---------------------- ---------------------------------------
1849226244.5           1849226245

Testing it more in SQL I get an error converting from a string to decimal(18,0) -

Select Convert(decimal(18,0), '1.8492262445E9')

Decimal
---------------------------------------
Msg 8114, Level 16, State 5, Line 40
Error converting data type varchar to numeric.

But if converted from a number to decimal(18,0) it works as expected -

Select Convert(decimal(18,0), 1.8492262445E9) As [Decimal]

Decimal
---------------------------------------
1849226245

I'm setting the Schema property in the CsvDataReaderOptions to the SQL table schema and that's been working great. Unfortunately, I'm not seeing anything in the CsvDataReaderOptions to control the behavior above to eliminate the error.

Is this a bug, or just a current limitation of the CsvReader, or am I just missing something ?

Thanks.

CsvDataReader missing GetRecords<T>()

There is probably something missing on my part but I can't seem to find the issue.

I've imported the Sylvan.Data.Csv package into a Blazor server application at the latest version (1.2.6-b0002), while following the data binding example I'm trying to parse a document into a model but the reader appears to be missing the GetRecords() method, I do have access to GetRawRecord and GetRawRecordSpan methods. I'm also using both Sylvan.Data and Sylvan.Data.Csv namespaces but the first one isn't being used.

This is my method:

var opts = new CsvDataReaderOptions() { Schema = CsvSchema.Nullable };
using var reader = CsvDataReader.Create(path, opts);
var tags = reader.GetRecords<Tag>();

I'm sorry if this is an obvious fix but I can't seem to jump over this wall.

Non-descript exception thrown for Sylvan.Data.Csv

A specific line in a file continues to throw this exception, I cannot gleam any more data or reason as to why it's throwing this exception.

I unfortunately cannot share the CSV line as it contains confidential information, but I was hoping someone might have an idea as to how I can debug further?

Exception of type 'Sylvan.Data.Csv.CsvFormatException' was thrown

at Sylvan.Data.Csv.CsvDataReader.ReadField(Int32 fieldIdx)
at Sylvan.Data.Csv.CsvDataReader.d__6.MoveNext()
at Sylvan.Data.Csv.CsvDataReader.d__8.MoveNext()
at SomeFileReader.d__3.MoveNext() in

XBase reader options

Currently, the only option is to provide a specific encoding. There are probably a variety of other useful options to expose. Here are a few to start with.

Read deleted rows. This would likely include adding an IsDeleted property to the data reader too.
memo handling options.
- Skip memo file, would throw when accessing memo fields
- Ignore memo columns: memo columns wouldn't be exposed from the reader.
- other?

Incorrect parsing of row (row field count is too low)

Package references:

  <ItemGroup>
    <PackageReference Include="Sylvan.Common" Version="0.2.1" />
    <PackageReference Include="Sylvan.Data.Csv" Version="1.1.6" />
  </ItemGroup>

Repro code:

using System;
using System.Buffers;
using System.IO;
using Sylvan;
using Sylvan.Data.Csv;

namespace bad_csv
{
    class Program
    {
        static void Main(string[] args)
        {
            var content = @",,archimedes.library,archimedes.library/1.0.245,Archimedes.Library,1.0.245,2020-11-27T19:35:06.0046046Z,2020-11-27T19:33:25.687Z,Available,false,false,false,false,false,[],""[""""netstandard2.0""""]"",""[""""netstandard2.0""""]"",""[""""netstandard2.0""""]"",""[""""netstandard2.0""""]""
,,eto.serialization.xaml,eto.serialization.xaml/2.5.8,Eto.Serialization.Xaml,2.5.8,2020-11-27T19:34:44.6917011Z,2020-11-27T19:32:26.743Z,Available,false,false,true,false,false,[],""[""""any"""",""""netstandard1.0"""",""""netstandard2.0""""]"",""[""""netstandard1.0"""",""""netstandard2.0""""]"",""[""""any"""",""""netstandard1.0"""",""""netstandard2.0""""]"",""[""""any"""",""""netstandard1.0"""",""""netstandard2.0""""]""
";

            var stringPool = new StringPool(128);
            var pool = ArrayPool<char>.Shared;
            var bufferSize = 32 * 1024;
            var buffer = pool.Rent(bufferSize);
            using var csvReader = CsvDataReader.Create(new StringReader(content), new CsvDataReaderOptions
            {
                HasHeaders = false,
                Buffer = buffer,
                BufferSize = buffer.Length,
                StringFactory = stringPool.GetString,
            });

            Console.WriteLine("Raw CSV:");
            Console.WriteLine(content);

            while (csvReader.Read())
            {
                Console.WriteLine("Row:");
                for (var i = 0; i < csvReader.RowFieldCount; i++)
                {
                    Console.WriteLine("  " + csvReader.GetString(i));
                }
            }
        }
    }
}

Output:

Raw CSV:
,,archimedes.library,archimedes.library/1.0.245,Archimedes.Library,1.0.245,2020-11-27T19:35:06.0046046Z,2020-11-27T19:33:25.687Z,Available,false,false,false,false,false,[],"[""netstandard2.0""]","[""netstandard2.0""]","[""netstandard2.0""]","[""netstandard2.0""]"
,,eto.serialization.xaml,eto.serialization.xaml/2.5.8,Eto.Serialization.Xaml,2.5.8,2020-11-27T19:34:44.6917011Z,2020-11-27T19:32:26.743Z,Available,false,false,true,false,false,[],"[""any"",""netstandard1.0"",""netstandard2.0""]","[""netstandard1.0"",""netstandard2.0""]","[""any"",""netstandard1.0"",""netstandard2.0""]","[""any"",""netstandard1.0"",""netstandard2.0""]"

Row:


  archimedes.library
  archimedes.library/1.0.245
  Archimedes.Library
  1.0.245
Row:


  eto.serialization.xaml
  eto.serialization.xaml/2.5.8
  Eto.Serialization.Xaml

Excel parse:

Incorrect data on amounts

Hello Again Mark,

During my tests i had noticed that report downloads incorrect amount, this is due to the fact, that . is removed from amount, so all the numbers in format 123.54 appear as 12345, but 1234 ar 1234 as they should be.

Tested it befor my additional formatting and after

File: = On Email

Column in CSV = "Invoice Amount"

Excample = "C6"

Code Details:

method ThrowBadLines(r) => tries to parse as specific data, throws false on exception.

function JoinHeaders() => "DueDate,1,OriginalInvoiceAmount,OriginalInvoiceCurrency,PaymentDate,BaswareStatus,SupplierNumber,CustomerNumber,InvoiceNumber,9,10,11,12,13,14,15,16,17,18"

function NeededHeaders() =>

Code:

        private List<T>? SylvanReaderCSV<T>()
        {
            try
            {
                var schema = new SylCsv.CsvSchema(Schema.Parse(Setup.JoinHeaders()));
                var options = new SylCsv.CsvDataReaderOptions{Schema = schema};
                using (SylCsv.CsvDataReader csvDataReader = SylCsv.CsvDataReader.Create(Setup.FilePath, options)) //options
                {
                    var dataReader = csvDataReader.Select(Setup.NeededHeaders());

                    //throw bad lines from list, we do not want incorrect data.
                    var checkedData = dataReader.Where(r => ThrowBadLines(r));

                    var binderOptions = new DataBinderOptions
                    {
                        InferColumnTypeFromMember = false,
                        BindingMode = DataBindingMode.Any
                    };
                    DataBinder.Create<T>(checkedData, binderOptions);

                    List<T> extracts = checkedData.GetList<T>().ToList();

                    return extracts;
                }
            }
            catch (Exception ex)
            {
                throw new Exception(ex.Message);
            }
        }

Class:

    public class ExcelInvoice
    {
        #region [Data from files]
        public string? SupplierNumber { get; set; }
        public string? CustomerNumber { get; set; }
        public string? InvoiceNumber { get; set; }
        public DateTime? DueDate { get; set; }
        public double? OriginalInvoiceAmount { get; set; }
        public string? OriginalInvoiceCurrency { get; set; }
        public double? CurrentBalanceInvoiceAmount { get; set; }
        public string? CurrentBalanceInvoiceCurrency { get; set; }
        public DateTime? PaymentDate { get; set; }
        public string? BaswareStatus { get; set; } 
        public string? Reference { get; set; }
        public string? SupplierCountry { get; set; }
        public string? SupplierRegion { get; set; }
        public string? CustomerCountry { get; set; }
        public string? CustomerRegion { get; set; }
        public string? InvoiceType { get; set; }
        public string? WHTSmallDifference { get; set; }
        public decimal? LocalAmount { get; set; }
        public int? OutstandingDays { get; set; }
        #endregion

        #region [Taken with SQL]
        public double? NOKInvoiceAmount { get; set; }
        public double? NOKCurrentBalanceAmount { get; set; }
        public double? ExRate { get; set; }
        #endregion

        #region [Filled by GetInvoiceHash()]
        public int ID { get; set; }
        public string? HashKey { get; set; } 
        public string? FileHash { get; set; } 
        public bool? IsActive { get; set; } 
        public int? InsertID { get; set; } 
        #endregion

        #region [AP & AR Side only]
        public int BaswareMatchID { get; set; }
        public decimal? BaswareOriginalInvoiceAmount { get; set; }
        public decimal? BaswareOriginalInvoiceCurrency { get; set; }
        public decimal? BaswareNOKInvoiceAmount { get; set; }
        public decimal? OriginalDifference { get; set; }
        public decimal? NOKDifference { get; set; }
        public int MatchedID { get; set; }
        #endregion

        #region [AR Side only]
        public decimal? WHTStatus { get; set; }
        #endregion

    }

Read in batches

Advanced apologies if this is documented somewhere and I, obviously, failed to locate the hint.

I am in the process of writing a PowerShell module leveraging your, fast, libraries. I am looking for a method to Read/ReadAsync in batches in order to process large (> 1GB) CSV files without overtaxing the system. I have seen this (https://stackoverflow.com/questions/66570865/memorystream-data-getting-truncated) and the "RangeDataReader" would work. Is this still the advised approach? Which library contains the derivative class?

Many thanks in advance

Doc error

In here we see the variable dataReader used on the last line of the sample code fragment, yet it is not declared and it is not obvious what it is.

Make CsvDataReader.GetFieldSpan public.

It would be great to be able to call CsvDataReader.GetFieldSpan directly to access the ReadOnlySpan<char>. My current usecase is to create hash for a single column and this way i could convert the ReadOnlySpan<char> to a ReadOnlySpan<byte> and pass that to my hash function.

Another use case would be to manually parse custom types.

CsvRecordTooLargeException if there is no blank line in the end

The exception occurs only if there is no header row and there is only one row with data in the CSV file.

Question: how do we tell the databinder what format a datetime field is in?

We can have a CSV:

D,X
01/01/01,X

We can have a class:

class Record{ 
  [DataMember(Name = "X")]
  string Something {get;set;}

  [DataMember(Name = "D")]
  DateTime Other {get;set;}
}

But how do we tell databinder that the date is e.g. yy/MM/dd or dd/MM/yy ?

Add code coverage

I'd like to have some code coverage numbers before shipping a 1.0 release. Unit tests are a bit anemic at this point, so there are probably some bugs that I've yet to discover.

does not contain a definition for "AsDataReader" or an accessible extension method "AsDataReader"

using Sylvan.Data;
using Sylvan.Data.Csv;

Error CS1061 "List" does not contain a definition for "AsDataReader" or an accessible extension method "AsDataReader" that accepts a first argument of type "List" (are you missing a using directive or an assembly reference?

Ordering of columns

The example below is a stylised example of the case when one is getting csv files from different source with similar data that needs to be extracted to the same schema - the various sources though have (1) different header names, and (2) different column order.
I have found that Sylvan deals with (1) very easily, but I have not found a way to do (2) easily - am I missing something ?
thanks
PS: I guess I could workaround it by defining a Record class and calling GetRecord() - but

using Sylvan.Data;
using Sylvan.Data.Csv;
using System.Data;

//Note: the end schemas here are the same, the source ones are not
CsvSchema schema1 = new CsvSchema(Schema.Parse("Name,Beauty:int,Charm:double"));
CsvSchema schema2 = new CsvSchema(Schema.Parse("Nome>Name,Bellezza>Beauty:int,Fascino>Charm:double"));
CsvSchema schema3 = new CsvSchema(Schema.Parse("blame>Name,duty>Beauty:int,chasm>Charm:double"));

string data1 =
@"Name,Beauty,Charm
MickeyMouse,2,2
DonaldDuck,1,3.5
Batman,-1,1.1";
string data2 =
@"Fascino,Bellezza,Nome
3.1,1,Topolino
2.3,2,Paperino
5,1,Paperinik";
string data3 =
@"duty,blame,chasm
1,Topo,0.4
2,Pape,0.5
1,nik,0.7";

var reader1 = CsvDataReader.Create(new StringReader(data1), new CsvDataReaderOptions { Schema = schema1});
var reader2 = CsvDataReader.Create(new StringReader(data2), new CsvDataReaderOptions { Schema = schema2});
var reader3 = CsvDataReader.Create(new StringReader(data3), new CsvDataReaderOptions { Schema = schema3});

DataTable dataTable1 = new DataTable();
dataTable1.Load(reader1);

DataTable dataTable2 = new DataTable();
dataTable2.Load(reader2);

DataTable dataTable3 = new DataTable();
dataTable3.Load(reader3);

//breakpoint here and view the 3 tables: the columnNames are the same (as expected),
//but the order is not:
//each table has the columns ordered as in the source "files", 
// not as defined in the schema
;

Write headers when no data?

Is there any way with CsvDataWriter to always write headers even when the data table contains no rows? Alternatively is there a mechanism to write CSV without using data tables or objects so I could 'pretend' that the header is a row?

Unable to write IDataReader to CSV

Out of interest, why does

public long Write(DbDataReader reader)

Require a DbDataReader?

Why don’t you use IDataReader?

Then you can support a much wider range of data sources.

I have an IDataReader that I wanted to save to csv, and I can’t because you require a much more complex contract, for which there seems to be no reason.

Any chance you can change the contract to support the IDataReader

markpflug / sylvan Goto Github PK

sylvan's People

Contributors

Stargazers

Watchers

Forkers

sylvan's Issues

Recommend Projects

Recommend Topics

Recommend Org