-
-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling multiline text column #422
Comments
Unfortunately the header cannot be parsed in case of it is on many lines. If you know for sure how many lines you header is, and if the position of each column is the same, you can mention the column index instead of the column name. Then you can ask to ignore the X first lines of your csv file. The mapping would look like the following: stream.CrossApplyTextFile("parse file", o => o
.UseMap(m => new
{
Value1 = m.ToColumn(0),
Value2 = m.ToDateColumn(1, "ddMMyyyy"),
Value3 = m.ToNumberColumn<decimal?>(2, ".")
})
// Skip the parsing of the header. The global parsing won't need it as no column relies on its name
.IgnoreFirstLines(2)
); Let me know if it works (it should!). |
Sorry for the confusion. Multiple line is not on the header but on the column values. for example, my csv looks like this. Code|Name|Description here the 'Description' column value is multiline. Also, We cannot predict how many lines each column value will take. |
Sorry for the confusion. At the moment, native support of multiline values doesn't exist. As a matter of a fact, the parsing algorithm need to be drastically changed. |
Note for myself: maybe use this class https://github.com/JoshClose/CsvHelper/blob/master/src/CsvHelper/CsvParser.cs as a replacement for the csv parsing. |
Hi, Thanks for the suggestion. I have tested the code with a large size table (6 column - numeric & guid) having 4.3 million records. I have deployed the code on a linux server. It took around 13 hours to complete the loading from csv (under a zip folder) to sql database. Is this the expected time for such large table or I can improve the performance by optimising?My code looks like below :
where CrossApplyWithCsvHelper is the new extension method I have written with CsvHelper.
|
It is definitely not the expected time. Millions rows to push directly on sql server is a matter of minutes. I will have a closer look to your code and come back to you. |
Improvments I can think of: Submit rows as they are parsedHere, you are loading ands parsing the entire file in memory before submitting its rows it to the output stream. You should submit rows one by one as they are parsed: public static IStream<TOut> CrossApplyWithCsvHelper<TOut>(this IStream<IFileValue> stream, string name,string delimitter = "|", bool noParallelisation = false)
{
return stream.CrossApply<IFileValue, TOut>("parse file", (fileValue, dependencyResolver, cancellationToken, push) =>
{
var config = new CsvConfiguration(CultureInfo.InvariantCulture)
{
Mode = CsvMode.RFC4180,
Delimiter = delimitter,
LineBreakInQuotedFieldIsBadData = false,
};
config.BadDataFound = null;
config.MissingFieldFound = null;
using (var reader = new StringReader(new StreamReader(fileValue.GetContent()).ReadToEnd()))
using (var csv = new CsvReader(reader, config))
{
csv.Read();
csv.ReadHeader();
while (csv.Read() && !cancellationToken.IsCancellationRequested)
{
push(csv.GetRecord<TOut>());
}
}
});
} Use bulkloadAt the moment SqlServer extensions for dotnet don't leverage the power of bulk load. Therefore, (still at the moment) rows are saved one by one using a Indexes13h to save couple of millions rows is absolutely not acceptable. It as also possible that your existing table your are inserting in doesn't have the proper indexes and that the database has to make an index scan or even a table scan to try to find the row that may need to be updated. The index you need to create is the key that is used to find the corresponding row. If none is mentionned, the primary key(s) will be used Use insert onlyUse insert only if you don't need to make an upsert. This will blindly make an insert into instead of a merge that needs to seek the potential existing row instead. |
Thanks for the suggestion. Can you give me an example for insert only option? I couldn't find an option in the save operator. |
Oups, sorry, indeed, at the moment, this option only exists in EFCore extensions. Did you give a try using EFCore extensions? Because 13h to upsert million rows seems REALLY a lot! |
No. Moving to EFCore was little difficult for me since I need to do a code rewrite on different places. I tried to make the bulk insert with the current implementation. I have added an extension method for SqlBulkInsert and the performance got improved a lot. But I have another challenge on the memory usage. Memory usage is almost 100% for a 40GB ram size. I doubt the memory is not getting removed even after the one table load is completed and it keeps appending for each table until all the tables are loaded completely. my code looks like below.
and the SqlBulkInsert function looks like this
where BulkLoadProcessor is an inheritance of IDoProcessor
please let me know if you can find any problem that causes the memory leakage. |
How can we handle if the pipe sepeated csv is having a text column with multiline text. Text delimitter is ".
With the given example in the document, it is throwing error "could not deserialize value" because of the new line in the text column
The text was updated successfully, but these errors were encountered: