Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't read the entire log file into memory #13

Open
Shazwazza opened this issue Mar 18, 2016 · 11 comments
Open

Don't read the entire log file into memory #13

Shazwazza opened this issue Mar 18, 2016 · 11 comments

Comments

@Shazwazza
Copy link
Contributor

Reading files can be done MUCH more effectively, by no means should this be loading the entire log file into memory and then parse each line (= more strings) and then regex parse each line. This will consume far more memory and CPU than necessary.

For example this:
https://github.com/DanDiplo/UmbracoTraceLogViewer/blob/master/Diplo.TraceLogViewer/Services/LogDataService.cs#L53

You can instead open a file stream and read each line one at a time until the end of the file, this way you are only loading in one line at a time and performing regex on a single line.

@DanDiplo
Copy link
Owner

@Shazwazza Sure, that makes sense. In mitigation, when I first started the project all the example log files I had where quite small, so never really had any issue with memory.

Having said that, it is going to be hard to determine how to read just one entry per line. Because of the way the Umbraco log is formatted (without any obvious delimiters) you can't just say that there is one entry per line, since a single log entry can encompass many newlines. For example, take the following example log entry fragment:

System.NullReferenceException: Object reference not set to an instance of an object.
   at ASP._Page_Views_Partials_Website_SomeFile_cshtml.Execute() in c:\Projects\MyWebsite\Views\Partials\JobBoard\SomeFile.cshtml:line 9
   at System.Web.WebPages.WebPageBase.ExecutePageHierarchy()
   at System.Web.Mvc.WebViewPage.ExecutePageHierarchy()
   at System.Web.WebPages.WebPageBase.ExecutePageHierarchy(WebPageContext pageContext, TextWriter writer, WebPageRenderingBase startPage)
   at System.Web.Mvc.RazorView.RenderView(ViewContext viewContext, TextWriter writer, Object instance)
   at System.Web.Mvc.BuildManagerCompiledView.Render(ViewContext viewContext, TextWriter writer)
   at Umbraco.Core.Profiling.ProfilingView.Render(ViewContext viewContext, TextWriter writer)
   at System.Web.Mvc.HtmlHelper.RenderPartialInternal(String partialViewName, ViewDataDictionary viewData, Object model, TextWriter writer, ViewEngineCollection viewEngineCollection)
   at System.Web.Mvc.Html.PartialExtensions.Partial(HtmlHelper htmlHelper, String partialViewName, Object model, ViewDataDictionary viewData)
   at System.Web.Mvc.Html.PartialExtensions.Partial(HtmlHelper htmlHelper, String partialViewName, Object model)
   at ASP._Page_Views_Partials_JobBoard_JobBoardList_cshtml.<>c__DisplayClass1.<RenderJobs>b__0(TextWriter __razor_helper_writer) in....

As you can see, a single entry can span multiple newlines, which is the reason I had to use complex regex to parse the log in the first place. Can you see a way that you could read just one entry at a time from a stream, as I'm not sure I can?

@Shazwazza
Copy link
Contributor Author

Right, so you can continue to read line by line until you reach the beginning of a log entry delimiter that you are looking for. You would need to do this already to determine when the next log entry starts. When you detect that a new log entry starts, you know the previously combined string is a single log entry, you can then proceed to run your regex parsing against that, add it to the output, then start the new string from the beginning of the log input and continue.

It's not just memory you need to worry about here, it's also CPU. Running this much regex on a larger string will result in more memory and lots of CPU used by Regex. Performing your parsing on smaller strings will be much more efficient. Also be sure to use StringBuilder for a lot of this, you don't want to be allocating a ton of strings by doing any normal concatenation.

@DanDiplo
Copy link
Owner

OK, that makes sense and should be doable. I might have a little free time next week, so will try and give this a look and let you know how it goes.

@DanDiplo
Copy link
Owner

OK, I did some testing today and added in a new method that read the log file in using a TextReader that read one line at a time. Each line read in was added to a StringBuilder until it detected a new long entry was found. The contents of the StringBuilder were then passed as usual to the regex that parses the line into a LogDataItem model and this then build up a collection of LogDataItems to send back to the view as JSON.

I then created a large log file with 59,000 lines (~ 62MB) and then did some tests to compare this to the old FileReadAll() method. The test comprised of a using a Stopwatch to time how long it would take to perform a few iterations of this file. The results were surprising - in every test I did the FileReadAll method was almost twice as fast as reading the file in line by line. I tried the same thing with a smaller file and many more iterations, and the results were the same - the ReadAll method was twice as fast.

I'm not sure about memory usage as this is harder for me to measure. But I'm sure you are right in that reading line by line would use less. But if this method is twice as slow I'm not really sure any saving in memory would be worth it. I'm sure most log files people have are way under 1MB in size, and I don't think holding the contents of that size file in memory is over the top - after all a typical Umbraco site uses between 200MB - 500MB (and I've seen some large sites go over 1GB!).

Any thoughts?

@Shazwazza
Copy link
Contributor Author

Hey Dan, thanks for running some profiles. I'm sure there's something odd going on that is causing the performance bottleneck. Happy to have a look soon(ish) to see if i can spot anything. Are you able to point me to where/how your performance profiling code is ?

@DanDiplo
Copy link
Owner

Hi @Shazwazza - thanks, I appreciate any input you have as you are far more experienced than me at this kind of thing! I've committed the experimental changes into Github. Note the changes are only accessed by the Unit tests which is what I used for profiling.

Basically, the new method that reads using a TextReader is here:

https://github.com/DanDiplo/UmbracoTraceLogViewer/blob/master/Diplo.TraceLogViewer/Services/LogDataService.cs#L87

I then have some NUnit tests. One of the tests opens a large log file then parses it 3 times using the original method and 3 times using the stream method and then outputs to the test stream the timings. This is here:

https://github.com/DanDiplo/UmbracoTraceLogViewer/blob/master/Diplo.TraceLogViewer.Tests/Umbraco72xTests.cs#L85

If you are not familiar with NUnit then I use NUnit Test Adapter VS extension Once installed go to Test > Windows > Test Explorer in VS. Right click the test called "Log Profiling Tests" and choose "Run Selected Test". It takes about 30 secs. Then in Test Explorer you can click the "Output" link to view the timings. Yeah, it's a bit crude, but something I just hacked together to test the waters. I'm sure you know something better :)

image

image

@Shazwazza
Copy link
Contributor Author

Have submitted a PR ... #14 Hope it makes sense!

@Shazwazza
Copy link
Contributor Author

The next step is to deal with paging properly. IIRC there's paging in the UI correct? It would be heaps more efficient to do that correctly instead of reading the entire log file and then paging over it. You could page over it by knowing the line to seek to, then start parsing a page and exit the loop/close the file when you've read enough lines for a page.

@DanDiplo
Copy link
Owner

That's fantastic @Shazwazza ! I'll review the PR tomorrow when I have more time, but all sounds good.

Regarding paging: Yeah, the paging is all performed client-side (so it basically reads all entries and then breaks them up client side). It uses http://ng-table.com/ for this. As you state, this isn't the most efficient when there are a large number of lines. Adding server side paging would be more efficient.

However, the search (via NgTable) relies on the fact that all entries are returned to the Angular view. If the pagination was made server side then the search would also need to operate that way, too I can see that being quite tricky!

@Shazwazza
Copy link
Contributor Author

Yikes :( Ok I guess this is another issue we'll need to deal with. Essentially this would mean that you could really bring a browser to it's knees with this kind of data. For example, the test log you are using in those tests is 60mb or so. This would mean that the request to get this log data would return 60mb worth of data, this would take a substantial amount of time and then this data would be stored in the browsers memory.

You could still do all of this server side and still do it with a smaller amount of memory ... though it would chew a bit of CPU because you'd have to perform a search + filter on every line read but since it's paged data, you'd the amount of lines read would be limited (so long as the search yields some results). This would still be more effective for larger files than sending 100% of the data to the client though.

@DanDiplo
Copy link
Owner

Yeah, in some ways it is horribly inefficient. Having said that, I've never actually seen real log files ever get anywhere near that size. Where I work we have over 30 Umbraco 7 sites in production and haven't encountered any that have given any issues. I even tried loading that 60MB file and it took a few seconds to load, but other than that it performed well (albeit I have a powerful home PC with i7 and 16GB RAM!). Chrome memory usage with that giant file was about the same as Facebook, Twitter and Gmail open at once (not sure what that says... ;-))

There are also advantages in putting the load on the client - most devs have powerful machines and it removes the burden from the server and minimises HTTP requests. Once the data has loaded it tends to be more responsive handled client side. So there are pros and cons.

I guess it is down to whether users will regularly encounter really large log files? Are there circumstances where this has happened to you?

I also wonder whether if big files are going to be common whether the file format needs to be easier to parse (ie. proper delimiters)? Maybe even go back to storing the data in the DB layer (async) if a lot of server-side filtering is required?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants