-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't read the entire log file into memory #13
Comments
@Shazwazza Sure, that makes sense. In mitigation, when I first started the project all the example log files I had where quite small, so never really had any issue with memory. Having said that, it is going to be hard to determine how to read just one entry per line. Because of the way the Umbraco log is formatted (without any obvious delimiters) you can't just say that there is one entry per line, since a single log entry can encompass many newlines. For example, take the following example log entry fragment:
As you can see, a single entry can span multiple newlines, which is the reason I had to use complex regex to parse the log in the first place. Can you see a way that you could read just one entry at a time from a stream, as I'm not sure I can? |
Right, so you can continue to read line by line until you reach the beginning of a log entry delimiter that you are looking for. You would need to do this already to determine when the next log entry starts. When you detect that a new log entry starts, you know the previously combined string is a single log entry, you can then proceed to run your regex parsing against that, add it to the output, then start the new string from the beginning of the log input and continue. It's not just memory you need to worry about here, it's also CPU. Running this much regex on a larger string will result in more memory and lots of CPU used by Regex. Performing your parsing on smaller strings will be much more efficient. Also be sure to use StringBuilder for a lot of this, you don't want to be allocating a ton of strings by doing any normal concatenation. |
OK, that makes sense and should be doable. I might have a little free time next week, so will try and give this a look and let you know how it goes. |
OK, I did some testing today and added in a new method that read the log file in using a I then created a large log file with 59,000 lines (~ 62MB) and then did some tests to compare this to the old I'm not sure about memory usage as this is harder for me to measure. But I'm sure you are right in that reading line by line would use less. But if this method is twice as slow I'm not really sure any saving in memory would be worth it. I'm sure most log files people have are way under 1MB in size, and I don't think holding the contents of that size file in memory is over the top - after all a typical Umbraco site uses between 200MB - 500MB (and I've seen some large sites go over 1GB!). Any thoughts? |
Hey Dan, thanks for running some profiles. I'm sure there's something odd going on that is causing the performance bottleneck. Happy to have a look soon(ish) to see if i can spot anything. Are you able to point me to where/how your performance profiling code is ? |
Hi @Shazwazza - thanks, I appreciate any input you have as you are far more experienced than me at this kind of thing! I've committed the experimental changes into Github. Note the changes are only accessed by the Unit tests which is what I used for profiling. Basically, the new method that reads using a TextReader is here: I then have some NUnit tests. One of the tests opens a large log file then parses it 3 times using the original method and 3 times using the stream method and then outputs to the test stream the timings. This is here: If you are not familiar with NUnit then I use NUnit Test Adapter VS extension Once installed go to Test > Windows > Test Explorer in VS. Right click the test called "Log Profiling Tests" and choose "Run Selected Test". It takes about 30 secs. Then in Test Explorer you can click the "Output" link to view the timings. Yeah, it's a bit crude, but something I just hacked together to test the waters. I'm sure you know something better :) |
Have submitted a PR ... #14 Hope it makes sense! |
The next step is to deal with paging properly. IIRC there's paging in the UI correct? It would be heaps more efficient to do that correctly instead of reading the entire log file and then paging over it. You could page over it by knowing the line to seek to, then start parsing a page and exit the loop/close the file when you've read enough lines for a page. |
That's fantastic @Shazwazza ! I'll review the PR tomorrow when I have more time, but all sounds good. Regarding paging: Yeah, the paging is all performed client-side (so it basically reads all entries and then breaks them up client side). It uses http://ng-table.com/ for this. As you state, this isn't the most efficient when there are a large number of lines. Adding server side paging would be more efficient. However, the search (via NgTable) relies on the fact that all entries are returned to the Angular view. If the pagination was made server side then the search would also need to operate that way, too I can see that being quite tricky! |
Yikes :( Ok I guess this is another issue we'll need to deal with. Essentially this would mean that you could really bring a browser to it's knees with this kind of data. For example, the test log you are using in those tests is 60mb or so. This would mean that the request to get this log data would return 60mb worth of data, this would take a substantial amount of time and then this data would be stored in the browsers memory. You could still do all of this server side and still do it with a smaller amount of memory ... though it would chew a bit of CPU because you'd have to perform a search + filter on every line read but since it's paged data, you'd the amount of lines read would be limited (so long as the search yields some results). This would still be more effective for larger files than sending 100% of the data to the client though. |
Yeah, in some ways it is horribly inefficient. Having said that, I've never actually seen real log files ever get anywhere near that size. Where I work we have over 30 Umbraco 7 sites in production and haven't encountered any that have given any issues. I even tried loading that 60MB file and it took a few seconds to load, but other than that it performed well (albeit I have a powerful home PC with i7 and 16GB RAM!). Chrome memory usage with that giant file was about the same as Facebook, Twitter and Gmail open at once (not sure what that says... ;-)) There are also advantages in putting the load on the client - most devs have powerful machines and it removes the burden from the server and minimises HTTP requests. Once the data has loaded it tends to be more responsive handled client side. So there are pros and cons. I guess it is down to whether users will regularly encounter really large log files? Are there circumstances where this has happened to you? I also wonder whether if big files are going to be common whether the file format needs to be easier to parse (ie. proper delimiters)? Maybe even go back to storing the data in the DB layer (async) if a lot of server-side filtering is required? |
Reading files can be done MUCH more effectively, by no means should this be loading the entire log file into memory and then parse each line (= more strings) and then regex parse each line. This will consume far more memory and CPU than necessary.
For example this:
https://github.com/DanDiplo/UmbracoTraceLogViewer/blob/master/Diplo.TraceLogViewer/Services/LogDataService.cs#L53
You can instead open a file stream and read each line one at a time until the end of the file, this way you are only loading in one line at a time and performing regex on a single line.
The text was updated successfully, but these errors were encountered: