Best ways to process massive log files with DCGs #1573
-
Hi, I'm writing an experimental parser for very large log files to pull out salient facts using DCGs. What is the right way for achieving this that is space and time efficient? As a trivial example, if I wanted to determine how many lines were present in a log file (either with \r\n or \n) a naive implementation might look like this:
Then I can call the following to get an answer from C.
I have found with the latest scryer on master, for a 70 meg log file (which is not considered large) this can grow the heap to many gigs and it will take quite some time before completing (but it does). Given the recursive formulation above, I can understand why. I noticed with a similar program for SWI, I get "ERROR: Out of global stack" very quickly. However if I put in a cut (!) after newline to indicate I am not interested in any other solutions (fine for this use-case), SWI completes quickly and with constant space, I assume because it has optimised this. I don't see the same effect with scryer however. I was hoping with DCGs there would be a nice way to express this and still realise good space/time efficiency. In my real-life example, the log format I am parsing is not trivial and each entry may involve several lines. Is there an alternative way to handling this sort of parsing problem that will be more space/time efficient? Or are cuts with "optimisation" the only way? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
Whenever you want to process large files start with small ones. So small, that their content fits conveniently into a single line. For this, just use
So your non-terminal In your definition you used the non-terminal And did I say all numbers that are smaller? But zero is missing!
Why? Let's reformulate this to ask: What does the content of files look like that have zero lines?
So we find only this one, and by using the goal length we ensure that all possible lists are considered:
So this version now just finds exactly one answer for a given content. And, it also enumerates all possible contents. But the Meanwhile you might be tempted to introduce a cut, but this will at the same time also disable general questions about all possible outcomes. That is, you would make this program incorrect for certain uses. Many programmers prefer this way. Some even use To better understand the problem in
This is all fine except for the
All this complex code just to avoid leftover choicepoints in those two cases above! Will this solve your problem? Only when there will be garbage collection in Scryer. Another more ad hoc way to this problem is to use call_semidet/1. (There is still one leftover choicepoint for |
Beta Was this translation helpful? Give feedback.
Whenever you want to process large files start with small ones. So small, that their content fits conveniently into a single line. For this, just use
phrase/2
alone:So your non-terminal
log_file//2
is ambiguous. Not only does it determine the number of newlines in a string but also all numbers smaller than this number.In your definition you used the non-terminal
... //0
which means any sequence, starting with the empty sequence and including also sequences with newlines. You need to reformulate your query such that you get exactly one solution a…