-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
files app: large number of files in directory issues #3801
Comments
i did some more measurements and not sure what to think of it the size of the json is not the real issue: eg 10k files results with this very verbose json only a 4.5MB download (takes 10s to create and download the json). bigger issue are that eg a firefox tab to show this then takes 700MB of ram, and is slow to render. everything seems nice and linear excpept for my real problem: the ruby dashboard on the ondemand server process grows from 150MB idle to 180MB. still nothing to worry about, but there is no way to release this. if i then also open another folder with 30k files, the dahsboard process jumps to 330MB, without releasing the memory. larger folders, more memory usage, more added on top and not release. adding files to a folder that was already opened increases the memory with number of extra files (eg adding 1k files in a 30k files folder). renaming a folder that was opened before triggers an increase as well, but not as much as compared from scratch. @johrstrom wouldn't it make more sense to only return eg the first 1-2k files in the json, and then showing some warning (and maybe provide some url arg |
Yea seems like we need to paginate very large directories. |
more debugging revealed the memory issue: generating the large files json, causes ruby to create a lot of objects (i guess 50 to 80 per file). these are recycled by the ruby GC but not released from memory. calling |
@johrstrom reading a bit more on pagination in datatables, i assume there is no "easy" fix to this. from what i see
to work around the memory issue on the server side, i might have a solution; but also needs some extra code: if we construct a sinlge large text blob in eg csv format, and send that as files data (instead of list of dicts), ruby doesn't have to make a lot of separate objects. the csv format will also be more compact then the current json, so that is also a nice benefit. strangely enough you can't read csv text in datatables; but converting the csv to json in the browser is not that hard (searching online gave some not very long javascript examples). |
@stdweird I'm following your debugging with interest, thanks for looking into this!
I think there's a risk of painting ourselves into a corner here with file names containing all sorts of characters. Since Linux file names can contain any bytes except the NUL byte ( I wanted to raise this concern, to be sure that you've considered it. For the record, files with invalid UTF-8 bytes in their names are currently being discarded by OOD on the server side. See PR #2626 so these are not shipped to the client side. |
@CSC-swesters i have not considered anything ;) to generate the csv format we indeed should be aware of unicode stuff and test it. but indeed we should be careful. if we put the filename last column the splitting should be more robust (IF we can get rid of the urls that also have the name in them). to be clear, to avoid the overload of object being created, we probably need to generate the string ourself; not use some ruby gem for it. |
At some point we'd likely rewrite this to use turbo_streams instead of json responses.
Isn't that the difference between |
@CSC-swesters wrt the utf8 issues, this might be the problem that the pun code runs with |
@johrstrom here is some GC.stat data. so it look like 70 pages per file. the
so a lot of pages being used, and not released ever (or at a very slow rate. and GC does not clean based on age, but based on iteration. the decrease rate is too slow anyway to trigger a meaningful decrease in RSS with series of if you wonder why, well every string is an object 9so at least on page), so all values of the
there is no anyway, my analysis is probably not 100% correct ;), but for the purpose of building the json, looping over the files from the |
That's not really going to help. If you have bytes which cannot be decoded as valid text, changing the Slurm doesn't enforce UTF-8 in its job names, so the analogy to Linux file names is a good one, but the solution in #3644 is not the correct one. This here is the correct solution, which allows you to get some readable text representation out of a byte sequence with non-valid UTF-8 bytes in it. |
@CSC-swesters ah, i misread the "invalid utf-8" part. do these files have a valid encoding (at all), or are these just corrupt filenames. |
No, but they don't need to. Linux handles file names as bytes, so applications should do that too.
No, they're not corrupt, that's the thing I wanted to point out in my original comment 🙂 Just because we cannot read them, it doesn't mean that it isn't a valid file name for Linux, so it needs to be supported or at least not cause crashes and problems. |
we are deviating from the original issue, but if you don't use correct encoding, you will get garbage. so when ood converts the stat binary data to string or to json, it should do that correctly; and send the correct encoding with the json so the browser can also do something with it. i guess something goes wrong somewhere in that chain. maybe you can send the binary stat data straight away to client in some other protocol, but the original encoding needs to be known. |
Yes, sorry for that.
Either OOD would need to drop invalid UTF-8 entries on the server side, or we could base64 encode any problematic file names if we really want to send them to the client. This quickly becomes a user experience problem in that users won't know there's a file with an unprintable name if we simply drop it. And base64 is not very user-friendly, even if it's technically usable. There's a separate issue open for discussing the user interface-side of "special" files (perhaps ones with unprintable file names could be added there?) over in #3026 |
in python you can set fallback mechanism for string encoding, but this is better handled by separate code so it's clear that the reported name is not really the correct one. |
Good find, it sounds like this could help with memory management 👍 I suppose the |
@CSC-swesters i am not an expert (i read here and there that chaining should "just" pass things around, but indeed the sort requires the whole sequence (although i am not sure it's relevant for the json output, as it re-sorts in the webclient i thiink. should be easy to try. |
Using ondemand 3.1.7, we have 2 kind of issues when using using the files app:
The memory issue seems a "feature" of jbuilder caching the json structures somehow. The slowness and amount of memory result (imho) from a way too verbose json being generated.
In particular, for every file, 3 url strings are generated, but it would be better if they are generated on the browser end. The structure is sum of prefix of directory part of the url and the file name; and the download url might have some fixed suffix as well.
I think i can modify the jbuilder code to create a lighter json; but i am stuck with the javascript/templating stuff that happens on the browser side. In particular, in
_file_action_menu.html.erb
, the file data is somehow passed asdata
, but i don't know where that comes from and/or how it is evaluated (or generated): can we access thefiles
javascript variables inside the{{data.something}}
templates; or can we manipulate whateverdata
is coming from before the templating happens? (why is it not pure javascript, at least that might be more consistent and easier to read ;)anyway, help welcome
The text was updated successfully, but these errors were encountered: