Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource busy error with new async runtime #903

Closed
malteneuss opened this issue Nov 14, 2021 · 16 comments · Fixed by #945
Closed

Resource busy error with new async runtime #903

malteneuss opened this issue Nov 14, 2021 · 16 comments · Fixed by #945
Labels

Comments

@malteneuss
Copy link
Contributor

After upgrading to the newest Hakyll version 4.15.1.0 with async runtime, i get the following error:

[ERROR] .../store/8802798117733530795 for Hakyll.Core.Resource.Provider.MetadataCache/.../siteConfig.yaml/metadata: Store.set: resource busy (file is locked)

This file in question is a yaml file containing site-wide data such as the webpage title, and is loaded from every page. There exists no siteConfig.yaml.metadata file.

ps: I can work around this issue by falling back to a single thread with the usual cabal settings:

executables site
    ...
    ghc-options: -threaded -rtsopts -with-rtsopts=-N1
@Minoru
Copy link
Collaborator

Minoru commented Nov 15, 2021

Hi! Is the source code for you site public? It's be nice to get a reproducible example of some sort.

@Minoru Minoru added the bug label Nov 15, 2021
@malteneuss
Copy link
Contributor Author

malteneuss commented Nov 15, 2021

Hi Minoru, i can't show the full source, because i have some closed-source code in it, but i was able to reduce it to a minimal stack project: https://github.com/malteneuss/hakyll-async-error
The relevant files are

-- Main.hs
import Hakyll
import Data.Aeson

main :: IO ()
main = do
  hakyll $ do
    match "content/siteConfig.yaml" $ compile (getResourceString :: Compiler (Item String))

    match "content/post/**" $ do
      route idRoute
      compile $ do
        siteConfig <- load "content/siteConfig.yaml" :: Compiler (Item String)
        siteConfigJson <- getMetadata (itemIdentifier siteConfig)
        -- decode Json and process...
        makeItem ("" ::String)

and siteConfig.yaml, which is setup like an empty .md file with only the frontmatter in it (technically, it's invalid yaml, and should be renamed to .md, but that doesn't change anything):

---
title: Some title
---

The post markdown files are empty for this example, and the number of threads are >1, e.gl:

ghc-options: -threaded -rtsopts -with-rtsopts=-N4

The error occurs when running stack run site -- watch for the first time (or after stack run site -- clean) and when there is more than one post.

@Minoru
Copy link
Collaborator

Minoru commented Nov 16, 2021

Excellent! Thank you very much, this should be enough for me to reproduce this. I should probably have time for this in a few days.

@Minoru
Copy link
Collaborator

Minoru commented Nov 20, 2021

Thanks for an easily reproducible example! I just copied contents/post/1.md a few more times and ran the following loop:

$ while true ; do stack run -- clean && stack run -- build +RTS -N3 || break ; done

This usually fails on the second iteration :)

So, getMetadata calls into Hakyll.Core.Compiler.Internal.compilerGetMetadata which calls Hakyll.Core.Provider.MetadataCache.resourceMetadata which checks the "store" (a cache) and if the metadata is not there, it reads it from the file itself (e.g. contents/post/1.md) and calls Hakyll.Core.Store.set to put it into the cache. Here's the code:

set :: (Binary a, Typeable a) => Store -> [String] -> a -> IO ()
set store identifier value = withStore store "set" (\key path -> do
encodeFile path value
cacheInsert store key value
) identifier

The problem here is the encodeFile, which writes the metadata into a file in the cache. It's silly for multiple threads to write into the same file simultaneously, so GHC puts a lock on it, and when multiple threads attempt to write, one of them gets the "resource busy (file is locked)" error. So far, so good.

We could protect all I/O in Store with an rwlock, but that's quite coarse, and would probably serialize the entire program. A better approach would be to make those writes atomic; we could use the usual "write to temp file and rename" trick, but apparently Windows can't rename files atomically. We also can't treat the Store as write-only, e.g. ignore these errors if a file already exists.

Any ideas?

/cc @LaurentRDC

@LaurentRDC
Copy link
Collaborator

I'll think about it, but knee-jerk reaction: how do we feel about keeping the store in-memory and only write once at the end of the run?

@jtojnar
Copy link
Contributor

jtojnar commented Nov 21, 2021

Unfortunately, the workaround from OP does not work for me. I also tried passing the flag via CLI flags (site rebuild +RTS -N1 with ghc-options: -threaded -rtsopts) but can still reproduce the issue – and really, -N1 should be the default value if I am reading the docs correctly.

For now, I am disabling -threaded.

@LaurentRDC
Copy link
Collaborator

LaurentRDC commented Nov 21, 2021

I ran @Minoru 's script but cannot reproduce the bug on Windows. I tried:

  • threaded with N=3 and N=8;
  • not threaded;

I looped for several minutes without issues.

Edit: Oops, needed to create multiple posts. Can reproduce now

@Minoru
Copy link
Collaborator

Minoru commented Nov 21, 2021

@LaurentRDC Not sure how I feel about that option. The cache only stores metadata and snapshots; I wonder how large this can get?

@gwern, I know you have a large Hakyll site — can you please tell us how large is your _cache directory?

Also, the cache could be disabled. I searched GitHub for "inMemoryCache language:haskell", found 20 Hakyll sites, one of them disabled the cache, another one did it conditionally, the other 18 enabled the cache. The current default is "enabled". So it's viable to drop the setting, but I'd prefer to avoid it.

I ran @Minoru 's script but cannot reproduce the bug on Windows. I tried:

Did you also make sure to create additional posts in contents/posts/? How many?

I'm on Linux by the way. I'd be surprised if this bug is platform-dependent!

@jtojnar Looking at the code you've shared on IRC yesterday (https://github.com/jtojnar/hakyll/blob/72982e1a64f1b5a5a3da9621aad919425124d3ec/data/example/site.hs#L78-L84), I don't yet see why -N1 wouldn't help in your case.

@LaurentRDC
Copy link
Collaborator

LaurentRDC commented Nov 21, 2021

I can also reproduce @jtojnar 's issue, even without -threaded. Seems like this isn't strictly related to the async runtime.

I'm thinking about whether this is a problem with lazy IO. the binary package doesn't offer strict deserialization. I'm looking into it now

@gwern
Copy link
Contributor

gwern commented Nov 21, 2021

@gwern, I know you have a large Hakyll site — can you please tell us how large is your _cache directory?

Because the compile time is about an hour now and the symlinking, it's a bit difficult to say what size it is. (Should that include all the symlinked static files/documents?) I can say the fully built and remotely synced directory is 53GB, inclusive of all files. I'm currently on Ubuntu Linux, hakyll-4.13.4.1, so I would not have observed any new issues in 4.15.x.

@LaurentRDC
Copy link
Collaborator

I've tried to enforce strictness as much as possible using BangPatterns, with no luck.

I'm wondering if this due to the very small posts that are being processed. What is the timescale for file locks to lock/unlock? This reminds me of #783 , where tests were running so fast that the filesystem couldn't keep up. The solution then was to add a retry mechanism.

With this in mind, I've added a retry to withStore and the test passes, both with and without -threaded.

from:

-- Hakyll\Core\Store.hs
withStore :: Store -> String -> (String -> FilePath -> IO a) -> [String] -> IO a
withStore store loc run identifier = modifyIOError handle $ run key path
  where
    key = hash identifier
    path = storeDirectory store </> key
    handle e = e `ioeSetFileName` (path ++ " for " ++ intercalate "/" identifier)
                 `ioeSetLocation` ("Store." ++ loc)

to

-- Hakyll\Core\Store.hs
import           Hakyll.Core.Util.File (retryWithDelay)

withStore :: Store -> String -> (String -> FilePath -> IO a) -> [String] -> IO a
withStore store loc run identifier = modifyIOError handle $ retryWithDelay 5 $ run key path
  where
    key = hash identifier
    path = storeDirectory store </> key
    handle e = e `ioeSetFileName` (path ++ " for " ++ intercalate "/" identifier)
                 `ioeSetLocation` ("Store." ++ loc)

which adds a delay of 0.1ms when an exception is thrown, and retries at most 5times.

This isn't great, but I don't understand what could cause this problem otherwise

@malteneuss
Copy link
Contributor Author

malteneuss commented Nov 21, 2021

I'm wondering if this due to the very small posts that are being processed.

My actual posts files are all 10-20 KB so quite small, yes.

With this in mind, I've added a retry to withStore and the test passes, both with and without -threaded.
This looks like a good quickfix for now.

Btw, i added a second post to the error project (that somehow was not commited before)

@Minoru
Copy link
Collaborator

Minoru commented Nov 22, 2021

@gwern I think 53G is for _site, but what I'm interested in is _cache (or whatever name you specified via Configuration.storeDirectory).

@LaurentRDC

What is the timescale for file locks to lock/unlock?

I didn't look at the code, but I expect that it's just a mutex in RTS. If that's the case, it would take from hundreds of nanoseconds to small digits of microseconds. I'm pretty sure this would be dwarfed by opening and closing the file (even if we wrote nothing to it), because that would involve a switch into the kernel.

retryWithDelay is an interesting idea, but the number of retries caps the number of RTS threads we can have. Imagine a machine with N cores, N RTS threads, and N tasks that all try to access the same metadata file. One of the tasks will get lucky and succeed; the other N-1 tasks will retry. They would do it at the same time, so one of them will succeed and N-2 will retry. It's clear that if we allow X retries, than N can't be more than X+1, otherwise there would be "leftover" tasks which would conflict and cause an error. Ways to improve: 1) only retry on "resource busy" errors, bubble the rest; 2) exponential backoff; 3) do not limit the number of retries.

All these retries are wasteful, so I tried a slightly different approach:

diff --git a/lib/Hakyll/Core/Store.hs b/lib/Hakyll/Core/Store.hs
index da16c6f..0c07e36 100644
--- a/lib/Hakyll/Core/Store.hs
+++ b/lib/Hakyll/Core/Store.hs
@@ -17,17 +17,19 @@ module Hakyll.Core.Store
 
 --------------------------------------------------------------------------------
 import qualified Data.Hashable        as DH
-import           Data.Binary          (Binary, decode, encodeFile)
+import           Data.Binary          (Binary, decode, encode)
 import qualified Data.ByteString.Lazy as BL
 import qualified Data.Cache.LRU.IO    as Lru
 import           Data.List            (intercalate)
 import           Data.Maybe           (isJust)
 import           Data.Typeable        (TypeRep, Typeable, cast, typeOf)
+import           Control.Monad        (when)
 import           System.Directory     (createDirectoryIfMissing, doesFileExist, removeFile)
 import           System.FilePath      ((</>))
 import           System.IO            (IOMode (..), hClose, openFile)
 import           System.IO.Error      (catchIOError, ioeSetFileName,
-                                       ioeSetLocation, modifyIOError)
+                                       ioeSetLocation, modifyIOError,
+                                       tryIOError, isAlreadyInUseError)
 
 
 --------------------------------------------------------------------------------
@@ -127,33 +129,48 @@ cacheDelete (Store _ (Just lru)) key = do
 
 --------------------------------------------------------------------------------
 -- | Store an item
-set :: (Binary a, Typeable a) => Store -> [String] -> a -> IO ()
-set store identifier value = withStore store "set" (\key path -> do
-    encodeFile path value
-    cacheInsert store key value
-  ) identifier
+set :: forall a. (Binary a, Typeable a) => Store -> [String] -> a -> IO ()
+set store identifier value =
+    withStore store "set" (\key path -> do
+        let encoded_value = encode value
+        e1 <- tryIOError (trySet key path encoded_value)
+        case e1 of
+            Left e2 | isAlreadyInUseError e2 -> do
+                cached :: Result a <- cacheLookup store key
+                let encoded_cached = fmap encode $ toMaybe cached
+                when (encoded_cached /= Just encoded_value) $
+                    trySet key path encoded_value
+            Left e2 -> ioError e2
+            Right result -> return result
+    ) identifier
+  where
+    trySet key path encoded_value = do
+        cacheInsert store key value
+        BL.writeFile path encoded_value
 
 
 --------------------------------------------------------------------------------
 -- | Load an item
 get :: (Binary a, Typeable a) => Store -> [String] -> IO (Result a)
-get store = withStore store "get" $ \key path -> do
-    -- First check the in-memory map
-    ref <- cacheLookup store key
-    case ref of
-        -- Not found in the map, try the filesystem
-        NotFound -> do
-            exists <- doesFileExist path
-            if not exists
-                -- Not found in the filesystem either
-                then return NotFound
-                -- Found in the filesystem
-                else do
-                    v <- decodeClose path
-                    cacheInsert store key v
-                    return $ Found v
-        -- Found in the in-memory map (or wrong type), just return
-        s -> return s
+get store idenfifier =
+    retryIfAlreadyInUse $ withStore store "get" (\key path -> do
+        -- First check the in-memory map
+        ref <- cacheLookup store key
+        case ref of
+            -- Not found in the map, try the filesystem
+            NotFound -> do
+                exists <- doesFileExist path
+                if not exists
+                    -- Not found in the filesystem either
+                    then return NotFound
+                    -- Found in the filesystem
+                    else do
+                        v <- decodeClose path
+                        cacheInsert store key v
+                        return $ Found v
+            -- Found in the in-memory map (or wrong type), just return
+            s -> return s
+        ) idenfifier
   where
     -- 'decodeFile' from Data.Binary which closes the file ASAP
     decodeClose path = do
@@ -162,6 +179,14 @@ get store = withStore store "get" $ \key path -> do
         BL.length lbs `seq` hClose h
         return $ decode lbs
 
+    retryIfAlreadyInUse :: IO a -> IO a
+    retryIfAlreadyInUse action = do
+        e1 <- tryIOError action
+        case e1 of
+            Left e2 | isAlreadyInUseError e2 -> action
+            Left e2 -> ioError e2
+            Right result -> return result
+
 
 --------------------------------------------------------------------------------
 -- | Strict function

Note that I put the new entry into LRU before I write it to disk. Anyone who gets a "resource busy" error retries starting with LRU. If they intended to write the same value as is already in LRU, then they consider the job done; otherwise they try to write one more time, and fail then.

This fixes the code from OP, but would fail if there were two cache entries that cause conflicts. That's the same problem as in your solution; we have to allow infinite number of retries.

I also tried @jtojnar's code, and I do get errors, but never with Store.set:

[ERROR] _cache/-2047741568731054182 for Hakyll.Core.Resource.Provider.MetadataCache/posts/metadata.yaml/metadata: Store.get: resource busy (file is locked)

and

  [ERROR] Data.Binary.Get.runGet at position 0: not enough bytes
    CallStack (from HasCallStack):
      error, called at libraries/binary/src/Data/Binary/Get.hs:351:5 in binary-0.8.8.0:Data.Binary.Get

I haven't dug into them yet, but the core problem is the same: Store doesn't support concurrent access.

I'll try rwlock sometime this week, and report back.

@gwern
Copy link
Contributor

gwern commented Nov 25, 2021

@gwern, I know you have a large Hakyll site — can you please tell us how large is your _cache directory?

601MB.

@rqy2002
Copy link

rqy2002 commented Feb 28, 2023

I has this bug again. Using hakyll 4.15.1.1, I get

[ERROR] _cache\1358274680180442700 for Hakyll.Core.Resource.Provider.MetadataCache/posts\2022-10-30-test.md/metadata: Store.set: resource busy (file is locked)

every time I change the content of 2022-10-30-test.md and run stack run -- build.
But the bug does'nt appear when I rebuild all pages.
It doesn't depend on whether I use "-threaded" or not -- I get the same bug in both situation.

This bug occurs when I want to produce two version of a file, one page and one raw, as in the tutorial.

This is part of my site.hs:

  match "posts/*" $ version "raw" $ do
    route idRoute
    compile getResourceBody

  match "posts/*.md" $ do
    route $ setExtension "html"
    compile $ pandocCompiler'
      >>= loadAndApplyTemplate "templates/post.html" postCtx
      >>= loadAndApplyTemplate "templates/default.html" postCtx
      -- >>= relativizeUrls

(where pandocCompiler' is pandoc with some configuration).
If I remove the raw version, then the bug doesn't occur.

@Minoru
Copy link
Collaborator

Minoru commented Apr 27, 2023

This is fixed in 4.16.0.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants