Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Memory growth when using PyGWalker with Streamlit #618

Open
ChrnyaevEK opened this issue Sep 13, 2024 · 3 comments
Open

[BUG] Memory growth when using PyGWalker with Streamlit #618

ChrnyaevEK opened this issue Sep 13, 2024 · 3 comments
Assignees
Labels
bug Something isn't working good first issue Good for newcomers P1 will be fixed in next version

Comments

@ChrnyaevEK
Copy link

ChrnyaevEK commented Sep 13, 2024

Describe the bug
I observe RAM growth when using PyGWalker with Streamlit framework. The RAM usage constantly grow on page reload (on every app run). When using Streamlit without PyGWalker, RAM usage remain constant (flat, does not grow). It seems like memory is never released, this was observed indirectly (we tracked growth locally, see reproduction below, but we also observe same issue in Azure web app and RAM usage never decline).

To Reproduce
We tracked down the issue with isolated Streamlit app with PyGwalker and memory profile (run with python -m streamlit run app.py):

# app.py
import numpy as np
np.random.seed(seed=1)
import pandas as pd
from memory_profiler import profile
from pygwalker.api.streamlit import StreamlitRenderer

@profile
def app():
    # Create random dataframe
    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list("ABCD"))
    render = StreamlitRenderer(df)
    render.explorer()
app()

Observed output for a few consequent reloads from browser (press R, rerun):

Line #    Mem usage    Increment  Occurrences   Line Contents
    13    302.6 MiB     23.3 MiB           1       render.explorer()
    13    315.4 MiB     23.3 MiB           1       render.explorer()
    13    325.8 MiB     23.3 MiB           1       render.explorer()

Expected behavior
RAM usage to remain at constant level between app reruns.

Screenshots
On screenshot you may observe a user activity peaks (cause CPU usage) and growing RAM usage (memory set).
Metrics from Azure

On this screenshot a debug app memory profiling is displayed.
Debug app memory profile

Versions
streamlit 1.38.0
pygwalker 0.4.9.3
memory_profiler (latest)
python 3.9.10
browser: chrome 128.0.6613.138 (Official Build) (64-bit)
Tested locally on Windows 11

Thanks for support!

@ChrnyaevEK ChrnyaevEK added the bug Something isn't working label Sep 13, 2024
@longxiaofei longxiaofei self-assigned this Sep 14, 2024
@ChrnyaevEK
Copy link
Author

Update

It seems like I may have misinterpreted observations. I continued to track production app and did some more testing and results point away from PyGWalker as I originally thought (potentially to Azure web app or our production code other issues). I will do local tests with memory profiler to see how it behaves overtime to rule out this observation as well.

I'm sorry for disturbance, I will continue debug with new evidences.

Production app observations

Health endpoint has been added to our production version and now we observe strange memory behaviour even without opening PyGWalker explorer (PyGWalker was still imported as package). Health opens empty Streamlit page every 5 mins and over last 24h a RAM usage was gradually growing (on image you can observe used memory getting closer to 500Mb without spikes with constant increase rate related to health calls).

RAM usage in production
RAM usage in production

Sample app deployment

I also tested sample app deployment on Azure to exclude Azure resource virtualization issues, but results did not confirm original hypothesis.

Without PyGWalker

image
Sample app without PyGWalker on Azure

# app.py
import numpy as np
np.random.seed(seed=1)
import pandas as pd
import streamlit as st

def app():
    # Create random dataframe
    df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list("ABCD"))
    st.table(df)
app()

With PyGWalker

Sample app with PyGWalker was also deployed to Azure (it is running for few hours now). How ever it behaves as expected and release memory when objects are destroyed. Which makes me think, that the problem with our production version lays somewhere else.

image
Sample app with PyGWalker on Azure

import numpy as np
np.random.seed(seed=1)
import pandas as pd
from pygwalker.api.streamlit import StreamlitRenderer

def app():
    df = pd.DataFrame(
        np.random.randint(0, 1000, size=(100000, 4)), columns=list("ABCD")
    )
    render = StreamlitRenderer(df)
    render.explorer()
app()

@longxiaofei
Copy link
Contributor

Hi @ChrnyaevEK , Thanks for your feedback.

Using pygwalker latest version, and try to cache StreamlitRenderer, it may avoid memory growth.

from pygwalker.api.streamlit import StreamlitRenderer
import pandas as pd
import streamlit as st

@st.cache_resource
def get_pyg_renderer() -> "StreamlitRenderer":
    df = pd.read_csv("xxx")
    return StreamlitRenderer(df)

renderer = get_pyg_renderer()

renderer.explorer()

There are several reasons why pygwalker memory grows:

  1. StreamlitRenderer(df) will parse the dataframe and infer the data type.
  2. render.explorer() It will render the ui using html iframe(0.4.9.8 version has used the streamlit custom component to render pygwalker ui. The streamlit component has optimized this part of the memory overhead)
  3. For data calculation communication, the calculated data needs to complete http communication through the customized tornado endponit.(This will also be optimized in future versions)

In the next period of time, pygwalker will optimize the user experience of the streamlit component. Thank you again for your feedback.

@longxiaofei longxiaofei added the good first issue Good for newcomers label Sep 18, 2024
@ChrnyaevEK
Copy link
Author

Hi @longxiaofei ! Thanks for your attention.

Caching

I'm afraid that caching is not an option in this case, our data change with every request and thus cached function should look more like this:

@st.cache_resource
def get_pyg_renderer(key: str) -> "StreamlitRenderer":
    df = pd.read_csv(key)
    ...

which basically is equivalent for no cache at all. ttl and max_entries will not help either.

I did however test this approach and I'm still facing the same strange behavior.

import numpy as np
import pandas as pd

import streamlit as st
from pygwalker.api.streamlit import StreamlitRenderer

@st.cache_resource(max_entries=3, ttl=20)
def get_render(key: int):
    df = pd.DataFrame(
        np.random.randint(0, 1000, size=(100000, 4)), columns=list("ABCD")
    )

    return StreamlitRenderer(df)

def app():
    render = get_render(np.random.randint(1, 100))
    render.explorer()

app()

Running this app locally (windows, as described in first massage with pygwalker 0.4.9.3 as this is our production version) results in constantly growing memory (it seems to occasionally release insignificant amount of memory, but it does not return to initial values).
image
RAM used by python process with streamlit server with cached pygwalker render

Other local tests

I did also test few other code snippets locally to confirm that memory will eventually be released, but it seems like it's not.

Bare Streamlit

Code

import numpy as np
import pandas as pd
import streamlit as st

def app():
    df = pd.DataFrame(
        np.random.randint(0, 1000, size=(100000, 4)), columns=list("ABCD")
    )
    st.dataframe(df)
app()

Debug sequence

streamlit server start (python -m streamlit run ...) - 12:25 (memory increase due to initial object initialization)
restart (R) - 12:27 (memory increased)
restart (R) - 12:28 (memory increased)
restart (R) - 12:29 (memory increased)
restart (R) - 12:30 (memory increased)
restart (R) - 12:31 (memory did not react)
page close - 12:32 (memory decreased, but not to initial level)
stop - 12:58 (before stop a few slight memory decreases were observed without any external trigger)
Total test time: ~30min

Graph

See attached PDF
debug.pdf

Streamlit with PyGWalker

Code

import numpy as np
import pandas as pd
from pygwalker.api.streamlit import StreamlitRenderer

def app():
    df = pd.DataFrame(
        np.random.randint(0, 1000, size=(100000, 4)), columns=list("ABCD")
    )
    render = StreamlitRenderer(df)
    render.explorer()
app()

Debug sequence

start - 13:09
restart - 13:11 (significant memory increase)
restart - 13:12 (memory increase)
restart - 13:13 (memory increase)
restart - 13:14 (memory increase)
restart - 13:15 (memory increase)
page close - 13:16 (memory decrease, not to initial values)
stop - 13:40 (no memory decrease observed)

Graph

See attached PDF
debug.pdf, same as above

Conclusions up to the moment

Apps with and without PyGWalker both hold memory. PyGWalker allocate memory on every rerun, bare Streamlit seems to eventually saturate (may not allocate noticeable amount of memory).

There is no issue openning multiple Streamlit apps without PyGWalker, but as soon as PyGWalker is used we run out of memory (even with cache). This seems to be confirmed locally and on Azure.

I still suspect some issue with PyGWalker on Streamlit (may be PyGWalker just misuse Streamlit caching mechanisms), can you please check steady memory growth when running minimal PyGWalker app locally?

Thanks!

@longxiaofei longxiaofei added the P1 will be fixed in next version label Sep 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers P1 will be fixed in next version
Projects
None yet
Development

No branches or pull requests

2 participants