You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following code should reproduce the memory problem.
The case is that in server there are two lmr to read data.
The first read is successful. When the first lmr is dropped, jemalloc dalloc is triggered (if the lmr is fairly large), and the EXTENT_TOKEN_MAP will remove the raw_mr item.
However, when creating the second lmr, the lookup_raw_mr function in mr_allocator.rs will get error. There are actually three error situaitions in this function and I have seen all of them (still wondering why...)
Some thing about my system setting:
I tried using sudo to run this, still failed
I have set my user ulimit to unlimited
I use softiwarp on ubuntu 20.04, but I think the bug is only related to mr_allocator
use async_rdma::{LocalMrWriteAccess, RdmaBuilder};
use portpicker::pick_unused_port;
use std::{
alloc::Layout,
io::{self, Write},
net::{Ipv4Addr, SocketAddrV4},
time::Duration,
};
const SIZE: usize = 44444444;
async fn client(addr: SocketAddrV4) -> io::Result<()> {
let rdma = RdmaBuilder::default().connect(addr).await?;
let data = vec![0u8; SIZE];
// first send
let layout = Layout::from_size_align(SIZE, 1).unwrap();
let mut lmr = rdma.alloc_local_mr(layout)?;
lmr.as_mut_slice().write(&data)?;
rdma.send_local_mr(lmr).await?;
// second send
let layout = Layout::from_size_align(SIZE, 1).unwrap();
let mut lmr = rdma.alloc_local_mr(layout)?;
lmr.as_mut_slice().write(&data)?;
rdma.send_local_mr(lmr).await?;
// wait for server to read, otherwise this client will early exit
tokio::time::sleep(Duration::from_secs(5)).await;
Ok(())
}
#[tokio::main]
async fn server(addr: SocketAddrV4) -> io::Result<()> {
let rdma = RdmaBuilder::default().listen(addr).await?;
{
let layout = Layout::from_size_align(SIZE, 1).unwrap();
println!("layout: {:?}", layout);
let mut lmr = rdma.alloc_local_mr(layout)?;
println!("lmr: {:?}", lmr);
let rmr = rdma.receive_remote_mr().await?;
rdma.read(&mut lmr, &rmr).await?;
println!("rdma read\n-------------");
}
// lmr will drop here
{
let layout = Layout::from_size_align(SIZE, 1).unwrap();
println!("layout: {:?}", layout);
// the memory bug occurs here
let mut lmr = rdma.alloc_local_mr(layout)?;
println!("lmr: {:?}", lmr);
let rmr = rdma.receive_remote_mr().await?;
rdma.read(&mut lmr, &rmr).await?;
println!("rdma read\n-------------");
}
Ok(())
}
#[tokio::main]
async fn main() {
let addr = SocketAddrV4::new(Ipv4Addr::new(127, 0, 0, 1), pick_unused_port().unwrap());
std::thread::spawn(move || server(addr));
tokio::time::sleep(Duration::from_secs(1)).await;
client(addr)
.await
.map_err(|err| println!("{}", err))
.unwrap();
}
The text was updated successfully, but these errors were encountered:
Hi @fishiu , Thanks for your feedback.
The three errors you mentioned may all be related to the retain function of jemalloc.
You can try to set retain as false to fix the OOM or can not find raw mr, as described in #110 . I can pass your test in that way.
Actually it should be false as default in linux, but according to the feedbacks, it seems not to be the case.
The reason for the arena_id assertion failure may be that the client and server are in the same address space during testing, and because lazy_static shares the same EXTENT_TOKEN_MAP, combined with jemalloc's retain, it reads the address from the wrong arena.
Further analysis is required and it is necessary to enhance the default settings and error prompts in this case.
The following code should reproduce the memory problem.
The case is that in
server
there are two lmr to read data.The first read is successful. When the first lmr is dropped, jemalloc dalloc is triggered (if the lmr is fairly large), and the EXTENT_TOKEN_MAP will remove the raw_mr item.
However, when creating the second lmr, the
lookup_raw_mr
function inmr_allocator.rs
will get error. There are actually three error situaitions in this function and I have seen all of them (still wondering why...)Some thing about my system setting:
The text was updated successfully, but these errors were encountered: