search.json

[{"title":"Brief Dive into Macro-o1 and o1-like LLMs","date":"2024-11-29T05:30:38.000Z","url":"/LLM/macro-o1.html","tags":[["LLM","/tags/LLM/"],["MCTS","/tags/MCTS/"]],"categories":[["LLM","/categories/LLM/"]],"content":"Macro-o1 7B is the first o1-like open source LLM. In this post, we will dive into its training paradigm and its inference implementation. Code link:  Paper link:  Refer to caption Training Paradigm Macro-o1 adopts full-param SFT using existing CoT datasets by OpenAI and synthetic dataset generated by open source LLM. Compared with previous SFT, a CoT dataset provides a linear path for LLM to follow, while LLM can also explore other action nodes for exploration. This may lead to a balance between exploration and exploitation, significantly reduces the distribution shift caused by bias within dataset. MCTS Revisit Selection: Start from root R and select successive child nodes until a leaf node L is reached. The root is the current game state and a leaf is any node that has a potential child from which no simulation (playout) has yet been initiated. The section below says more about a way of biasing choice of child nodes that lets the game tree expand towards the most promising moves, which is the essence of Monte Carlo tree search. Expansion: Unless L ends the game decisively (e.g. win/loss/draw) for either player, create one (or more) child nodes and choose node C from one of them. Child nodes are any valid moves from the game position defined by L. Simulation: Complete one random playout from node C. This step is sometimes also called playout or rollout. A playout may be as simple as choosing uniform random moves until the game is decided (for example in chess, the game is won, lost, or drawn). Backpropagation: Use the result of the playout to update information in the nodes on the path from C to R. Training Procedures In the training procedure, \"action\" is formulated as LLM outputs, or to say, the fixed-length tokens . Then the value of each state is obtained by applying the softmax function to its log probability and the log probabilities of the top 5 alternative tokens. Here is the log probability of token generated by LLM, and denotes the top predicted tokens at step . Thus the reward of a state is obtained as: Dataset Preparation Besides Open-O1 dataset, they also built a synthetic dataset using MCTS. An example of CoT data is provided in CoT_demo.json . It's generated by Qwen2.5-7B-Instruct. Inference Macro-o1 has a vllm implementation based on Qwen2Model. Compared with original Qwen structure, it adds a generate_response function above model's forward instead of directly use model.generate(). The following is huggingface implementation, which is more detailed than vllm ones. Such generation process exposes the logits of tokens to achieve the above MCTS procedures. In Macro's implementation, &lt;Thought&gt; and &lt;\\Thought&gt; are not special tokens!"},{"title":"Kyle's Market Micro-architecture Model","date":"2024-10-11T10:59:55.000Z","url":"/game-theory/kyle-market-model.html","tags":[["Game-Theory","/tags/Game-Theory/"],["Finance","/tags/Finance/"]],"categories":[["game-theory","/categories/game-theory/"]],"content":"The trading behavior in a market is often backed by asymmetric information, making the market a asymmetric game. Albert S. Kyle proposed a micro-structure formulation in 1985, stating that such market consists of three players: an insider, a random noise trader, and a market maker. His successors also proved that there exist an equilibrium with additional conditions. In this post, we will discuss the formulation of Kyle's model formulated as an extensive form game and an Optimal-Transport based setting of equilibrium. Basic Formulation The Initial model proposed by Kyle stated three kinds of traders in the market: Insider (): Who knows the true value of the asset and perform strategies accordingly. Noise Trader (): Who trades randomly as a Poisson process or Brownian motion. Market Maker (Y): Who observes the market and forms its own belief of the asset based on the summed order flow of insider and noise trader. Since the trading behavior is relatively noisy, one of the key contribution of Kyle's model is constructing an information-centric approach to analyze the market movement. The market can be formulated as a game tree with nodes formulated as Each round of trading can be broken down into two steps. Firstly, new information about the true value of the asset is revealed to the insider. Secondly, both insider and noise trader simultaneously trade a discrete quantity of shares on each price level. Information Flow The insider holds private information about the true value as a distribution. There are fundamental information states and the true value is a mapping The flow of information at time is defined as a set-valued function of . Belief formulation &amp; Pricing System The transition of nodes are formulated by the pricing system of market maker and strategy of insider. The direct successors of game tree is given by Equilibrium Setting A Kyle equilibrium is a pair satisfying: Given , the strategy is optimal. Given , the pricing system is rational. Optimal Transport Solution There are a few works examining nonlinear strategies 𝑋 and the uniqueness of linear strategies in Kyle (1985). Single-Period Kyle Model Cho and El Karoui (2000) find a nonlinear strategy for the single-period Kyle model if they use a Bernoulli distribution for the noise term. For continuous noise (i.e. non-atomic distributions), they also characterize the existence of a unique (linear) equilibrium. Boulatov, Kyle, and Livdan (2012) show the linear strategy is unique for the original single-period Kyle model setup. Boulatov and Bernhardt (2015) also examine a single-period case and show that the linear strategy is unique and robust while nonlinear strategies are not robust. Thus the linear strategy is the equilibrium. Multi-Period Kyle Model Foster and Viswanathan (1993) show that for multi-period Kyle models, the linear strategy is a unique equilibrium for beliefs in the class of elliptical distributions (e.g. the Gaussian distribution used by Kyle). Continuous-time Kyle Model Back (1992) shows that in the continuous-time Kyle model, there may be nonlinear strategies. The strategies 𝑋 are, however, smooth and monotone in the total order size. As an interesting aside, Back and Baruch (2004) study conditions where the continuous-time Kyle model converges to the same equilibrium as the Glosten and Milgrom (1985) model. Stochastic Models of Market Microstructure"},{"title":"From MPC to Diffusion Policy","date":"2024-10-05T13:16:21.000Z","url":"/EBM/mpc-diffusion-policy.html","tags":[["robotics","/tags/robotics/"],["diffusion","/tags/diffusion/"],["EBM","/tags/EBM/"]],"categories":[["EBM","/categories/EBM/"]],"content":"In this post, we will try to connect Energy-Based Model with classical optimal control frameworks like Model-Predictive Control from the perspective of Lagrangian optimization. This is one part of the series about energy-based learning and optimal control. A recommended reading order is: Notes on \"The Energy-Based Learning Model\" by Yann LeCun, 2021 Learning Data Distribution Via Gradient Estimation [From MPC to Energy-Based Policy] How Would Diffusion Model Help Robot Imitation Causality hidden in EBM Review of EKF and MPC Consider a state-space model with process noise and measurement noise as: With a state and observation . To perform optimal control on such system, we first predict and update our observation with Extended Kalman Filter: Prediction Step Priori Estimation: Jacobian of the State Transition Function: Error Covariance Prediction: Update Step Jacobian of the Measurement Function: Kalman Gain: Posterior Estimation: Error Covariance Update: With the estimated state we can perform Model-Based Control with the following condition: $$ \\begin{aligned} {} J &amp;= {i=0}^{N-1} (x[k+i], u[k+i]) + f(x[k+N]) \\ (x, u) &amp;= (x - x{})^{} Q (x - x_{}) + u^{} R u\\ &amp; x[k+i+1] = f(x[k+i], u[k+i]) i = 0, , N-1 \\ &amp; x[k] = [k] \\ &amp; x_{} x[k+i] x_{} i \\ &amp; u_{} u[k+i] u_{} i \\end{aligned} $$ : Control input sequence. : Prediction horizon. : Stage cost function. : Terminal cost function. : State constraints. : Control constraints. Introducing Energy Based Model We can replace the state transition function in the formulated state space model with EBM as: Supervised Training Since is often intractable, we will use score matching to learn the EBM in the following content. Here we have the score function as: Dataset: A set of transitions with \"independently and identically distributed (i.i.d.)\" assumption. The training objective is to minimize the difference between data landscape and model landscape , and the objective function is defined as follows, where loss is commonly defined as MSELoss: HOWEVER, we cannot get access to full data distribution. According to (Hyvärinen, et.al , 2005)1 we may use the following procedures. (More in Appendix A of original paper) $$ \\begin{aligned} J()&amp;={p{data}}[| s_(x) |^2 - 2 s_(x)^s_{}(x) + | s_{}(x) |^2] \\ J'()&amp;= {p{}} - {p{}} \\ \\end{aligned} $$ By integrating by parts, we move the derivative from to : $$ s_(x)^{x} p{}(x) dx = -p_{}(x) {x} s(x) dx \\ {x} s(x) = {x} ( -{x} E_(x) ) = -{x} {x} E_(x) = -{x} E(x)\\ {p{}} = -{p{}} \\begin{aligned} J() &amp;= {p{}} + {p{}} \\ J_k() &amp;= ( {x[k]}^2 E(x[k]) ) + | {x[k]} E(x[k]) |^2 \\end{aligned} $$ denotes the trace of Hessian matrix (or Jacobian) of score function w.r.t. . If you haven't seen such formulation in diffusion models and feel strange: The training objective is to learn the distribution of , which is a known Gaussian distribution since the noise level is provided. This is also why q-sampling requires . Optimization Based Inferencing Langevin Dynamics can produce samples from a probability density using only the score function . Given a fixed step size , and an initial value with being a prior distribution, the Langevin method recursively computes the following where . The distribution of equals when and , or else a Metropolis-Hastings update is needed to correct the error. Hyvärinen, A., &amp; Dayan, P. (2005). Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4). ↩ "},{"title":"Notes on 'The Energy-Based Learning Model' by Yann LeCun, 2021","date":"2024-10-01T08:26:45.000Z","url":"/EBM/lecun-ebm-2021.html","tags":[["EBM","/tags/EBM/"],["self-supervised-learning","/tags/self-supervised-learning/"]],"categories":[["EBM","/categories/EBM/"]],"content":"In this note we connect the energy concepts including Lagrangian Dynamics and Gibbs Formula to measuring the quality of prediction. Then we go over the inferencing and training in multi-modal scenarios. This is one part of the series about energy-based learning and optimal control. A recommended reading order is: Notes on \"The Energy-Based Learning Model\" by Yann LeCun, 2021 Learning Data Distribution Via Gradient Estimation From MPC to Energy-Based Policy How Would Diffusion Model Help Robot Imitation Causality hidden in EBM Recording Link: Yann LeCun | May 18, 2021 | The Energy-Based Learning Model Further Readings: A Tutorial on Energy-Based Learning Reformulation of Back-propagation as Lagrangian Optimization Instead of forces, Lagrangian mechanics uses energy as a unified parameter. A Lagrangian is a function which summarizes the dynamics of the entire system. The non-relativistic Lagrangian for a system of particles in the absence of an electromagnetic field is given by where denotes total kinetic energy of the system and denotes the total potential energy, reflecting the energy of interaction between the particles. The optimization target is to minimize Lagrange's Equations for a time varying system with number of constraints being . Particles are labeled as and have positions and velocity . For each constraint equation , there's a Lagrange multiplier Gibbs Formula [TODO] Gibbs energy was developed in the 1870’s by Josiah Willard Gibbs. He originally termed this energy as the “available energy” in a system. His paper “Graphical Methods in the Thermodynamics of Fluids” published in 1873 outlined how his equation could predict the behavior of systems when they are combined. This quantity is the energy associated with a chemical reaction that can be used to do work, and is the sum of its enthalpy and the product of the temperature and the entropy of the system. Further reading: The Markov blankets of life: autonomy, active inference and the free energy principle Statement in DL Loss s.t. , Lagrangian for optimization under constraints Optimality conditions In back propagation, the Lagrange multiplier is the gradient Self Supervised Learning Learning hierarchical representations Learning predictive models Uncertainty/multi-modality? Energy Based Model Using divergence measure as metrics cannot deal with futures with multiple possibilities (e.g. Multiple solutions in path planning). The average of all possibilities will possibly be the optimal result in such metrics, overfitting may also occur for datasets of insufficient samples. We can replace the divergence measure with an energy function, which measures the \"incompatibility\" between and . the compatible solution can be inferred using gradient descent, heruistic search etc. Now the target becomes \"Finding an output satisfying the constraints\". An unconditional EBM measures the compatibility between components of . For a conditional EBM , we have low energy near provided data points, while higher energy for everywhere else. (PS. This optimization is in inference process) As a visualization: Probabilistic models are a special case of EBM. Energies are like un-normalized negative log probabilities 1 . If we want to turn energy functions as distributions, we can use Gibbs-Boltzmann distribution, which adopts maximum entropy approach and Gibbs formula. is a positive constant. However the normalization constant at denominator is often intractable. One possible solution is to learn the log-likelihood of , and the distribution is changed into EBM for Multi-Modal Scenarios Joint Embedding For the above example, the energy function is trained with the similarity of and . There may exist multiple unseen that has the same , and we may quantify those unseen by projecting them into the invariant subspace of latent . Latent Generative EBM Ideally the latent variable represents the independent explanatory factors of variation of prediction. But since it's unobservable, information capacity of latent variable must be minimized. We may also see it as a \"bias\", or a placeholder for uncertainties. The inference can be formulated as: However the latent variable is not presented and cannot be measured in supervised manner. So we will have to minimize its effect. is a free energy model and constrained on \"temperature\" term as previously shown in Gibbs-Boltzmann distribution. In practice, can be a variance schedule (eg. DDPM). If we try to understand this in a causal approach, we can treat as a unobserved confounder, which causes a biased estimation of the effect of on . This operator means an intervention (or call it adjustment) of . The intervention cannot \"back-propagate\" to confounder thus the result is changed. The \"adjust formula\" provides an unbiased solution by taking all into consideration. However the domain of is often intractable and we need to regularize it so it's small enough to neglect. On the contrary, we can also examine the robustness of with sensitivity analysis. This connects to the concept of \"quantify the uncertainty of unseen by projecting them into the invariant subspace of latent \" mentioned above. For those may be interested: Sensitivity analysis of such EBM: *For simplicity, we directly write for . * If we treat the above generative EBM as a causal graph, assuming linear relationship among variables (non-linear scenes will be derived in the following posts), we will have: We can obtain the confounding bias by adjusting , leading to different , and gets confounding bias: The Average total Effect (ATE) of on is as By assuming the distribution of , we can derive the uncertainty of by \"propagating\". We can use the Fisher Information, which is defined to be the variance of the score function, to further state the uncertainty. This will come in the next post that provides a more detailed formulation of score function and EBM. Training of EBM Shape so that: is strictly smaller than for all different from . Keep smooth. (Max-likelihood probabilistic methods breaks this) More Existing approaches: Contrastive-based where is negative sample. Regularized / Architectural methods: Minimize the volume of low-energy regions. ​ e.g. Limit the capacity of latent: Appendix Why Max-Likelihood Sucks in Contrastive Method Makes the energy landscape into a valley. Further readings of learning data distributions: Generative Modeling by Estimating Gradients of the Data Distribution↩ "},{"title":"How Would Diffusion Model Help Robot Imitation","date":"2024-09-01T15:56:13.000Z","url":"/robotics/diffusion-robot-imitation.html","tags":[["robotics","/tags/robotics/"],["diffusion","/tags/diffusion/"],["EBM","/tags/EBM/"]],"categories":[["robotics","/categories/robotics/"]],"content":"In short, using diffusion process instead of directly applying MSE loss on trajectories enables a wider variety of trajectory solutions learnt from imitation data, instead of the trajectory provided by dataset(s), which is beneficial for small-set imitation learning. In this blog post, we will discuss how and why diffusion process can achieve such and trajectory-agnostic result. This is one part of the series about energy-based learning and optimal control. A recommended reading order is: Notes on \"The Energy-Based Learning Model\" by Yann LeCun, 2021 Learning Data Distribution Via Gradient Estimation From MPC to Energy-Based Policy How Would Diffusion Model Help Robot Imitation Causality hidden in EBM These trajectories learned by diffusion policy for Push-T task shows that diffusion loss makes learning multiple solutions from a single imitation data possible. Diffusion Process Revisiting Diffusion process of DDPM. We can start from the target functions of DDPM. The forward process, where noise are added to samples, is formulated as where , schedules the noise level. This can be inferred from based on its markov chain property. The optimization target of network is formulated as predicting the noise added in the forward process at different noise levels. For example, , where denotes noise generated with noise level . pesudo-code of DDPM training and sampling. And if we want to add condition for the model, we can simply use these conditions with cross attention on noise predictors. Paradigm: Learning Distribution by Estimating Gradients It’s easy for researchers to come out the idea that the conditional observation-action distribution can also be learned by estimation gradients of data distribution, which is a common paradigm of RL and IL. image-20240901122151333 Implicit Policy Learning Score-Based Generative Modeling through Stochastic Differential Equations states that, for score matching, it's reversing a \"variance exploding\" SDE where more and more noise is added to data, while diffusion models are reversing a \"variance preserving\" SDE that interpolates between the data and a fixed variance gaussian. Stochastic Sampling and Initialization Intuitively, multi-modality in action generation for diffusion policy arises from two sources – an underlying stochastic sampling procedure and a stochastic initialization. Trajectory Generation with Diffusion Models People have been dreaming of using video or scene generation model as “world model” to guide robot planning. However, most of the information in a video frame, even for those captured in real scenarios are highly redundant for the oriented task. In need of a unified representation of states and actions among different training sets, especially visual training sets like Internet videos, trajectory, as a time-dependent elegant modal, is adopted by multiple works, its generation with diffusion is also a low-hanging fruit. Flow as the Cross-domain Manipulation Interface "},{"title":"Notes on ScoreGrad","date":"2024-08-07T23:22:01.000Z","url":"/diffusion/scoregrad.html","tags":[["diffusion","/tags/diffusion/"],["time series","/tags/time-series/"]],"categories":[["diffusion","/categories/diffusion/"]],"content":"ScoreGrad is a EBM make use of iterative conditional SDE sampling via diffusion to perform multi-variate probabilistic time series prediction. Basically it uses RNN/LSTM/GRU to encode past time series as a condition and sample a probability distribution of predicting time series based on this. Paper link:  Code release:  Architecture Conditioner As a conditional generation task, it use an RNN/LSTM and extracts last hidden state as feature. For a time Score Function"},{"title":"[Reading]PARIS: Part-level Reconstruction and Motion Analysis for Articulated Objects","date":"2023-10-01T21:18:03.000Z","url":"/NeRF/PARIS.html","tags":[["NeRF","/tags/NeRF/"],["Articulation Reconstruction","/tags/Articulation-Reconstruction/"]],"categories":[["NeRF","/categories/NeRF/"]],"content":"ABSTRACT: This paper introduces a self-supervised, end-to-end architecture that learns part-level implicit shape and appearance models and optimizes motion parameters jointly without requiring any 3D supervision, motion, or semantic annotation. The training process is similar to original NeRF but and extend the ray marching and volumetric rendering procedure to compose the two fields. [Arxiv] [Github] [Project Page] Problem Statement The problem of articulate object reconstruction in this paper can be summarized as: Given start state and end state and corresponding multi-view RGB images and camera parameters. The first problem is to decouple the object into static and movable part. Here the paper assumes that an object has only one static and one movable part. The second problem is to estimate the articulated motion . A revolute joint is parametrize as a pivot point and a rotation as quaternion , . A prismatic joint is modeled as a joint axis as unit vector and a translation distance . The training process will adapt one of them as prior info states. If no such prior info is given, the motion is modeled by . Method image-20231014145659879 This paper divides the parts by registration on input state to a canonical state . The components agrees with the transformation is extracted as moving part and the remaining as the static part. Structure Static and moving part are jointly learnt during training and they are built separately on networks with the same structure that built upon InstantNGP. Their relationship is modeled explicitly as the transformation function as described in Problem Statement. The fields are represented as: $$ { \\begin{aligned} :&amp; S(x_t,d_t)= S(x_t),c^S(x_t,d_t)\\ :&amp; S(x_{t},d_{t})= S(x_{t}),cS(x_{t^},d_{t^*})\\ \\end{aligned} . $$ Here 𝕩 is a point sampled along a ray at state with direction. 𝕕. 𝕩 is the density value of the point x, and 𝕩𝕕 is the RGB color predicted from the point x from a view direction . Training The adapted training pipeline is similar to NeRF and the ray marching and volumetric rendering procedure to compose the two fields is extended."},{"title":"【Reading】Ditto-Building Digital Twins of Articulated Objects from Interaction","date":"2023-06-13T16:36:17.000Z","url":"/Interactive-Perception/ditto.html","tags":[["NeRF","/tags/NeRF/"],["Articulation Reconstruction","/tags/Articulation-Reconstruction/"],["Interactive-Perception","/tags/Interactive-Perception/"]],"categories":[["Interactive-Perception","/categories/Interactive-Perception/"]],"content":"This paper propose a way to form articulation model of articulated objects by encoding the features and find the correspondence of static and mobile part via visual observation before and after the interaction. image-20230616163010066 Workflow-In Brief Two Stream Encoder Given point cloud observations before and after interaction: Encode them with PointNet++ Encoder : , . . is the number of the sub-sampled points, and is the dimension of the sub-sampled point features. Fuse the features with attention layer: , , . The fused feature is decoded by two PointNet++ decoder , , and get , . are point features aligned with Feature encoding based on ConvONet. image-20230616183619689 is projected into 2D feature planes and is projected into voxel grids as in the ConvONets. The points that fall into the same pixel cell or voxel cell are aggregated together via max pooling. Training image-20230616185216164 image-20230616185327505 Revolute joint: image-20230616185351385 image-20230616185433598 "},{"title":"【Reading】Ditto in the House-Building Articulation Models of Indoor Scenes through Interactive Perception","date":"2023-06-08T19:41:29.000Z","url":"/Interactive-Perception/ditto-in-the-house.html","tags":[["NeRF","/tags/NeRF/"],["robotics","/tags/robotics/"],["Articulation Reconstruction","/tags/Articulation-Reconstruction/"],["Interactive-Perception","/tags/Interactive-Perception/"]],"categories":[["Interactive-Perception","/categories/Interactive-Perception/"]],"content":"The paper proposed a way of modeling interactive objects in a large-scale 3D space by making affordance predictions and inferring the articulation properties from the visual observations before and after the self-driven interaction. Workflow-In brief Components Overview Given Initial scene observation , each point is a 6D vector . Get an affordance map and samples peak locations as interaction hotspots (The locations where the robot can successfully manipulate the articulated objects, and infer the articulation model) from observation . For each interaction, robot applies force to the hotspot to produce potential articulated motions. Sample point cloud , center on the interaction hotspot before and after the interaction. Record contact location . An articulation inference network segments the point cloud into static and mobile parts based on . Estimate articulation parameters. [1] Prismatic joint: , where is the translation axis, is the joint state, which is the relative translation distance. Revolute joint: , where is the revolute axis, is the pivot point on the revolute axis, is the joint state, which is the relative rotation angle. Map the estimated articulation model of each object from the global frame. The set of these articulation models constitutes the scene-level articulation model . Modules-In detail Affordance Prediction Estimation based on PointNet++[2] Architecture of PointNet++ The prediction problem is modeled as a binary classification problem. The observation point cloud is fed into PointNet++ to get a point-wise affordance map. Exploration driven training: We first uniformly sample locations over the surface of both articulated and non-articulated parts, then have the robot interact with them. If the robot successfully moves any articulated part of the objects, we label the corresponding location as positive affordances or negative otherwise. To simulate gripper-based interactions, we perform a collision check at each location to ensure enough space for placing a gripper. The virtual robot interacts with the object as in reality (Pull, push or rotate). For each successful interaction, we collect the robot’s egocentric observations before and after interaction and the object’s articulation model as training supervision. The data distribution is imbalanced due to the large proportion of negative data. To mitigate the imbalance problem, we optimize the network with the combination of the cross-entropy loss and the dice loss。 Non-maximum suppression (NMS) for peak selection select the point with the maximum score and add it to the preserving set. Suppress its neighbors by a certain distance threshold. Repeat this process until all points are added to the preserving set or suppressed. Articulation Inference Ditto: Building Digital Twins of Articulated Objects [3] The occupancy decoder is discarded. Refinement The estimated articulation model could have a higher accuracy if the observations covered significant articulation motions and a complete view of the object’s interior. However, these observations may be partially occluded due to ineffective actions. eg. we find that articulation estimation of a fully opened revolute joint, like , is more accurate in terms of angle error than one with an ajar joint. Accordingly, we develop an iterative procedure of interacting with partially opened joints and refining the articulated predictions. we exploit the potential motion information from the previous articulation model and extract the object-level affordance. We refine the affordance prediction by selecting a pair of locations and actions to produce the most significant articulation motion. We set the force direction as the moment of the axis. Given the joint axis and part segment, we select the point in the predicted mobile part farthest from the joint axis as our next interaction hotspot. References [1] Li, X., Wang, H., Yi, L., Guibas, L. J., Abbott, A. L., &amp; Song, S. (2020). Category-level articulated object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3706-3715).  [2] Qi, C. R., Yi, L., Su, H., &amp; Guibas, L. J. (2017). Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30.  pointnet arxiv.org/pdf/1612.00593.pdf [3]"},{"title":"【Reading】LATITUDE:Robotic Global Localization with Truncated Dynamic Low-pass Filter in City-scale NeRF","date":"2023-04-19T19:22:28.000Z","url":"/NeRF/LATITUDE.html","tags":[["NeRF","/tags/NeRF/"],["robotics","/tags/robotics/"],["UAV","/tags/UAV/"],["localization","/tags/localization/"],["optimization","/tags/optimization/"]],"categories":[["NeRF","/categories/NeRF/"]],"content":"This paper proposes a two-stage localization mechanism in city-scale NeRF. Abstract Neural Radiance Fields (NeRFs) have made great success in representing complex 3D scenes with high-resolution details and efficient memory. Nevertheless, current NeRF-based pose estimators have no initial pose prediction and are prone to local optima during optimization. In this paper, we present LATITUDE: Global Localization with Truncated Dynamic Low-pass Filter, which introduces a two-stage localization mechanism in city-scale NeRF. In place recognition stage, we train a regressor through images generated from trained NeRFs, which provides an initial value for global localization. In pose optimization stage, we minimize the residual between the observed image and rendered image by directly optimizing the pose on the tangent plane. To avoid falling into local optimum, we introduce a Truncated Dynamic Low-pass Filter (TDLF) for coarse-to-fine pose registration. We evaluate our method on both synthetic and real-world data and show its potential applications for high-precision navigation in large scale city scenes. System Design Place Recognition Original poses, accompanied by additional poses around the original ones are sampled. The pose vector is passed through the trained and fixed Mega-NeRF with shuffled appearance embeddings. Initial poses of the inputted images are predicted by a pose regressor network. Pose Optimization The initial poses are passed through positional encoding filter The pose vector is passed through the trained and fixed Mega-NeRF and generates a rendered image. Calculate the photometric error of the rendered image and the observed image and back propagate to get a more accurate pose with the TDLF. Implementation Place Recognition Data Augmentation: A technique in machine learning used to reduce overfitting when training a machine learning model by training models on several slightly-modified copies of existing data. First uniformly sample several positions in a horizontal rectangle area around each position around original poses . Then add random perturbations on each axis drawn evenly in , where is the max amplitude of perturbation to form sampled poses . They are used to generate the rendered observations by inputting the poses to Mega-NeRF. To avoid memory explosion, we generate the poses using the method above and use Mega-NeRF to render images during specific epochs of pose regression training. Additionally, Mega-NeRF’s appearance embeddings are selected by randomly interpolating those of the training set, which can be considered as a data augmentation technique to improve the robustness of the APR model under different lighting conditions. Pose Regressor: Absolute pose regressor (APR) networks are trained to estimate the pose of the camera given a captured image. Architecture: Built on top of VGG16’s light network structure, we use 4 full connection layers to learn pose information from image sequences. Input: Observed image (resolution ), rendered observations Output: Corresponding estimated poses , . Loss Function: (In general, the model should trust more on real-world data and learn more from it.) Pose Optimization MAP Estimation Problem[A] Formulation: Here denotes place recognition; denotes the trained Mega-NeRF. We optimize posterior by minimizing the photometric error of and the image rendered by . Optimization on Tangent Plane: We optimize pose on tangent plane to ensure a smoother convergence. [1] TODO I know nothing about :( Explanations &amp; References [1]Adamkiewicz, M., Chen, T., Caccavale, A., Gardner, R., Culbertson, P., Bohg, J., &amp; Schwager, M. (2022). Vision-only robot navigation in a neural radiance world. IEEE Robotics and Automation Letters, 7(2), 4606-4613.  Turki, H., Ramanan, D., &amp; Satyanarayanan, M. (2022). Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12922-12931).  Yen-Chen, L., Florence, P., Barron, J. T., Rodriguez, A., Isola, P., &amp; Lin, T. Y. (2021, September). inerf: Inverting neural radiance fields for pose estimation. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 1323-1330). IEEE.  [A]Maximum A Posterior (MAP) Estimation: Maximum a posteriori (MAP) estimation is a method of statistical inference that uses Bayes' theorem to find the most likely estimate of a parameter given some observed data. Maximum a posteriori estimation - Wikipedia "},{"title":"Reading:\"NeRF:Representing Scenes as Neural Radiance Fields for View Synthesis\"","date":"2023-04-17T19:21:56.000Z","url":"/NeRF/NeRF-startup.html","tags":[["NeRF","/tags/NeRF/"],["papers","/tags/papers/"],["Computer-Vision","/tags/Computer-Vision/"],["Deep-Learning","/tags/Deep-Learning/"]],"categories":[["NeRF","/categories/NeRF/"]],"content":"This is a summary for paper \"NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis\". Keywords: scene representation, view synthesis, image-based rendering, volume rendering, 3D deep learning A brief understanding: How to train a network for NeRF Training a neural network for NeRF (Neural Radiance Fields) involves several steps, including data preparation, network architecture design, training, and evaluation. Data preparation: The first step is to prepare the data that will be used to train the neural network. This typically involves capturing a set of 3D scans of the object or environment being represented, and labeling the data with the corresponding colors that should be associated with each point in the 3D space. Network architecture design: The next step is to design the architecture of the neural network that will be used to represent the object or environment. This typically involves defining the number and types of layers in the network, as well as the size and shape of the network. Training: Once the network architecture has been designed, the next step is to train the network using the prepared data. This involves feeding the data into the network and adjusting the weights of the network over multiple iterations, or epochs, to optimize the performance of the network. Evaluation: After the network has been trained, it is typically evaluated on a separate set of data to measure its performance and ensure that it is generating accurate results. This can involve comparing the output of the network to the ground truth data, as well as using visualization techniques to compare the rendered images produced by the network to actual photographs of the object or environment. Overall, the process of training a neural network for NeRF involves a combination of data preparation, network architecture design, training, and evaluation to produce a highly accurate and efficient 3D representation of an object or environment. ​ By Vicuna-13b Contribution An approach for representing continuous scenes with complex geometry and materials as 5D neural radiance fields, parameterized as basic MLP networks. A differentiable rendering procedure based on classical volume rendering techniques, which we use to optimize these representations from standard RGB images. This includes a hierarchical sampling strategy to allocate the MLP's capacity towards space with visible scene content. A positional encoding to map each input 5D coordinate into a higher dimensional space, which enables us to successfully optimize neural radiance fields to represent high-frequency scene content. An overview of our neural radiance field scene representation and differentiable rendering procedure. Here g.t. represents the \"ground truth\", which means the real scene. Overview of the Rendering Process March camera rays through the scene to generate a sampled set of 3D points. Use those points and their corresponding 2D viewing directions as input to the neural network to produce an output set of colors and densities. use classical volume rendering techniques to accumulate those colors and densities into a 2D image. we can use gradient descent to optimize this model by minimizing the error between each observed image and the corresponding views rendered from our representation. Neural Radiance Field Scene Representation This is a method for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully-connected (non-convolutional) deep network , whose input is a single continuous 5D coordinate and whose output is the volume density and view-dependent emitted radiance at that spatial location. : 3D location : 2D viewing direction : Emitted color : Volume density From Object to Scene: Volume Rendering with Radiance Fields Our 5D neural radiance field represents a scene as the volume density and directional emitted radiance at any point in space. We render the color of any ray passing through the scene using principles from classical volume rendering[1]. The volume density can be interpreted as the differential probability[2] of a ray terminating at a particle at location . The expected color of camera ray with near bound and far bound . : Camera ray, where is the position of the camera, is the position of the point in the 3D space being rendered, and is the direction of the camera ray. : Denotes the accumulated transmittance along the ray from to , i.e., the probability that the ray travels from tn to t without hitting any other particle. Example: From 3D object to hemisphere of viewing directions In (a) and (b), we show the appearance of two fixed 3D points from two different camera positions: one on the side of the ship (orange insets) and one on the surface of the water (blue insets). Our method predicts the changing appearance of these two 3D points with respect to the direction of observation , and in (c) we show how this behavior generalizes continuously across the whole hemisphere of viewing directions(This hemisphere can be viewed as the plot of , where is the unit vector in the spherical coordinate frame and shows the color). Discrete Sampling Rendering a view from our continuous neural radiance field requires estimating this integral for a camera ray traced through each pixel of the desired virtual camera. However, MLP would only be queried at a discrete set of locations. So we use deterministic quadrature[3] to numerically estimate this continuous integral. we partition into evenly-spaced bins and then draw one sample uniformly at random from within each bin: From Scene to Object: Estimation of : The distance between adjacent samples This function for calculating from the set of values is trivially differentiable and reduces to traditional alpha compositing[4] with alpha values . Implementation details Network Architecture First layers (ReLU): Input: 3D coordinate processed by Output: ; 256-dimensional feature vector. layer: Input: ; 256-dimensional feature vector; Cartesian viewing direction unit vector processed by Output: View-dependent RGB color Details of variables are in Improving Scenes of High Frequency. Network Architecture Training Datasets: Captured RGB images of the scene, The corresponding camera poses and intrinsic parameters, and Scene bounds (we use ground truth camera poses, intrinsics, and bounds for synthetic data, and use the COLMAP structure-from-motion package to estimate these parameters for real data) Iteration: Randomly sample a batch of camera rays from the set of all pixels in the dataset following the hierarchical sampling Loss: The total squared error between the rendered and true pixel colors for both the coarse and fine renderings In our experiments, we use a batch size of 4096 rays, each sampled at coordinates in the coarse volume and additional coordinates in the fine volume. We use the Adam optimizer with a learning rate that begins at and decays exponentially to over the course of optimization (other Adam hyper-parameters are left at default values of , , and ). The optimization for a single scene typically take around 100--300k iterations to converge on a single NVIDIA V100 GPU (about 1--2 days). Notable Tricks Enhancements brought by the tricks Improving Scenes of High Frequency Deep networks are biased towards learning lower frequency functions. findings in the context of neural scene representations, and show that reformulating as a composition of two functions , where is fixed. It is used to map variables of to . This function is applied separately to each of the three coordinate values in (which are normalized to lie in ) and to the three components of the Cartesian viewing direction unit vector (which by construction lie in ). In the experiments, we set for and for . Reducing the Cost with Hierarchical Sampling Our rendering strategy of densely evaluating the neural radiance field network at query points along each camera ray is inefficient: free space and occluded regions that do not contribute to the rendered image are still sampled repeatedly. Instead of just using a single network to represent the scene, we simultaneously optimize two networks: one \"coarse'' and one \"fine''. The coarse Network Rewrite the alpha composited color as a weighted sum of all sampled colors along the ray: : The number of sampling points for coarse network. The fine Network Normalizing as produces a piecewise-constant PDF along the ray. Then sample from locations from this distribution using inverse transform sampling[5]. Then we evaluate using samples. Conclusion TODO Explanations [1] Volume rendering is a technique used in computer graphics and computer vision to visualize 3D data sets as 2D images. It works by slicing the 3D data set into a series of thin layers, and then rendering each layer as a 2D image from a specific viewpoint. These 2D images are then composited together to form the final volume rendering. [2] If a distribution (here in 3D space) has a density , that means that for (almost) any volume in that space , you can assign a probability to it by integrating the density (here \"density\" means probability per unit volume, very similar to, say, the concentration of salt in a solution). [3] Deterministic quadrature is a mathematical method used to estimate the definite integral of a function. The basic idea is to divide the area under the curve into smaller areas, and calculate the approximate value of the definite integral by summing the areas of the smaller areas. There are several types of deterministic quadrature methods, including the trapezoidal rule, Simpson's rule, and Gaussian quadrature. [4] Alpha compositing is a technique used in computer graphics and image processing to combine two or more images or video frames by blending them together using an alpha channel. The alpha channel is a mask that defines the transparency or opacity of each pixel in the image. Alpha compositing is used to create composites, where the resulting image is a combination of the original images, with the transparency or opacity of each image controlled by the alpha channel. The alpha channel can be used to create effects such as blending, fading, and layering. Alpha compositing - Wikipedia [5] Inverse transform sampling (ITS) is a technique used in digital signal processing to reconstruct a signal from a set of samples. It is the inverse of the discrete Fourier transform(DFT). The basic idea behind ITS is to use the Fourier coefficients obtained from DFT to reconstruct the signal in the time domain. Inverse transform sampling - Wikipedia"},{"title":"【笔记】项目管理的逻辑","date":"2022-08-22T00:52:57.000Z","url":"/Project-Management/pro-mgt-lecs.html","tags":[["项目管理","/tags/%E9%A1%B9%E7%9B%AE%E7%AE%A1%E7%90%86/"]],"categories":[["Project-Management","/categories/Project-Management/"]],"content":"本笔记主要记录清华大学管理学系列讲座相关笔记。 什么是项目 项目与运营：团队的两大任务 运营(Operation)：持续性、重复性的工作，如报税、报销、记账等 项目(Project)：阶段性，一次性的工作 项目是为创造独特的产品、服务或成果进行的临时性工作 ——PMBOK 6th 项目的特点 组织项目管理 项目生命周期与应用 项目生命周期主要有以下几种： 预测型 迭代型 增量型 敏捷型 预测型 在做项目前已对项目结果心中有数 结果明确，开发过程成熟 Example 经典瀑布开发模型 每一步需要很完善，很到位 需求分析阶段随便提需求，但开始方案设计后不能再动（项目进行方向不可逆） 迭代型与增量型 迭代型： 项目为一整体，逐版本迭代升级 增量型： 整个项目分为多个部分，逐个交付，每个交付都是完成状态 客户可提前了解工作成果 适应性开发（敏捷开发） Product Backlog: 客户需求池 Sprint: 开发周期 有节奏地，持续性地进行适应性开发。 综合对比 管控项目进度 需求不断变更，时间、物质成本不足：先上线，后迭代？\\(\\implies\\) 运维不便 应当确定合理的项目阶段划分 项目的阶段划分 Example 1. 工程建设 按照专业每个阶段设置关口，及时验收，满足条件才可进入下一阶段 Example 2. 产品设计 主要流程： 需求分析 → 原型设计 → 产品开发 → 验收交付 控制项目质量；发现脱离初衷/难以修正时及时止损 主要看是否实现项目既定目标。 项目的分工 面向项目的管理模式 以矩阵形式为特征： 既有按职能划分的专业部门，也有从专业部门抽调形成的项目团队 竖向划分是专业，职能部门；横向划分是项目团队 项目的相关方 凸显模型 (Salience Model) (蓝色为所具备特征) 取得项目相关方共识 References 【公开课】清华大学：项目管理的逻辑（全6讲）"}]