This project applies Reinforcement Learning (RL) to optimize trade execution strategies for minimizing transaction costs when selling a fixed quantity of shares in a day. This documentation outlines the steps and methodologies used, the algorithms implemented, and the experimental results compared against standard benchmarks (TWAP and VWAP).
- Project Objectives
- Data Analysis
- Project Assumptions
- Methodology
- Benchmark Comparison
- Experimentation & Results
- Future Work
- Contributors
The main goal is to develop an intelligent RL-based trade execution strategy that minimizes transaction costs over a single trading day, given historical trade and quote data. We explore a variety of RL algorithms to create a trading agent that can outperform TWAP and VWAP benchmarks.
We start by analyzing the provided high-frequency trading data, containing bid and ask prices, bid and ask sizes, and OHLC data. Key points observed:
- Each trading day contains minute-level data with up to 390 rows.
- The data has structured bids and asks, allowing us to calculate transaction costs associated with slippage and market impact.
- TWAP and VWAP benchmarks provide cost baselines against which our RL strategies are compared.
To simplify the modeling process, we made the following assumptions:
- No Fractional Trading: All trade orders are in whole numbers, with no fractional shares allowed.
- Mandatory Sale Completion: The model must fully execute the sale of 1000 shares by the end of each trading day. This constraint ensures the model’s strategy aligns with end-of-day liquidity requirements.
- Market Order Execution: Every trade is executed as a market order, with the bid price as the execution price, regardless of quantity. This approach prioritizes immediate execution over limit order considerations, impacting slippage and market impact.
These assumptions influence the model’s strategies and cost-minimization goals. They ensure realistic but simplified trading conditions, focusing on optimizing transaction cost rather than more complex factors like order book dynamics or time-weighted sales.
We began by studying relevant research, including papers on trade execution with reinforcement learning and well-established benchmarks. We adapted methodologies to fit our corporate objective of minimizing costs and developed experiments based on these research insights.
To systematically approach the solution, we implemented and tested multiple RL models:
- Q-Learning with DQN - Used for its simplicity, DQN was first implemented to validate the setup and understand baseline agent performance.
- Custom DQN Model - Built a custom DQN model to accommodate specific requirements in our data structure.
- PPO Implementation - Shifted to the PPO algorithm, exploring its suitability for continuous action spaces and stable learning.
For deployment, I used the Custom DQN model. I experimented with the other two models in a Python notebook.
Each model implementation follows a similar workflow:
- Environment Setup - Used custom Gym environments with action spaces allowing for partial or no trade executions within each time step.
- Reward Calculation - Defined the reward as negative transaction cost, emphasizing cost minimization as the primary objective.
- Exploration vs. Exploitation - Guided the model to avoid excessive selling and spaced out trading behavior by experimenting with reward shaping and epsilon decay in DQN and entropy coefficient tuning in PPO.
For evaluation, we compare each RL agent's transaction costs against:
- TWAP (Time-Weighted Average Price) - Divides total shares equally over time.
- VWAP (Volume-Weighted Average Price) - Proportionally sells shares based on trading volume at each interval.
Each experiment documents total transaction costs across multiple trading days to compare the effectiveness of each RL model against TWAP and VWAP.
Model | Avg Transaction Cost | TWAP Cost | VWAP Cost | Comparison |
---|---|---|---|---|
DQN | [1.13] | [0.0023] | [0.00029] | + TWAP/ + VWAP |
Custom DQN | [0.034] | [0.0023] | [0.00029] | + TWAP/ + VWAP |
PPO | [0.003] | [0.0023] | [0.00029] | ~ TWAP/ + VWAP |
- '+' increase cost
- '~' Almost same cost
Note: Findings suggest that PPO technique is the best among all, probably because the use of continuos space of action. But for the purpose of the demonstration and considering the time constraint for this taks I have deployed DQN model on the cloud. But we can surely replace PPO Algorithm in the deploy directory.
Each model's performance is summarized, highlighting improvements or areas where they fall short relative to benchmarks.
-
Benchmark Assumptions: Standard benchmarks like TWAP and VWAP operate under conditions that allow for:
- Fractional Shares: Both TWAP and VWAP can sell fractional shares at each time interval, which reduces transaction costs but does not reflect real-world trading limitations.
- No Sale Completion Requirement: These benchmarks do not enforce the completion of 1000-share sales by day’s end, allowing for a lower transaction cost without liquidity constraints.
-
Real-World Applicability: In contrast, our RL models are built with realistic constraints:
- Whole-Share Trades Only: Our RL agent executes whole-share trades, aligning with the discrete nature of real-world transactions.
- Mandatory Sale Completion: The model is required to complete 1000-share sales by day’s end, simulating end-of-day liquidity requirements.
Given these considerations, while the RL model may not outperform the benchmarks in terms of transaction cost, its strategy is better suited for practical, real-world trading conditions where such restrictions are standard.
Overall, this result underlines the real-world value of our approach, even if benchmark comparisons appear less favorable under theoretical assumptions.
To get started you can check the readme file in train and deploy directories respectively.
Future directions for this project include:
- Futher testing: - The findings from this task needs further testing to ensure the robustness of the algorithms.
- Enhanced Model Architectures - Explore multi-agent setups, ensemble models, or deep Q-networks with prioritized experience replay.
- Advanced Reward Shaping - Further refine reward signals to encourage more optimal behavior and stable learning.
- Real-World Deployment Considerations - Test scalability and adapt models to real-world trading environments with live data integration.
- [Shyamal Gandhi] - [email protected]
- Reinforcement Learning in Limit Order Markets — This is the primary research paper you need to replicate. This will serve as the foundational model for your task, although you will need to adjust the objective to minimize transaction costs rather than maximize wealth.
- AWS SageMaker Documentation:
- AWS SageMaker Developer Guide — This documentation will help you understand how to deploy your model on SageMaker, including setting up real-time inference endpoints for serving the trade optimization model.
- SageMaker Real-Time Inference — Detailed guide on deploying machine learning models using real-time inference in SageMaker, which is essential for making your model accessible via an API.
- PPO implementation
- Stable Baselines3 SAC Documentation — A practical library for implementing SAC in Python, using PyTorch.
- Time-Weighted Average Price (TWAP) and Volume-Weighted Average Price (VWAP):
- Python Libraries for Reinforcement Learning and Trading:
- Pandas for Financial Data — Documentation for Pandas, which is essential for handling financial datasets and performing time-series operations on market data.
- Trading Benchmark: