This project focuses on analyzing over 35 million HTTP requests from a Poland-based e-commerce platform to uncover patterns in user behavior and system performance. By leveraging Python and visualizing data through a Streamlit-based UI, the project converts raw server logs into actionable insights.
- Source: Web server logs from a Poland-based e-commerce site.
- Period: December 1, 2019, to May 31, 2020 (183 days).
- Size: 35,157,691 HTTP requests stored in CSV format (~9.1 GB raw, ~10.7 GB processed).
- Features:
- IpId: Anonymized IP with country identifier.
- UserId: Differentiates between regular and admin users.
- TimeStamp: Converted from Windows FileTime to a datetime format.
- HttpMethod: Types of HTTP requests (GET, POST, etc.).
- Uri: Partially redacted resource identifiers.
- HttpVersion, ResponseCode, Bytes, Referrer, UserAgent.
- Datetime Conversion: Timestamps transformed into human-readable datetime format.
- Feature Engineering:
- Extracted
CountryCode
fromIpId
. - Parsed
UserAgent
to extractBrowser
,OS
, andDevice_Type
. - Categorized
Uri
intoURI_Type
andReferrer
intoReferrer_Type
.
- Extracted
- Data Cleaning: Removed irrelevant
UserId
column, handled missing values. - Storage: Final dataset saved in CSV format with enriched features.
- Examined distribution of HTTP methods (GET, POST).
- Analyzed response codes for user experience and security insights.
- Identified key traffic metrics: 519.92 GB total data transfer, HTTP 200 as the most common response.
- Identified peak activity hours (18:00–20:00), with a notable spike at 20:00 (14,402 requests).
- Major traffic from Poland (PL), followed by the US, Netherlands, and Germany.
- Logarithmic scaling used to highlight disparities between major and minor contributors.
- Chrome and Firefox were the dominant browsers.
- Significant traffic from Windows (53%) and Linux (40%), indicating a tech-savvy user base.
- Sessions defined as periods of activity with thresholds (2–120 minutes, 30-minute inactivity cutoff).
- Explored engagement patterns with URI types, referrers, and geographic data.
- Graphs and charts illustrate trends in traffic, user agents, referrers, and more.
- Key insights are displayed in a user-friendly Streamlit interface.
- Clone the repository:
git clone https://github.com/projects506/Big-Data-Analysis-Tool-for-Web-Server-Logs.git
- Install dependencies:
pip install -r requirements.txt
- Run the Streamlit app:
streamlit run app.py
pandas
,numpy
,matplotlib
,seaborn
,streamlit
, and others.
- Deep insights into user behavior, session dynamics, and system performance.
- Actionable recommendations for website optimization, security, and marketing strategies.
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.