- run bash build.sh
- This will remove previous docker images
- Remove old volumes
- Build new Image
- Run the docker-compose service which runs the testing setup.
- Relevate output dataframes will be created ./data folder
-
There is spark-master container with 2 spark-workers and spark-submit container which will submit the job.
-
Not done, but is possible to scale docker-compose to any number of instances if we set static IPs in docker-compose carefully.
-
I assume ip:port as single user (For users behind NAT, they have same IP but different port number)
-
First make a new column which states if present time and previous time of hhtp request by particular user belong to same session
-
Then take a cumulative some, this will automatically give different number to different rows in different sessions
-
Then partition and make various queries
-
Lastly just save rough output to data directory
-
It is able to run but still few problems with standalone mode
-
There are few errors like rsync error because it can't find spark host name but it does not effect the output.
-
Some warning relating to some spark jars not found but it does not effect the output
-
When trying to run in docker, spark standalone cluster mode fails relay tasks to worker node(while client mode can connect to both workers and successfully complete the job.).