Build a web crawler that generates a site map
- Java 8
- sbt
- npm
sbt "run https://www.thoughtworks.com"
sbt test
sbt assemble
java -jar ./target/scala-2.12/crawler.jar https://www.thoughtworks.com
- Execute the project
- Install local-server with
npm install -g local-web-server
- Run local server with
ws
- Navigate to
http://localhost:8000
To deal with the concurrency nature of the problem, the crawler has been implemented using a functional programming style and an approach based on an actor system implemented with Akka.
- Url normalization has not been taken into account, that means urls such as
http://www.thoughtworks.com
,https://www.thoughtworks.com
,https://www.thoughtworks.com:80
will be treated as different urls - It is OK to filter all the URLs that contain a '#' or ':'
- Validation is not done for input parameters