API server implementation of a web crawler. Apache Tika is leveraged to extract text from crawled docs (html, pdf, etc). The extracted text is then stored in Redis as JSON and indexed via RediSearch.
- Implements a simple web crawler (cheerio-based)
- Utilizes Apache Tika server for mime-type detection and text extraction
- Utilizes RedisJSON for document storage and RediSearch for indexing.
- Docker
- Node.js
- npm
- Apache Tika
- Redis w/RediSearch and RedisJSON modules
-
Clone this repo.
-
Go to doc-crawler folder.
cd doc-crawler
- Install Node.js requirements
npm install
- Build and start docker containers
docker compose up
npm run test
#app status
curl -X GET http://localhost:8000
{"status":"app running"}
#start a crawl task
curl -X POST http://localhost:8000/crawl \
-H 'Content-Type: application/json' \
-d '{"fqdn":"developer.redis.com"}'
{"taskID":"ec135f4c-d3f5-4a9f-bafb-4eb90bfd8273"}
#check status on a crawl task
curl -X GET http://localhost:8000/status/tasks/ec135f4c-d3f5-4a9f-bafb-4eb90bfd8273
{"status":"active"}
curl -X GET http://localhost:8000/status/tasks/ec135f4c-d3f5-4a9f-bafb-4eb90bfd8273
{"indexed":159,"errors":0,"time":56.81,"status":"complete"}
#document search
curl -X PUT http://localhost:8000/search \
-H 'Content-Type: application/json' \
-d '{"term":"Node.js"}'
{"docs":["developer.redis.com/develop/node","developer.redis.com/develop/node/node-crash-course","developer.redis.com/develop/java/redis-and-spring-course/lesson_8","developer.redis.com/develop/node/nodecrashcourse/runningtheapplication","developer.redis.com/develop/node/nodecrashcourse/welcome","developer.redis.com/develop/node/nodecrashcourse/coursewrapup","developer.redis.com/develop/node/nodecrashcourse/redisbloom","developer.redis.com/develop/node/nodecrashcourse/sessionstorage","developer.redis.com/develop/node/nodecrashcourse/checkinswithstreams","developer.redis.com/develop/node/nodecrashcourse/redisearch"]}