Example of a Web Crawler with Redis Indexing

Summary

API server implementation of a web crawler. Apache Tika is leveraged to extract text from crawled docs (html, pdf, etc). The extracted text is then stored in Redis as JSON and indexed via RediSearch.

Architecture

High Level

Detailed

Application Flow

Features

Implements a simple web crawler (cheerio-based)
Utilizes Apache Tika server for mime-type detection and text extraction
Utilizes RedisJSON for document storage and RediSearch for indexing.

Prerequisites

Docker
Node.js
npm
Apache Tika
Redis w/RediSearch and RedisJSON modules

Installation

Clone this repo.
Go to doc-crawler folder.

cd doc-crawler

Install Node.js requirements

npm install

Build and start docker containers

docker compose up

Usage

Test Client

npm run test

CURL

#app status
curl -X GET http://localhost:8000

{"status":"app running"}

#start a crawl task
curl -X POST http://localhost:8000/crawl \
-H 'Content-Type: application/json' \
-d '{"fqdn":"developer.redis.com"}'

{"taskID":"ec135f4c-d3f5-4a9f-bafb-4eb90bfd8273"}

#check status on a crawl task
curl -X GET http://localhost:8000/status/tasks/ec135f4c-d3f5-4a9f-bafb-4eb90bfd8273

{"status":"active"}

curl -X GET http://localhost:8000/status/tasks/ec135f4c-d3f5-4a9f-bafb-4eb90bfd8273

{"indexed":159,"errors":0,"time":56.81,"status":"complete"}

#document search
curl -X PUT http://localhost:8000/search \
-H 'Content-Type: application/json' \
-d '{"term":"Node.js"}'

{"docs":["developer.redis.com/develop/node","developer.redis.com/develop/node/node-crash-course","developer.redis.com/develop/java/redis-and-spring-course/lesson_8","developer.redis.com/develop/node/nodecrashcourse/runningtheapplication","developer.redis.com/develop/node/nodecrashcourse/welcome","developer.redis.com/develop/node/nodecrashcourse/coursewrapup","developer.redis.com/develop/node/nodecrashcourse/redisbloom","developer.redis.com/develop/node/nodecrashcourse/sessionstorage","developer.redis.com/develop/node/nodecrashcourse/checkinswithstreams","developer.redis.com/develop/node/nodecrashcourse/redisearch"]}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
app		app
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Example of a Web Crawler with Redis Indexing

Summary

Architecture

High Level

Detailed

Application Flow

Features

Prerequisites

Installation

Usage

Test Client

CURL

About

Releases

Languages

License

redis-developer/doc-crawler

Folders and files

Latest commit

History

Repository files navigation

Example of a Web Crawler with Redis Indexing

Summary

Architecture

High Level

Detailed

Application Flow

Features

Prerequisites

Installation

Usage

Test Client

CURL

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Languages