Hi there, I'm Mrugank 👋
- Overview
- Applications
- Running big data cluster
- Access docker containers
- Access hadoop ecosystem
- Containers running in the cluster
- Access HDFS
- Access Postgres DB
- Access Kadmin
- Run hive queries
- Run PySpark
- Run sqoop
- Run airflow
- Create Kafka Topic
- Create Kafka Producer
- Create Kafka Consumer
- Run flume agents
- Run queries in cassandra
- Available Jars
- Configure hadoop
- Configure hue
- Configure zeppelin
- Configure airflow
- Configure sqoop
- Configure ssh server
- Build Custom Docker Images
- Manage resources
- Dependencies
Today, there are many projects available that were created to deploy a Spark or Hadoop cluster, but they are either ineffective or resource-intensive, causing the system to freeze. This repository is created to simplify the deployment of these clusters on a local machine using Docker containers.
This project will start a docker cluster which gives access to the following frameworks/technologies.
Framework/Technology | Version |
---|---|
Hadoop | 3.2.1 |
Mini-Conda | 4.12.0 |
Python | 3.9 |
Spark | 3.2.2(Built using scala 2.12) |
Airflow | 2.3.3 |
Zeppelin | 0.10.1 |
Hive | 2.3.2 |
Hue | 4.6.0 |
Sqoop | 1.4.7 |
Confluent Kafka | 5.4.0-ce |
Confluent Schema Registry | 5.4.0-ce |
Confluent Control Center | 5.4.0 |
Kadmin | 0.9.3 |
Flume | 1.11.0 |
Postgres | 15.1 |
pgAdmin4 | 6.18 |
Cassandra | 4.1.0 |
To start the cluster run the following command from the project directory.
The basic cluster gives access to Hadoop, PySpark, Airflow, Flume and Zeppelin. It's a great place to start if you want to try out some of these tools without having to install them on your computer.
sudo docker-compose -f basic-hadoop-docker-compose.yaml up
docker-compose -f basic-hadoop-docker-compose.yaml up
Note: You need to install WSL 2 on windows to run this cluster.
This file starts the basic cluster, Hive, Hue, Postgres, pgAdmin & Cassandra. This is great for building batch processing pipelines.
sudo docker-compose -f hive-sqoop-postgres-cassandra-docker-compose.yaml up
docker-compose -f hive-sqoop-postgres-cassandra-docker-compose.yaml up
Note: You need to install WSL 2 on windows to run this cluster.
This file starts the basic cluster, zookeeper, kafka broker & schema registry. This is great for building streaming pipelines.
sudo docker-compose -f kafka-docker-compose.yaml up
docker-compose -f kafka-docker-compose.yaml up
Note: You need to install WSL 2 on windows to run this cluster.
This file starts a kafka cluster. This is great for getting started with kafka and schema registry.
sudo docker-compose -f kafka-zookeper.yaml up
docker-compose -f kafka-zookeper.yaml up
Note: You need to install WSL 2 on windows to run this cluster.
This file starts all the framework/technology that are mentioned here.
sudo docker-compose -f all-docker-compose.yaml up
docker-compose -f all-docker-compose.yaml up
Note: You need to install WSL 2 on windows to run this cluster.
You can access a container by running this command
sudo docker exec -it <container-name> /bin/bash
If this does not work replace /bin/bash
with /bin/sh
This is a list of technologies or frameworks are exposed to the Host. You can access these using the following urls.
Application | Url |
---|---|
Namenode UI | http://localhost:9870/dfshealth.html#tab-overview |
Namenode(IPC port) | http://localhost:9000 |
History server | http://localhost:8188/applicationhistory |
Datanode | http://localhost:9864/ |
Nodemanager | http://localhost:8042/node |
Resource manager | http://localhost:8088/ |
Hue | http://localhost:8888 |
Spark Master UI | http://localhost:8080 |
Spark Slave UI | http://localhost:8081 |
Spark Driver UI | http://localhost:4040 (accessible only after a driver is started) |
Zeppelin UI | http://localhost:8082 |
Airflow UI | http://localhost:3000 |
pgAdmin UI | http://localhost:5000 |
Zookeeper | http://localhost:2181 |
kafka broker | http://localhost:9092 |
Schema Registry | http://localhost:8083 |
Kadmin UI | http://localhost:8084/kadmin/ |
Kafka Control Center | http://localhost:9021 |
Below are containers running in the cluster.
Container Name | Installed Frameworks/Tech | Description |
---|---|---|
namenode | HDFS Namenode, Spark Master Node, Spark Slave Node, Zeppeline, Airflow, Flume & Python | It acts as a namenode & also has access to other framework/tech. |
datanode | HDFS Datanode, Python | It acts as a datanode & python is installed. |
resourcemanager | YARN, Python | It acts as YARN & python is installed. Default YARN scheduler is set to CapacityScheduler . |
nodemanager | Nodemanager, Python | It acts as a nodemanager & python is installed. |
historyserver | History Server | Tracking of past jobs are delegated to History server. |
hive-server | Hive server 2, Sqoop | Hive server 2 runs on this container, You can also create & execute sqoop jobs from this container. |
hive-metastore | Hive metastore | Hive metastore service runs here. It is used to store Hive ralated metadata |
hive-metastore-postgresql | Postgres | Used by Hive metastore to store metadata. |
hue | Hue Server | Hue server run here. |
huedb | Postgres | Used by Hue server to store metadata. |
external_postgres_db | Postgres | It's a relational DB where you can store your data. |
external_pgadmin | pgAdmin | You can use this to access Postgres server running in external_postgres_db container. |
cassandra | Cassandra | It's a non-relational column-oriented DB where you can store your data. |
zookeeper | Confluent Zookeper | This container runs zookeeper which keeps track of kafka broker. |
kafka-broker | Confluent Kafka Broker | This container runs a confluent kafka broker. |
schema-registry | Confluent Schema Registry | This container runs confluent schema registry which you can use to store schemas. |
kadmin | Kadmin | This container runs kadmin. You can use this application to produce & consume messages. It supports AVRO format data. |
control-center | Confluent Control Center | This container runs confluent control center. Control center can be used to consume messages, look at broker status & create schemas in schema-registry. |
You can access HDFS from Hue or Namenode UI.
Note: Hue allows you to access file content on HDFS, however Namenode's UI does not support this.
You can also access HDFS from inside the containers running hadoop such as namenode
.
You can access Postgres server using pgAdmin. You can run SQL queries using the same.
- user: external
- password: external
- db: external
- email: [email protected]
- password: external
To access this look at the Access hadoop ecosystem section. You can create producers and consumers on kadmin. It will ask you for Kafka Host and Kafka Schema Registry Url before creating a producer or consumer. You can set Kafka Host and Kafka Schema Registry Url to kafka-broker:29092
and http://schema-registry:8083
respectively.
You can run Hive queries using Hue. A thrift server runs on the top of hiveserver 2. Hue has been set up to establish a connection with the thrift server, which executes your Hive queries.
Spark is installed in namenode
. You need to access namenode container & run spark-submit
command to run spark jobs.
In this cluster You can run Spark in three modes:
Mode | Command |
---|---|
Local | spark-submit [python-script-path] |
YARN | spark-submit --master yarn --deploy-mode cluster --archives [file_path] --queue default [python-script-path] |
Stand Alone | spark-submit --master spark://namenode:7077 --deploy-mode cluster [python-script-path] |
In this mode YARN takes care of scheduling spark job. You must create a compressed file that contains all of the packages if you're using any external packages, such as pandas.
Run the below command to create this compressed file
conda pack -f -o conda_env.tar.gz
The above command will create conda_env.tar.gz
in the current directory. You can then pass the absolute path of the file in --archives
flag.
NOTE: You may choose not to add --archives flag if you're not using any external packages.
For development purpose you can use Zeppelin to run spark jobs. Select Spark Interpreter while creating a new script in zeppelin. Don't forget to add %spark.pyspark
at the starting of the block/cell. Keep your spark related code in one block/cell.
Note: Zeppeline is configured to run spark jobs in local mode.
You can schedule Spark Jobs from Airflow. You need Spark Provider to schedule spark jobs which has already been installed. To run spark jobs in spark stand alone cluster set Host to namenode
& port to 7077
while creating a spark connection in Airflow.
Sqoop is installed in hive-server container. Below is a sample command to import data using sqoop.
sqoop import --connect jdbc:postgresql://external_postgres_db/external --username external --password external --table <your-table-name> --target-dir <dir-in-hdfs> --m 1
You can set mapper as per your needs, Since we're in development env I suggest using 1 mapper.
Sqoop is configured to store DB password when sqoop job is created. It stores all the metadata in $HOME/.sqoop
.
hive-server container houses an ssh server.
SSHOperator in Airflow may be used to log into the hive server and run a Sqoop job. To login to hive-server from airflow, set Host to hive-server
, Username to root
& Password to container's password while creating a ssh connection in Airflow.
Below is the command to set password in a container.
passwd root
In order to schedule jobs in Airflow, DAGs must be created.
Your DAG scripts must be kept in the project directory's /dags
directory. Your scripts will automatically sync with the container, and you will be able to see your DAG via Airflow's user interface.
By default, SequentialExecutor is used to schedule jobs. This executor can only execute 1 task at once. If you want to schedule multiple jobs parallelly then You will have to configure Airflow to use other executors such as LocalExecutor.
SequentialExecutor uses SQLite to store metadata. If you want to use other executor such as LocalExecutor, you might need to configure Airflow to use other DBs like Postgres. You can make use of Postgres that comes with this cluster.
name | value |
---|---|
username | admin |
password | admin |
firstname | admin |
lastname | admin |
role | Admin |
[email protected] |
You can create kakfka topic either from CLI
or using Kafka control center
. You can set the maximum value of replication factor to 1 as this cluster runs only one kafka broker.
To create kafka topic via CLI, you need to access kafka-broker
container and then run the following command.
kafka-topics --bootstrap-server kafka-broker:29092 --create --topic <topic-name> --partitions <int> --replication-factor <int>
This is an UI provided by Confluent. To access this look at the Access hadoop ecosystem section.
You can create kakfka producer either from CLI
or using KAdmin
.
To create a simple kafka producer via CLI, you need to access kafka-broker
container and then run the following command.
kafka-console-producer --broker-list kafka-broker:29092 --topic <topic-name> --producer-property <key>=<value>
Note: The above command creates a producer that sends data without a key.
To create an avro producer via CLI, you need to access schema-registry
container and then run the following command.
kafka-avro-console-producer \
--broker-list kafka-broker:29092 \
--topic <topic-name> \
--property schema.registry.url=http://schema-registry:8083 \
--property value.schema='<avro-schema>'
Note: This producer will not create a schema in schema-registry. It directly uses the schema that you've provided in the above command. Above command accepts schema of value.
You can create a simple producer or avro producer using Kadmin.
You can create kakfka consumer either from CLI
or using KAdmin
.
To create a simple kafka consumer via CLI, you need to access kafka-broker
container and then run the following command.
kafka-console-consumer --bootstrap-server kafka-broker:29092 --topic <topic-name> --from-beginning --formatter kafka.tools.DefaultMessageFormatter --property print.timestamp=true --property print.key=true --property print.value=true
Note: The above command creates a consumer which reads from the beginning.
To create an avro consumer via CLI, you need to access schema-registry
container and then run the following command.
kafka-avro-console-consumer \
--bootstrap-server kafka-broker:29092 \
--topic <topic-name> \
--from-beginning \
--property schema.registry.url=http://schema-registry:8083
Note: The above command creates a consumer which reads from the beginning.
You can create a simple consumer or avro consumer using Kadmin.
Flume is installed in namenode
. To configure flume agents you need to configure flume_config/flume.conf
file in the project directory. To start flume agents you need to access namenode & run the below command.
flume-ng agent --conf conf --conf-file /opt/flume/conf/flume.conf --name <agent-name> -Dflume.root.logger=INFO,console
Above command runs flume in the console.
You can also create a Flume service & manage it as a service too.
Access cassandra container & run the below command to start executing cql queries.
cqlsh
Here is a list of extra jar files available which can be used in "spark".
These jar files are located in /opt/spark/jars/
in namenode
container.
Group Id | Artifact Id | File Name | Version |
---|---|---|---|
com.sun.jersey | jersey-client | jersey-client-1.19.4.jar | 1.19.4 |
com.sun.jersey | jersey-bundle | jersey-bundle-1.19.4.jar | 1.19.4 |
com.datastax.spark | spark-cassandra-connector-assembly_2.12 | spark-cassandra-connector-assembly_2.12-3.2.0.jar | 3.2.0 |
mysql | mysql-connector-java | mysql-connector-java-5.1.49.jar | 5.1.49 |
org.apache.spark | spark-sql-kafka-0-10_2.12 | spark-sql-kafka-0-10_2.12-3.2.1.jar | 3.2.1 |
org.apache.kafka | kafka-clients | kafka-clients-3.2.2.jar | 3.2.2 |
org.apache.spark | spark-avro_2.12 | spark-avro_2.12-3.2.2.jar | 3.2.2 |
org.apache.spark | spark-token-provider-kafka-0-10_2.12 | spark-token-provider-kafka-0-10_2.12-3.2.2.jar | 3.2.2 |
org.apache.commons | commons-pool2 | commons-pool2-2.11.1.jar | 2.11.1 |
org.apache.avro | avro-mapred | avro-mapred-1.11.1.jar | 1.11.1 |
org.postgresql | postgresql | postgresql-42.5.0.jar | 42.5.0 |
You can configure resource allocation to each container by modifying hadoop.env
in the project directory.
- YARN_CONF_yarn_scheduler_capacity_root_default_maximum___allocation___mb: This determines the maximum memory allocation to YARN scheduler. By default it is set to
8GB
. You may increase it if you have more resources to spare. - YARN_CONF_yarn_scheduler_capacity_root_default_maximum___allocation___vcores: This determines the maximum number of core allocation to YARN scheduler. By default it is set to
4 cores
. You may increase it if you have more resources to spare.
- YARN_CONF_yarn_nodemanager_resource_memory___mb: This determines the maximum memory allocation to nodemanager. By default it is set to
16GB
. I suggest not to decrease it otherwise, the map-reduce jobs might get stuck due unavailability of resources. - YARN_CONF_yarn_nodemanager_resource_cpu___vcores: This determines the maximum number of core allocation to nodemanager. By default it is set to
8 cores
. I suggest not to decrease it.
- MAPRED_CONF_mapreduce_map_memory_mb: This determines the maximum memory allocation to a mapper. By default it is set to
4GB
. You may increase it if you have more resources to spare. - MAPRED_CONF_mapreduce_reduce_memory_mb: This determines the maximum memory allocation to a reducer. By default it is set to
8GB
. You may increase it if you have more resources to spare.
You can configure Hue server by modifying hue-overrides.ini
in the project directory.
Hue is configured to use postgres running in huedb
container. Below is the default config.
engine=postgresql_psycopg2
host=huedb
port=5432
user=hue
password=hue
name=hue
Hue connects to the hive-server
container using port 10000
. Below is the config for the same
[beeswax]
# Host where HiveServer2 is running.
# If Kerberos security is enabled, use fully-qualified domain name (FQDN).
hive_server_host=hive-server
# Port where HiveServer2 Thrift server runs on.
hive_server_port=10000
thrift_version=7
You can configure zeppelin by modifying configs/zeppelin-env.sh
& configs/zeppelin-site.xml
files.
Below are variables set in zeppelin-env.sh
which is used by zeppelin.
Variable | Value |
---|---|
JAVA_HOME | /usr/lib/jvm/java-8-openjdk-amd64/ |
SPARK_MASTER | spark://namenode:7077 |
ZEPPELIN_PORT | 8082 |
SPARK_HOME | /opt/spark |
HADOOP_CONF_DIR | /etc/hadoop |
PYSPARK_PYTHON | /root/anaconda/bin/python3.9 |
PYTHONPATH | /root/anaconda/bin/python3.9 |
Java properties can be configured from zeppelin-site.xml
. I've modified it's default config to expose the server externally. Below is the modified config.
<property>
<name>zeppelin.server.addr</name>
<value>0.0.0.0</value>
<description>Server binding address</description>
</property>
Follow the below steps
- click on profile in the top right corner of the screen.
- click on Interpreter.
- scroll down until you see
spark
& click on edit.
You can configure airflow by modifying configs/namenode_airflow.cfg
in the project directory.
Below are variables that I've configured. You can leave other variables as it is.
Variable | Value |
---|---|
dags_folder | /root/airflow/dags |
sql_alchemy_conn | sqlite:////root/airflow/airflow.db |
executor | SequentialExecutor |
You can use other executors such as LocalExecutor
. To use this executor you need to create a DB instance in a DB server, you can use external_postgres_db
container for the same. Once the DB is created, set sql_alchemy_conn to postgresql+psycopg2://:@[IP/container-name]:5432/[db-name] & executor to LocalExecutor.
After configuring the variables, you need to initiate a new DB instance & restart airflow server & scheduler.
Login to namenode
by running this command & run the below command to initiate a DB.
airflow db init && \
airflow users create \
--username admin \
--password admin \
--firstname admin \
--lastname admin \
--role Admin \
--email [email protected]
Run the below command to restart airflow server & scheduler.
rm -rf ~/airflow/airflow-scheduler.pid && \
rm -rf ~/airflow/airflow-webserver-monitor.pid && \
rm -rf ~/airflow/airflow-webserver.pid && \
airflow webserver -p 3000 -D --workers 1 && \
airflow scheduler -D
No need to run this the next time you start your cluster.
You can configure sqoop by modifying configs/hive_server/sqoop-site.xml
in the project directory.
I’ve modified it’s default config so that it could save DB server password in the metastore. Below is the modified config.
<property>
<name>sqoop.metastore.client.record.password</name>
<value>true</value>
<description>If true, allow saved passwords in the metastore.
</description>
</property>
A ssh server runs on hive-server container. You can configure it by modifying configs/hive_server/sshd_config.conf
in the project directory. I've configured it to allow password authentication & login as root.
You can customize the images that is used in these docker YAML files. The directory docker_image_conf
contains the Dockerfiles for creating images.
I recommend using a workstation with at least 4 hyper-threaded cores & 8GB RAM to run this cluster. Additionally, I suggest against running any other programmes simultaneously because doing so could lead to resource shortages and other issues when using certain ports.
- If you aren't using Hive, I advise starting the
basic cluster
. - You can start the cluster without the external_pgadmin container if you do not need to execute SQL queries on Postgres.
- If you don't need cassandra, launch the cluster without it.
- when you run your PySpark code in zeppelin, it starts a spark driver code, this consumes a lot of resources, I recommend running
spark.stop()
&sc.stop()
in the next block in zeppelin. This will stop the spark driver. - If it still hogs resources, you can login to
namenode
& runzeppelin-daemon.sh stop
. This will stop zeppelin.
- Docker:
- Install Docker on ubuntu: https://docs.docker.com/engine/install/ubuntu/
- Install Docker on windows: https://docs.docker.com/desktop/install/windows-install/