hadoop-on-k8s

Kubernetes manifest files for building Hadoop clusters

This repository hosts the kubernetes manifest files to deploy multi-node hadoop clusters on Kubernetes. This project uses the docker image files created using eiswar/hadoop-on-docker project.

Steps to setup the Hadoop cluster on Kubernetes

Create the Hadoop docker image using eiswar/hadoop-on-docker project.
Checkout this repo or download https://raw.githubusercontent.com/eiswar/hadoop-on-k8s/master/hadoop.yml
Update the image name in the manifest file.
Setup storage
Execute the following command to create the Hadoop cluster.

kubectl create -f https://raw.githubusercontent.com/eiswar/hadoop-on-k8s/master/hadoop.yml

or

kubectl create -f hadoop.yml

Steps to setup storage

NameNode deployment uses a PersistentVolumeClaim. Make sure that PersistentVolumes are configured with the Kubernetes cluster. The PVC will for namenode data will be created automatically.
DataNode deployment uses hostpath. Datanode and node-manager are deployed as Kubernetes daemonset. So, the datanode and node-manager will be running on all the nodes. All the nodes should have a mount point /hdfs. If there are multiple disks, the disks can be added to a logical volume group(LVM) and it can be mounted on /hdfs mount point.

How it works

The single Kubernetes manifest file https://raw.githubusercontent.com/eiswar/hadoop-on-k8s/master/hadoop.yml will do the following things:

Create a service account in Kubernetes for Hadoop operations.
Create a namespace in Kubernetes for Hadoop operations.
Create a role, cluster role, role-bindings, etc to grant privileges to the hadoop service account
Create a PVC for namenode data and it will bind the PVC with the hadoop-master pod
Create a service that will point the hostname hadoop-master.${namespace}.svc.cluter.local to the hadoop-master pods.
Create a deployment for runnng hadoop-master pods.
Create a daemonset for running hadoop-worker pods on all the nodes in the cluster.
The namenode data will be saved in the PVC, so that when the hadoop-master pod fails, a new one will be created and the same PVC will be mounted to the new pod.
The datanode data will be saved in the /hdfs directory on each node.
The datanode daemonset uses host network, so the pods get the same IP address all the times.

Once the hadoop-master pod is created, it will automatically creates a kubernetes secret for storing and retreiving authorized keys to allow master to connect to worker pods using SSH passwordless authentication.
The hadoop worker pods will pull the authorized keys that were generated by masters and it will setup SSH passwordless authentication automatically.
Since host network is used for Hadoop worker pods, all the worker pods will get static hostnames and IP addresses.

To do

Enable namenode High Availablity
Enable HDFS federation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

hadoop-on-k8s

Steps to setup the Hadoop cluster on Kubernetes

Steps to setup storage

How it works

To do

Files

README.md

Latest commit

History

README.md

File metadata and controls

hadoop-on-k8s

Steps to setup the Hadoop cluster on Kubernetes

Steps to setup storage

How it works

To do