Back to Blog

Demystifying Docker and Orchestration using Kubernetes

Miracle Ogunlade

Docker no doubt has earned all the hype. I mean it is every development team’s dream come true. It’s like being able to eat loads of pie without getting fat. To efficiently leverage docker, it’s necessary to understand it, how it fits in and getting the best out of it which is what I plan on doing of course after eating my pie.

Let’s get started, shall we?

Precursor to Containerization

The past pain of application deployment used to have widespread ramifications ranging from time spent on dependencies installation, libraries compilation, and applications not having the same behavior as the development environment. Often, it requires a dedicated team of engineers called operations. Any detection of bug or error results in the restart of deployment. This was made a little easier with infrastructure as code, a paradigm I love.

Virtual machines created ways for applications to run in isolation having their own set of software resources, system binaries and even an entire virtual hardware stack including CPUs, memory, hard drives and other devices mapped to the real hardware on the physical machine. Multiple virtual machines can run on the host machine managed by a hypervisor providing a way to test and run applications on systems they were not intended for.

 

The rise of Containerization

Containerization provides a way to leverage the isolation of virtual machines without the overhead. This method is implemented by engines like Docker, Solaris Zones, BSD Jails, and LXC.

Containerization is a virtualization method where applications run isolated from other applications and the host OS. These isolated applications are called containers. 

Containers do not run separate operating systems but rather leverage the host OS kernel, hence they are  lightweight which is where the major difference arises when compared with virtual machines. All the operating system level architecture is shared across containers with the bins and libs being the only parts created from scratch. Containerization works by bundling together all configuration files, libraries, and dependencies required for the application ensuring consistent behavior across environments. 

Distributed applications and micro-services architecture have benefited greatly from containerization because each container operates independently hence preventing interdependencies and encouraging continuous integration and delivery.


Why Docker?

Maybe because we just love the docker whale :). Docker essentially popularized containerization. It creates containers on the host operating system by using Linux Kernel features like namespaces which are quite useful for process isolation, control groups, networking, and volumes. 

So you may ask, what’s the hype about docker? Well, to answer you, docker is popular because of the following:
  • Open-Source: Docker is open source so you need not break the bank 
  • Ease of use: Docker takes advantage of containers in order to quickly build and test portable applications. Applications can be packaged in the development environment, which in turn can be deployed/shipped.
  • Speed: Since docker containers do not have dedicated OS but rather run on the host kernel, they are very lightweight and fast. A Docker container can be created in seconds.
  • Docker Image Registry: This is a way to efficiently store and share docker applications (images). Docker hub is the most popular image registry. It has tens of thousands of open-source images created by the community. 
  • Modularity and Scalability: Docker makes it easy to break out an application’s functionality into individual containers and scale as needed. 

Understanding Docker

To fully understand Docker, the following concepts are important. Lets’ go through the fundamental parts of Docker piece by piece:

  • Docker Engine: This is the lightweight core of docker. It is the layer on which docker runs. It manages containers, images, builds, etc. It is generally made up of docker Daemon and docker client which communicates with the docker daemon to execute commands.
  • Docker Client: This is responsible for communicating user instructions to the docker daemon which in turn executes the instructions. The Docker client isn’t necessarily required to run on the host machine. It can run on a different machine and communicate with the Docker Daemon that’s running on the host machine.
  • Docker Daemon: This is responsible for actually executing commands sent to the Docker Client, like building and running containers. The Docker Daemon runs on the host machine, and the user never directly communicates with it. 
  • Docker Image: Docker images are read-only files comprised of multiple layers, used to execute code in a docker container. Images are built from sets of build instructions. Images consist of the packaged application and its dependencies.
  • Dockerfile: The Dockerfile is where the sets of instructions required to build an image are written. Docker image is built using a Dockerfile. Each instruction specified in the Dockerfile adds a new layer to the image. Layers are at the center of Docker’s lightweight yet powerful structure.
  • Volumes: Volumes are responsible to store data of a container, initialized during container creation. Volumes also make it possible to persist and share a container’s data.  Even on destroy, update, or rebuild of containers, the data volumes will remain untouched. A volume can also be updated independently of the container. Volumes can also be shared and reused among multiple containers.
  • Union File Systems: This stands at the core of building a docker image. The Union File System can be thought of as a stackable file system, meaning files and directories of separate file systems (known as branches) which can then be transparently overlaid to form a single file system. The contents of directories that have the same path within the overlaid branches are seen as a single merged directory thereby avoiding the need to create separate copies of each layer. When certain layers need to be modified, it’ll create a copy and modify a local copy, leaving the original unchanged. Layered systems offer two main benefits:
  • Duplication-free: Layers help to avoid duplicating a complete set of files every time you use an image to create and run a new container, making instantiation of docker containers very fast and cheap.
  • Layer segregation: Making a change is much faster as docker only has to propagate new updates to the layer that was changed.


Building a Docker Image

To build a docker image, a sets of instructions need to be provided. These instructions are defined in a docker file. Below are the most common commands in a docker file.

  • FROM: This instruction initializes a new build stage and sets the base for subsequent instructions. The image can be any valid image. This command pulls the specified image and layers the result of the execution of the next sets of instructions over it.
  • RUN: This executes any command in a new layer on top of the current image and commits the results. The resulting image is then used for the next step in the docker file. 
  • ADD: The add instruction copies new files, directories or remote file URLs and adds them to the filesystem of the image.
  • COPY: This copies a new file or directory and adds them to the filesystem of the container at the path.
  • ENTRYPOINT: This specifies the command that gets executed when the container is run.

Once a valid setsof instructions have been provided in the Dockerfile, the image can be built using the command docker build [dockerfile src]. The image can then be run using docker run [image name]


For further reading, visit here


Container Orchestration

Container orchestration is the management of the lifecycles of containers usually applicable for large scale projects. Container orchestration can be used for:

  • Provisioning and deployment of containers
  • Coordinating redundancy and availability of containers
  • Scaling or removing containers for effective load balancing
  • Allocation of resources between containers
  • External exposure of services running in a container with the outside world
  • Load balancing of service discovery between containers
  • Health monitoring of containers and host
  • Configuration of an application in relation to the containers running it


Kubernetes

Kubernetes is the orchestration tool originally designed by Google. It has gained popularity among DevOps practitioners because it creates an avenue to deliver a self-service Platform-as-a-Service (PaaS) that creates a hardware layer abstraction for development teams. Due to its flexibility, many cloud providers such as Azure Kubernetes Service (AKS),  Google Cloud Kubernetes Engine, Amazon Elastic Container Service for Kubernetes (Amazon EKS) offer services for Kubernetes.

The main components of Kubernetes include:

  • Cluster: A cluster is a set of nodes with at least one master node and several worker nodes that can be virtual or physical machines.
  • Kubernetes master: The master is responsible for managing the scheduling and deployment of application instances across nodes. The full set of services the master node runs is known as the control plane. The Kubernetes API server is responsible for communication between master and other nodes.
  • Kubelet: Each Kubernetes node runs an agent process called a kubelet responsible for managing the state of the node such as starting, stopping, and maintaining application containers based on instructions from the master through the Kubernetes API server.
  • Pods: The basic scheduling unit, which consists of one or more containers guaranteed to be co-located on the host machine and able to share resources. Each pod is assigned a unique IP address within the cluster, allowing the application to use ports without conflict. The desired state of the containers in a pod is specified in a podspec passed to the kubelet through the API server
  • Deployments, replicas, and ReplicaSets: A deployment is a YAML object that defines the pods and the number of container instances, called replicas, for each pod. The number of replicas to have run in the cluster is specified via a ReplicaSet which is part of the deployment object such that if a node running a pod dies, the replica set will ensure that another pod is scheduled on another available node.


Example:

We are going to create a simple scraping application. This application searches google image using the user-supplied query string. The application then scraps the images links and save them to a CSV file.


const cheerio = require('cheerio');
const fs = require('fs');
const minimist = require('minimist');
const chrome = require('selenium-webdriver/chrome');
const { Builder, By, Key } = require('selenium-webdriver');

const argv = minimist(process.argv.slice(2))

const screen = {
   width: 640,
   height: 480
 };

const { q, file } = argv;

const browser = new Builder()
 .forBrowser('chrome')
 .setChromeOptions(
  new chrome.Options()
   .headless()
   .windowSize(screen)
   .addArguments('--no-sandbox')
   .addArguments('--disable-dev-shm-usage')
 )
 .build();

 const saveImageLink = (image) => {
   if (typeof image === 'string') fs.writeFileSync(file || './image.csv', `${image}\n`, {flag: 'a+'})
 }

const getImagesSrc = (body) => {
 const $ = cheerio.load(body);

 return $('div a img').map((_i, el) => $(el).attr('src')).filter(el => !!el)
};

const functionScrap = async () => {
 await browser.get('https://www.google.com/imghp?hl=en');
 const searchField = browser.findElement(By.css('input[name="q"]'));

 await searchField.sendKeys(`${q} ${Key.ENTER}`);

 const body = await (await      browser.findElement(By.tagName('body'))).getAttribute('innerHTML');
 const images = getImagesSrc(body);
  images.map((_i, img) => saveImageLink(img))
}

functionScrap();
The above code snippet is the complete code sample for the scrapper written in javascript. Node.js is not required to be installed on the local machine as the application will be run in docker container but for tinkering, you might need Node.js and NPM installed. I am therefore going to break down the code snippet into sections and explain what it does.
const cheerio = require('cheerio');
const fs = require('fs');
const minimist = require('minimist');
const chrome = require('selenium-webdriver/chrome');
const { Builder, By, Key } = require('selenium-webdriver');
Loads all required libraries. Cheerio parses markup and provides an API for traversing/manipulating the resulting data structure. It does not interpret the result as a web browser does. minimist is a CLI argument parser. Selenium is a browser automation library. Most often used for testing web-applications, Selenium may be used for any task that requires automating interaction with the browser. selenium-webdriver is the official WebDriver JavaScript bindings from the Selenium project.
const argv = minimist(process.argv.slice(2))

const screen = {
   width: 640,
   height: 480
 };

const { q, file } = argv;
The first line initializes minimist to capture the command line arguments supplied to the application which is subsequently obtained from the assigned variable. On the next line, the screen size of the browser is specified although not necessary.
const browser = new Builder()
 .forBrowser('chrome')
 .setChromeOptions(
  new chrome.Options()
   .headless()
   .windowSize(screen)
   .addArguments('--no-sandbox')
   .addArguments('--disable-dev-shm-usage')
 )
 .build();
The browser is initialized/built based on the options supplied by the chained methods. In the above case, the type of browser specified is chrome. Further options were specified such as the headless mode which specifies that the browser should not open a GUI but rather run as a background process.
const getImagesSrc = (body) => {
 const $ = cheerio.load(body);

 return $('div a img').map((_i, el) => $(el).attr('src')).filter(el => !!el)
};
The getImageSrc function is responsible for getting the src of the from all the image tags. The function uses cheerio to get this attribute. It is very much similar to the use of jquery to retrieve an attribute of an HTML element.
const functionScrap = async () => {
 await browser.get('https://www.google.com/imghp?hl=en');
 const searchField = browser.findElement(By.css('input[name="q"]'));

 await searchField.sendKeys(`${q} ${Key.ENTER}`);

 const body = await (await      browser.findElement(By.tagName('body'))).getAttribute('innerHTML');
 const images = getImagesSrc(body);
  images.map((_i, img) => saveImageLink(img))
}

functionScrap();

The above code snippet is the complete code sample for the scrapper written in javascript. Node.js is not required to be installed on the local machine as the application will be run in docker container but for tinkering, you might need Node.js and NPM installed. I am therefore going to break down the code snippet into sections and explain what it does.

const cheerio = require('cheerio');
const fs = require('fs');
const minimist = require('minimist');
const chrome = require('selenium-webdriver/chrome');
const { Builder, By, Key } = require('selenium-webdriver');

Loads all required libraries. Cheerio parses markup and provides an API for traversing/manipulating the resulting data structure. It does not interpret the result as a web browser does. minimist is a CLI argument parser. Selenium is a browser automation library. Most often used for testing web-applications, Selenium may be used for any task that requires automating interaction with the browser. selenium-webdriver is the official WebDriver JavaScript bindings from the Selenium project.

const argv = minimist(process.argv.slice(2))

const screen = {
   width: 640,
   height: 480
 };

const { q, file } = argv;

The first line initializes minimist to capture the command line arguments supplied to the application which is subsequently obtained from the assigned variable. On the next line, the screen size of the browser is specified although not necessary.

const browser = new Builder()
 .forBrowser('chrome')
 .setChromeOptions(
  new chrome.Options()
   .headless()
   .windowSize(screen)
   .addArguments('--no-sandbox')
   .addArguments('--disable-dev-shm-usage')
 )
 .build();

The browser is initialized/built based on the options supplied by the chained methods. In the above case, the type of browser specified is chrome. Further options were specified such as the headless mode which specifies that the browser should not open a GUI but rather run as a background process.

const getImagesSrc = (body) => {
 const $ = cheerio.load(body);

 return $('div a img').map((_i, el) => $(el).attr('src')).filter(el => !!el)
};

The getImageSrc function is responsible for getting the src of the from all the image tags. The function uses cheerio to get this attribute. It is very much similar to the use of jquery to retrieve an attribute of an HTML element.

const functionScrap = async () => {
 await browser.get('https://www.google.com/imghp?hl=en');
 const searchField = browser.findElement(By.css('input[name="q"]'));

 await searchField.sendKeys(`${q} ${Key.ENTER}`);

 const body = await (await      browser.findElement(By.tagName('body'))).getAttribute('innerHTML');
 const images = getImagesSrc(body);
  images.map((_i, img) => saveImageLink(img))
}

functionScrap();

This is the core of the application responsible for trigger the scrap. The function visits the google image route using the selenium get method and inputs the specified search parameter from the CLI. Once a search is complete, the function to the get image src is triggered after which the returned src are subsequently saved in a CSV file.

node_modules/
image.csv

This is added to the .dockerignore file as we will be reinstalling all dependencies on each image build.

FROM node:10

RUN apt-get update
RUN apt-get install -y apt-utils
RUN apt-get -y install -y unzip xvfb libxi6 libgconf-2-4 --fix-missing
RUN apt-get -y install default-jdk --fix-missing
RUN apt-get -y install wget

WORKDIR /home/scrap
RUN wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
RUN dpkg -i google-chrome-stable_current_amd64.deb; exit 0
RUN apt-get -f -y install
RUN google-chrome --version
RUN wget https://chromedriver.storage.googleapis.com/80.0.3987.106/chromedriver_linux64.zip
RUN unzip chromedriver_linux64.zip
RUN mv chromedriver /usr/bin/chromedriver
RUN chown root:root /usr/bin/chromedriver
RUN chmod +x /usr/bin/chromedriver

RUN wget https://selenium-release.storage.googleapis.com/3.141/selenium-server-standalone-3.141.59.jar

RUN apt-get -y install curl
RUN curl -sL https://deb.nodesource.com/setup_10.x
RUN apt-get -y install software-properties-common
RUN add-apt-repository -y ppa:deadsnakes/ppa

RUN curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | apt-key add -
RUN apt-get -y install yarn

ADD . ./
RUN Xvfb :1 -screen 5 1024x768x8 &
RUN export DISPLAY=:1.5
RUN yarn

ENTRYPOINT [ "node", "scrap.js" ]

The above code snippet is the dockerfile which is required to build a docker image by specifying the necessary build steps/layers. I will be breaking down what each major step does.

FROM node:10

This tells docker to use a base image node which is open source and available in the docker hub/registry. The node image by default has Nodejs, NPM installed. The required version number (10) is also specified after the colon.

RUN apt-get update
RUN apt-get install -y apt-utils
RUN apt-get -y install -y unzip xvfb libxi6 libgconf-2-4 --fix-missing
RUN apt-get -y install default-jdk --fix-missing
RUN apt-get -y install wget

Other required dependencies for selenium to run are installed in this step. The apt package manager is used the node's base image is Debian-based.

WORKDIR /home/scrap
RUN wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
RUN dpkg -i google-chrome-stable_current_amd64.deb; exit 0
RUN apt-get -f -y install
RUN google-chrome --version
RUN wget https://chromedriver.storage.googleapis.com/80.0.3987.106/chromedriver_linux64.zip
RUN unzip chromedriver_linux64.zip
RUN mv chromedriver /usr/bin/chromedriver
RUN chown root:root /usr/bin/chromedriver
RUN chmod +x /usr/bin/chromedriver

RUN wget https://selenium-release.storage.googleapis.com/3.141/selenium-server-standalone-3.141.59.jar

RUN apt-get -y install curl
RUN curl -sL https://deb.nodesource.com/setup_10.x
RUN apt-get -y install software-properties-common
RUN add-apt-repository -y ppa:deadsnakes/ppa

RUN curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | apt-key add -

The WORKDIR command is synonymous with running the following on a Linux machine mkdirdirname&&cddirname. The next steps install chrome driver and set it up by making it executable and available in the bin directory.

ADD . ./
RUN Xvfb :1 -screen 5 1024x768x8 &
RUN export DISPLAY=:1.5
RUN yarn

The Add command copies our source code from the root directory to the current working directory in the docker image. Also, although not necessary for the scope of this application, I decided to add Xvfb. Xvfb (X virtual framebuffer) is an in-memory display server for UNIX based operating systems. It makes it possible to run GUI applications on headless servers(e.g., browser tests on a CI server) while also having the ability to take screenshots. Also, the dependencies for the source code is installed by running the yarn command.

ENTRYPOINT [ "node", "scrap.js" ]

Specifies that the command to be executed once this container is to be run. It also passes all supplied CLI arguments to the docker run command to it.

The concept of this application can be used for task automation such as automating end to end tests.



Share on social media: 

More from the Blog

Building a remote first future of work

GitStart has always been a remote first company since its inception. As many forward thinking companies have realised having a remote and distributed setup has several key advantages.

Read Story

SaaS vs Outsourced Services

Software as a service means that you can now get up and running in a matter of hours. Almost every practical arena of professional work has accompaining SAAS products which are standardized, work out of the box and can be customised to fit your exact needs.

Read Story

The unprecedented value of a shared purpose

Perhaps, the core distinguishing characteristic that will separate super successful organizations from the ordinary ones will be — “the capability for rapid cycles of innovation.”

Read Story

Stay in the Loop.

Gitstart runs a monthly email newsletter for developers, engineering leaders and CEOs, covering latest tips and trends.
We will never share your email address with third parties.