Over the weekend I participated in a hack day kind of workshop about containerized data science apps. It was pretty fun and I got to use docker semi-seriously for the first time. I created two containers; a dockerized recommeder app and a dockerized mongodb instance and linked the two.
Docker is an open source tool which has made using linux containers (LxC) really really simple. A linux containers is different from a virtual machines in that it uses the same underlying os kernel or even a virtual machine but provides isolated container to run your application in. This has the near effect of segregation of applications functionally.
Docker consists of the docker daemon that you run on your machine, a docker client like boot2docker and docker registry. Containers are created from images. Multiple containers can be spawned from a single image. An image is built from a dockerfile, which is a simple text file. The daemon keeps track of images and containers running on your machine. The images (i.e. dockerfiles) can be versioned using github and can be pushed to your personal docker repository. You can also infact pull images created by other people and build on top of those. Docker provides users with “base” images which are nothing but blank ubuntu or redhat installs. But people have gone wild with this and created really cool images of their own and even neat hack around the docker cocept of working with apps. It is pretty impressive.
So coming back to my dockerized recommender app I created two different images from blank one for the recommender app and another for mongodb.
Dockerfile for mongodb image (from docker user guide)
FROM ubuntu:latest MAINTAINER Manas # Installation: # Import MongoDB public GPG key AND create a MongoDB list file RUN apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 7F0CEB10 RUN echo 'deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen' | tee /etc/apt/sources.list.d/10gen.list # Update apt-get sources AND install MongoDB RUN apt-get update && apt-get install -y mongodb-org # Create the MongoDB data directory RUN mkdir -p /data/db # Expose port #27017 from the container to the host, #28017 to receive http requests EXPOSE 27017 EXPOSE 28017 # Set /usr/bin/mongod as the dockerized entry-point application CMD /usr/bin/mongod --rest --dbpath=/data/db
Next I build the image from the docker file and run a container which runs mongodb as its contained application.
docker build -t manas/doc-mongo . docker run -d -p 27017:27017 -p 28017:28017 --name mongo-cont001 manas/doc-mongo
This is all well and good. Docker gives you a container id for the container you ran. If you run it with a -d flag it runs as a daemon in the background. Its always better to name a container when running, else docker will give it a random name. You can actually enter the container like such
docker exec -it mongo-cont001 bash
and check whether mongod is actually running using the mongodb client $mongo
Next I wrote a dockerfile for the recommender app I will write in python. So the app really is a ipython nodebook which contains the recommender app code. I decided to use the graphlab implicit recommender using item similarity metrics. Now the base docker ubuntu image does not have anything installed, so I will need to install python and all the required packages. This I did in the following dockerfile
FROM wiseio/datascience-base # Get graphlab-create # ubuntu 14.04 does not have libgomp1 lib which is needed for running graphlab RUN apt-get install libgomp1 RUN mkdir -p ~/.graphlab RUN echo -e "[Product]\nproduct_key=XXXX” > ~/.graphlab/config RUN pip install -U graphlab-create==1.2.1 # Get data science packages RUN pip install scipy numpy scikit-learn scikit-image pyzmq nose readline pandas matplotlib seaborn dateutil ipython-notebook plotly # Get more packages RUN pip install tornado jinja2 networkx pymongo anyjson simplejson statsmodels imgurpython # Add current files to / and set entry point. ADD . /workspace WORKDIR /workspace ADD notebook.sh /notebook.sh RUN chmod a+x /notebook.sh EXPOSE 8888 CMD ["/notebook.sh"]
So there are many things going on here. First of all notice that I am building my image on top of wiseio/datascience-base image that wise.io guys have created. It comes with standard installations of wget, curl, unzip, anaconda, etc. (you can view that image). It also comes with the data science packages but I have included them here for illustration. The new thing is the install of graphlab-create package. For this you will need to get a license key from graphlab.com website. I am also including the imgurpython package to get data from imgur app. Finally the container exposes the port 8888 to be later bind to the localhost port 8888 to run and access ipython notebook from the browser (its default port).
The docker container for this recommender app written in the ipython notebook is build and run as follows
docker build -t manas/datasci-graphlab . docker run -d -p 80:8888 -e "PASSWORD=xxx" -v $PWD/workspace:/workspace -v $PWD/workspace/data/:/workspace/data --name recod-001 --link mongo-cont001:mongo-cont001 manas/datasci-graphlab
The “docker build” command will build a image called manas/datasci-graphlab. Beware if your base image contains a lot of dependencies this may take a long time for it to download and install the first time. The good part is docker daemon will store all the dependencies. Even if your build has errors in it, the daemon will pick up from where it left off.
The “docker run” command does a bunch of things here. “-d” runs this process as a daemon. “-p” option binds local port 80 to container port 8888. “-e” option sets the env variable PASSWORD. “-v” option is a volume mount option, it mount the local pwd/workspace volume to the container/workspace volume. Same for local pwd/workspace/data volume. “–name” obviously sets the name of the container to recod-001. “–link” is the most interesting option here, it links the recod-001 container when it starts to the already running mongo-cont001. For this to work, mongo-cont001 should be running. (You can check this by running docker ps command). The linked container is specified as alias:name (therefore mongo-cont001:mongo-cont001 twice since it does not have an alias. For more on aliases refer docker guide). Finally manas/datasci-graphlab is the image to build the container from.
Now if you point your browser to 127.0.0.1 the ipyhton notebook should start up. If it does not try the ip given by the command boot2docker ip.
In the next part I will show how to write a simple recommender app in ipython notebook that collects data from an external api, computes models and presists recommendations for users in a mongodb. As an additional twist it is also possible to expose the recommendations through a RESTApi run on a tornado web server run in a different container, but we will leave it for later. For now we will make the recommendations available through mongos REST api. Therefore earlier we exposed port 28017 mongo’s default http interface port and started mongo with the “–rest” option.