This guide describes my setup for doing data science and data analysis projects. It is comprehensive and not all packages and softwares are required to do basic stuff. However, you may not want to follow the guide exactly and can pretty much do whatever you want. I have a Macbook air with Yosemite, so everything in this guide is with that in mind.
1. Softwares and packages
programming languages – python, R
$(which python) --version #will tell you the python version if it exists. yosemite default is 2.7.6 $ brew install python
Now we will install packages needed for doing data science. If you installed python with homebrew then you will automatically install the python package manager pip with it. pip is itself a package which keeps an index of python packages, downloads and installs packages for you.
Usign pip packages are installed with
pip install <package-name>
Below is a list of packages you should install (pip list will give you a list of packages already installed.) It is not required to install all of these but recommended.
Out of these packages the most important are scikit-learn, pandas, matplotlib, scipy, numpy and statsmodels which are the relevant data science packages. However I often find myself using graphlab-create a lot simply because it offers a super quick way to prototype recommenders and classifiers using standard algorithms. To install the package you will need a registration key which you can get from here https://dato.com/products/create/quick-start-guide.html. Once you get the license key, follow the instructions and save it to the appropriate directory. Then install the package,
sudo pip install graphlab-create==1.2.1
. You can check if any of the package is properly installed by going to the python prompt and importing the package.
Editor – sublime text 2 http://www.sublimetext.com/2
Databases – mysql and mongodb
to install mysql first make sure it does not exist already
$brew update $brew doctor $brew update $brew install mysql # install mysql $mysql.server start # start the server to check if working
$brew install mongodb $brew install mongodb --with-openssl # for openssl support
Database clients – SQuirrelSQL and Robomongo
SquirrelSQL client – http://squirrel-sql.sourceforge.net/#installation
SquirrelSQL is an generic sql client. It should support most sql servers like mysql server or MS sql server. Once you have downloaded the client, get and install the JDBC driver for MS SQL from here http://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=11774
Once squirrel is installed, on the main screen click the Drivers tab. In it look for Microsoft MSSQL Server JDBC Driver and add the downloaded file as the new driver.
Robomongo – is a pretty good client for mongodb. You can find it here http://robomongo.org/download.html
Using ipython notebookIPython notebook is the web interface to the interactive python interface. Start the ipython notebook as follows
$cd /../desired/dir $ipython notebook
This will start the ipython notebook server at the default port of 8888. You can go to http://localhost:8888 and see ipython running. You can then start a notebook from your desired folder. For more ipython options do ipython –help
Now you are ready to write simple recommenders and classifiers.