Programming Work Flow for Data Analysis

About IPython

IPython is actually what provides the interactive Python kernel we use in Jupyter notebook. And in fact, we can use IPython outside of Jupyter notebook with its command-line interface in our terminal. This is very convenient and awesome for quick modifications, exploration, experimentation, and even running Python scripts!

To use IPython’s command-line interface, just type in ipython in your terminal. This should work if you already have Jupyter notebook installed. Like we always did in the Jupyter notebook, let’s import our packages and load into a dataset.

Even though we are in the terminal, we can still view datasets the same way with head.

And although you’re in IPython, you can still use command line commands! So we can do things like checking our directory for other files and renaming them.

Using IPython in your terminal can be very convenient for quick changes to your files. For example, if you wanted to change the column names in a dataset before sending off a csv file.

Visualizations here are also the same, except there’s no % matplotlib inline since that’s specific to Jupyter notebook. To have our visualizations show, we need to call plt.show().

Entering that will open the plot in another window like this.

One thing I do in IPython all the time is test or experiment with different functions, algorithms, or just Python. Sometimes, I even do a quick check here to remember how a function works before reading documentation. (Although reading documentation is a VERY good idea!)

This was just a quick overview to expose you to a different tool you can use to practice your new Python for data analysis skills. If you’d like to learn more, make sure to check out the documentation linked above!

Scripting Your Analysis

Being able to write and run scripts is invaluable for programming tasks and projects.

You can write your code in a text editor and then run the file in your terminal. Here’s a simple example printing column names from the census income dataset. if you save your file as a .py file with the code shown on the right, you can run it as shown on the left. Make sure you are in the same directory you saved this file in!

Ideally, you’d group your analysis into functions and run them in your main function. This helps you organize your code and generalize if possible.

The script below creates a double histogram of ages for people with lower and higher incomes.

You can imagine how plot_hist could be reused if we had more double histograms. This function could even be more generalized. If we were creating a script for many visualizations for this dataset, the query logic we have in the main function should probably move to a different function. If this project got really big, we could even separate our code into different files or modules to make it even more organized.

These were just simple examples to expose you to a different workflow. Writing and running scripts from your terminal is a very flexible and powerful way to program. This is more ideal as a development environment than Jupyter Notebook - which still works and is very useful, but more suited for things like reports.

I strongly recommend getting familiar with a good text editor and using the terminal if you aren’t already. Then, you could do things like automating scripts to pull data from a database every morning to deliver daily insights! Even if you don’t do anything fancy like that, it will still be very valuable to be familiar with a good text editor and terminal.

top