Data analysis with Python: An introduction to Pandas and Jupyter Notebooks

Kevin Beswick

Nushrat Khan

Bret Davidson

Agenda

Lecture	30 mins
Technical Orientation	10 mins
Hands-on exercise part 1	45 mins
Break	10 mins
Lecture	15 mins
Hands-on exercise part 2	60 mins
Q/A	10 mins

Introductions

Tell us your name, and where you're from

Lab Setup

Visit github.com/NCSU-Libraries/pandas-code4lib-2017
Click on the "launch binder" badge at the top of the README.

Prerequisites

Basic Python familiarity

What is Pandas?

A Python package providing fast, flexible and expressive data structures
Designed to make working with relational or labeled data both easy and intuitive
Filling out the gap of data analysis and modeling that was not as easy to do using Python before

Why should I want to use Pandas?

Enables you to carry out an entire data analysis workflow in Python
Integrates with other Python libraries
No need to learn another language, such as R

Capability

Data import/export
Data cleaning
Data analysis
Data visualization

Other Popular Statistical Software

Open source software: R
Proprietary software: Mathematica, MATLAB, SASS, SPSS

Comparison between Pandas and R

Syntax is often similar
Python has packages for data analysis, but R has larger ecosystem
Python can be more efficient in non-statistical areas, such as text analysis
Both are useful and complement each other

Data Structures in Pandas

Series and DataFrames

Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

Example of Series

Data on different occupations

0	Librarian
1	Software Developer
2	Engineer
3	IT Support Specialist
4	Research Data Coordinator

DataFrame

A DataFrame is a tabular data structure comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. It can be thought as a group of Series objects that share an index (column names).

Example of DataFrame

Data about different departments within a library

	Department Name	Number of Employees	Number of Machines
0	Digital Library Initiatives (DLI)	12	14
1	IT	20	25
2	Special Collections	13	10
3	Building Services	2	1
4	Research Data Management	2	3

About Jupyter Notebook

What is a Jupyter Notebook?

Jupyter Notebooks are interactive documents that can contain both code and rich text elements.
Renders Python and Markdown by default.
This literate programming approach makes it easier to share code, and is popular in scientific computing and data science.

Brief History

Originally called IPython notebooks
Developed in 2011 by a team of researchers led by Fernando Pérez from UC Berkeley and Brian Granger from California Polytechnic State University.
Later renamed to Jupyter Notebook to make it compatible with other programming languages, such as Julia and R.

Why use a Jupyter Notebook?

Free and open source - widely used
Ease of sharing
Literate programming document - ability to incorporate rich text with code
Reproducibility

Installation instructions

Jupyter can be installed easily on your local machine using Anaconda.
Experienced Python users can use pip (pip install jupyter)
Detailed instructions can be found in the Jupyter Documentation

Cells in a Jupyter Notebook

Each section of a Jupyter notebook is called a cell.
A cell can typically contain either Python code or Markdown.
Cells make it easier to execute a specific portion of your code or split between code and documentation easily.

Overview of first hands-on section

In the first section we will talk about different data types, how to import data, and some basic functions to explore your data.

Pandas integrates with many Python libraries for related tasks. For example, in this workshop we will use matplotlib and ggplot for visualization.

Lab Setup

Visit github.com/NCSU-Libraries/pandas-code4lib-2017
Click on the "launch binder" badge at the top of the README.

Alternative Lab Setup

Visit www.continuum.io/downloads and download the appropriate Anaconda package for your OS.
Visit github.com/NCSU-Libraries/pandas-code4lib-2017 and clone or download the repo.
Using your favorite shell, run `jupyter notebook` from within the repo directory.