Data analysis with Python: An introduction to Pandas and Jupyter Notebooks

Kevin Beswick

Nushrat Khan

Bret Davidson

Agenda

Lecture 30 mins
Technical Orientation 10 mins
Hands-on exercise part 1 45 mins
Break 10 mins
Lecture 15 mins
Hands-on exercise part 2 60 mins
Q/A 10 mins

Introductions

Tell us your name, and where you're from

Lab Setup

  1. Visit github.com/NCSU-Libraries/pandas-code4lib-2017
  2. Click on the "launch binder" badge at the top of the README.

Prerequisites

  • Basic Python familiarity

What is Pandas?

  • A Python package providing fast, flexible and expressive data structures
  • Designed to make working with relational or labeled data both easy and intuitive
  • Filling out the gap of data analysis and modeling that was not as easy to do using Python before

Why should I want to use Pandas?

  • Enables you to carry out an entire data analysis workflow in Python
  • Integrates with other Python libraries
  • No need to learn another language, such as R

Capability

  • Data import/export
  • Data cleaning
  • Data analysis
  • Data visualization

Other Popular Statistical Software

  • Open source software: R
  • Proprietary software: Mathematica, MATLAB, SASS, SPSS

Comparison between Pandas and R

  • Syntax is often similar
  • Python has packages for data analysis, but R has larger ecosystem
  • Python can be more efficient in non-statistical areas, such as text analysis
  • Both are useful and complement each other

Data Structures in Pandas

Series and DataFrames

Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

Example of Series

Data on different occupations

0 Librarian
1 Software Developer
2 Engineer
3 IT Support Specialist
4 Research Data Coordinator

DataFrame

A DataFrame is a tabular data structure comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. It can be thought as a group of Series objects that share an index (column names).

Example of DataFrame

Data about different departments within a library

Department Name Number of Employees Number of Machines
0 Digital Library Initiatives (DLI) 12 14
1 IT 20 25
2 Special Collections 13 10
3 Building Services 2 1
4 Research Data Management 2 3

About Jupyter Notebook

What is a Jupyter Notebook?

  • Jupyter Notebooks are interactive documents that can contain both code and rich text elements.
  • Renders Python and Markdown by default.
  • This literate programming approach makes it easier to share code, and is popular in scientific computing and data science.

Brief History

  • Originally called IPython notebooks
  • Developed in 2011 by a team of researchers led by Fernando PĂ©rez from UC Berkeley and Brian Granger from California Polytechnic State University.
  • Later renamed to Jupyter Notebook to make it compatible with other programming languages, such as Julia and R.

Why use a Jupyter Notebook?

  • Free and open source - widely used
  • Ease of sharing
  • Literate programming document - ability to incorporate rich text with code
  • Reproducibility

Installation instructions

  • Jupyter can be installed easily on your local machine using Anaconda.
  • Experienced Python users can use pip (pip install jupyter)
  • Detailed instructions can be found in the Jupyter Documentation

Cells in a Jupyter Notebook

  • Each section of a Jupyter notebook is called a cell.
  • A cell can typically contain either Python code or Markdown.
  • Cells make it easier to execute a specific portion of your code or split between code and documentation easily.

Overview of first hands-on section

In the first section we will talk about different data types, how to import data, and some basic functions to explore your data.

Pandas integrates with many Python libraries for related tasks. For example, in this workshop we will use matplotlib and ggplot for visualization.

Lab Setup

  1. Visit github.com/NCSU-Libraries/pandas-code4lib-2017
  2. Click on the "launch binder" badge at the top of the README.

Alternative Lab Setup

  1. Visit www.continuum.io/downloads and download the appropriate Anaconda package for your OS.
  2. Visit github.com/NCSU-Libraries/pandas-code4lib-2017 and clone or download the repo.
  3. Using your favorite shell, run `jupyter notebook` from within the repo directory.

Next session:

Overview of Technical Orientation