Knowledge is the Only Good
  • About

Effective Python: Importing Data into Pandas

programming
Python
effective python
Author

Stephen J. Mildenhall

Published

2022-03-11

Problems arise when computers interact with the real world. You step out of a controlled environment into one with bizarre, uncontrolled user inputs. Even structured (read tabular) real world data can cause problems when it is not strictly structured. In this post, I discuss importing a simple dataset into pandas. I’ve chosen something awkward, representative of data in the wild, but very short, so we can really see what is going on.

Data

Here is the data. There are only four rows.

Date    Activity    Steps   Distance    Duration    Calories
Mar 4, 12:52    Walk    6,030   3.42 miles  51:14   424 cals
Mar 3, 13:34    Walk    5,833   3.39 miles  49:33   415 cals
Mar 2, 14:51    Walk    5,936   3.44 miles  50:41   413 cals
Mar 2, 13:03    Workout    337   N/A     21:47   92 cals

The data includes

  1. A header row
  2. A date and time in column 1
  3. Steps as a number, but formatted with a comma separator
  4. Distance as a number, but including the miles suffix
  5. A time duration in minutes and seconds
  6. Another number including a cals suffix
  7. The columns are separated by two to four spaces.

This type of data could result from cutting and pasting a table from a document, PDF, or web page.

Let’s look at XX different ways to convert this data into a pandas dataframe. At the end we’ll discuss pros and cons of each method.

Method 1: Manual

Stephen J. Mildenhall. License: CC BY-SA 2.0.

 

Website made with Quarto