Effective Python: Importing Data into Pandas
programming
Python
effective python
Problems arise when computers interact with the real world. You step out of a controlled environment into one with bizarre, uncontrolled user inputs. Even structured (read tabular) real world data can cause problems when it is not strictly structured. In this post, I discuss importing a simple dataset into pandas
. I’ve chosen something awkward, representative of data in the wild, but very short, so we can really see what is going on.
Data
Here is the data. There are only four rows.
Date Activity Steps Distance Duration Calories4, 12:52 Walk 6,030 3.42 miles 51:14 424 cals
Mar 3, 13:34 Walk 5,833 3.39 miles 49:33 415 cals
Mar 2, 14:51 Walk 5,936 3.44 miles 50:41 413 cals
Mar 2, 13:03 Workout 337 N/A 21:47 92 cals Mar
The data includes
- A header row
- A date and time in column 1
- Steps as a number, but formatted with a comma separator
- Distance as a number, but including the miles suffix
- A time duration in minutes and seconds
- Another number including a cals suffix
- The columns are separated by two to four spaces.
This type of data could result from cutting and pasting a table from a document, PDF, or web page.
Let’s look at XX different ways to convert this data into a pandas
dataframe. At the end we’ll discuss pros and cons of each method.