Effective Python: Dates
Problems arise when computers interact with the real world. You step out of a controlled environment into one with user inputs, paper jams, invalid filenames, and dates. Dates are on the front-line of man vs. nature. Are there 365 days in a year? Not quite. 365.25? 365.2424? Closer, but no1. When is noon? Daylight savings time? Which time zone?
Data Scientists work with dates and, if you’re like me, they often cause irritation and frustration. Here are some tips and tools I use to make working with dates easier. They are not a comprehensive treatment; just enough information to understand the problem and get the job done. And, at this point, let’s stipulate that any inevitable jokes about date’s double meaning are satisfied!
What is a date?
What do we mean by the date January 1, 2022? There are two interpretations. It can mean
- A specific day-long time or
- A single point-in-time.
Events that occur on January 1 use the first. The second is meant when we say that a contract is effective on January 1: it comes into effect at a specific point during the day, usually at midnight (the start of the day) or an instant after midnight to avoid quibbles about which day contains midnight. Keeping these two interpretations clear is critical to dealing with dates.
Next, we come to a critical question:
What is the expiration date of a one-year contract effective January 1, 2022?
Is it December 31, 2022, or January 1, 2023? The expiration date means the point-in-time at which the contract ceases to apply—the second interpretation. A one-year contract must cover December 31. It expires at the end of December 31, i.e., at the very first instant (midnight) of January 1, 2023. Stick to this convention, and you win 50% of your battles with dates.
In the rest of this post, date means point-in-time. If no time is given, it is understood to be the start of the day. Sometimes we are given a date and time, such as 11:30 am on January 1. Again, we interpret this as a point in time. It means 11:30:00.000 even if the seconds are not specified. If we are merely given 11 o’clock, we mean 11:00 exactly. We’ll call both a date and a date-and-time just a date.
Side-bar As a mathematician, I have to point out the similarity between dates and the old saw “Are 0.9999… and 1 the same number?” For us: is the end of December 31 the same moment as the start of January 1? We deem that point-in-time to occur on January 1, but it’s the same point in time as the last moment on December 31. [And the math answer: also yes, because \(0.9999\dots=0.9 \times (1 + 0.1 + 0.001 +\cdots) = 0.9 \times (1 - 0.1)^{-1} =1\) by the formula for the sum of a geometric series.]
The ISO date standard
Depending on where you live, 03/02/2022 can mean February 3 or March 2. Neither interpretation is ideal because neither sorts in the correct order. The ISO Date Standard2 solves this problem by writing dates YYYY-MM-DD. Everyone should use it!
What is a date to a computer?
Now we know what we mean by date, we can ask what a computer means. The happy answer is: we don’t care! Provided the computer handles certain things for us (which it always does), the details of its internal book-keeping are irrelevant. It’s the same with numbers: you don’t care how the computer stores a float. We require that the computer:
- Allows us to create dates in some reasonable way by specifying the year, month, day, hour, minute, second, fractions of a second,
- Displays dates for us in human-readable form,
- Computes the difference between two dates, creating a time delta, and
- Performs basic arithmetic with dates and time deltas.
Let’s assume the computer creates a date using =date(Y, M, D, h, m, s, fs)
, where all entries except the last (fractions of a second) are integers. Only YMD
are required; if hms.fs
are missing, they are 0 (midnight, again).
We can create arbitrary representations of dates as real numbers with this functionality. For example, if \(d\) is a date, then \[ r_1 = \frac{d - \mathsf{date}(1900, 1, 1)}{\mathsf{date}(1900, 1, 2) - \mathsf{date}(1900, 1, 1)} \] represents \(d\) as the number of days since January 1, 1900. By changing the denominator, we readily obtain a representation in seconds (remember missing arguments are zero) \[ r_2 = \frac{d - \mathsf{date}(1900, 1, 1)}{\mathsf{date}(1900, 1, 1, 0, 0, 1) - \mathsf{date}(1900, 1, 1)}. \] That’ll be a big number. We could pick a more recent reference time, like January 1, 1970 \[ r_3 = \frac{d - \mathsf{date}(1970, 1, 1)}{\mathsf{date}(1970, 1, 1, 0, 0, 1) - \mathsf{date}(1970, 1, 1)}. \]
As I write this, on February 4, 2022, the three representations are
- \(r_1 = 44594.593127378765\) says since 1900-01-01
- \(r_2 = 3852972862.741849 \approx 3,852,972,862.7\) seconds since 1900-01-01, and
- \(r_1 = 1643984129.762957 \approx 1,643,984,129.8\) seconds since 1970-01-01.
UNIX/Linux systems use the 1970 base and work in seconds so that you may see “dates” as numbers around 1.65 billion (seconds since 1970). Other systems, catering to the financial market’s needs, work in nanoseconds, giving numbers like 1,643,984,129,762,957,056. Excel claims to use days since 1900-01-01. Right now, typing =NOW()*1
into a spreadsheet gives 44,596.59684, which is two days more than \(r_1\)3. Two things cause the difference:
- Excel, following Lotus 1-2-34, assumes that 1900 was a leap year. It wasn’t. One ahead for Excel.
- Excel thinks of 1900-01-01 as time \(t=1\) (try it:
=DATE(1900,1,1)*1
gives 1), not \(t=0\) required by our point-in-time interpretation. (This is analogous to saying a newborn baby is age 1, not age 0. It is in its first year of life and achieves age 1 at the end of the year.) Two ahead for Excel.
The Julian day5 system takes November 24, 4714 BC as the base and counts in days.
In all these systems, the different units are easily interchangeable. It is the interpretation of the numbers that is critical. Remember, point-in-time. Working in units of one day, the picture looks like this:
Using the same diagram but in units of years, year 1 of a contract spans from \(t=0\) to \(t=1\), year 2 from \(t=1\) to \(t=2\), and so forth.
Epoch converter provides a tool to convert different date representations.
Dates in Python
Dates are not a built-in Python data type. Date functionality is provided by the datetime
standard library and as well as by some pandas
and NumPy
functions. Dates can be created and manipulated much as you’d expect.
from datetime import datetime
datetime.now()# datetime.datetime(2022, 2, 4, 14, 48, 53, 133941)
- datetime(1970, 1, 1)) / (datetime(1970,1,1, 0, 0, 1) - datetime(1970, 1, 1))
(datetime.now() # 1643986180.155806
The odd looking import reflects that the function now
is a method of the class datetime
defined in the datetime
library.
Printing dates directly is arcane, requiring the function strftime
derived from C (string from time). A set of hard-to-remember coding strings, described in the documentation, controls the output
"%a, %d %b %Y %H:%M:%S")
datetime.now().strftime(# 'Fri, 04 Feb 2022 14:53:26'
The good news is that pandas
mostly insulates you from these codes.
Pandas and dates
Pandas integrates well with the datetime
library, and provides human-formatted output by default
import pandas as pd
'x': datetime.now()}, index=[1]) pd.DataFrame({
renders like so in Jupyter Lab
x | |
---|---|
1 | 2022-02-04 14:56:51.417745 |
It is essential to be able to read dates into a DataFrame and convert them into Python date types. Good news again: most of the time, it is easy. The read_csv
function allows you to specify columns (or collections of columns, if the date data is separated) to parse as dates. Here is an example. (StringIO
is used to create a file-stream-like object to give to read_csv
.)
data = '''idx,eff_date,poss_date
A,10-01-2020,1.6435e+18
B,12-21-2020,1.6437e+18
C,04-15-2021,1.6438e+18
D,02-03-2022,1.6439e+18'''
from io import StringIO
sio = StringIO(data)
df = pd.read_csv(sio, parse_dates=['eff_date'])
df
idx | eff_date | poss_date | |
---|---|---|---|
0 | A | 2020-10-01 | 1.6435e+18 |
1 | B | 2020-12-21 | 1.6437e+18 |
2 | C | 2021-04-15 | 1.6438e+18 |
3 | D | 2022-02-03 | 1.6439e+18 |
Calling df.dtypes
reveals that eff_date
is stored as datetime64[ns]
: nanoseconds since 1970—so it is date-aware. Without specifying parse_dates=['eff_date']
it is imported as an object
. The last column we suspect is a date. We can use pd.to_datetime
to convert it:
'new_date'] = pd.to_datetime(df.poss_date) df[
The first row now appears
idx | eff_date | poss_date | new_date | |
---|---|---|---|---|
0 | A | 2020-10-01 | 1.6435e+18 | 2022-01-29 23:46:40 |
If we know that the last column represents a date time stamp in UTC6 we can convert into a local (US/Eastern for me) time using
'new_date2'] = pd.to_datetime(df.poss_date, utc=True).dt.tz_convert('US/Eastern') df[
(the middle .dt
exposes date functions) to yield
idx | eff_date | poss_date | new_date | new_date2 | |
---|---|---|---|---|---|
0 | A | 2020-10-01 | 1.6435e+18 | 2022-01-29 23:46:40 | 2022-01-29 18:46:40-05:00 |
The help on read_csv
and to_datetime
explains many other options. But for me, to_datetime
and the timezone conversion usually gets the job done.
Footnotes
The Julian calendar (Julius Caesar) assumed 365.25 days per year: 3 regular years followed by a leap year. In 1582, Pope Gregory XIII introduced the Gregorian calendar, with 365.2425 days per year (divisible by 100 is not a leap year—1800, 1900—except divisible by 400 is—2000). The actual solar year has 365.24219 days. Because the Julian year was too “long”, by about 1 day per 100 years, fixed solar events such as the winter solstice move earlier in the year. The Gregorian calendar included a ten day catch-up, 4 October 1582 was followed by 15 October.↩︎
Multiplying 44,596.59684 by 3600 to convert to seconds gives 3,853,146,026, which is 173,163 seconds more than \(r_2=\) 3,852,972,863, almost exactly two days more.↩︎
See the Microsoft article: Excel incorrectly assumes that the year 1900 is a leap year.↩︎
One other date representation you might meet is the Julian day: the number of whole days since noon Monday, January 1, 4713 BC in the proleptic (OED, adj. projected backwards) Julian calendar or, equivalently, November 24, 4714 BC, in the Gregorian calendar. This date marked the start of three multi-year cycles, and it precedes any dates in recorded history. SQLite uses Julian days.
Python
datetime
provides the functiondatetime.toordinal
to counts days since Gregorian January 1 of the first year, i.e., 0001-01-01. Currently,datetime.now().toordinal()=738190
. Adding 1,721,424.5 gives Julian day, 2,459,614.5.↩︎
Comments
To comment on this post please go over the Reddit thread.