Effective Python: Files
If you write a program of any length, you will likely have to manipulate files and filenames. Common file-related tasks include:
- Create (write) a new file,
- Read the contents of an existing file,
- Read file attributes, such as size or modified date,
- Determine the drive, directory, or file type from a file path,
- Rename a file or change its extension,
- Determine if a file exists, and
- List or search the contents of a directory.
Files have a physical existence that makes them fussy, especially allowing for differences across operating systems. Windows-style c:\dir_name\filename.suf
name vs. Linux //mnt/c/dir_name/filename.suf
is a case-in-point. Python has a great utility class called Path
that simplifies many of these operations. Path
belongs to pathlib
, a standard Python library.
This post shows how to accomplish several everyday file-related tasks using Path
. File tasks can act on
- Strings that can be interpreted as filenames,
- Files, or
- Directories.
Let’s examine each in turn.
1. Working with Filenames
To get started, let’s load and create a Path
from pathlib import Path
= Path('temp\\file.aux') p
We have created a Path
object representing the name of a file. On Windows p
is WindowsPath('temp/file.aux')
and on Linux PosixPath('temp/file.aux')
, but these two can both be treated like Path
objects in most situations. You can use \\
or /
as a separator.
At this stage, it does not matter whether the file temp/file.aux
exists or not. Regardless, several operations on p
are possible:
- The filename:
p.name
returns the stringfile.aux
. - The suffix (extension, type):
p.suffix
returns the string.aux
. Note that the period is included in the suffix. - The parent (folder, directory):
p.parent
returns anotherPath
object to the directorytemp
. - All the parts:
p.parts
returns the tuple('temp', 'file.aux')
.
Exercise: run these four commands on q = Path('c:\\temp\\subfolder\\file.aux')
.
Path
allows filenames to be manipulated intuitively.
q = p.with_name('myfile.aux') # creates Path('temp/myfile.aux')
r = p.with_suffix('.bak') # creates Path('temp/file.bak')
s = Path('c:\\users\\steve') / p # creates Path('c:/users/steve/temp/file.aux')
The last method is particularly useful. We can also combine Path
objects and strings
= Path('c:\\users\\steve') / 'folder1/folder2/new_file.xlsx'
t
t# WindowsPath('c:/users/steve/folder1/folder2/new_file.xlsx')
Files are often relative to your home directory. Path
includes the Path.home()
function as a shortcut
= Path.home() / 'folder1/folder2/new_file.xlsx'
t
t# WindowsPath('c:/users/steve/folder1/folder2/new_file.xlsx')
Be careful not to make the mistake of including an extra /
= Path.home() / '/folder1/folder2/new_file.xlsx'
t
t# WindowsPath('C:/folder1/folder2/new_file.xlsx')
Python won’t complain, but you don’t get what you want.
So far Path
is just manipulating names. The file does not have to exist.
2. Working with Files
The function p.exists()
returns True
if there is a file temp\\file.aux
, and p.resolve()
gives its full name. The handy function touch
creates an empty file (Windows right-click, New File). We can use it to illustrate.
= Path('some_new_file.wxyz')
p
p.exists()# False
=False)
p.touch(exist_ok
p.exists()# True
p.resolve()# WindowsPath('c:/users/steve/.../some_new_file.wxyz')
The function stat
accesses file attributes:
p.stat()# os.stat_result(st_mode=33206, st_ino=21955048184252520, st_dev=2829893387, st_nlink=1, st_uid=0, st_gid=0, st_size=0, st_atime=1643904374, st_mtime=1643904374, st_ctime=1643904374)
I don’t want to get into operating system details and dealing with dates (see [ref to dates] (/blog?id=xxx)). Instead, here is a function to humanize the results1, notably converting the dates.
import stat
import pandas as pd
def human_stat(p):
"""
Return human-readable stat information on Path p.
"""
if p.exists() is False:
print(f'{p} does not exist.')
return
= p.resolve()
p = p.stat()
s return pd.DataFrame(dict(zip(
'name', 'stem', 'suffix', 'drive', 'parent',
('create_date', 'modify_date', 'access_date',
'size', 'mode', 'links'),
str(p.parent),
[p.name, p.stem, p.suffix, p.drive, ='s'),
pd.to_datetime(s.st_ctime, unit='s'),
pd.to_datetime(s.st_mtime, unit='s'),
pd.to_datetime(s.st_atime, unit
s.st_size, stat.filemode(s.st_mode), s.st_nlink])),=pd.Index([s.st_ino], name='ino')) index
The stat
library interprets the read, write, and execute permissions encoded in st_mode
2. Running human_stat
on p
yields
21955048184252520 | |
---|---|
name | some_new_file.wxyz |
stem | some_new_file |
suffix | .wxyz |
drive | C: |
parent | C:\… |
create_date | 2022-02-03 16:06:14.263677184 |
modify_date | 2022-02-03 16:06:14.263677184 |
access_date | 2022-02-03 16:06:14.263677184 |
size | 0 |
mode | -rw-rw-rw- |
links | 1 |
The function with_name
creates a new Path
object but does not rename any files. If you want to do that use rename
:
= Path('newfile.txt')
p
p.exists()# False
p.touch()
p.exists()# True
'another.txt').exists()
Path(# False
= p.rename('another.txt')
q
q.exists()# True
p.exists()# False
If you want to copy a file, you need to use the shutil
(shell utilities) library. Copying a file involves some tricky choices (so you want to copy attributes?) that I don’t want to discuss. However, if you want to duplicate a file, you can use link_to
, which creates another name for an existing file.
= Path('duplicate_another.txt')
r
r.exists()# False
q.link_to(r)
r.exist()# True
Running human_stat
on q
and r
reveals they have the same identifier (reference the same bits on the hard drive), and each returns links==2
. Run r.unlink()
to delete a link.
Linking allows you to solve an perennial file organization problem (for consultants): I want presentations sorted by client, but I would also like all my presentations together. Creating a link allows you to do this without consuming extra disk space. And, because links share the same data on disk, a change to either file is automatically reflected in the other. I wish I had known this twenty years ago.
Path
supports reading and writing to files in two ways. It can open a file, like the open
function, for subsequent read-write:
with q.open('w', encoding='utf-8') as f:
'text written to file another.txt')
f.write(
with q.open('r', encoding='utf-8') as f:
print(f.read())
# 'text written to file another.txt'
print(q.stat().st_size)
# 32
Or, it can read or write directly:
'newfile.txt').write_text('text written to a new file')
Path(print(Path('newfile.txt').read_text())
# 'text written to a new file'
There are read_bytes
and write_bytes
functions for binary data.
3. Working with Directories
Finally, there are some special functions for directories. p.is_dir()
returns True
if p
refers to a directory. As we’ve already mentioned, Path.home()
returns your home
directory. Path.cwd()
returns the current working directory. mkdir
creates a path and is particularly useful if you are trying to write to a file but don’t know if its parent directory exists. It even creates intermediate directories.
= Path.cwd() / 'b/c/d/newfile.txt'
p open('w')
p.# FileNotFoundError
=True, exist_ok=True) p.parent.mkdir(parents
The call to mkdir
creates ~/b
, ~/b/c
, and ~/b/c/d
if they do not exist, but it doesn’t complain if they do. Now, the call p.open('w')
succeeds.
There are two handy functions to access files in a directory: iterdir
iterates over all files, and glob
finds files matching a pattern. To print the name and size of each file in the home directory:
for f in Path.home().iterdir():
if f.is_file():
print(f'{f.stat().st_size:9d}\t{f.name}')
To print the name and last accessed time of all Markdown files:
for f in Path.cwd().glob('*.md'):
if f.is_file():
print(f"{pd.to_datetime(f.stat().st_atime, unit='s'):%Y-%m-%d %H:%M:%S}\t{f.name}")
See my post on [dates] (/blog?id=4f205412b12e4ca84a8f2444fb782aac) for more about formatting dates.
The glob
function can recursively search all subdirectories using p.glob('**/*.md')
. You can also use wildcards to filter the results. The call p.glob('*.xls?')
matches all Excel suffices, xlsx, xlsm, xlsa and so forth. Patterns and wildcards can be combined: p.glob('Client_*.xls?')
matches all Excel filenames beginning Client_
.
Conclusion
I find Path
does everything I want with files using a simple, consistent interface. If you work with files, it is worth investing the time to understand its capabilities.
Notes and References
Footnotes
human_stat
usespandas
to provide nicely formatted output. It returns a row rather than a column so that each variable has the correct data type, which is useful when applied to several files. The table above is the transpose of the returned row. See my post [Introspection with Pandas] (/blog?id=b722b27b32f8855c39569a5c35719907) for more.↩︎-rw-rw-rw-
expands into three groupsrw-
of read, write, and execute permissions for the owner, the owner’s group, and all others. The first character is ad
for a directory.↩︎