Skip to content

gh-57911: Sanitize symlink targets in tarfile on win32#138309

Merged
encukou merged 6 commits into
python:mainfrom
wiomoc:fix-tar-extract-symlink-win32-path
Sep 5, 2025
Merged

gh-57911: Sanitize symlink targets in tarfile on win32#138309
encukou merged 6 commits into
python:mainfrom
wiomoc:fix-tar-extract-symlink-win32-path

Conversation

@wiomoc

@wiomoc wiomoc commented Aug 31, 2025

Copy link
Copy Markdown
Contributor

@wiomoc wiomoc force-pushed the fix-tar-extract-symlink-win32-path branch from 5956b5d to 2ccc462 Compare August 31, 2025 23:08
@wiomoc wiomoc marked this pull request as ready for review August 31, 2025 23:34
@wiomoc wiomoc requested a review from ethanfurman as a code owner August 31, 2025 23:34
@encukou

encukou commented Sep 1, 2025

Copy link
Copy Markdown
Member

Should this be done for hardlinks as well?

I'd prefer saying “rewrite” rather than “sanitize”, as this is not fixing unsafe input.

We should probably skip this for leading // -- that would turn symlink targets into UNC paths, where I'm not sure of the security implications, and anyway it's not something you'd find in a portable UNIX-based tarball.

@wiomoc

wiomoc commented Sep 1, 2025

Copy link
Copy Markdown
Contributor Author

Should this be done for hardlinks as well?

This problem doesn't seam to occur when creating hardlinks:

>>> os.mkdir("tmp")
>>> os.mkdir("tmp\\child")
>>> open("tmp\\child\\test.txt", "w").write("hello world")
11
>>> os.link("tmp/child/test.txt", "tmp/testlink.txt")
>>> open("tmp\\testlink.txt").read()
'hello world'

We should probably skip this for leading // -- that would turn symlink targets into UNC paths, where I'm not sure of the security implications, and anyway it's not something you'd find in a portable UNIX-based tarball.

On the one hand, I see the risk of security implications, but I also note that in pathlib, slashes are also replaced in UNC paths.

@encukou encukou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK; re-reading the duscussion I see experts suggesting to always replace, so let's go with that.

Comment thread Lib/test/test_tarfile.py
@wiomoc wiomoc requested a review from encukou September 2, 2025 20:43
@encukou encukou added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Sep 3, 2025
@bedevere-bot

Copy link
Copy Markdown

🤖 New build scheduled with the buildbot fleet by @encukou for commit 942b6e3 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F138309%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Sep 3, 2025
Comment on lines +1 to +2
When extracting tar files on Windows Posix flavoured path separators in symlink
targets will be replaced by backward-slashes to prevent corrupted links.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When extracting tar files on Windows Posix flavoured path separators in symlink
targets will be replaced by backward-slashes to prevent corrupted links.
When extracting tar files on Windows, slashes in symlink
targets will be replaced by backslashes to prevent corrupted links.

@encukou

encukou commented Sep 3, 2025

Copy link
Copy Markdown
Member

Looks good! Could you also add an entry to Doc/whatsnew/3.15.rst?

@wiomoc wiomoc requested a review from AA-Turner as a code owner September 3, 2025 15:39
@encukou encukou merged commit c1a9c23 into python:main Sep 5, 2025
45 checks passed
@encukou

encukou commented Sep 5, 2025

Copy link
Copy Markdown
Member

Thank you!

@picnixz

picnixz commented Sep 7, 2025

Copy link
Copy Markdown
Member

I think this broke some buildbots: https://buildbot.python.org/#/builders/914. It wasn't detected until #138276 (which was 20h ago, but this change is 2 days ago and the previous buildbot run was 3 days ago)

@AA-Turner

Copy link
Copy Markdown
Member

@picnixz if you open a revert PR as draft, you could use !buildbot to check this worker & see if it passes again?

A

@picnixz

picnixz commented Sep 7, 2025

Copy link
Copy Markdown
Member

I'm not on a dev session but I'll do it tomorrow if no one beats me to it.

@wiomoc

wiomoc commented Sep 7, 2025

Copy link
Copy Markdown
Contributor Author

The test only fails, when running on a Windows system with disabled symlink support - therefore it didn't occure on GH actions.
I have a possible fix in #138626

lkollar pushed a commit to lkollar/cpython that referenced this pull request Sep 9, 2025
@dnicolodi

Copy link
Copy Markdown
Contributor

Shouldn't the inverse normalization of path separator characters be done in tarfile.TarFile.add()? Otherwise, the symlink targets written in the tar archive are wrong when extracted on another platform (and I believe not conforming to the standard). See #151669.

@encukou

encukou commented Jun 19, 2026

Copy link
Copy Markdown
Member

I believe not conforming to the standard

Which standard are you referring to?

Shouldn't the inverse normalization of path separator characters be done in tarfile.TarFile.add()?

Perhaps, but, I'm still fixing issues from the last time I changed long-standing behaviour in tarfile.
The change needs far stronger justification than I can see currently.

@dnicolodi

Copy link
Copy Markdown
Contributor

Which standard are you referring to?

The tar format definition. I haven't checked the format definition (and I don't even know whether it exists), but all paths in a tar file are usually stored with / as the path separator (and this is what the tarfile module implements, with one exception). The current behavior of tarfile.TarFile.add() is to use native path separators for symlink targets. On Windows, this result in tar files that produce wrong symbolic links when extracted on POSIX platforms.

Perhaps, but, I'm still fixing issues from the last time I changed long-standing behaviour in tarfile.

Think to this as more fallout from these changes. The normalization has always been missing, but it broke my application only since this PR has been merged (and Windows support for symlinks is more widespread: I would not have noticed it if not for the fact that GitHub Actions Windows runners have support for symlink enabled).

The change needs far stronger justification than I can see currently.

tar archives produced by code like

os.symlink("../baz", "foo")
tar = tarfile.open("sample.tar", "w")
tar.add("foo")
tar.close()

are currently invalid when created on Windows. In other words, they produce the wrong filesystem objects when extracted on a platform that uses / as path separator.

Maybe code like this can be fixed specifying a filter function that fixes the path separators on Windows. However, the exact same argument could have been made for the change in this PR. On the other hand, I don't see a way to fix tar archives produced by shutil.make_archive().

I think that having the normalization only on extract is the worst possible outcome as it introduces unnecessary asymmetry: users of the tarfile module need special treatment of symlinks on Windows when creating archives, but not when extracting them.

@encukou

encukou commented Jun 19, 2026

Copy link
Copy Markdown
Member

Which standard are you referring to?
The tar format definition. I haven't checked the format definition (and I don't even know whether it exists)

Could you check, then? An actual reference to the standard would make this an easy sell -- that would mean that relevant people already thought about this.
Talking about imaginary standards, on the other hand, make me think I need to consult a tarfile expert on this -- or become one (i.e. find all the caveats and consequences myself).

@bonzini

bonzini commented Jun 19, 2026

Copy link
Copy Markdown

The tar standard is a POSIX standard. POSIX does not make backslashes special in any way; they are a valid character in a file name. Therefore, POSIX treats a link name of a\b as a symbolic link to the file a\b in the current directory.

For example, https://man.freebsd.org/cgi/man.cgi?query=tar&apropos=0&sektion=5&manpath=FreeBSD+7.0-RELEASE&arch=default&format=html has

       linkpath
	       The full path of the linked-to file.  Note that this is encoded
	       in UTF8 and can thus include non-ASCII characters.

and https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_271 defines

3.170 Filename

A sequence of bytes consisting of 1 to {NAME_MAX} bytes used to name a file. The bytes composing the name shall not contain the <NUL> or <slash> characters. In the context of a pathname, each filename shall be followed by a <slash> or a <NUL> character; elsewhere, a filename followed by a <NUL> character forms a string (but not necessarily a character string). The filenames dot and dot-dot have special meaning. A filename is sometimes referred to as a "pathname component". See also Pathname.

3.271 Pathname

A string that is used to identify a file. In the context of POSIX.1-2017, a pathname may be limited to {PATH_MAX} bytes, including the terminating null byte. It has optional beginning <slash> characters, followed by zero or more filenames separated by <slash> characters. A pathname can optionally contain one or more trailing <slash> characters. Multiple successive <slash> characters are considered to be the same as one <slash>, except for the case of exactly two leading <slash> characters.

@bonzini

bonzini commented Jun 19, 2026

Copy link
Copy Markdown

Found something in the POSIX standard as well (https://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html#tag_20_92_13_06), where directory entry type 2 "represents a symbolic link. The contents of the symbolic link shall be stored in the linkname field."

Within tar files, the contents of the symbolic link are best described in the linkpath extension record, which is defined as follows: "The pathname of a link being created to another file, of any type, previously archived. This record shall override the linkname field in the following ustar header block(s)."

In other words, the target of a symbolic link in a tar file should indeed be a pathname, and as such separated by forward slashes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants