Explain why we get "latin1-encoded unicode strings" for paths

刘昶 ​wrote to the ftputil mailing list with the following observation:

test Enviroment:
Server: File Zilla Server 0.9.50 
Client OS: Win7

import ftputil
# Download some files from the login directory.
with ftputil.FTPHost("localhost", user='honglei',passwd='111111' ) as ftp_host:
    names = ftp_host.listdir(ftp_host.curdir)

I find that:
   name[-1] == u'\xe8\xbf\x99\xe6\x98\xaf\xe4\xb8\xad\xe6\x96\x87.txt', 
it is a 'utf-8' encoded filename, rather than an unicode string.
schwa (unverified)
Assigned to
7 years ago
7 years ago

schwa (unverified) 7 years ago · edit

Thanks a lot for bringing this up.

Technically, this is a unicode string, but you're right in that you can see a UTF-8 encoding here.

When you use listdir in ftputil, it uses the standard library's ftplib to retrieve a directory listing. On Python 3, ftplib returns unicode strings. However, since the socket ultimately gets only bytes and ftplib doesn't know the encoding, it arbitrarily assumes latin1 encoding. Since this is an 8-bit encoding, there can't be decoding exceptions.

ftputil processes the strings returned by ftplib as they come, so ftputil in turn gives you those latin1-encoded unicode strings.

Since ftputil uses a unified API for Python 2 and 3, it applies the same unicode handling when run on Python 2.

If you know that the strings use latin1 encoding and if you know that the original encoding coming from the FTP server was UTF-8, you can calculate a unicode string in the correct encoding:

>>> s = u'\xe8\xbf\x99\xe6\x98\xaf\xe4\xb8\xad\xe6\x96\x87.txt'
>>> s.encode("latin1")
>>> s.encode("latin1").decode("utf8")

I guess this is the name you expected.

In the general case, i. e. if you don't know the encoding, you can just calculate the byte string by encoding with latin1 as the encoding.

I plan to extend the ftputil documentation to clarify what's going on here.

schwa (unverified) 6 years ago · edit

I fixed and expanded the documentation section "Directory and file names" in 21d9df0d26acf8a35c8950e86de37a438e1ae25c.

Register here or Log in to comment, or comment via email.