When running the tests with tox for Python 3.9, I get
______________ TestOther.test_listdir_with_non_ascii_byte_string _______________
test/test_real_ftp.py:895: in test_listdir_with_non_ascii_byte_string
names = host.listdir(path)
ftputil/host.py:906: in listdir
items = self._stat._listdir(path)
ftputil/stat.py:825: in _listdir
return self.__call_with_parser_retry(self._real_listdir, path)
ftputil/stat.py:801: in __call_with_parser_retry
result = method(*args, **kwargs)
ftputil/stat.py:681: in _real_listdir
raise ftputil.error.PermanentError(
E ftputil.error.PermanentError: 550 /äbc: no such directory or wrong directory parser used
E Debugging info: ftputil 4.0.0, Python 3.9.0 (linux)
_____________ TestOther.test_listdir_with_non_ascii_unicode_string _____________
test/test_real_ftp.py:908: in test_listdir_with_non_ascii_unicode_string
names = host.listdir(path)
ftputil/host.py:906: in listdir
items = self._stat._listdir(path)
ftputil/stat.py:825: in _listdir
return self.__call_with_parser_retry(self._real_listdir, path)
ftputil/stat.py:801: in __call_with_parser_retry
result = method(*args, **kwargs)
ftputil/stat.py:681: in _real_listdir
raise ftputil.error.PermanentError(
E ftputil.error.PermanentError: 550 /äbc: no such directory or wrong directory parser used
E Debugging info: ftputil 4.0.0, Python 3.9.0 (linux)
______________ TestOther.test_path_with_non_latin1_unicode_string ______________
test/test_real_ftp.py:922: in test_path_with_non_latin1_unicode_string
self.host.mkdir(path)
E Failed: DID NOT RAISE <class 'UnicodeEncodeError'>
=========================== short test summary info ============================
FAILED test/test_real_ftp.py::TestOther::test_listdir_with_non_ascii_byte_string
FAILED test/test_real_ftp.py::TestOther::test_listdir_with_non_ascii_unicode_string
FAILED test/test_real_ftp.py::TestOther::test_path_with_non_latin1_unicode_string
================== 3 failed, 182 passed in 185.52s (0:03:05) ===================
ERROR: InvocationError for command /home/build/ftputil/.tox/py39/bin/python -m pytest test (exited with code 1)
___________________________________ summary ____________________________________
py36: commands succeeded
py37: commands succeeded
py38: commands succeeded
ERROR: py39: commands failed
I assume this is related to https://docs.python.org/3/whatsnew/3.9.html#changes-in-the-python-api :
The encoding parameter has been added to the classes ftplib.FTP and ftplib.FTP_TLS as a keyword-only parameter, and the default encoding is changed from Latin-1 to UTF-8 to follow RFC 2640.
As I'm quite serious about Semantic versioning, which ftputil officially follows since ftputil version 4.0.0., I don't want to break backward compatibility in the ftputil 4.x series.
This is my plan:
For the default session factory, instead of using
ftplib.FTP
directly, use something likeclass DefaultFTP(ftplib.FTP): encoding = "latin-1"
This change unfortunately has the effect that for Python 3.9, paths with non-latin-1 characters will be incompatible between ftplib.py and ftputil. On the other hand, setting the default encoding to "UTF-8" would be incompatible between ftplib.py and ftputil for Python 3.6, 3.7 and 3.8, which ftputil supports, too. Using a default of "UTF-8" or using the implicit default for the Python version would also break backward compatibility between ftputil 4.0.0 and the next version, which I don't want.
Users who want the Python 3.9 default behavior can pass its
ftplib.FTP
orftplib.FTP_TLS
as thesession_factory
argument toFTPHost
.When Python 3.8 reaches end of life, we can release a new ftputil 5.0.0, which uses UTF-8 as default encoding like Python 3.9's ftplib.py. Of course, we could reach version 5.0.0 sooner, but in my opinion it would be silly to switch to ftputil 5.0.0 to just change the session factory default encoding.
- In
ftputil.session.session_factory
, add a keyword argumentencoding
, with a default ofNone
. The semantics will be similar touse_passive_mode
:None
means don't passencoding
to thebase_class
constructor, i. e. use thebase_class
default. Otherwise the encoding must be a unicode string which will be passed to thebase_class
constructor.This change makes it relatively easy to adapt the wanted default encoding if client code already uses a factory made with
ftputil.session.session_factory
.
- Document the changes, with a link to this ticket to explain the background.
Version number change
Semver.org explains that the API should (primarily) be defined by the software documentation. Now, the ftputil documentation states the following:
- Paths created with ftputil should be the same as paths created with ftplib when using the same (unicode) string as input.
- Unicode string arguments will be encoded to latin-1 when sent to the FTP server.
- The subsection on session factories says that
ftplib.FTP
is the default factory.Now that Python 3.9's
ftplib.FTP
uses UTF-8 encoded paths by default, it's impossible to fulfill all of these three statements; we always have some contradiction. We also can't "ignore" Python 3.9 in ftputil 4.0.0 because ftputil 4.0.0 was released with the tag "Python :: 3", not with individual Python versions. So even if the adoption of Python 3.9 is low, clients may already use ftputil 4.0.0 with Python 3.9. (Although they won't notice any problems as long as they stick to ASCII-only paths.)So, taking semver into account, this unfortunately is a backward-incompatible change and would require a new major version number, i. e. ftputil 5.0.0.
Planned code changes and new behavior
Contrary to what I wrote in my first comment, I think it would cause more harm than good to use latin-1 as default encoding also for Python 3.9. I guess, at least if explaining/knowing the encoding change in Python 3.9's ftplib module, most developers would assume that ftputil also uses 3.9 ftplib's path encoding.
Therefore, I'm considering the following code/behavior changes:
- Change
ftputil.session.session_factory
to support a new argument as described in the previous comment. But name the argumentpath_encoding
instead ofencoding
to make it clear that it has nothing to do with file contents.ftplib.FTP
will stay the default session factory. Hence in Python 3.8 and earlier, ftputil will use latin-1 path encoding and in Python 3.9 ftputil will use UTF-8 encoding. This is the behavior with ftputil 4.0.0, as long as you pass unicode strings (str
) to the methods that take a path.- At the moment, i. e. in version 4.0.0, ftputil uses an encoding
ftputil.tool.LOSSLESS_ENCODING
, which is set to latin-1. ftputil does not useftplib.FTP.encoding
(which is latin-1 in Python 3.8 and earlier) because theencoding
attribute was never documented.LOSSLESS_ENCODING
is used inftputil.tool.as_str
, which is used to convertbytes
paths tostr
paths to be passed to ftplib. In other words, there are two different places which rely on the encoding, in ftplib and in thebytes
tostr
decoding in ftputil. To (hopefully) make sure that both places use the same encoding, I plan the following internal changes:
- In
FTPHost._make_session
, inspect theencoding
attribute of the created session instance, then remember the encoding in theFTPHost
object.- If the created session doesn't have a
session
attribute, fall back to UTF-8 and log a warning. This leaves a small chance where the actual encoding used in the session instance is different from the encoding that ftputil assumes forbytes
tostr
path conversions. If that happens, users need to wrap their session factory in one that sets theencoding
attribute.- Pass this encoding to
tool.as_str
instead of using the hardcodedLOSSLESS
encoding.Variations
- Add a
path_encoding
argument with defaultNone
to theFTPHost
constructor instead of checking the session'sencoding
attribute. This would be a bit cleaner. However, I don't think hardly anyone will use a session factory that doesn't inherit fromftplib.FTP
and hence is missing theencoding
attribute.
Deleted milestone 4.1.0 since this ticket will most likely cause a new ftputil version 5.0.0.
The implemented behavior now is:
ftputil.session.session_factory
got a new keyword argumentencoding
(notpath_encoding
as initially suggested). If not specified, this uses the encoding from thebase_class
argument. If specified, use this encoding for paths. This works with session base classes that inherit fromftplib.FTP
andftplib.FTP_TLS
, both with Python 3.8 (and earlier) and Python 3.9. About the argument name:path_encoding
would be clearer, but it may be confusing as the term used inftplib
is justencoding
.When a session instance is created, ftputil will take the path encoding from the session's
encoding
attribute. Again, this will work forftplib.FTP
andftplib.FTP_TLS
in Python 3.8 (and earlier) and Python 3.9. If a session instance doesn't have theencoding
attribute, ftputil will fall back to the default "latin-1" encoding. However, since I'm not sure whether this is a reasonable behavior, I may want to change it without releasing another backward-incompatible version. Therefore, the documentation states that in the case of a missingencoding
attribute the behavior is undefined. Note that the encoding isn't used if you pass paths asstr
.
If the session instances from the factory don't have an
encoding
attribute, other behaviors could be:
- Raise an exception once we get a session instance (i. e. in
_make_session
) withoutencoding
attribute.- Use a different default encoding. I think UTF-8 would not be a good idea because in other places the default encoding is latin-1 and having yet another encoding would be very confusing. However, I can imagine using ASCII as the default encoding if the
encoding
attribute is missing. With this behavior, we can at least still process ASCII paths. Also, if nobytes
paths are used, the encoding isn't used (by ftputil) anyway. Then again, maybe it's better to raise an exception than working most of the time (as long as we have only ASCII paths), but fail at some later point in production because we encounter a non-ASCIIbytes
path.
Replying to schwa:
- Use a different default encoding. I think UTF-8 would not be a good idea because in other places the default encoding is latin-1 and having yet another encoding would be very confusing. However, I can imagine using ASCII as the default encoding if the
encoding
attribute is missing. With this behavior, we can at least still process ASCII paths. Also, if nobytes
paths are used, the encoding isn't used (by ftputil) anyway. Then again, maybe it's better to raise an exception than working most of the time (as long as we have only ASCII paths), but fail at some later point in production because we encounter a non-ASCIIbytes
path.Changed behavior: If a session instance doesn't have an
encoding
attribute, fail early by raising aNoEncodingError
. Originally this was calledMissingEncodingError
, butNoEncodingError
is shorter and hopefully as clear.