Running the code
host = ftputil.FTPHost(hostname, user, password) host.synchronize_times() host.close()
triggers an exception
Traceback (most recent call last): File "g:\test\test.py", line 13, in <module> host.synchronize_times() File "g:\test\ftputil\ftputil.py", line 896, in __exit__ self.close() File "g:\test\ftputil\ftputil.py", line 253, in close self.stat_cache.clear() File "g:\test\ftputil\ftp_stat_cache.py", line 117, in clear self.resize(0) File "g:\test\ftputil\ftp_stat_cache.py", line 100, in resize self._cache.size = new_size File "g:\test\ftputil\lrucache.py", line 186, in __setattr__ del self.__dict[lru.key] KeyError: '/htdocs/new/_ftputil_sync_'
with Python 2.6.1. This happens on Windows, but not under Ubuntu Intrepid.
(originally reported by Peter Stirling)
I can reproduce the bug with Windows XP 3 and both Python 2.5.4 and 2.6.1. I can also confirm that it doesn't happen on Ubuntu 8.10 (Intrepid).
Peter Stirling wrote:
After further inspection it appears that the windows python2.6 install doesn't include a dll of the compiled heapq module
/usr/lib/python2.5/lib-dynload/_heapq.so, on ubuntu python2.5
However when I tried to test the pure python version it decided to work, so there must be a bad dll somewhere.
To test this I made a copy of the pure python heapq.py as heapq26.py, and commented out the code that replaces pure-python with the c versions (line 313). I then altered ftputil/lrucache.py(line 43) to import from this new module, and when I run the test program I no longer get the exception.
Running lrucache's tests also seems(*) to reveal the bug.
$ python lrucache.py <<class '__main__.LRUCache'> (0 elements)> <<class '__main__.LRUCache'> (25 elements)> <<class '__main__.LRUCache'> (24 elements)> <<class '__main__.LRUCache'> (10 elements)> <<class '__main__.LRUCache'> (10 elements)> 10 40 41 42 43 44 45 47 48 49 46 <<class '__main__.LRUCache'> (10 elements)> 1239353317.44 40 41 42 43 44 45 47 48 49 46
S:\ftputil>python lrucache.py <<class '__main__.LRUCache'> (0 elements)> <<class '__main__.LRUCache'> (25 elements)> <<class '__main__.LRUCache'> (24 elements)> Traceback (most recent call last): File "lrucache.py", line 219, in <module> cache.size = 10 File "lrucache.py", line 195, in __setattr__ del self.__dict[lru.key] KeyError: 46
(*) We don't know for sure if it's the same bug but I think it's highly likely.
Peter, I can't reproduce your workaround with the modified heapq module. When I run the tests in the modified lrucache module (i. e. no ftputil involved), on Windows I still get the traceback for the lrucache tests. Also the traceback for the original ftputil usage, on which this issue is based, remains.
For comparison with your patches I attached mine. (Please ignore the debug code further down in lrucache.py.) What have we done differently?
I set the bug priority to critical because the problem might - in other contexts - apply to a lot of users of ftputil under Windows.
I made an extra branch, issue32, to investigate the bug. There, I combined the test code into a single module which contains code to reproduce the bug. Just run the script to see if the bug occurs. I also added the outputs of this script when run under Windows XP and Ubuntu Linux.
My results so far:
- When overriding
time.time()in the test code with
time.time = time.clockat the start of the script the output on Windows and Linux is the same, i. e. the bug no longer shows.
- When using
time.time()but prepended with
time_)--but without the following
- When using the normal
time.time()without the sleep delay but with just a print statement (not even the
print tcurrently in the code), the bug does not occur.
So there seem to be issues with the way the "time managements" in the two environments work. Maybe this is also related to my Windows investigations happening in a virtual machine (VirtualBox 2.2). This might also explain why the bug hasn't been reported earlier as most Windows users assumedly will run a native Windows and so won't come across the bug (that is, if it really is limited to the virtual environment).
Peter, does the Windows Vista you mentioned in another mail run in a virtual machine? If yes, can you run the test script in a native Windows, i. e. outside any virtual machine, and does the bug happen there?
By the way, the Windows installation is a Windows XP SP3 (officially patched as of 2009-05-01 but without Windows Explorer 8). The Ubuntu Linux is version 8.10 with a kernel 2.6.27-11. The native OS runs on an AMD64 platform with a single processor/single core and without hardware virtualization support (this processor doesn't have any).
As one can check with the attached program time_test.py, time.time measures more precise under Linux, time.clock under Windows. The output of the program under Linux is:
$ python time_test.py time [1241286942.7352209, 1241286942.7352271, 1241286942.73523, 1241286942.7352331, 1241286942.7352359, 1241286942.735239, 1241286942.7352419, 1241286942.735245, 1241286942.7352469, 1241286942.73525] [6.198883056640625e-06, 2.86102294921875e-06, 3.0994415283203125e-06, 2.86102294921875e-06, 3.0994415283203125e-06, 2.86102294921875e-06, 3.0994415283203125e-06, 1.9073486328125e-06, 3.0994415283203125e-06] 2.90870666504e-05 $ python time_test.py clock [0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.0
while on Windows it's the other way around:
>python time_test.py time [1241287016.0969999, 1241287016.0969999, 1241287016.0969999, 1241287016.0969999, 1241287016.0969999, 1241287016.0969999, 1241287016.0969999, 1241287016.0969999, 1241287016.0969999, 1241287016.0969999] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] 0.0 >python time_test.py clock [1.0336509249080539e-05, 1.6482541775560862e-05, 2.1511113842681123e-05, 2.6539685909801387e-05, 3.1568257976921645e-05, 3.7434925388561956e-05, 4.2184132340842205e-05, 4.7212704407962462e-05, 5.2241276475082727e-05, 5.7549213657043007e-05] [6.1460325264803225e-06, 5.0285720671202609e-06, 5.0285720671202643e-06, 5.0285720671202575e-06, 5.8666674116403117e-06, 4.7492069522802485e-06, 5.0285720671202575e-06, 5.0285720671202643e-06, 5.3079371819602801e-06] 4.7212704408e-05
time.timeirrespective of the platform, on Windows it assigns exactly the same timestamp to multiple cache nodes. This seems to confuse the
lrucache module. I'll have to investigate this in more detail to find out what's going on here. :-)
While I was trying to figure out where the (non-existant) bug was in the heapq module, I learned quite a lot about heaps. A heap is a special kind of binary tree where the keys of each node are greater than that of its parent (this means that you can remove the lowest value in the whole tree by removing the root of the tree, which is a low cost operation).
However for this to work you need a stable sort key, which is where the unix-windows fight causes the problem.
From the python docs:
On Unix, return the current processor time as a floating point number expressed in seconds. The precision, and in fact the very definition of the meaning of “processor time”, depends on that of the C function of the same name, but in any case, this is the function to use for benchmarking Python or timing algorithms.
On Windows, this function returns wall-clock seconds elapsed since the first call to this function, as a floating point number, based on the Win32 function
QueryPerformanceCounter(). The resolution is typically better than one microsecond.
Return the time as a floating point number expressed in seconds since the epoch, in UTC. Note that even though the time is always returned as a floating point number, not all systems provide time with a better precision than 1 second. While this function normally returns non-decreasing values, it can return a lower value than a previous call if the system clock has been set back between the two calls.
Windows is one platform where you only get 1s precision from
time.time()(there are others).
This uses a synthetic key for all comparisons made within the heapq module, and with it the code runs identically on my windows box and ubuntu.
The patch could be simplified if you didn't want to retain the
LRUCache.mtime(key)interface (returns the modification time of the cached item with that key)
thanks for your patch! I've also thought about using just a counter. :-) Thanks to Python having long integers, there won't be any problems with integer overflow or "wrap-around" situations, and we won't have to deal with platform dependencies.
Though this approach should fix the problem, I wonder why
lrucachegets so confused. Granted, if the time value provides no stable sort order, I'd expect
lrucacheto perhaps not delete the actually least recently used item from the cache. However, that doesn't explain the traceback for the cache's size change where it seems the code takes a node from the heap list
self.__heapbut no longer finds the node's key* in the dictionary
self.__dict. I'd like to know why this happens before I declare the bug fixed.
mtime, I need this for the "maximum age" functionality described in the documentation.
- Here, I mean the node's
keyattribute, not the "sort key" (which is supposed to be the last-access time) for the heap.
It seems the problem is in
LRUCache.__delitem__. Running with a time function which assigns the same timestamp on every invocation, I find ...
node = self.__dict[key]
extracts the correct node by its key, in the case of the test program that's 16.
deletes this node via its key.
deletes a node by searching through the
__heaplist and comparing the value it finds there with the node. However, since this comparison is done by using
removemethod removes the very first item in the list, which not happens to be the intended 16 but 15.
So the cache object ends up with a heap list that doesn't match the dictionary.
As far as I can see, the resizing in
__setattr__seems to work correctly.
So increasing the comparison key with every new node, as we both thought of, should reliably fix the bug.
The fixed version I sent to the original author of lrucache consists of these changesets.