This feed contains pages about Python.
Python Evolution: From Script To Program
Posted Sat 21 Jun 2008 11:18:12 PM EDTThe Evolution of a Python Programmer is funny, but it only covers one aspect of programming. Many times I will see code that is fine from a CS point of view, but absolutely horrible when it comes to program structure and module organization.
You often see people saying things like "Hello World in python is just 'print "Hello World"'", and that is true. It is very easy to get started writing python, but if you don't structure your modules correctly, you will be in a world of pain later on. It is something that can be hard to explain, since the results in the short term are the same, and it may not be clear at first why one way of doing things is better than the other.
Instead of Hello World, let's take the example of a program to get stock quotes. The actual implementation here is not relevant, pretend it contacts a web service or database or something.
A common case is the "python script". I HATE python scripts. "script" almost always ends up being a single file with no entry points, no main function, and mixes IO with logic.
s = raw_input("symbol:")
if s == 'MSFT':
print 'price=', 28.23
elif s == 'GOOG':
print 'price=', 546.43
The first step in fixing this is to define an actual function. Now you can import the module and run get_price().
def get_price(): s = raw_input("symbol:") if s == 'MSFT': print 'price=', 28.23 elif s == 'GOOG': print 'price=', 546.43
The (hopefully) obvious problem with this is that the IO is mixed in with the logic. What if you wanted to get the stock price for 1000 stocks and output a nice summary? This next version is slightly better, here the input is a proper parameter, but you still have no control over the output. You could get your 1000 quotes, but you would have no way to report on the output. Again, this should be obvious, but I come across code that does this way too often.
def get_price(s): if s == 'MSFT': print 'price=', 28.23 elif s == 'GOOG': print 'price=', 546.43 ### if __name__ == "__main__": s = raw_input("symbol:") get_price(s)
The first respectable version adds a main() function that
handles the input and output. The main function should also get the
stock from the command line arguments, rather than interactively. I
think you tend to see things like this more often from windows
users, who like to double click on things rather than run them from
a shell. You could probably write a whole book on this subject
though 
def get_price(s): if s == 'MSFT': return 28.23 elif s == 'GOOG': return 546.43 ### def main(): s = raw_input("symbol:") print 'price=', get_price(s) if __name__ == "__main__": main()
The final steps are to make a proper python package out of this module, but I'll save that for a later post.
how my dupe finding program works
Posted Thu 21 Feb 2008 11:41:03 PM ESTfinding duplicate files
This post is about my duplicate finding program available under Programs. The program is a little bare, and needs a nicer API, but the method it uses is the most efficient one that I am aware of.
There are a couple of different ways you can find duplicate files:
Compute the hash of all the files, and look for duplicates
This method works well if the files on disk are mostly static, and files are added infrequently. In this case you can compute the hashes once, and keep it around for later scans. However, if you are only running the scan once, this method is not ideal since it requires you to read the full contents of every file
Compute the hash of files with the same size
This is the method that I think fdupes still uses. It first builds a candidate list of files that are the same size, and computes the checksum of each. This method works well if most of the files that are the same size are really duplicates, but otherwise triggers too much unneeded IO.
Compare all files with the same size in parallel
This is the method that my program uses. Like fdupes, I first built up a candidate list of files with the same size. Instead of hashing the files, it simply reads each file at the same time, comparing block by block. This is just like what the cmp(1) program does, but for multiple files at the same time. The benefit of this over calculating the files hash, is that as soon as the files differ, you can stop reading.
Implementation
There are a couple of things you need to keep in mind to implement this method.
Don't open too many files.
You have to be careful not to try and open too many files at once. If the user has 5,000 files that all have the same size, the program shouldn't try and open all 5,000 at once. My program uses a simple helper class to handle opening and closing files. The default blocksize in my program would probably waste a bit of memory in this case, but that is easily changed.
Correctly handle diverging sets.
Imagine the filesystem contains 4 files of the same size, 'a', 'b','c', and 'd', where a==c, and b==d. While reading through the files, it will become clear that a!=b, a==c, and a!=d. It is important that at this step the program continues searching using (a,c) and (b,d) as possible duplicates. This is implemented using recursion, the sets (a,c) and (b,d) are fed back into the duplicate finding function.
Example run, compared to fdupes.
Here is dupes.py running against fdupes on a modestly sized directory. Notice how dupes.py only needs to read 600K(not counting metadata).
According to iofileb.d from the dtrace toolkit, dupes.py reads 10M of data (which I think includes python), and fdupes reads 517M. This alone explains the 20x speedup seen in dupes.py
justin@pip:~$ du -hs $DIR 15G $DIR justin@pip:~$ time python code/dupes.py $DIR 2896 total files 35 size collisions, max of length 5 bytes read 647168 real 0m1.224s user 0m0.234s sys 0m0.494s justin@pip:~$ time fdupes -r $DIR real 0m41.694s user 0m13.612s sys 0m7.491s justin@pip:~$ time python code/dupes.py $DIR 2896 total files 35 size collisions, max of length 5 bytes read 647168 real 0m3.662s user 0m0.256s sys 0m0.568s justin@pip:~$ time fdupes -r $DIR real 0m55.473s user 0m11.383s sys 0m6.433s
regex with named groups
Posted Wed 20 Feb 2008 11:42:21 AM ESTAs I mentioned in a comment at Some more tweaks to my Python script, there are a lot of ways you can use the re module. If you need to match multiple expressions against each line, you can build up a single regular expression that includes all the patterns, and used named groups to tell them apart.
import re #if you were matching many of these it would be a good idea #to make a function that simply fills in '%s>(?P<%s>[^<]+)<' cpattern = 'total_credit>(?P<credit>[^<]+)<' opattern = 'os_name>(?P<os>[^<]+)<' pattern = '(%s)|(%s)' % (cpattern, opattern) search = re.compile(pattern).search lines = [ 'blah blah blah total_credit>10< blah blah', 'hkfhsd klfjhs dfkljsdfsl fds', 'hkashflksd os_name>win< hhkjhdflksj d', 'hkfhsd klfjhs dfkljsdfsl fds', 'blah blah blah total_credit>20< blah blah', ] for line in lines: r = search(line) if r: print r.groupdict()
Running this gives
{'credit': '10', 'os': None}
{'credit': None, 'os': 'win'}
{'credit': '20', 'os': None}
In this case you could even generalize the regular expression further, like so:
pattern = '\s(?P<key>[^\s>]+)>(?P<value>[^<]+)<'
Running that (probably less than optimal) regular expression over the input gives
{'key': 'total_credit', 'value': '10'}
{'key': 'os_name', 'value': 'win'}
{'key': 'total_credit', 'value': '20'}
dynamic ikiwiki pages
Posted Fri 15 Feb 2008 08:57:58 PM ESTThe static pages that ikiwiki generates are great, but I want to have some dynamic content here as well.
If this works, this page should include the servers uptime.
21:34:00 up 15 days, 6:24, 0 users, load average: 0.00, 0.00, 0.00yay 
So how does that work?
first configure nginx as follows
server {
listen 80;
server_name bouncybouncy.net *.bouncybouncy.net web;
location / {
root /home/justin/bbdotnet/static/;
index index.html index.htm;
ssi on;
}
location /dyn {
# All POST requests go to pylons directly
include /usr/local/nginx/conf/proxy.conf;
proxy_redirect default;
if ($request_method = POST) {
proxy_pass http://127.0.0.1:5000;
break;
}
default_type text/html;
set $memcached_key "$uri";
memcached_pass localhost:11211;
proxy_intercept_errors on;
# If no info would be found in memcache or memecache would be dead, go to real dynamic location
error_page 404 502 = @dynamic_request;
}
location @dynamic_request{
# This means, that we can't get to this location from outside - only by internal redirect
internal;
include /usr/local/nginx/conf/proxy.conf;
proxy_redirect default;
proxy_pass http://127.0.0.1:5000;
}
}
Pylons is setup to run on port 5000 as usual, nothing fancy there.
Then anywhere we want some dynamic content we can simply do
<!--# include virtual="/dyn/demo/uptime" -->
For now, you have to disable the htmlscrubber plugin for this to work. There is probably a better solution. I think this would simply involve a plugin that could run after htmlscrubber to insert the include, then you would only need to have something like [[include virtual="/dyn/demo/uptime"]] in your pages.
If you did not mind requring javscript, you could use HInclude instead of SSI.
To keep things running fast, we enable to caching on the pylons controller. using a modified version of the beakercache decorator. The following lines are inserted at the end of the createfunc method, which causes the page result to be cached in memcache as well as in beaker.
url = pylons.request.path_info if pylons.request.params: url += "?" + pylons.request.environ['QUERY_STRING'] mc = memcache.Client(['localhost']) mc.set(url, result, cache_expire)
The only remaining problem I see is a small race condition. If
the cache expires, and 20 concurrent requests all come in for the
page, most of them will end up hitting python instead of waiting
for the memcache key to appear. This might actually work better
using varnish or apache2 with mod_disk_cache, but the
last time I tried I could not get varnish to work at all, and
apache2 (I think) still does not support PURGE.