This feed contains pages about Python.
json vs thrift and protocol buffers round 2
Posted Mon Mar 2 21:54:57 2009Following up to the previous posts, A few comments out on the internet mentioned that my first tests werent very fair to thrift and protocol buffers because they were mostly serializing strings. I gutted the test code and re-wrote the IDL files to use this structure:
message DnsRecord {
required fixed32 sip = 1;
required fixed32 dip = 2;
required uint32 sport = 3;
required uint32 dport = 4;
}
Nothing fancy, basically the standard ipv4 4-tuple.
I also replaced the random record generation with this:
def get_random_records(num=10000): data = [] for x in xrange(num): data.append({ 'sip': 192*255**3+168*255**2+255+random.randrange(0,255), 'dip': random.randrange(1,255**4), 'sport': random.randrange(1024,2048), 'dport': random.choice([21,22,25,80,110,443]) }) return data
This will generate 10000 records with:
- a random source IP on the 192.168.1.0/24 network
- a completely random destination IP
- a source port between 1024 and 2048
- a destination port chosen from six common ports.
The raw size of this data using fixed length ints would be 10000*(4+4+4+4) = 160,000 bytes. The variable length encoding that protocol buffers does should be able to save some space when storing the smaller port numbers.
Running the test code produces the following output:
10000 total records (0.280s) get_thrift (0.060s) get_pb (0.950s) ser_thrift (0.560s) 370009 bytes ser_pb (4.850s) 171650 bytes ser_json (0.080s) 680680 bytes ser_cjson (0.120s) 680680 bytes ser_yaml (17.330s) 610680 bytes ser_thrift_compressed (0.620s) 111326 bytes ser_pb_compressed (3.980s) 98571 bytes ser_json_compressed (0.110s) 124919 bytes ser_cjson_compressed (0.120s) 124919 bytes ser_yaml_compressed (17.160s) 121065 bytes serde_thrift (2.130s) serde_pb (7.550s) serde_json (0.130s) serde_cjson (0.110s) serde_yaml (56.740s)
These results show that protocol buffers and thrift do indeed excel at serializing numeric values. The pre-compressed output from protocol buffers is considerably smaller than the other serialization methods, with thrift ending up somewhere in the middle. In fact, the protocol buffers output is barely larger than the original data would be in compact binary form. Since JSON and YAML serialize numbers to strings, their output ends up being 4 times bigger.
However, once you add in compression, all this fancy extra work to save space only slightly improves on JSON. The speed and simplicity of the JSON+zlib approach can not be ignored...
The protocol buffers speed issues are still there, but I'm sure that over time things will improve. If the C extension for simplejson can speed up serialization by an order of magnitude, I have no doubt that similar improvements can be made to protocol buffers and thrift.
If you want to run these tests for yourself, the code is available from sertest2.tgz
Some other things to try would be to set the default dport to 80, and see how that effects serialization size and speed.
more on json vs thrift and protocol buffers
Posted Sun Mar 1 13:19:58 2009Following up to the previous post, a couple of people pointed out to me that the cjson library is faster than simplejson, and that the latest simplejson has a small C extension.
Re running the tests with simplejson 2.0.9 and cjson yields the following results:
5000 total records (0.730s) get_thrift (0.040s) get_pb (0.620s) ser_thrift (0.550s) 555125 bytes ser_pb (2.980s) 415125 bytes ser_json (0.030s) 718455 bytes ser_cjson (0.040s) 718455 bytes ser_yaml (12.770s) 623455 bytes ser_thrift_compressed (0.630s) 287621 bytes ser_pb_compressed (3.020s) 284441 bytes ser_json_compressed (0.090s) 293073 bytes ser_cjson_compressed (0.080s) 293073 bytes ser_yaml_compressed (13.260s) 291106 bytes serde_thrift (1.460s) serde_pb (5.250s) serde_json (0.070s) serde_cjson (0.060s) serde_yaml (44.110s)
There doesn't seem to be much doubt about it, if you need to serialize basic python data structures and don't need the extra features of thrift or protocol buffers, it is hard to beat JSON.
The test code at sertest.tgz has also been updated to use time.clock instead of time.time.
thrift and protocol buffers
Posted Sat Feb 28 11:53:08 2009I've been experimenting with thrift and protocol buffers recently. For the most part when I need to serialize something I've been using JSON or compressed JSON. Thrift and protocol buffers have a couple of advantages, and are also supposedly faster and produce smaller output.
The test I've been using is a simple list of hashes, nothing too complicated. here is the protocol buffers file. The thrift file is pretty much the same thing.
package passive_dns;
message DnsRecord {
required string key = 1;
required string value = 2;
required string first = 3;
required string last = 4;
optional string type = 5 [default = "A"];
optional int32 ttl = 6 [default = 86400];
}
message DnsResponse {
repeated DnsRecord records = 1;
}
The optional and default values are one of the benefits of both serialization libraries. A record that matches the default value does not need to be included in the serialized output.
I wrote up a simple test program to compare thrift, protocol buffers, json, and compressed json for size and speed. The results, at least for the type of data I use, are very interesting:
5000 total records (0.745s) get_thrift (0.044s) get_pb (0.608s) ser_thrift (0.474s) 554953 bytes ser_pb (3.087s) 414862 bytes ser_json (0.273s) 718191 bytes ser_yaml (13.121s) 623191 bytes ser_thrift_compressed (0.545s) 287617 bytes ser_pb_compressed (3.150s) 284297 bytes ser_json_compressed (0.326s) 292904 bytes ser_yaml_compressed (13.665s) 290993 bytes serde_thrift (1.289s) serde_pb (5.411s) serde_json (1.474s) serde_yaml (45.637s)
EDIT: Updated to include yaml results
The get_* functions are the times needed to covert the python data structure into the classes that the library needs.
The ser_* functions are the times needed to get and serialize the python data structure to a string.
The ser_*_compressed functions are the times needed to get, serialize, and compress the python data structure.
The serde_* functions are the times needed to get, serialize, and de-serialize the python data structure to and from a string.
The results show that serializing to compressed JSON is both smaller and faster than thrift, and serializing+de-serializing is only slightly slower. If I converted the python data to be (header, rows) like a csv file, rather than a flat list of dicts, the json output would be smaller, and likely faster to serialize.
The totally unexpected result was that protocol buffers clocked in at over 4 times slower than thrift. I find it hard to believe that protocol buffers could be that slow, so I will have to run some more tests to make sure that I am using the library correctly.
If you want to run my tests for yourself, the code is available from sertest.tgz
Python Evolution: From Script To Program
Posted Sat Jun 21 23:18:12 2008The Evolution of a Python Programmer is funny, but it only covers one aspect of programming. Many times I will see code that is fine from a CS point of view, but absolutely horrible when it comes to program structure and module organization.
You often see people saying things like "Hello World in python is just 'print "Hello World"'", and that is true. It is very easy to get started writing python, but if you don't structure your modules correctly, you will be in a world of pain later on. It is something that can be hard to explain, since the results in the short term are the same, and it may not be clear at first why one way of doing things is better than the other.
Instead of Hello World, let's take the example of a program to get stock quotes. The actual implementation here is not relevant, pretend it contacts a web service or database or something.
A common case is the "python script". I HATE python scripts. "script" almost always ends up being a single file with no entry points, no main function, and mixes IO with logic.
s = raw_input("symbol:")
if s == 'MSFT':
print 'price=', 28.23
elif s == 'GOOG':
print 'price=', 546.43
The first step in fixing this is to define an actual function. Now you can import the module and run get_price().
def get_price(): s = raw_input("symbol:") if s == 'MSFT': print 'price=', 28.23 elif s == 'GOOG': print 'price=', 546.43
The (hopefully) obvious problem with this is that the IO is mixed in with the logic. What if you wanted to get the stock price for 1000 stocks and output a nice summary? This next version is slightly better, here the input is a proper parameter, but you still have no control over the output. You could get your 1000 quotes, but you would have no way to report on the output. Again, this should be obvious, but I come across code that does this way too often.
def get_price(s): if s == 'MSFT': print 'price=', 28.23 elif s == 'GOOG': print 'price=', 546.43 ### if __name__ == "__main__": s = raw_input("symbol:") get_price(s)
The first respectable version adds a main() function that
handles the input and output. The main function should also get the
stock from the command line arguments, rather than interactively. I
think you tend to see things like this more often from windows
users, who like to double click on things rather than run them from
a shell. You could probably write a whole book on this subject
though 
def get_price(s): if s == 'MSFT': return 28.23 elif s == 'GOOG': return 546.43 ### def main(): s = raw_input("symbol:") print 'price=', get_price(s) if __name__ == "__main__": main()
The final steps are to make a proper python package out of this module, but I'll save that for a later post.
how my dupe finding program works
Posted Thu Feb 21 23:41:03 2008finding duplicate files
This post is about my duplicate finding program available under Programs. The program is a little bare, and needs a nicer API, but the method it uses is the most efficient one that I am aware of.
There are a couple of different ways you can find duplicate files:
Compute the hash of all the files, and look for duplicates
This method works well if the files on disk are mostly static, and files are added infrequently. In this case you can compute the hashes once, and keep it around for later scans. However, if you are only running the scan once, this method is not ideal since it requires you to read the full contents of every file
Compute the hash of files with the same size
This is the method that I think fdupes still uses. It first builds a candidate list of files that are the same size, and computes the checksum of each. This method works well if most of the files that are the same size are really duplicates, but otherwise triggers too much unneeded IO.
Compare all files with the same size in parallel
This is the method that my program uses. Like fdupes, I first built up a candidate list of files with the same size. Instead of hashing the files, it simply reads each file at the same time, comparing block by block. This is just like what the cmp(1) program does, but for multiple files at the same time. The benefit of this over calculating the files hash, is that as soon as the files differ, you can stop reading.
Implementation
There are a couple of things you need to keep in mind to implement this method.
Don't open too many files.
You have to be careful not to try and open too many files at once. If the user has 5,000 files that all have the same size, the program shouldn't try and open all 5,000 at once. My program uses a simple helper class to handle opening and closing files. The default blocksize in my program would probably waste a bit of memory in this case, but that is easily changed.
Correctly handle diverging sets.
Imagine the filesystem contains 4 files of the same size, 'a', 'b','c', and 'd', where a==c, and b==d. While reading through the files, it will become clear that a!=b, a==c, and a!=d. It is important that at this step the program continues searching using (a,c) and (b,d) as possible duplicates. This is implemented using recursion, the sets (a,c) and (b,d) are fed back into the duplicate finding function.
Example run, compared to fdupes.
Here is dupes.py running against fdupes on a modestly sized directory. Notice how dupes.py only needs to read 600K(not counting metadata).
According to iofileb.d from the dtrace toolkit, dupes.py reads 10M of data (which I think includes python), and fdupes reads 517M. This alone explains the 20x speedup seen in dupes.py
justin@pip:~$ du -hs $DIR 15G $DIR justin@pip:~$ time python code/dupes.py $DIR 2896 total files 35 size collisions, max of length 5 bytes read 647168 real 0m1.224s user 0m0.234s sys 0m0.494s justin@pip:~$ time fdupes -r $DIR real 0m41.694s user 0m13.612s sys 0m7.491s justin@pip:~$ time python code/dupes.py $DIR 2896 total files 35 size collisions, max of length 5 bytes read 647168 real 0m3.662s user 0m0.256s sys 0m0.568s justin@pip:~$ time fdupes -r $DIR real 0m55.473s user 0m11.383s sys 0m6.433s
regex with named groups
Posted Wed Feb 20 11:42:21 2008As I mentioned in a comment at Some more tweaks to my Python script, there are a lot of ways you can use the re module. If you need to match multiple expressions against each line, you can build up a single regular expression that includes all the patterns, and used named groups to tell them apart.
import re #if you were matching many of these it would be a good idea #to make a function that simply fills in '%s>(?P<%s>[^<]+)<' cpattern = 'total_credit>(?P<credit>[^<]+)<' opattern = 'os_name>(?P<os>[^<]+)<' pattern = '(%s)|(%s)' % (cpattern, opattern) search = re.compile(pattern).search lines = [ 'blah blah blah total_credit>10< blah blah', 'hkfhsd klfjhs dfkljsdfsl fds', 'hkashflksd os_name>win< hhkjhdflksj d', 'hkfhsd klfjhs dfkljsdfsl fds', 'blah blah blah total_credit>20< blah blah', ] for line in lines: r = search(line) if r: print r.groupdict()
Running this gives
{'credit': '10', 'os': None}
{'credit': None, 'os': 'win'}
{'credit': '20', 'os': None}
In this case you could even generalize the regular expression further, like so:
pattern = '\s(?P<key>[^\s>]+)>(?P<value>[^<]+)<'
Running that (probably less than optimal) regular expression over the input gives
{'key': 'total_credit', 'value': '10'}
{'key': 'os_name', 'value': 'win'}
{'key': 'total_credit', 'value': '20'}
dynamic ikiwiki pages
Posted Fri Feb 15 20:57:58 2008The static pages that ikiwiki generates are great, but I want to have some dynamic content here as well.
If this works, this page should include the servers uptime.
16:50:56 up 42 days, 9:48, 0 users, load average: 0.00, 0.00, 0.00yay 
So how does that work?
first configure nginx as follows
server {
listen 80;
server_name bouncybouncy.net *.bouncybouncy.net web;
location / {
root /home/justin/bbdotnet/static/;
index index.html index.htm;
ssi on;
}
location /dyn {
# All POST requests go to pylons directly
include /usr/local/nginx/conf/proxy.conf;
proxy_redirect default;
if ($request_method = POST) {
proxy_pass http://127.0.0.1:5000;
break;
}
default_type text/html;
set $memcached_key "$uri";
memcached_pass localhost:11211;
proxy_intercept_errors on;
# If no info would be found in memcache or memecache would be dead, go to real dynamic location
error_page 404 502 = @dynamic_request;
}
location @dynamic_request{
# This means, that we can't get to this location from outside - only by internal redirect
internal;
include /usr/local/nginx/conf/proxy.conf;
proxy_redirect default;
proxy_pass http://127.0.0.1:5000;
}
}
Pylons is setup to run on port 5000 as usual, nothing fancy there.
Then anywhere we want some dynamic content we can simply do
<!--# include virtual="/dyn/demo/uptime" -->
For now, you have to disable the htmlscrubber plugin for this to work. There is probably a better solution. I think this would simply involve a plugin that could run after htmlscrubber to insert the include, then you would only need to have something like [[include virtual="/dyn/demo/uptime"]] in your pages.
If you did not mind requring javscript, you could use HInclude instead of SSI.
To keep things running fast, we enable to caching on the pylons controller. using a modified version of the beakercache decorator. The following lines are inserted at the end of the createfunc method, which causes the page result to be cached in memcache as well as in beaker.
url = pylons.request.path_info if pylons.request.params: url += "?" + pylons.request.environ['QUERY_STRING'] mc = memcache.Client(['localhost']) mc.set(url, result, cache_expire)
The only remaining problem I see is a small race condition. If
the cache expires, and 20 concurrent requests all come in for the
page, most of them will end up hitting python instead of waiting
for the memcache key to appear. This might actually work better
using varnish or apache2 with mod_disk_cache, but the
last time I tried I could not get varnish to work at all, and
apache2 (I think) still does not support PURGE.