Aaron N. Tubbs bio photo

Aaron N. Tubbs

Dragon chaser.

Twitter Facebook Google+ LinkedIn Github

It seems to be a surprisingly tricky thing with modern languages to attempt to figure out just how to retrieve data from the web behind a proxy that requires authentication. I’ve found it especially confusing when using languages that have half a dozen different standard libraries for retrieving url-based or http-based data. Two great examples of this are Java and Python.

Python was today’s experiment. Python, in its standard library, has both httplib and urllib. The former can be extended to do proxy auth by manually hacking up a number of headers. This isn’t that big of a deal, and this is what I’ve done with LWP in the past, but it’s a crude solution. On the other hand, there is what appeared to be the far more intuitive urllib, except its proxy support is spotty, and the syntax to do so wasn’t readily apparent. Luckily, if you use the specialized FancyURLopener inside of urllib, it’s possible to use a proxy with the the following caveat: there’s nothing built in to handle proxy authentication. Better yet, the failure behavior is to just return an empty document, as opposed to tossing an exception to say that something isn’t right.

That’s cool though, because after a little more perspiration, I came across urllib2, which seems to set out to solve all of the problems that urllib cannot. Of course, to do this, it takes what is a relatively clean and elegant library, and transforms it into a monster. This is first apparent in that switching from a import urllib to import urllib2 quadrupled the load time on a small script. It’s also apparent in that we go from two lines of code to retrieve a url to:

import urllib2
proxy = urllib2.ProxyHandler({'http':'http://username:passwordserver:port/’})@
authentication = urllib2.HTTPBasicAuthHandler()
url_opener = urllib2.build_opener(proxy, authentication, urllib2.HTTPHandler)
data = urllib2.urlopen(url)

And of course, we haven’t even added exception handling code yet. Now, from looking at the architecture of urllib2, it is very flexible, and elegant in its own way — build up your transports and layers as you go, and roll your own retriever that does exactly what you need, nothing more, nothing less. This object/message/event approach to doing things pops up a lot in a lot of languages. Python suddenly seems to be a strange mixture of doing things the hacker way, the easy way, and the flexible way. To this end, the old “unlike perl, there is only one way to do it” argument for python seems to be completely bunk. Maybe web retrieval is just an exception to the rule. At the end of the day, we have something that is elegant and a bit complicated, and seems to be the right way to go about retrieving data behind a proxy with authentication with python.

Now that we’ve accomplished the simple task of getting out of the proxy and retrieving the google homepage, say we want to do something slightly more complicated, like make a SOAP call on a website over SSL, digest authenticated, over a proxy, with proxy authentication. Where do we even begin? Well, let’s see, we look at the standard SOAP modules for python. Right, there don’t seem to be any. So then we have to choose between 4Suite SOAP, SOAPy, SOAP.py, soaplib, ORchard, and PySOAP. No worries, I’m sure there are good reasons that there are half a dozen SOAP modules.

I’ll try one at random. Most have been abandoned, frozen, deprecated, or replaced. Curiously, I couldn’t find one that actually said “use me, I’m the de-facto standard, I’m implemented the python way.” I picked what seemed to be the most popular choice in SOAPpy, but then it gets muddy again. It actually has some built in code to try and work behind an HTTP proxy, using httplib. It also has some stuff built in for pyglobus support. Whatever. Further investigation yields that the proxy implementation would not yet handle authorization at the proxy … but it wouldn’t be hard to hack up the module to implement this. But then I get to thinking.

I’m told that python is a beautiful elegant object-oriented world. If that were true, this simple extension of what I’ve already accomplished should be just that — simple. I would pass some sort of generic HTTP transport object to my SOAP object (something like our urllib2 solution above), which would then have everything it needed to go off and do its work; it should not have to worry about implementing any HTTP headers not directly related to SOAP. It does not make sense that a SOAP library should have to implement its own HTTP proxy handling, and the danger of this is seen in that this implementation does not even completely support what might be required of working behind a proxy. Somebody implementing this sort of thing should not have to think about these things! This isn’t development in the object-oriented paradigm, this is monolithic procedural crap that has been coaxed into classes and methods.

Now, there is probably a python library that does this all correctly, and I admit this is probably all still brought on by my inherent dislike of indent-based scoping, but seriously, why make these things so difficult? Isn’t the point of python to be easy, object-oriented, powerful, and straightforward? I am not trying to suggest that other languages necessarily do a better job of this — I would run up against the same sort of crap in Java, Perl, C++ … the only place where I have a little hope that things will work the right way out of the box is within the .NET CLR. Even better, it would probably be able to do something clever, like just use my Internet Explorer settings, rather than having to build up all of that extra information in a redundant application config. I haven’t actually validated that this is the case. I just feel like once we’ve stepped outside of the stuff in getting to know python tutorial, suddenly we’ve entered a dark world where everything is not quite as it seems.