Breaking Code

June 29, 2010

Using Google Search from your Python code

Filed under: Tools, Web applications — Tags: , , , , , , , , — Mario Vilas @ 6:31 pm

Hi everyone. Today I’ll be showing you a quick script I wrote to make Google searches from Python. There are previous projects doing the same thing -actually, doing it better-, namely Googolplex by Sebastian Wain and xgoogle by Peteris Krumins, but unfortunately they’re no longer working. Maybe the lack of complexity of this script will keep it working a little longer… 🙂

The interface is extremely simple, the module exports only one function called search().

        # Get the first 20 hits for: "Breaking Code" WordPress blog
        from googlesearch import search
        for url in search('"Breaking Code" WordPress blog', stop=20):

You can control which one of the Google Search pages to use, which language to search in, how many results per page, which page to start searching from and when to stop, and how long to wait between queries – however the only mandatory argument is the query string, everything else has a default value.

        # Get the first 20 hits for "Mariposa botnet" in Google Spain
        from googlesearch import search
        for url in search('Mariposa botnet', tld='es', lang='es', stop=20):

A word of caution, though: if you wait too little between requests or make too many of them, Google may block your IP for a while. This is especially annoying when you’re behind a corporate proxy – I won’t be made responsible when your coworkers suddenly develop an urge to kill you! 😀

EDIT (Jan 2017): Wow, this little code has expanded a lot since its creation. Now it’s an installable package and had contributions from many people. Thanks everyone! 🙂

Source code

Get the source code from GitHub:

January 14, 2010

Having fun with URL shorteners, part 2: parasitic storage

Filed under: Tools, Web applications — Tags: , , — Mario Vilas @ 5:50 am

In my previous post I briefly mentioned a rather creative use for URL shorteners: to store the contents of arbitrary files (a.k.a. “parasitic storage”). There was previous work on this area, namely a tool called TinyDisk that could implement a rudimentary filesystem on top of the TinyURL service. The bad news is, the site that hosted the tool appears to be down and I couldn’t get to play with it. 😦


After a deep, thorough investigation (read: 30 seconds of googling) I came to the conclusion that no source code sample was available to do this. Naturally, the solution was to roll my own! 😀 It’s a crude proof-of-concept but it works. My first idea was to break up the file to be uploaded into small chunks that could fit the maximum length allowed in an URL, then ask TinyURL to shorten each of them, and keep the shortened URLs to be able to download the file later.

I started to test the tolerance of the TinyURL API and found three interesting things. One, that I wouldn’t even need to disguise the data as valid URLs, because TinyURL makes no validation of any kind – all I had to do was make sure the data was encoded in a way that wouldn’t break the HTTP requests. I chose to use hex encoding, but I’m sure there are more efficient encodings that would do the job nicely. Two, I could send obscenely large URLs. Sending 256 Kb of POST data worked like a charm so I chose that for the chunk size. The next power of two (512 Kb) caused the TinyURL server to drop the connection without a reply but I haven’t determined if this was really a limitation in their server or in my Squid proxy – in any case I thought it was a good idea to leave it at that. The third thing I noticed was that after a while of sending repeated requests the TinyURL service would stop responding for a while, I guess this is a protection against spammers. Quite understandable given the fact that this API is anonymous and doesn’t require an API key like many other services. The solution is to wait a little while when this happens before making any more HTTP requests.

Now the problem is how to get the data back. A simple GET request won’t work behind proxies, because TinyURL blindly returns our data in a Location: header and predictably my Squid refused to parse it when I tried. The solution I found was the preview feature. Changing the into returns a webpage that contains the target URL – then we can parse the HTML to get the data back. It’s actually easier this way, because urllib2 tries to follow redirections by default and that would have called for some hacks on our part. Downloading the preview page is painless, but it also introduces more overhead (a 256 Kb block produces a preview page of a approx. 1 Mb).

That’s pretty much it… when we upload a file we break it into 256K blocks and send it to the TinyURL servers, then we store each returned short URL in a text file. Since all shortened URLs begin the same way ( we only need to store the code after the slash. The we download each block by asking for the preview page for each URL and extracting the data from the HTML page.


January 11, 2010

Having fun with URL shorteners

Filed under: Tools, Web applications — Tags: , , , , — Mario Vilas @ 2:14 am

I’ve taken some interest in URL shorteners recently. URL shorteners are the latest fad in web services – everybody wants to have their own, even if it’s not yet entirely clear how to profit from it, at least not for us mere mortals. They were born out of Twitter users need to compress their microblogging posts (some links are actually longer than 160 characters, believe it or not) and it’s spread everywhere on the Internet. Major web sites like Facebook, Google and Youtube are into it and -some say- it’s one step closer to destroying the Web as we know it.

Of course, this can also be used for evil purposes. 🙂

So I made myself a Python module to toy with these things. It can be used as a command line tool as well as a module to be imported in projects of your own, and contains no non-standard dependencies. You can download it from here:

For example, this is how you convert a long URL into a short one (the shortest possible choice is automatically selected):

    $ ./

And it’s reverse, converting that short URL back into the long one:

    $ ./ -l

You can also choose a specific URL shortener service. Let’s try Twitter’s new favorite,

    $ ./ -u

Now, the fun stuff 🙂 what would happen if we try to shorten, say, a javascript: link? We can find out with the -t switch (output edited for brevity):

    $ python -t "javascript:alert('Pwn3d');"
    RuntimeError: No data returned by URL shortener API
    RuntimeError: <h3>URL has wrong format! Forgot 'http://'?</h3>
            Short [1]:
            Long  [0]: http://javascript:alert('Pwn3d');
            Short [1]:
            Long  [0]: javascript:alert('Pwn3d');
            Short [1]:
            Long  [0]: http://xrl.usjavascript:alert('Pwn3d');

Most services either report an error condition or simply don’t return anything at all. Two services return a garbage response ( and But to my surprise TinyURL, the second largest URL shortener service is vulnerable to this! True, this is probably not news to many of you, and blocking this kind of links doesn’t provide that much of an advantage (clicking on a link with an unknown target is pretty much game over anyway) but I still fail to see the reason to support it in the first place. I can’t think of any legitimate reasons to shorten a javascript link.

Example of TinyURL's poor handling of javascript links

As a matter of fact, TinyURL seems to accept anything you send it, may it look like a valid URL or not. This has lead some people to build a rogue filesystem called TinyDisk, world readable, yet very hard (if not impossible) to trace back to it’s creator. The download link seems to be down but the Slashdot article is still there. I also found this article explaining how such a filesystem would work.

Back on the matter, the way people react to URL shorteners make them perfect for an attacker to hide the true target of his/her exploits, or bypass spam filters as it’s being done in the wild for quite some time now. There are some countermeasures. Many URL shorteners (including TinyURL) have a preview feature to let you examine the URL target before going there (for example, shows the sample javascript code we used above). The Echofon plugin for Firefox is a Twitter client that makes heavy use of URL shorteners, and among other things it automatically expands short URLs when the mouse hovers over them. The service offers URL expansion as well, both from the web and as a Firefox plugin.

But… how well do they behave when you shorten the URL multiple times? First, let’s try shortening a link with TinyURL three successive times:

    $ ./ -c 3 -u

The preview feature only seems to work for the first redirection (see for yourself). Now let’s try shortening our example URL three times, using random providers:

    $ ./ -v -c 3

I found the Echofon plugin fails to expand this short URL. However the service works like a charm – it even told me how many redirections I used! Click here to see it for yourself.

Example of Echofon expanding a short URL

Just for fun, how about an infinite recursion? First, we create a link to a shortened URL that doesn’t yet exist.

    $ python -v -c 5

Now let’s go to the TinyURL webpage and create the link manually. Voila! We have an infinite redirection. The service seems to break now (see for yourself) but this is of no consequence – the link doesn’t really take you anywhere. Other apps like wget get stuck in an infinite loop however. This is probably of no real use to an attacker but I thought it was fun to try it out. 🙂

As a final note: while writing this I came across this article describing even more risks inherent to URL shorteners: information disclosure. A dedicated attacker could dump randomly chosen shortened URLs to retrieve information like corporate intranet links, user IDs and even passwords. I liked the idea, so I whipped out a small script to test it out:

    from random import randint
    from shorturl import longurl, is_short_url

    characters = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
    while 1:
        token = ''.join([ characters[ randint(0, len(characters) - 1) ] for i in xrange(6) ])
            url = longurl('' % token)
        except Exception:
            print token
        if not is_short_url(url):
            print "%s => %s" % (token, url)

But the results of bruceforcing didn’t seem so great. Then again, one could also wonder how random this short URLs really are… maybe there’s a way to predict them, or at least reduce the number of URLs one has to try. One could also try a dictionary based approach for services like, which provide the users the ability to pick custom short URLs. Update: Those TinyURL links weren’t so random after all 😉 here’s an interesting project on URL scraping.

Read Part 2: parasitic storage


Source code

You can get the source code at Github.

Blog at