Breaking Code

June 29, 2010

Quickpost: Using Google Search from your Python code

Filed under: Tools, Web applications — Tags: , , , , , , , , — Mario Vilas @ 6:31 pm

Hi everyone. Today I’ll be showing you a quick script I wrote to make Google searches from Python. There are previous projects doing the same thing -actually, doing it better-, namely Googolplex by Sebastian Wain and xgoogle by Peteris Krumins, but unfortunately they’re no longer working. Maybe the lack of complexity of this script will keep it working a little longer… :)

The interface is extremely simple, the module exports only one function called search().

        # Get the first 20 hits for: "Breaking Code" WordPress blog
        from google import search
        for url in search('"Breaking Code" WordPress blog', stop=20):
            print(url)

You can control which one of the Google Search pages to use, which language to search in, how many results per page, which page to start searching from and when to stop, and how long to wait between queries – however the only mandatory argument is the query string, everything else has a default value.

        # Get the first 20 hits for "Mariposa botnet" in Google Spain
        from google import search
        for url in search('Mariposa botnet', tld='es', lang='es', stop=20):
            print(url)

A word of caution, though: if you wait too little between requests or make too many of them, Google may block your IP for a while. This is especially annoying when you’re behind a corporate proxy – I won’t be made responsible when your coworkers suddenly develop an urge to kill you! :D

Below are the download links (source code and Windows installers) and the source code for you to read online. Enjoy! :)

Changelog

  • Version 1.0 (initial release).
  • Version 1.01 (fixed the IOError exception bug).
  • Version 1.02 (fixed the missing href bug reported by Rahul Sasi and the duplicate results bug reported by Slawek).
  • Version 1.03 (extracts the hidden links from the results page, thanks ubershmekel!).
  • Version 1.04 (added support for BeautifulSoup 4, thanks alxndr!).
  • Version 1.05 (added compatibility with Python 3.x, better command line parser, and also added some improvements by machalekj)
  • Version 1.06 (added an option to only grab the relevant results, instead of all possible links from each result page, as requested by user Nicky and others).

Download

Source code: google-1.06.tar.gz

Windows 32 bits installer: google-1.06.win32.msi

Windows 64 bits installer: google-1.06.win-amd64.msi

Documentation: google-1.06-doc.zip

Source code

Get the source code from GitHub: https://github.com/MarioVilas/google

    #!/usr/bin/env python

    # Python bindings to the Google search engine
    # Copyright (c) 2009-2014, Mario Vilas
    # All rights reserved.
    #
    # Redistribution and use in source and binary forms, with or without
    # modification, are permitted provided that the following conditions are met:
    #
    #     * Redistributions of source code must retain the above copyright notice,
    #       this list of conditions and the following disclaimer.
    #     * Redistributions in binary form must reproduce the above copyright
    #       notice,this list of conditions and the following disclaimer in the
    #       documentation and/or other materials provided with the distribution.
    #     * Neither the name of the copyright holder nor the names of its
    #       contributors may be used to endorse or promote products derived from
    #       this software without specific prior written permission.
    #
    # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
    # AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
    # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
    # ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
    # LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
    # CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
    # SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
    # INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
    # CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
    # ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
    # POSSIBILITY OF SUCH DAMAGE.

    __all__ = ['search']

    import os
    import sys
    import time

    if sys.version_info[0] > 2:
        from http.cookiejar import LWPCookieJar
        from urllib.request import Request, urlopen
        from urllib.parse import quote_plus, urlparse, parse_qs
    else:
        from cookielib import LWPCookieJar
        from urllib import quote_plus
        from urllib2 import Request, urlopen
        from urlparse import urlparse, parse_qs

    # Lazy import of BeautifulSoup.
    BeautifulSoup = None

    # URL templates to make Google searches.
    url_home          = "http://www.google.%(tld)s/"
    url_search        = "http://www.google.%(tld)s/search?hl=%(lang)s&q=%(query)s&btnG=Google+Search&inurl=https"
    url_next_page     = "http://www.google.%(tld)s/search?hl=%(lang)s&q=%(query)s&start=%(start)d&inurl=https"
    url_search_num    = "http://www.google.%(tld)s/search?hl=%(lang)s&q=%(query)s&num=%(num)d&btnG=Google+Search&inurl=https"
    url_next_page_num = "http://www.google.%(tld)s/search?hl=%(lang)s&q=%(query)s&num=%(num)d&start=%(start)d&inurl=https"

    # Cookie jar. Stored at the user's home folder.
    home_folder = os.getenv('HOME')
    if not home_folder:
        home_folder = os.getenv('USERHOME')
        if not home_folder:
            home_folder = '.'   # Use the current folder on error.
    cookie_jar = LWPCookieJar(os.path.join(home_folder, '.google-cookie'))
    try:
        cookie_jar.load()
    except Exception:
        pass

    # Request the given URL and return the response page, using the cookie jar.
    def get_page(url):
        """
        Request the given URL and return the response page, using the cookie jar.

        @type  url: str
        @param url: URL to retrieve.

        @rtype:  str
        @return: Web page retrieved for the given URL.

        @raise IOError: An exception is raised on error.
        @raise urllib2.URLError: An exception is raised on error.
        @raise urllib2.HTTPError: An exception is raised on error.
        """
        request = Request(url)
        request.add_header('User-Agent',
                           'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)')
        cookie_jar.add_cookie_header(request)
        response = urlopen(request)
        cookie_jar.extract_cookies(response, request)
        html = response.read()
        response.close()
        cookie_jar.save()
        return html

    # Filter links found in the Google result pages HTML code.
    # Returns None if the link doesn't yield a valid result.
    def filter_result(link):
        try:

            # Valid results are absolute URLs not pointing to a Google domain
            # like images.google.com or googleusercontent.com
            o = urlparse(link, 'http')
            if o.netloc and 'google' not in o.netloc:
                return link

            # Decode hidden URLs.
            if link.startswith('/url?'):
                link = parse_qs(o.query)['q'][0]

                # Valid results are absolute URLs not pointing to a Google domain
                # like images.google.com or googleusercontent.com
                o = urlparse(link, 'http')
                if o.netloc and 'google' not in o.netloc:
                    return link

        # Otherwise, or on error, return None.
        except Exception:
            pass
        return None

    # Returns a generator that yields URLs.
    def search(query, tld='com', lang='en', num=10, start=0, stop=None, pause=2.0,
               only_standard=False):
        """
        Search the given query string using Google.

        @type  query: str
        @param query: Query string. Must NOT be url-encoded.

        @type  tld: str
        @param tld: Top level domain.

        @type  lang: str
        @param lang: Languaje.

        @type  num: int
        @param num: Number of results per page.

        @type  start: int
        @param start: First result to retrieve.

        @type  stop: int
        @param stop: Last result to retrieve.
            Use C{None} to keep searching forever.

        @type  pause: float
        @param pause: Lapse to wait between HTTP requests.
            A lapse too long will make the search slow, but a lapse too short may
            cause Google to block your IP. Your mileage may vary!

        @type  only_standard: bool
        @param only_standard: If C{True}, only returns the standard results from
            each page. If C{False}, it returns every possible link from each page,
            except for those that point back to Google itself. Defaults to C{False}
            for backwards compatibility with older versions of this module.

        @rtype:  generator
        @return: Generator (iterator) that yields found URLs. If the C{stop}
            parameter is C{None} the iterator will loop forever.
        """

        # Lazy import of BeautifulSoup.
        # Try to use BeautifulSoup 4 if available, fall back to 3 otherwise.
        global BeautifulSoup
        if BeautifulSoup is None:
            try:
                from bs4 import BeautifulSoup
            except ImportError:
                from BeautifulSoup import BeautifulSoup

        # Set of hashes for the results found.
        # This is used to avoid repeated results.
        hashes = set()

        # Prepare the search string.
        query = quote_plus(query)

        # Grab the cookie from the home page.
        get_page(url_home % vars())

        # Prepare the URL of the first request.
        if start:
            if num == 10:
                url = url_next_page % vars()
            else:
                url = url_next_page_num % vars()
        else:
            if num == 10:
                url = url_search % vars()
            else:
                url = url_search_num % vars()

        # Loop until we reach the maximum result, if any (otherwise, loop forever).
        while not stop or start < stop:

            # Sleep between requests.
            time.sleep(pause)

            # Request the Google Search results page.
            html = get_page(url)

            # Parse the response and process every anchored URL.
            soup = BeautifulSoup(html)
            anchors = soup.find(id='search').findAll('a')
            for a in anchors:

                # Leave only the "standard" results if requested.
                # Otherwise grab all possible links.
                if only_standard and (
                            not a.parent or a.parent.name.lower() != "h3"):
                    continue

                # Get the URL from the anchor tag.
                try:
                    link = a['href']
                except KeyError:
                    continue

                # Filter invalid links and links pointing to Google itself.
                link = filter_result(link)
                if not link:
                    continue

                # Discard repeated results.
                h = hash(link)
                if h in hashes:
                    continue
                hashes.add(h)

                # Yield the result.
                yield link

            # End if there are no more results.
            if not soup.find(id='nav'):
                break

            # Prepare the URL for the next request.
            start += num
            if num == 10:
                url = url_next_page % vars()
            else:
                url = url_next_page_num % vars()

    # When run as a script...
    if __name__ == "__main__":

        from optparse import OptionParser, IndentedHelpFormatter

        class BannerHelpFormatter(IndentedHelpFormatter):
            "Just a small tweak to optparse to be able to print a banner."
            def __init__(self, banner, *argv, **argd):
                self.banner = banner
                IndentedHelpFormatter.__init__(self, *argv, **argd)
            def format_usage(self, usage):
                msg = IndentedHelpFormatter.format_usage(self, usage)
                return '%s\n%s' % (self.banner, msg)

        # Parse the command line arguments.
        formatter = BannerHelpFormatter(
            "Python script to use the Google search engine\n"
            "By Mario Vilas (mvilas at gmail dot com)\n"
            "https://github.com/MarioVilas/google\n"
        )
        parser = OptionParser(formatter=formatter)
        parser.set_usage("%prog [options] query")
        parser.add_option("--tld", metavar="TLD", type="string", default="com",
                          help="top level domain to use [default: com]")
        parser.add_option("--lang", metavar="LANGUAGE", type="string", default="en",
                          help="produce results in the given language [default: en]")
        parser.add_option("--num", metavar="NUMBER", type="int", default=10,
                          help="number of results per page [default: 10]")
        parser.add_option("--start", metavar="NUMBER", type="int", default=0,
                          help="first result to retrieve [default: 0]")
        parser.add_option("--stop", metavar="NUMBER", type="int", default=0,
                          help="last result to retrieve [default: unlimited]")
        parser.add_option("--pause", metavar="SECONDS", type="float", default=2.0,
                          help="pause between HTTP requests [default: 2.0]")
        parser.add_option("--all", dest="only_standard",
                          action="store_false", default=True,
                          help="grab all possible links from result pages")
        (options, args) = parser.parse_args()
        query = ' '.join(args)
        if not query:
            parser.print_help()
            sys.exit(2)
        params = [(k,v) for (k,v) in options.__dict__.items() if not k.startswith('_')]
        params = dict(params)

        # Run the query.
        for url in search(query, **params):
            print(url)
About these ads

64 Comments »

  1. Hey, that’s pretty cool – I used to do the same with html5lib, BeautifulSoup and mechanize but you’d better remove the code quickly – this violates Google’s TOS.

    Comment by cryzed — June 30, 2010 @ 6:45 am

  2. Hey, this is really cool! I kept getting an IOError when trying to instantiate cookiejar – .google-cookie does not exist. Any ideas as to why it wouldn’t be creating the cookie? For now, I just caught the exception and passed, which is probably dangerous.

    Secondly, how can I modify this to get the number of results for a specific query? I’m not sure how to parse the html accordingly, because I can’t quite just print out the variable ‘html’ to see what it looks like.

    Thanks

    Comment by Bob — July 5, 2010 @ 6:30 am

  3. Ah sorry for the spam. I actually figured out both my questions, haha. Using soup.prettify(), I outputted the html and was able to look at BeautifulSoup documentation and parse accordingly.

    I removed the IOException catching, and it turns out that the .google-cookie was created, just not on the first pass of the program (or at least this is what I think happened?).

    Comment by Bob — July 5, 2010 @ 7:07 am

  4. @cryzed: Thanks for the comment! I don’t think posting the code violates the TOS, but using it may. I’m not sure exactly which is covered by the TOS and which isn’t, really. :(

    @Bob: Not spam at all! :)

    I think maybe different versions of the cookiejar module throw different exceptions. The first run is supposed to raise an exception since the file doesn’t exist yet, but it should be a cookiejar.LoadError when calling load(), not an IOError when instantiating the object.

    Catching the exception and running it once was the right call, but I should think of a more elegant solution…

    Comment by Mario Vilas — July 5, 2010 @ 8:36 am

  5. Good one!
    Only one point, the xgoogle project is still working at the momment.

    Regards

    Comment by Emilio — August 6, 2010 @ 10:05 pm

  6. @Emilio: Thanks! Good to know xgoogle is working now, at the time I wrote this it had the “0 results” bug too.

    Comment by Mario Vilas — August 9, 2010 @ 5:03 am

  7. xgoogle is still giving me “0 results”. Is this fixed or is something else going on…

    Comment by Dave — August 23, 2010 @ 3:02 pm

  8. Hi,

    look at the last comments in the post of catonmat. There is a minor fix. Currently I am using a functional version after this fix.

    Regards.

    Comment by Emilio — August 23, 2010 @ 3:36 pm

  9. Well done,

    I just removed youtube link by replacing :
    if o.netloc and (‘google’ not in o.netloc)
    by
    if o.netloc and (‘google’ not in o.netloc) and (“youtube” not in o.netloc):

    Thanks for this code !
    I Hope it will work for a long time.

    Comment by NaN — August 27, 2010 @ 11:19 pm

  10. Mario

    There is a problem with getting results from google search using your script.

    If google do not show local business results, your script works great but if shows (map and links) I get duplicate results like:
    (these are urls by map)

    http://www.example1.pl

    http://www.example1.pl

    http://www.example2.pl

    http://www.example2.pl


    and results from 1 to 10 on first page do not show

    is this clear? i can send you screenshot with incorrect urls

    greetings

    Comment by Slawek — October 1, 2010 @ 9:36 pm

  11. [...] a quick and dirty fix IMHO the best python google API written by Mario Vilas from breakingcode [...]

    Pingback by Using Google Search from your Python code [fixed] « Ulisses Castro Security Labs — November 21, 2010 @ 3:32 am

  12. [...] Google has undergone a lot of changes since 2001 and Googolplex and other  libraries like xgoogle are now part of Internet history. A similar new library  is available at Mario Vilas Google Search Python blog post as Quickpost: Using Google Search from your Python code. [...]

    Pingback by Google Search NoAPI « Data Big Bang Blog — January 20, 2011 @ 8:03 pm

  13. [...] Google, information gathering, LinkedIn, open source, python, recon, search, tool, web Breaking Code This entry was posted in Breaking Code and tagged code, from, Google, Python, Quickpost, Search, [...]

    Pingback by Quickpost: Using Google Search from your Python code | Linux-backtrack.com — January 24, 2011 @ 9:13 am

  14. It works fine. Thanks..

    Comment by urkera — August 14, 2011 @ 2:57 am

  15. Very good. Only working google searcher that i’ve found. using the fix it works very fine

    Comment by Juliano Costa Machado — October 24, 2011 @ 2:00 am

  16. Found this update that works even if you get the funky fake url’s google is using (/?url=http://thereal.url.here) http://codepad.org/WLte6a0U

    Comment by ubershmekel — February 13, 2012 @ 8:23 pm

  17. Nice! I’ll patch it right away, thanks! :)

    Comment by Mario Vilas — February 13, 2012 @ 8:48 pm

  18. i only get 10 results per a page no matter if i set higher number for num

    anyone find a fix for this?

    Comment by dan — March 21, 2012 @ 7:51 am

  19. ok after some more googling.. finally found the fix for being limited to 10 results:

    add “as_qdr=all” to the url

    Comment by dan — March 21, 2012 @ 8:38 am

  20. @dan: Interesting! That limit didn’t exist when I originally wrote the script, it must be a new parameter to the URL. I’ll fix the script right away, thanks!

    Comment by Mario Vilas — March 21, 2012 @ 11:26 am

  21. @dan: Odd, I can’t seem to reproduce your results. Adding as_qdr=all seems to have no effect (good or bad), and setting a higher number for “num” is working fine for me in the first place…

    I’ll take note of this just in case, but for now I won’t be modifying the script since I can’t reproduce the problem.

    Comment by Mario Vilas — March 21, 2012 @ 11:36 am

  22. hmm yes that is odd.

    i believe what is causing it for me is google’s new “Google Instant” feature, and adding as_qdr=all seems to shut it off.

    would be nice to figure out why it is happening for me.

    well i did a couple tests… and what TLD are you using, like .com or something else? it seems the 10 results limit is only happening to me with .com, and not with other TLDs (i tested .ca and .co.uk, and both give me the full number of results without adding as_qdr=all)

    Comment by dan — March 21, 2012 @ 5:16 pm

  23. I think this is happening or not depending on the query you make. Reading up on this: http://jwebnet.net/advancedgooglesearch.html it says the as_qdr parameter controls how old are the results it gives you.

    This is going to become a more complex problem IMHO. I need to read the whole document and see what can be implemented in the script.

    Comment by Mario Vilas — March 22, 2012 @ 6:07 pm

  24. It’s highly unfortunate Google doesn’t have an actual API for searching (their Custom Search API is intentionally crippled), unlike Bing and Yahoo. I can’t use Bing or Yahoo though because Google consistently gives me the best, the most, and the most accurate results for things I’m trying to retrieve, so I’ve been using hacks like this instead.

    This is really good though; thank you for it. It’d be great if you keep maintaining it for a while.

    Comment by Anorov — March 23, 2012 @ 11:32 am

  25. I’m getting the following error

    >> python google.py ‘2012 movies’

    Traceback (most recent call last):
    File “google.py”, line 211, in
    for url in search(query, stop=20):
    File “google.py”, line 175, in search
    soup = BeautifulSoup.BeautifulSoup(html)
    File “/usr/lib/pymodules/python2.6/BeautifulSoup.py”, line 1499, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
    File “/usr/lib/pymodules/python2.6/BeautifulSoup.py”, line 1230, in __init__
    self._feed(isHTML=isHTML)
    File “/usr/lib/pymodules/python2.6/BeautifulSoup.py”, line 1263, in _feed
    self.builder.feed(markup)
    File “/usr/lib/python2.6/HTMLParser.py”, line 108, in feed
    self.goahead(0)
    File “/usr/lib/python2.6/HTMLParser.py”, line 148, in goahead
    k = self.parse_starttag(i)
    File “/usr/lib/python2.6/HTMLParser.py”, line 266, in parse_starttag
    % (rawdata[k:endpos][:20],))
    File “/usr/lib/python2.6/HTMLParser.py”, line 115, in error
    raise HTMLParseError(message, self.getpos())
    HTMLParser.HTMLParseError: junk characters in start tag: u'{t:119}); class=gbzt’, at line 1, column 32127

    Comment by John — May 17, 2012 @ 12:28 am

  26. I think that may be a problem with the version of BeautifulSoup. At least I can’t seem to reproduce it, using the exact same search query.

    Comment by Mario Vilas — May 17, 2012 @ 12:34 am

  27. Thanks for you help. What version of BeautifulSoup would you recommend? I’m running it on Ubuntu

    Comment by John — May 17, 2012 @ 6:00 am

  28. You’re welcome! I’m using version 3.2.0 on Windows, but I see in the webpage that the latest 3.x version is 3.2.1, so I’d go with that. Version 4.x probably won’t work. http://www.crummy.com/software/BeautifulSoup/download/3.x/BeautifulSoup-3.2.1.tar.gz

    Comment by Mario Vilas — May 17, 2012 @ 10:57 am

  29. Hi Mario, thanks for your great package.
    I get the very same error as John.
    What is most interesting, it runs great on Windows, but throws an error on LInux.
    – Windows 7, Python 2.6.6, BeautifulSoup 3.2.1
    – BackTrack 5 R2, Python 2.6.5, BeautifulSoup 3.2.1
    I have managed to make it work by commenting out the lines below in HTMPLParser.py, but that is not the best solution IMHO :(

    “””
    offset = offset + len(self.__starttag_text)
    self.error(“junk characters in start tag: %r”
    % (rawdata[k:endpos][:20],))
    “””

    Comment by Dejan — May 18, 2012 @ 1:37 pm

  30. UPDATE:
    It seams that my BackTrack was using BeautifulSoup 3.1.0.1 from /usr/shared/pyshared instead of 3.2.1 from /usr/local/lib/python2.6/dist-packages/BeautifulSoup.py
    It works now :)

    Comment by Dejan — May 18, 2012 @ 2:18 pm

  31. [...] though.) The code is also available on GitHub, together with a small third-party library (via Mario Vilas) that’s used to access Google search results. Here’s the [...]

    Pingback by How to answer a question: a simple system | DDI — June 13, 2012 @ 6:30 pm

  32. Project started in 2010 and still working like a charm in July 2012… AWESOME :)

    Comment by topo — July 2, 2012 @ 6:22 pm

  33. Thanks! :D

    Comment by Mario Vilas — July 3, 2012 @ 1:25 pm

  34. I work with it correctly when using stand alone python script. But when I run this script from Django, it encounters “HTTP Error 503: Service Unavailable”.
    Is there any possible solution?

    File “google.py” in get_page
    82. response = urllib2.urlopen(request)
    File “/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py” in urlopen
    126. return _opener.open(url, data, timeout)
    File “/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py” in open
    406. response = meth(req, response)
    File “/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py” in http_response
    519. ‘http’, request, response, code, msg, hdrs)
    File “/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py” in error
    438. result = self._call_chain(*args)
    File “/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py” in _call_chain
    378. result = func(*args)
    File “/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py” in http_error_302
    625. return self.parent.open(new, timeout=req.timeout)
    File “/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py” in open
    406. response = meth(req, response)
    File “/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py” in http_response
    519. ‘http’, request, response, code, msg, hdrs)
    File “/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py” in error
    444. return self._call_chain(*args)
    File “/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py” in _call_chain
    378. result = func(*args)
    File “/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py” in http_error_default
    527. raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)

    Exception Type: HTTPError at
    Exception Value: HTTP Error 503: Service Unavailable

    Comment by papalagi — July 7, 2012 @ 1:05 pm

  35. @papalagi: I haven’t ever used Django, but is it possible that it does some global change that affects the behavior of urllib2? It could also be something in your environment, from the traceback it seems there was a 302 (URL redirection) before the 503 error. Try sniffing the network traffic, maybe whatever is going wrong will stand out there.

    Comment by Mario Vilas — July 7, 2012 @ 5:57 pm

  36. Hi Mario, first thanks for you class.

    I have a little problem, so a i have the next search

    for url in search(‘gato’, num=100, start=0,stop=700):

    The first page to visit is http://www.google.com/search?hl=en&q=gato&num=100&btnG=Google+Search

    In the browser, contains 100 or more results, but, with your class, get very few records (about 16 of the first page).

    Does your know why?

    Thanks

    Comment by Pedro — July 24, 2012 @ 9:37 pm

  37. Hi Mario, thank for the awesome tool. It was working fine but now I’m getting the following error

    Traceback (most recent call last):
    File “g_search.py”, line 216, in
    for url in search(query, stop=11):
    File “g_search.py”, line 162, in search
    get_page(url_home % vars())
    File “g_search.py”, line 84, in get_page
    response = urllib2.urlopen(request)
    File “/usr/lib/python2.7/urllib2.py”, line 126, in urlopen
    return _opener.open(url, data, timeout)
    File “/usr/lib/python2.7/urllib2.py”, line 391, in open
    response = self._open(req, data)
    File “/usr/lib/python2.7/urllib2.py”, line 409, in _open
    ‘_open’, req)
    File “/usr/lib/python2.7/urllib2.py”, line 369, in _call_chain
    result = func(*args)
    File “/usr/lib/python2.7/urllib2.py”, line 1185, in http_open
    return self.do_open(httplib.HTTPConnection, req)
    File “/usr/lib/python2.7/urllib2.py”, line 1160, in do_open
    raise URLError(err)
    urllib2.URLError:

    Comment by Johny — July 24, 2012 @ 11:05 pm

  38. Hello Pedro! I’m guessing the difference is in the real-time search API. Currently the Google Search page uses this new API which is far better (among other things because it works over Ajax instead of a full page reload, and has less restrictions than the old API). This script uses the old API which is still maintained for backwards compatibility with older browsers, and it’s so on purpose to make sure it still works with no changes over a longer time – but it also means the search results won’t be exactly the same.

    Sebastian Wain (the author of Googolplex) has recently written a new Google Search script that works on the Ajax API instead, if you need to get results closer to the web page’s you’ll want to try this one out :)

    http://blog.databigbang.com/google-search-no-api/

    Cheers!
    -Mario

    Comment by Mario Vilas — July 25, 2012 @ 9:29 am

  39. Hi Johny! That traceback seems incomplete (there’s no error message for the URLError exception), but that kind of error usually means some connectivity problem. Make sure you can reach google.com normally from that machine and the Python interpreter isn’t prevented by SELinux from creating sockets and connecting to the outside (some free hosting providers do that to prevent misuse). Good luck! :)

    Comment by Mario Vilas — July 25, 2012 @ 9:36 am

  40. Thanks Mario. I’ll try to debug as you said. I also think that it may be some connectivity problem.

    Comment by Johny — July 27, 2012 @ 8:13 pm

  41. Hi, awesome binding, I’ve been searching a long time for something that works, I created this using your binding : http://lnemec.tk/search/

    Comment by Lukas Nemec — September 10, 2012 @ 12:12 pm

  42. Thankyou for this amazing piece of code.

    Comment by xdr — April 23, 2013 @ 2:47 am

  43. Thank you for your comment! :)

    Comment by Mario Vilas — April 23, 2013 @ 10:10 am

  44. HI Mario!
    First of all – thanks for this cool library!

    I tried to use proxy to avoid blocking my home IP by google. After around 100 search requests, I was blocked and even when I disabled proxy it turned out that my home IP was also blocked! I suspect this is because of cookies. Am I right?
    How can I use proxy with your library and avoid google’s penalties. Can I disable cookies or tweak something else?

    Comment by Olexiy Logvinov (@OlexiyL) — May 23, 2013 @ 11:33 am

  45. Thanks! :)

    Honestly I’ve never tried circumventing the IP block, but I’d suggest deleting the cookie only after you know you’ve been banned. The script stores the cookie in a file called “.google-cookie” in the home directory for the current user, you can try deleting the file. From Python code that imports google.py instead of running it, you could try this code: import google; google.cookie_jar.clear()

    Let me know if it works!

    Comment by Mario Vilas — May 23, 2013 @ 1:11 pm

  46. Hello Mario,

    thanks alot for this great tool! Im very interested in getting the estimated search result number from google. Do you plan to implement this function and if not can you point me to a direction how this can be done by modifying your great module? Thanks again, best regards!

    Comment by zwieback86 — July 9, 2013 @ 11:27 am

  47. Me again, I already got it done by myself. It was really easy didnt expect that. Greets!

    Comment by zwieback86 — July 9, 2013 @ 8:54 pm

  48. @zwieback86: Hehehe, you coded it faster than I got around to looking at the blog comments! ;)

    Comment by Mario Vilas — July 10, 2013 @ 10:17 am

  49. I am extremely thankful for this tool and appreciate you taking the time to code it. I was using a windows version of a google scraper previously, but since I don’t like to use windows and was unfamiliar with the language it was coded in, this became the perfect solution. One thing that I miss about the windows version though is that I could specify a text file of proxies for it to use and it was able to do lots of searches without google being able to stop it. I am not sure how to incorporate this into the code myself, do you have any suggestions?

    Comment by Linco — August 2, 2013 @ 7:30 pm

  50. Sounds like a great feature to add! Since this script uses urllib2 to make the HTTP queries, you’d have to set the proxy like this: http://stackoverflow.com/questions/1450132/proxy-with-urllib2

    Comment by Mario Vilas — August 2, 2013 @ 8:25 pm

  51. Always get 503 Service Unavailable in all my 3 VPS and my local.
    :(

    Comment by Zeray Rice — October 24, 2013 @ 4:37 pm

  52. Seems to be working on my machine… Can you tell me anything else, like what country are you trying from, or what search query are you using?

    Comment by Mario Vilas — October 24, 2013 @ 4:43 pm

  53. It doesn’t work I get 403 Forbidden error

    Comment by asdasda — December 31, 2013 @ 11:35 am

  54. No idea why the 1.03 link is broken, I can see it in the directory index and I didn’t change anything. Sourceforge hosting sucks I guess. :(

    Try this instead for version 1.03: https://github.com/MarioVilas/google/tree/1ff92d2ec01f9103800a0ec94c1454979817de1a

    The link to the latest version seems to be working for me. But if it doesn’t anymore just go to Github instead, it’s much more reliable.

    Comment by Mario Vilas — December 31, 2013 @ 5:44 pm

  55. Cool tool, thanks! I noticed that it grabs a lot of urls off the page in addition to the standard 10 big blue search results. For anyone who wants to eliminate all the extra urls, I added

    if not a.parent.name == “h3″:
    continue

    immediately after the line that reads

    for a in anchors:

    Comment by Nicky — April 13, 2014 @ 8:24 pm

  56. Thanks! I just committed a change to add that as an option. :)

    Comment by Mario Vilas — April 14, 2014 @ 2:11 pm

  57. Hi, I am thinking if we could get results in a specific time range? For example, if we want the results from 01/01/2009 to 01/01/2010, can we pass this parameter to your function? I think it should be fine. So could you please make it clear?

    Comment by David — April 23, 2014 @ 10:18 pm

  58. Hi David. Does the Google Search legacy API support this? I don’t recall such option… but if there is, let me know how to use it and I’ll add it to the script. :)

    Comment by Mario Vilas — April 23, 2014 @ 11:57 pm

  59. Hi, I think there could be someway to add the specific time range, I tried but do not know why it always failed.
    You could pass one more parameter in the url like this:

    https://www.google.com/#lr=lang_en&num=100&q=adidas&start=100&tbm=blg&tbs=cdr:1,cd_min:1/1/2009,cd_max:2/1/2009,lr:lang_1en

    so “tbs=cdr:1,cd_min:1/1/2009,cd_max:2/1/2009″ means to search result between 1/1/2009 and 2/1/2009.
    The thing is whenever I request this url in my program, it always returned the original search result. Here the original results means just search query without time range. However, when I open this in the browser, it can return our desired results.
    Besides, what puzzles me is if I save the html of the search results, it turns out there is no information on the html file. The only way to save other links is we have to save the web archive in order to get the links of the query. So I am really lost about it. Could you please have a look at it and we can discuss about it.
    Thanks!

    Comment by David — May 3, 2014 @ 6:23 pm

  60. Also, you could see this post about the parameters of google search urls. I tried about one week to figure out how to get the results of time range search in program, but it always fetch the original search results instead of time range search. Hope you could find a way! Thanks!

    Comment by David — May 3, 2014 @ 6:25 pm

  61. It works differently in the browser because at some point Google switched to a new API using Ajax, so all you’ll see in the HTML is the JavaScript code to access the API, not the search results. This script uses the legacy API that was in place before the switch, so while it’s more conveniente (no Ajax, HTML is parseable) some options that were added later to the search engine may not work.

    Comment by Mario Vilas — May 3, 2014 @ 6:36 pm

  62. If so, how could we solve this problem in order to get time range search results?

    Comment by David — May 3, 2014 @ 8:07 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Silver is the New Black Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 2,480 other followers

%d bloggers like this: