code trophygeek: Using Google App Engine to store and service large images

Still editing this post

Google App Engine has added support for serving images directly out of its Picasa infrastructural. Because "image serving service" is unwieldy to say, I'm going to use PIS for shorthand (Picasa Image Serving)

Benefits

This image service has lots of great benefits:

Streamed reads - animated gifs will start playing before whole image is downloaded
Resize & crop images via url parameters - No CPU or extra storage costs for thumbnails!
Served from edge servers
Cost effective way to host/serve large images.
Supports https
Spread across multiple domains

The problem is that it's confusing to code because it requires using the blobstore and the PIS APIs (Picasa Image Serving). From the design of these APIs and the upload delays observed, they seem like completely different services inside Google.

Here is a pieced together a basic Python for Google AppEngine that allows bulk upload of a lot of large images quickly. This would useful for a site that wants to serve large static images.

I've used it for my 4D Captcha project www.vappic.com to server lots of large animated gifs.

There are two ways to upload data. A single form page or a javascript enabled bulk uploader which is based on BlueImp's excellent project.

You can download a working example source.

Overview of how it works

Create a new Google AppEngine instance with a unique name.
Edit demo project and upload to Google AppEngine
Bulk upload all your images.
Download all the PIS urls and use them in your site.

Set up Google App Engine

You'll need to install enough Google AppEngine to use the code deploy tools. There are lots of how-to documents out there. This app uses Python, so you'll need to install Python 2.6.x. There are newer releases of Python, but GAE isn't compatible with them yet.

When you go to create your Google AppEngine instance, you'll need to change some default setting before you click Create Application.

You want to turn off the High Replication feature. Copying data out of the Google AppEngine instance isn't supported for that flavor DB. Also, High Replication is more expensive and since you should be embedding PIS image urls somewhere else, it really just overkill.

Now edit the app.yaml file and change the application name from pic-demo to your new application name.

file:app.yaml

application: pis-demo
version: 3
runtime: python

You can make other changes to the application in this file. By default, uploading images is restricted to admins. If you want anybody to be able to upload and view images, you'll need to change permissions:

  

builtins:
- deferred: on
- remote_api: on
- datastore_admin: on

...

- url: /uploadmgr
  static_files: uploadmgr.html
  upload: uploadmgr.html
  login: admin

- url: /getuploadurl
  script: main.py
  login: admin

- url: /upload
  script: main.py
  login: admin

- url: /del
  script: main.py
  login: admin

- url: /deleteall
  script: main.py
  login: admin

Upload the application

Upload the whole project (instructions for this step can be found in numerous how-to's).

Upload images

The upload is pretty fast if you have a good internet connection. You should do them in batches.

A feature I'd like to add soon is batch naming to make deleting sets of images easier.

View images

Note: the pagination uses He3's PagedQuery. It is fairly efficient; however, it does depend on memcache (which could be purged quickly if your app isn't heavily used). This means the viewing operation can be too expensive for broad, public consumption. Right now the viewer page is NOT restricted to admins like upload. The delete commands found here are restricted, though.

There are two ways to reference the uploaded images. An appspot url and a PIS url which typically starts with lh3.ggpht.com where the lh# changes.

//lh3.ggpht.com/6EP5Tzu8Ov2Ke_0ViE3t0hnbUCKozXp65JNvw8LVvemN32AbXU7r9pBL9RJiXGN5s3Zbt3SEP3W53HTV

You'll notice the urls don't start with http: or https:, this is intentional. You can use //domain/ style urls so that if images are embedded in secure pages (like https), they urls will use the same scheme without modification.

You can use the //pis-demo.appspot.com/i/7011/fate-of-sid.png style urls. This approach will lookup the ggpht.com url in the datastore and do a redirect; however, this generates CPU and database overhead costs. Also, there is load speed advantages to spreading you images across multiple domains like lh1.ggpht.com, lh2.ggpht.com, etc. However, if you needed to restrict access or track usage, then the redirector is the way to go.

Download image urls

This can be done multiple ways, but the idea is that you are uploading a lot of images, so generating a long table on the server isn't a good idea.

Here are three ways to download the ImageUrlMap data.

I prefer Option 2 for developers since it creates a local backup of data and keeps your dev environment up to date. Option 3 is good if you're just looking to host images.

Keep in mind that blobstore (actual images) are NOT downloaded. What you get is a list of urls serving images that you can reference in other code.

Options 1:
Copy tables between Google App Engine instances directly.
Good if you have a ton of data.

Edit appengine_config.py and add back in the lines for remoteapi_CUSTOM_ENVIRONMENT_AUTHENTICATION.
Add in your GAE domain name <yourapp>.appspot.com
NOTE: you need to grant permissions to the app *getting* the data
Deploy app to GAE (copy up files)
Use the Datastore Admin button "Copy to Another App"

Option 2:
Download SQL copy of the database, then upload to local running development server or different GAE instance.
Makes a local backup of data and keeps your GAE dev system primed.

Verify app.ymal has:
remote_api: on

Download the data to a local sql file:

appcfg.py download_data --application=<yourapp> --kind=ImageUrlMap --url=http://<yourapp>.appspot.com/_ah/remote_api --filename=imageurlmap.sql

Make sure Google App Engine launcher is running you app locally. NOTE url's port might be different!

Upload to dev server

appcfg.py upload_data --application=dev~<yourapp> --kind=ImageUrlMap --url=http://localhost:8080/_ah/remote_api --filename=imageurlmap.sql

Verify target instance's app.ymal has:
- remote_api: on

Upload to a different appengine instance NOTE the difference with dev~!

appcfg.py upload_data --application=<yourapp> --kind=ImageUrlMap --url=http://<yourapp>.appspot.com/_ah/remote_api --filename=imageurlmap.sql

Option 3:
Download all your urls as a local cvs file.

You can use the bulkloader.yaml adaptor file in tools and execute this command.

 appcfg.py download_data --config_file=bulkloader.yaml --kind=ImageUrlMap --url=http://<yourapp>.appspot.com/_ah/remote_api --filename=imageurlmap.csv

You can import this csv file into whatever format you like.

Code details

While I've been coding for over 20 years, I haven't had a ton of experience with Python. I'm sure some non-pythonisms have made their way into the source, so if you have any suggestions on how I can fix things, email me.

The logic to upload images is confusing. Here's a simplified version:

# this is for use with blueimp jquery uploader
class UploadUrlJsonHandler(webapp.RequestHandler):
    def get(self):
        batchidstr = self.request.get('batchid')[:20]
        if not batchidstr:
            batchidstr = '0'
        upload_url = blobstore.create_upload_url("/upload/%s/" % batchidstr,
                max_bytes_total=MAX_FILE_UPLOAD_SIZE)
        self.response.headers['Content-Type'] = 'application/json'
        self.response.out.write('"%s"' % upload_url)

This is initially confusing until you realize blob uploading is a difference service. Basically, you return a url that does all the complexity of uploading data and you give the call that generates that a "redirect" url to call once the upload is complete. That completion url is yours, but it's stored internally by the service.

batchidstr is a number generated by the server that groups all uploads that happen from the same page. This feature might be handy because large numbers of images can get unwieldy. The batchid is read from the url and embedded in the upload complete url.

It's interesting to not that the url returned by create_upload_url() does not contain the url you used to build it. I suspect the server stores the url and is returning a type of cookie.

class GetBatchId(webapp.RequestHandler):
    def get(self):
        # Used by upload page to tag batches of uploads. We want them
        # generally sequential; however, there can be gaps in the batchids.
        # We use Google DB to grab and consume a row id as a marker.
        (start_id, end_id) = db.allocate_ids(
                    db.Key.from_path('ImageUrlMap', 1), 1)
        self.response.out.write('{"batchid":%d}' % start_id)

class UploadFileHandler(blobstore_handlers.BlobstoreUploadHandler):
    def post(self, batchidstr):
        """
        Before starting an upload. The client requests a streaming upload
        url to use.
        """
        import urlparse
        from google.appengine.api import images
        from model import ImageUrlMap

        blob_info = self.get_uploads('files[]')[0]

        # this call can take several seconds
        imgserving_url = images.get_serving_url(blob_info.key())

        # remove the scheme part from the url so http/https is transparent
        # e.g. http://domain/path vs domain/path
        # (which just uses the refering page's scheme)
        if imgserving_url.lower()[0:5] == 'http://':
            imgserving_url = imgserving_url[7:]
        elif imgserving_url.lower()[0:6] == 'https://':  # shouldn't happen
            imgserving_url = imgserving_url[8:]

        imgmap = ImageUrlMap(blobstoreinfo=blob_info.key(),
                             imgserving_url=imgserving_url,
                             filename=blob_info.filename,
                             filesize=blob_info.size,
                             batchid=int(batchidstr))
        imgmap.put()

        baseuri = urlparse.urlparse(self.request.url)
        deleteuri = ("//%s/del/%d/%s" %
                    (baseuri.netloc, imgmap.key().id(),
                     urllib.quote(imgmap.filename)))

        # thumbnail dimensions are controlled by url param
        thumbnailuri = imgserving_url+THUMBNAIL_MAX_DIM_PARAM

        # We cannot return an actual document here. We *MUST* do a redirect.
        # We want to pass data from this function to the redirect url.
        # We could just pass the key from imgmap and re-read, but better
        # to just encode everything here. Risk is that the url will be > 2048
        # and not work on IE.
        # This is the json data blueimp expects
        json_response = (
            ('[{"name":"%s","id":%d,"size":%d,"url":"%s",'+
                '"batchid":%s,'+
                '"thumbnail_url":"%s","delete_url":"%s", "delete_type":"POST"}]') %
            (blob_info.filename, imgmap.key().id(),blob_info.size, imgserving_url,
                batchidstr,
                thumbnailuri, deleteuri))

        urlparam = json_response.encode('base64').encode('hex')
        self.redirect('/blueimp/%s' % urlparam)

# blobstore's upload logic requires a redirect
class BlueImpUploadDoneHandler(webapp.RequestHandler):
    def get(self, encodedparam):
        json_response = encodedparam.decode('hex').decode('base64')
        self.response.headers['Content-Type'] = 'application/json'
        self.response.out.write(json_response)

THUMBNAIL_MAX_DIM_PARAM = "=s100"

class GetDataDump(webapp.RequestHandler):
    # stupid web tricks. We want to generate a different filename for the csv
    # file, so we create it here and redirect to /csvfile/filename to do the
    # actual download.
    def get(self):
        batchidfilter = self.request.get('batchid', default_value='')[:16]
        filenamefilter = self.request.get('filename', default_value='')[:255]

        filename = 'export_'
        urlparams = ''
        if batchidfilter:
            filename += '_batch_' + batchidfilter

        if filenamefilter:
            filename += '_filename_' + filenamefilter


        url = ('/csvfile/%s.csv?%s' % (filename, self.request.query_string))
        self.redirect(url)

class SaveTabCSVData(webapp.RequestHandler):
    def get(self, csv_filename):
        import csv
        import cStringIO
        import math

        # You really should use the data export feature
        # See tools/downloading_data.txt
        batchidfilter = self.request.get('batchid', default_value='')[:16]
        filenamefilter = self.request.get('filename', default_value='')[:255]

        q = ImageUrlMap.all()
        if batchidfilter:
            q.filter("batchid =",int(batchidfilter))
        if filenamefilter:
            q.filter("filename =",filenamefilter)

        output = cStringIO.StringIO()

        csv_writer = csv.writer(output,
                ['datecreated','id','filename','filesize',
                 'batchid','imgserving_url'])

        numloops = math.ceil(MAX_ROWS_TO_DOWNLOAD/1000.0)

        for loops in xrange(0,numloops):
            # we cap it at 2,000 rows (1000 fetches - 2 times)
            # You can increase it.
            # Can cost too much cpu, use download tools!
            r = q.fetch(limit=1000)
            if not r or len(r) == 0:
                break

            for row in r:
                csv_writer.writerow(
                    [row.datecreated,row.key().id(),row.filename,row.filesize,
                     row.batchid,row.imgserving_url])

            # more in result.
            q.with_cursor(start_cursor=q.cursor())
            
        self.response.headers['Content-Type'] = 'text/csv'
        self.response.out.write(output.getvalue())

class DeepDeleteHandler(webapp.RequestHandler):
    def post(self, mapid, filename):
        imgmap = ImageUrlMap.GetbyIdAndFilename(mapid, urllib.unquote(filename))
        if not imgmap:
            logging.info("Image Delete failed because filename mismatch ('%s' vs '%s')" % (filename, imgmap.filename))
            self.error(404)
            return
        imgmap.DeepDelete()  # kinda wrong that one has to load to delete
        self.response.headers['Content-Type'] = 'application/json'
        self.response.out.write(mapid)

    # if you call blobstoreinfo.key(), you'll make a 2nd call to the database to load the blobstore record.
    # Use this call to get the key without loading the data
    def GetBlobstoreKey(self):
        return ImageUrlMap.blobstoreinfo.get_value_for_datastore(self)

code trophygeek

October 26, 2011

Using Google App Engine to store and service large images