AppEngine 1.4.3 released: new file API, concurrent requests & more

davepeck · on March 30, 2011

The testbed API is exciting -- testing (python) App Engine apps has always been thorny business.

It appears that support for integration testing, however, is not part of this release.

How do you integration test your (python) app engine apps?

Tools like nosegae + webtest give the _theoretical_ promise of integration testing by driving a WSGI app instance. Unfortunately, the _practice_ for any interesting App Engine app (especially those that call use_library for Django) is totally different. Basically, integration testing this way is completely broken: you end up in import error hell. This appears to be at least in part related to dev_appserver's (mis)use of Python's `imp` module, as described here: https://gist.github.com/883676#file_readme.md

bslatkin · on March 30, 2011

That's a nice Gist. Not sure if you should send that to the NoseGAE folks or what. The dev_appserver follows PEP302 (http://www.python.org/dev/peps/pep-0302/) and should properly set attributes on sub-modules (i.e., assert 'subpackage' in dir(sys.modules['package']) ). The dev_appserver does not set those attributes explicitly, but the loader driving the PEP302 hook should.

davepeck · on March 30, 2011

Interesting, thanks!

I'm afraid I'm having trouble parsing your statement on dev_appserver's behavior. Could you clarify it? It sounds like you're saying that dev_appserver both does and doesn't set submodule attributes on parent module instances. Where _should_ NoseGAE be doing this?

As far as I can tell, no code in the App Engine python SDK ever does this. In today's SDK, dev_appserver.py line 2256 adds the submodule to the sys.modules dictionary, but shouldn't there also be a line of code along the lines of setattr(sys.modules[parent_package_name], sub_package_name, submodule)?

Thanks for your help...

bslatkin · on March 30, 2011

IIRC (and it's been a few years) the hook goes on sys.meta_path, which uses the HardenedModulesHook to load modules, but then the Python runtime binds the variable names and attributes and whatnot before returning import statements back to application code.

davepeck · on March 30, 2011

The python runtime does not appear to do this in the specific case of submodule attributes. Right now, the Python bug against this indicates that it is considered a documentation bug... but my hunch is that it's an actual bug. I'll try manually inserting the setattr() line into my local copy of the SDK and see how integration testing goes then.

What do you use for GAE python integration testing?

krosaen · on March 31, 2011

this comment helped me get around the import errors when using nosegae + webtest.

http://code.google.com/p/nose-gae/issues/detail?id=45#c1

totally a PITA but once it is working having a set of smoke tests for all of our urls is invaluable.

ww520 · on March 31, 2011

The File API mapping to the Blobstore really simplifies blob support. The Blob datatype in Datastore has the 1M limit. It was kind of cumbersome to use both the Blobstore and Datastore API to store large objects.

BTW, the Java+Play!+GAE+Objectify really makes webapp development fun and fast.

netmau5 · on March 31, 2011

Using Java+Play!+GAE+Twig for Sparkmuse and Im very happy with the stack overall. The new Blobstore API will help with handling image uploads which is something Play! On GAE fails at horribly (GAEs fault mostly).

HowardRoark · on March 30, 2011

GAE Java only handled one request at a time till now?

teraflop · on March 30, 2011

One request per server instance. The change allows multiple threads to run in each instance, so now your code has to be threadsafe.

eneveu · on March 31, 2011

I was wondering about this in October, I guess the Vosao CMS needs to stop storing state in static fields now:

http://stackoverflow.com/questions/4028787/is-it-thread-safe...

Thilo added answer a few hours ago, pointing out the new threadsafe mode for GAE. That's why I like StackOverflow.

Smerity · on March 30, 2011

Your code only needs to be threadsafe if you wish to enable the concurrent flag, otherwise your code will happily run as it had before.

I just thought it's important to note that they're not forcing you to push a code update unless you want to handle concurrent requests with a single instance.

ShabbyDoo · on March 30, 2011

Wow. I'm not a GAE user, so I'm not embarrassed that I didn't know this. However, I'm surprised that such a restriction hasn't come up in all the blog entries I've read about GAE over the years.

space-monkey · on March 31, 2011

It's not a huge restriction since GAE will happily spin up dozens of instances without any intervention if your load requires it. Getting multiple concurrent requests into each instance will make things like local memory caching somewhat more useful though.

brown9-2 · on March 31, 2011

I don't really recall seeing this spelled out clearly in the documentation before.

tybris · on March 31, 2011

That's pretty standard for Servlet containers.

orijing · on March 30, 2011

Not to diminish the value of this release, but are there any improvements to consistency and availability of the data store?

mccutchen · on March 30, 2011

Hasn't this mostly been addressed by the High Replication Datastore?

davepeck · on March 30, 2011

I suspect the answer is "yes" for most users.

My understanding of how it all fits together, please correct if wrong: the default datastore is (strongly!) consistent and partition tolerant but sacrifices availability. The new HRD gives you availability and partition tolerance at the cost, potentially, of consistency. You can get an intermediate state by wrapping writes in taskqueue tasks, which I do for writes that can wait but must not fail on the standard datastore.

snewman · on March 30, 2011

To clarify: the HRD sacrifies some consistency, but the sacrifice is limited. It still maintains consistency within an "entity group", using the Paxos algorithm for distributed consensus. Consistency is sacrificed only for queries that span entity groups. Retrieving individual records, or queries within an entity group, are strongly consistent even in HRD. The primary tradeoffs for HRD really are cost (because you're maintaining more replicas of your data), and write latency (because you're updating multiple locations). For those not familiar with App Engine terminology, each database is partitioned, by the developer, into shards called Entity Groups -- see http://code.google.com/appengine/docs/python/datastore/entit... for details.

Some additional details on the HRD tradeoffs are at http://code.google.com/appengine/docs/python/datastore/hr/. The technical underpinnings are described in http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf (warning: this is not a light read).

[Full disclosure -- I used to work at Google and was involved in some of this work.]

bdonlan · on March 31, 2011

How is consistency sacrificed with HRD? Do you mean that, with HRD, you could potentially see the effects of some write W to entity group A, write some value dependent on that write to entity group B, then immediately look at entity group A and see a pre-W value?

snewman · on March 31, 2011

No, the scenario you describe could not occur -- it would involve a consistency violation within entity group A. Once you see the effects of write W to entity group A, all operations on entity group A (across all servers) are guaranteed to see W.

The reduced consistency guarantees in HRD involve indexes, because indexes span entity groups. An example: suppose that you set name=foo in record A1, which is part of entity group A. If you then retrieve record A1, that's an operation on entity group A, so you're guaranteed to see name=foo. But if you perform a global query for all records with name == foo, that's an operation on an index, which is outside the entity group and so is not mediated by the entity group's Paxos log. Therefore, your query might not return A1. The index is "eventually" consistent -- it's guaranteed that, eventually, all indexes will be updated. But AFAIK there's no guaranteed upper bound on "eventually". In practice, it should usually be very quick, but only usually.

davepeck · on March 31, 2011

Thanks. I didn't understand the relationship between entity groups/partitions and consistency in the HRD.

snarfed · on April 11, 2011

snewman's comments have addressed most of this. (hey steve!) i'll just add that HRD has more predictable read and write latency, which most developers like a lot, but its writes are also slower on average, since they incur at least one Paxos round across datacenters.

herrherr · on March 31, 2011

I really hoped that they would release a full-text search for the datastore.

Guess I'll have to wait for that and use the improvised solution (http://billkatz.com/2009/6/Simple-Full-Text-Search-for-App-E...).

yoda_sl · on March 31, 2011

One area that raise my interest is with the File API. I wonder if with the 'file' support it will be possible to get Lucene deployed onto GAE.

miloco · on March 31, 2011

Still no simple backup/restore feature. All I ask is for them to implement a dashboard based backup & restore tool!