[GAE] Experience of using mapreduce lib
This mapreduce lib is built on the top of “Task Queue API". Only map functionality is implemented so far. The library offers a powerful framework on splitting one job into tasks. With given reader and map function, the lib can create several tasks and execute them concurrently. It is super useful and super fast in schema change or database migration. All you need to do is write a “correct" map function. Despite of that, it also takes care of quota issues. You can avoid run out of quota in one job by setting a parameter, processing_rate.
If you want to take advantage by using mapreduce lib, I suggest you start from those two references.
In my experience, there are two factors, I think, critical.
- catching the data
In case of that, people who don’t know about idempotent, please refer to the session pdf. There are explanation and sample codes.
Why idempotence so import to mapreduce lib? That is because of the 30 seconds request time constraints. When you perform a long-run task, a DeadlineExceedError may raise in any time. By design, when error happens, the lib will reschedule the task and start from this entity over again. Which means the same entity might be processed many times. So it is important to make sure your map function idempotent.
It is very likely to get bad performance at first time using mapreduce lib. In my case, my map function needs to query entities of other kinds. It is not possible nor reasonable to do batch query in the map function.
Let’s put batch query issue aside. How are you going to store the query result if you can pre-fetch them somewhere else? Considering memcache is straightforward. It is very convenient to store the query results and fetch it in map function. But if you use it without carefulness, you will find out that fetching memcache becomes a new bottleneck. That is because, in practice, a memcache.get() call is still a RPC call and costs as much as a simple query. And you still need to call it N times aka N RPC calls.
My solutions for this is “prefetch + memcache + global variables". Before launch a mapreduce job, prefetch all necessary data and memcache it. After prefetch done, launch the mapreduce job. The lib will load handler from the module you specified. When the module get loaded, restore those data from memcache to module global variables. Use those global variables in you function.
By doing so, my process rate raise from 0.35 eneity/sec to 2x entity/sec. The boost is significant.
Here is the thought behind the solution.
I try to avoid one by one query and redundant queries. Therefore, I use a prefetch function to get the data in batch way. Then, I need a place to store the data and it must be able to be accessed anywhere. The memcache is for this need. I also need the map function use those data efficiently. Access data directly from local memory is the fastest way. So, I restore those data in module global variables when the module is loaded.
Here is something you need to notice.
- GAE will run your function in several instances. Instances are isolated from each other.
- Local memory has limitation. About 180mb (I saw from log. )
- You can cache your scripts/variables by following the instruction. App Caching