I've been working on a synchronization engine for an iPhone application called Clinks, and wanted to share some notes to help others down the road. This is a hefty topic… heck, even the title had to be somewhat unwieldy. With that, lets roll up our sleeves and dive in…
Goals
First off, you're probably reading this because you recognize that native iPhone clients need to go above and beyond a polished web experience. In particular, native apps need to function regardless of the state of an internet connection. This means making sure that sufficient data is cached locally, and that any local changes get propagated to the server/cloud when an internet connection is re-established.
Second, the synchronization engine you consider building will be highly dependent on the goals of your app and on the schema of your data. I've seen basic algorithms for sync engines that attempt to sync *everything* from the client up to the server, and *everything* on the server back down to the client, but in my experience, this isn't a solution to any real-world problem, just a starting point.
By way of example, Clinks is a social wine journal that allows wine enthusiasts to keep track of the wines they try and like, and share that information with their friends. The underlying schema contains, in high level terms, tables for person, product, and experience, where experience captures the tastes and ratings of a product by a person. Users can add experience records to their journal, which may or may not create new product records (they may add an experience record for an existing product, for example). For our sync scenario, we want to push all changes to the experience table, and only push new product records. We also want to pull down friends' experience records as they change too.
We've kept things simple: sync on a record level, and collisions in the history tables are handled using simple FIFO logic. Your needs may vary, and will directly affect the complexity of your sync algorithms.
History
One way to implement record-level sync is by using a history table, which is basically a separate table in your schema that tracks all the changes made to records in the other tables. In your code, whenever a user creates, updates, or deletes a record, your code also insert a new record into the history table that captures all that information: person, table, record, action, date.
Note that each client has a history table, and the cloud/server has a master history table. With that, the sync engine on the client basically reconciles the local history table with the history table on the server on a periodic or on-demand basis. The reconciliation can be optimized by keeping track of the last history record sync'ed, and then on subsequent syncs only processing history records after the previous last history record.
GUIDs
Before we proceed, we must address the issue of primary keys. All databases support a primary key field for uniquely identifying a record, which is crucial for edit operations as well as relationships between tables. Unfortunately, almost all databases assume management of primary key (PK) fields by default, and don't have any concept of a "distributed" PK space.
To illustrate, consider two users starting with fresh databases on their iPhones. When each user creates a record, each database will assign a PK of 1 to the records. Now when those users try to sync to the cloud/server, there would be two records with the same PK value (disregarding typical checks for uniqueness, etc.), making a mess of downstream searches, edits, and sync.
The simple solution is to add an additional "GUID" field to all tables that are to be synchronized, and to generate globally unique identifiers for each new record. Many operating systems and languages have APIs to do just that, generating sufficiently long and random unique ID strings that are statistically unlikely to ever collide.
In practice what this means is that your sync engine will deal with GUIDs, while your database engines will continue to deal with PKs. One significant change you'll need to make to your server is to implement fetch based on GUID rather than by PK. For Ruby on Rails, we simply overrode the GET resource to use the GUID field, then left all the other verbs, like PUT, POST, and DELETE using the PKs.
Algorithm
Here's some pseudo code tying all this together:
// push
get lastClientHistoryDate
get clientDeltaList from client History where (date > lastClientHistoryDate)
for (clientHistoryItem in clientDeltaList)
if (clientHistoryItem.guid isn't on server)
switch (clientHistoryItem.action)
case create:
get clientItem from client where (guid = clientHistoryItem.guid)
new remoteItem
remoteItem.guid = clientItem.guid
remoteItem.data = clientItem.data
post remoteItem
case update:
get clientItem from client where (guid = clientHistoryItem.guid)
get remoteItem from server where (guid = clientHistoryItem.guid)
remoteItem.data = clientItemData
put remoteItem
case delete:
get remoteItem from server where (guid = clientHistoryItem.guid)
delete remoteItem
add clientHistoryItem to server History table
set lastClientHistoryDate = clientHistoryItem.date
//pull
get lastServerHistoryDate
get serverDeltaList from server History where (date > lastServerHistoryDate)
for (serverHistoryItem in serverDeltaList)
if (serverHistoryItem.guid isn't on client)
switch (serverHistoryItem.action)
case create:
get remoteItem from server where (guid = serverHistoryItem.guid)
new clientItem
clientItem.guid = remoteItem.guid;
clientItem.data = remoteItem.data
save clientItem
case update:
get remoteItem from server where (guid = serverHistoryItem.guid)
get clientItem from client where (guid = serverHistoryItem.guid)
clientItem.data = remoteItem.data
save clientItem
case delete:
get clientItem from client where (guid = serverHistoryItem.guid)
delete clientItem
add serverHistoryItem to client History table
set lastServerHistoryDate = serverHistoryItem.date
Note that you must fetch the record from the opposite DB using GUIDs in order to get the PK for that record for subsequent DB operations.
Note also that the algorithm appears to push changes up, then process those same changes coming back down. While this is true, the algorithm is actually designed to support *multiple* clients for a user, all syncing to the cloud/server, so there could be unprocessed history items on the server that client hasn't yet fetched.
And finally, note that this algorithm doesn't speak to relational schemas. You'll need to map the relationships between record GUIDs to the record PKs on each database accordingly.
Dates
One final thought on date fields. We use dates as a way to optimize the sync engine, allowing us to only process history records that are newer than the last sync date. Its important, therefore, to declare your date fields the same on both client and server, or to otherwise normalize the date values when comparing them.