I’ve been developing a PHP/MySQL web application that will be accessed by multiple users. These users will be both viewing and editing records in the database. Obviously, any situation in which multiple users may be performing operations on the same record puts the integrity of the data at risk.
In the case of my application, there is a very real possibility (a certainty, actually) that two or more people will open the same record at the same time, make changes, and attempt to save these changes. This common concurrent execution problem is known as the “lost update”.
User A opens record “A”.
User B opens record “A”.
User A saves changes to record “A”.
User B saves changes to record “A”.
In this example, User A’s changes are lost – replaced by User B’s changes.
I started to look for a method of preventing this sort of data loss. At first, I wanted to lock the record using a sort of check-in/check-out system. User A would check-out record “A”, and then have exclusive write access to that record until it was checked back in as part of the saving process. There were a number of problems with this method, foremost that User A may decide to not make any changes and so not save the record, which would leave the record in a checked-out state until further administrative action was taken to unlock it.
For awhile, I tried to come up with some ingenious way around this, which usually boiled down to somehow automatically unlocking the record after a period of time. But this is not a satisfactory solution. For one thing, a user may have a legitimate reason to keep the record checked-out for longer periods. For another, User B shouldn’t have to wait for a time-out event to occur if User A is no longer in the record.
So what I eventually came up with is a method of checking whether a record has been changed since it was accessed by the user each time a save is initiated. My particular way of doing this involves comparing timestamps, but other ways exist.
Here’s how I’m implementing my solution to the lost update concurrency issue:
User A creates record “A” and saves it at 9:00 AM. A “last-saved timestamp” of 9:00 AM is generated and saved to the record.
User A opens record “A” at 10:00 AM. An “opened timestamp” of 10:00 AM is generated and written to a hidden (or readonly) input field on the HTML page.
User B opens record “A” at 10:00 AM. An “opened timestamp” of 10:00 AM is generated and written to a hidden (or readonly) input field on the HTML page.
At 10:30 AM, User A attempts to save the record. The “last-saved timestamp” is retrieved from the record. The “opened timestamp” of 10:00 AM is compared to the “last-saved timestamp” of 9:00 AM. Because the record has not been changed since it was opened, the record is saved. A new “last-saved timestamp” of 10:30 AM is generated and saved to the record.
At 11:00 AM, User B attempts to save the record. The “last-saved timestamp” is retrieved from the record. The “opened timestamp” of 10:00 AM is compared to the “last-saved timestamp” of 10:30 AM (User A’s timestamp). Because the record has been changed since it was opened by User B, User B is not allowed to save the record.
User B will have to re-open record “A”, consider the effect that User A’s changes may have, and then make any desired changes.
Unless I’m missing something, this assures that the data from the earlier save will not be overwritten by the later save. To keep things consistent, I’m using PHP to generate all of my timestamps from the server clock, as JavaScript time is based on the user’s system time and is therefore wholly unreliable.
The main drawback, that I see, is extra work for User B, who has to now review the record as saved by User A before deciding what changes to make. But this is going to be necessary anyway, as changes made between when User B opened and attempted to save the record may influence User B’s update.
The strange thing is that I haven’t seen this offered as a solution on any of the pages I found while Googling for solutions to the access control, lost update and other concurrency-related data loss problems. Lots of people acknowledge the problem and potential for data loss, but few offer solutions on the application level – preferring to rely on a database engine’s ability to lock rows.
Very good explanation, but I did found another source of information including the same idea:
http://www.cs.uta.fi/wo/slides/lecture%205.ppt
That’s a very interesting PowerPoint presentation, thanks for linking it. And I’m happy to see that we both arrived at the same solution. I hope it serves you well.
Pretty cool solution. To add to that, it works great when process collision is greater than the granularity of the MySql timestamp which is per second. ie, no 2 process will do a read/write within 1 second.
To get around this, you can store the time as a bigint which can handle milliseconds. Unless you have some serious hardware, I don’t think collision granularity would go smaller than milliseconds.
Also instead of re-reading the timestamp before B saves, B should try to UPDATE row WHERE “opened timestamp” = “last modified timestamp”. Otherwise the CPU could potentially decide to switch threads on you right after you’ve done the second read, and do the entire read-write of A in that time.
Thanks for the write up… i’ve been wrestling with the same design issue for some time now and your explaination is the best I’ve seen yet as a possible solution.
I prefer to use a hash of the original record. When user A retrieves the record for editing, I generate an MD5 hash of the record data that is included as a hidden field in the form. When the user tries to write the record, I can determine whether the record was changed by comparing the MD5 hash. That eliminates the need to manage the timestamps.
A hash is a great idea, and in one implementation, it could eliminate the need for each record to contain the timestamp data. However, as pointed out in another comment, attempting to modify the record using an UPDATE that looks for the timestamp serves to check that the record can be modified at the same instant it would be changed.
I’m not at all familiar with the risks involved with thread switching, but it seems to me that there still exists the potential for data loss in the time it takes the server to calculate the MD5 sum and then send the data to the database.
I’ll certainly consider using it the next time this issue comes up, though.
Why not relly on MySQL innoDB concurrency model? Is there any problem requirement which makes innodb inpractical in this case? Did I miss something?
Regards,
Jon di Leos
Would you all recommend the same solution if every user’s data was their own. I am building an app which people can use off of my site, but they each have their own data. Do I replicate the database or follow the examples here, or do something different.. I would appreciate your wisdom..
I would think that if only one user can access any given record, then there would be less opportunity for problems to arise. But there is still a possibility that data could be lost. For example, a user could access the same record from two separate browsers (log in from home, leave that page up, and then log in later from work) and cause data loss in the same way as two different users accessing a single record.
In my opinion, a responsible developer will work to eliminate any possibility of data loss, however unlikely.
**Important – Read Tony’s comment above!**
The server can switch between users A and B at any time, and there is a problem with the ‘straightforward’ sequence described at the top depending on how you implement it (read-modify-write sequence). Going back to the description of the solution, there is a fatal flaw if the following sequence of events happens when the two users try to write the same record (almost) simultaneously:
At 10:30 AM, User A attempts to save the record. The “last-saved timestamp” is retrieved from the record. — Server BREAKS HERE and switches attention to User B —
At 10:30 AM, User B attempts to save the record. The “last-saved timestamp” is retrieved from the record. The “opened timestamp” of 10:00 AM is compared to the “last-saved timestamp” of 9:00 AM. Because the record has not been changed since it was opened, the record is saved. A new “last-saved timestamp” of 10:30 AM is generated and saved to the record.
— Server BREAKS HERE and returns to User A —
The “opened timestamp” of 10:00 AM is compared to the “last-saved timestamp” of 9:00 AM (oops – now it’s the wrong value, but this was the value read above at the start…). Because the record has not been changed since it was opened, the record is saved (again). A new “last-saved timestamp” of 10:30 AM is generated and saved to the record.
The problem has not been solved because the ‘read’ of the new timestamp followed by the check on it followed by the ‘write’ of the new timestamp can be interrupted. If one ‘read’ interrupts another then both reads end up with the same, old, value, and both threads will ‘write’ their new record.
Solution? See Tony’s comment – this is IMPORTANT! The operation to read, check and write the timestamp (for both A and B) must be one indivisible transaction, i.e. use:
You need to check the result of this operation (how many rows were affected…?) to handle the case where someone else has updated it. In this case your update did not work…
Great article. The problem of multiple applications colliding in this environment is usually small and a locking mechanism is overkill. The idea of using the timestamp is a really elegant and simple solution. Of course, use it the way Tony and Jason have described to reduce the chances of errors. Note that this does not totally eliminate problems but significantly reduces the chance. An example of where an error would still occur:
User A opens record “A”.
User B opens record “A”.
User A saves changes to record “A”.
User B saves changes to record “A”.
all takes place in the same second.
For most applications this is a very unlikely event. Another option would be to add a version number. This number would need to be incremented on every update and put into the where clause. This would guarantee that the record was only updated once. Note that Transact-SQL (aka mssql) uses this approach with a global number that is referred to as either timestamp or rowversion.
One of the best things about this implementation is that it can be used when needed and added to already designed and coded websites when used with a timestamp. Any updates will modify the timestamp and no extra code needs to be written. The timestamp column can be added later. So, if one is worried about data integrity for a certain highly used web page, it can be added to that page with no bad side effects.
You can also, if desired, make the system more robust by adding a merge function that would look to see what had been changed and how to merge the two changes together. To implement this, though, you would need an original copy of the record, the updated record you are trying to write, and the current record in the database. So merge functionality would only be worth implementing for updates with high collisions.
Correction on the sequence of events above. Record “A” would have had to be inserted or updated in that second, also. Note that the last save by User B does not have to happen in that second.