The problem should be located in the way the setDevice() handles the attributes; probably the SQL statements executed in the whole operation are not protected into a transaction so the atomicity of the setDevice action is not enforced.
When it fails with "ORA-00001: unique constraint (SPRT.SPRT_EC_DEVICE_ATTR_UK) violated" error, it is always in step 4, executing the "devMgr.set(dev);" line.
For the InformSG process, the error implies that something else has done one of two things between the time it retrieves the DeviceDTO and the time is sets it:
1) The other process has added an attribute that the InformSG process also adds (e.g. via parameter mappings).
2) The other process has updated an attribute by removing then re-adding it (as opposed to simply updating the value).
#1 is a straight race condition, and can only be completely avoided if the attributes inserted by the device registration process (parameter mappings, primarily) and the other process can be made mutually exclusive. If both processes need to operate on the same attribute, perhaps the second one can be made to only update, not add it. Two potential product changes could be made to mitigate this: a) comparing attr_def_guids, not guids, of the attributes (this would solve the case when the other process commits before the saveAttributes method is invoked). b) when an insert fails, try an update instead. Note that either solution simply hides the fact that two processes have added the same attribute, and one process will still overwrite the value set by the other.
#2 can be caused by code that updates the value of an attribute by removing the attribute from the DTO and re-adding it. This removes the attribute GUID from the DTO, causing a DELETE of the old GUID and an INSERT with a new GUID, instead of an UPDATE of the original GUID. Since saveAttribute tests for existence of an attribute based on GUID, it will assume that another process has deleted it and attempt to re-add it. This can occur for an attribute that is not touched by the InformSG process. (a potential product change is to not attempt to re-add attributes that were not updated in the DTO. We currently skip the UPDATE statement in this case, but not the re-INSERT - I think this is a good idea for us to do.)
The best thing to do in this case is refactor any code that does a "remove/add" of an attribute into an "update". I see you have snippets of code for doing all three, but is there a chance you are doing a "remove/add" unnecessarily?
Another way to reduce the chances of this occurring is to minimize the time between getting a deviceDTO and setting it. If there is any I/O or "waiting" in between, this can increase the changes of another process changing the same device in the meantime.
We would only consider database locks as a last resort to solve any of these issues.
Again, even if locking and serialization were put in place, it could solve the exceptions and allow both processes to complete their updates, but it would not solve the fact that two processes are trying to update the attributes.. with no clear predictability as to which process would "win".
It's not entirely clear to me which one of these scenarios is occurring, if either, but I just wanted to share my insight right way.
Step 3 should be generally intended in this terms:
//Remove attributes (if not needed anymore)
//Update existing attribute values (if required so by a change)
//Add new attributes (if required so)
There is just one case, about PortMapping attribute, in which we update the values for existing attributes even if the value is not really changed...
But it is just one case and those errors we find are not confined to this case.
About caching, the encore.properties file is a little short of informations about which objects are cached when each of the two settings is enabled, can you explain better?
So that we are not spending time guessing, and if you find it an acceptable path forward, I suggest that we deliver a small HotFix that would log the attributes that are resulting in the constraint violation, and perhaps even log a message each time an attribute is changed, so that we could identify the attribute(s) causing the problem, and hopefully identify the "other" thread conflicting with it (assuming there is one!) If you are open to this idea, we can start on this late next week (we are currently down to the final days of our 4.3 project, so bad timing for this week). If you would rather exhaust other avenues of investigation first, that's fine too. We'll put some more thought into it and come up with some additional theories to pursue.
Regarding the caching, it is typically used for specific high-volume queries of records that have a good read:write ratio. It isn't necessarily correct to say that a specific object is cached (e.g. a Policy) since there may be many queries that retrieve a policy record or a list of policy records. However, the majority of caches are used to cache lookups of an object when queried by the primary key. For instance, we have a self-populating cache called "com.supportsoft.servicegateway.policy.dto.PolicyDTO" for the query "SELECT * FROM SPRT_SG_POLICY WHERE GUID = ?", since that is a common way we load the policy object from the database. Some other caches are used to cache lists of results, or the result of queries that look up an object based on many criteria.
Almost all of the caching present in the application is toggled by the CACHING_ENABLED property in encore.properties. Disabling this will certainly increase the query load on the database. All I have on-hand is a list of the cache names, not the queries that back them (that would require some digging to put them into an expanded SQL form), but I suspect we would only need to focus in on a few of these if we think caching is part of the problem. Here is the list of cache names:
In addition, there are two caches controlled by the DEVICE_CACHING_ENABLED property:
These both cache the Device object, based on lookup by primary key, or by Unique ID String. Until 4.3, enabling this was not recommended. Since device records can be quite numerous and are updated very frequently, the impact on memory and the multicast expiry messages was not well understood. It was also not proven that device objects were actually being expired from the caches when they needed to be.
We addressed these concerns in 4.3, and will be recommending it be enabled after an upgrade to 4.3. We also added a dozen or more caches in 4.3, reducing the number of queries during steady-state inform processing by 25-40%.
Hope this sheds some light on your question.
It seems it is possible that one of the caches involved in this process can get out of sync with the database, as follows:
1) Thread #1 is in the process of updating the attributes for a device. It runs an insert query to add an attribute, but it is not immediately committed. It also invalidates the cache that hold the current list of attribute record guids.
2) Thread #2 then reads the attributes for the same device. Since the cache does not have the list of attribute guids for the device, the database is queried (returning the "old" data - since Thread #1 has not committed) and the cache is repopulated!
3) Thread #1 then completes all subsequent queries and commits the transaction.
This leaves the cache out-of-sync with the database. The next write operation will retrieve the stale data from the cache, determine from it that an attribute must be added, and attempt an insert statement that results in the UK violation.
If this is indeed what is happening, the immediate workaround is to disable this cache. This can be done by adding the following to ehcache.xml on each of your application servers, and restarting the application:
This should be placed after the existing tag, and before the tag.
The longer term solution will probably see us take advantage of the JTA support in ehcache 2.x. This will require a change to use XA transactions for the database as well (which is not necessarily a low-risk change).
In the meantime, let me know if this workaround has a positive effect.
In the last 36 hours we had no errors