Article Original Creation Date: 2012-03-28
This should be placed after the existing tag, and before the tag.
The longer term solution will probably see us take advantage of the JTA support in ehcache 2.x. This will require a change to use XA transactions for the database as well (which is not necessarily a low-risk change).
In the meantime, let me know if this workaround has a positive effect.
Problem
sometimes we find this error (see attached file) during call to setDevice(), this leaves the device record in a non-consistent status due to some attributes missing or incomplete.
The problem should be located in the way the setDevice() handles the attributes; probably the SQL statements executed in the whole operation are not protected into a transaction so the atomicity of the setDevice action is not enforced.
The problem should be located in the way the setDevice() handles the attributes; probably the SQL statements executed in the whole operation are not protected into a transaction so the atomicity of the setDevice action is not enforced.
This is one way we are reproducing this issue, not elegant but seems working: you need to setup two cycles that simultaneously update the same device (setDevice) from two different managed servers...
Environment
Solaris 10
Oracle 10
Weblogic 9.2.3
Cause
I uploaded a schematic example of how we process updates on a device from our application. (see attached txt file)
When it fails with "ORA-00001: unique constraint (SPRT.SPRT_EC_DEVICE_ATTR_UK) violated" error, it is always in step 4, executing the "devMgr.set(dev);" line.
When it fails with "ORA-00001: unique constraint (SPRT.SPRT_EC_DEVICE_ATTR_UK) violated" error, it is always in step 4, executing the "devMgr.set(dev);" line.
It uses the GUID of the SPRT_EC_DEVICE_ATTR record to decide if an attribute in a DTO is "new" or not (thus requiring an INSERT). The INSERT statement is what is resulting in the constraint violation.
Engineering update:
Engineering update:
I will not rule out the possibility of a bug somewhere that could result in this error when only a single process in active, but I will first discuss the potential issues with the current design when multiple processes are manipulating the same device.
For the InformSG process, the error implies that something else has done one of two things between the time it retrieves the DeviceDTO and the time is sets it:
1) The other process has added an attribute that the InformSG process also adds (e.g. via parameter mappings).
2) The other process has updated an attribute by removing then re-adding it (as opposed to simply updating the value).
#1 is a straight race condition, and can only be completely avoided if the attributes inserted by the device registration process (parameter mappings, primarily) and the other process can be made mutually exclusive. If both processes need to operate on the same attribute, perhaps the second one can be made to only update, not add it. Two potential product changes could be made to mitigate this: a) comparing attr_def_guids, not guids, of the attributes (this would solve the case when the other process commits before the saveAttributes method is invoked). b) when an insert fails, try an update instead. Note that either solution simply hides the fact that two processes have added the same attribute, and one process will still overwrite the value set by the other.
#2 can be caused by code that updates the value of an attribute by removing the attribute from the DTO and re-adding it. This removes the attribute GUID from the DTO, causing a DELETE of the old GUID and an INSERT with a new GUID, instead of an UPDATE of the original GUID. Since saveAttribute tests for existence of an attribute based on GUID, it will assume that another process has deleted it and attempt to re-add it. This can occur for an attribute that is not touched by the InformSG process. (a potential product change is to not attempt to re-add attributes that were not updated in the DTO. We currently skip the UPDATE statement in this case, but not the re-INSERT - I think this is a good idea for us to do.)
The best thing to do in this case is refactor any code that does a "remove/add" of an attribute into an "update". I see you have snippets of code for doing all three, but is there a chance you are doing a "remove/add" unnecessarily?
Another way to reduce the chances of this occurring is to minimize the time between getting a deviceDTO and setting it. If there is any I/O or "waiting" in between, this can increase the changes of another process changing the same device in the meantime.
We would only consider database locks as a last resort to solve any of these issues.
Again, even if locking and serialization were put in place, it could solve the exceptions and allow both processes to complete their updates, but it would not solve the fact that two processes are trying to update the attributes.. with no clear predictability as to which process would "win".
It's not entirely clear to me which one of these scenarios is occurring, if either, but I just wanted to share my insight right way.
For the InformSG process, the error implies that something else has done one of two things between the time it retrieves the DeviceDTO and the time is sets it:
1) The other process has added an attribute that the InformSG process also adds (e.g. via parameter mappings).
2) The other process has updated an attribute by removing then re-adding it (as opposed to simply updating the value).
#1 is a straight race condition, and can only be completely avoided if the attributes inserted by the device registration process (parameter mappings, primarily) and the other process can be made mutually exclusive. If both processes need to operate on the same attribute, perhaps the second one can be made to only update, not add it. Two potential product changes could be made to mitigate this: a) comparing attr_def_guids, not guids, of the attributes (this would solve the case when the other process commits before the saveAttributes method is invoked). b) when an insert fails, try an update instead. Note that either solution simply hides the fact that two processes have added the same attribute, and one process will still overwrite the value set by the other.
#2 can be caused by code that updates the value of an attribute by removing the attribute from the DTO and re-adding it. This removes the attribute GUID from the DTO, causing a DELETE of the old GUID and an INSERT with a new GUID, instead of an UPDATE of the original GUID. Since saveAttribute tests for existence of an attribute based on GUID, it will assume that another process has deleted it and attempt to re-add it. This can occur for an attribute that is not touched by the InformSG process. (a potential product change is to not attempt to re-add attributes that were not updated in the DTO. We currently skip the UPDATE statement in this case, but not the re-INSERT - I think this is a good idea for us to do.)
The best thing to do in this case is refactor any code that does a "remove/add" of an attribute into an "update". I see you have snippets of code for doing all three, but is there a chance you are doing a "remove/add" unnecessarily?
Another way to reduce the chances of this occurring is to minimize the time between getting a deviceDTO and setting it. If there is any I/O or "waiting" in between, this can increase the changes of another process changing the same device in the meantime.
We would only consider database locks as a last resort to solve any of these issues.
Again, even if locking and serialization were put in place, it could solve the exceptions and allow both processes to complete their updates, but it would not solve the fact that two processes are trying to update the attributes.. with no clear predictability as to which process would "win".
It's not entirely clear to me which one of these scenarios is occurring, if either, but I just wanted to share my insight right way.
Answer from Customer:
Chris, we've reviewed the code, basically the attributes are deleted only if not needed anymore, so we don't fall in the ' unnecessarily "remove/add"' scenario.
Step 3 should be generally intended in this terms:
//Remove attributes (if not needed anymore)
//Update existing attribute values (if required so by a change)
//Add new attributes (if required so)
There is just one case, about PortMapping attribute, in which we update the values for existing attributes even if the value is not really changed...
But it is just one case and those errors we find are not confined to this case.
About caching, the encore.properties file is a little short of informations about which objects are cached when each of the two settings is enabled, can you explain better?
Step 3 should be generally intended in this terms:
//Remove attributes (if not needed anymore)
//Update existing attribute values (if required so by a change)
//Add new attributes (if required so)
There is just one case, about PortMapping attribute, in which we update the values for existing attributes even if the value is not really changed...
But it is just one case and those errors we find are not confined to this case.
About caching, the encore.properties file is a little short of informations about which objects are cached when each of the two settings is enabled, can you explain better?
Answer from Engineering:
Then it remains a puzzle.
So that we are not spending time guessing, and if you find it an acceptable path forward, I suggest that we deliver a small HotFix that would log the attributes that are resulting in the constraint violation, and perhaps even log a message each time an attribute is changed, so that we could identify the attribute(s) causing the problem, and hopefully identify the "other" thread conflicting with it (assuming there is one!) If you are open to this idea, we can start on this late next week (we are currently down to the final days of our 4.3 project, so bad timing for this week). If you would rather exhaust other avenues of investigation first, that's fine too. We'll put some more thought into it and come up with some additional theories to pursue.
Regarding the caching, it is typically used for specific high-volume queries of records that have a good read:write ratio. It isn't necessarily correct to say that a specific object is cached (e.g. a Policy) since there may be many queries that retrieve a policy record or a list of policy records. However, the majority of caches are used to cache lookups of an object when queried by the primary key. For instance, we have a self-populating cache called "com.supportsoft.servicegateway.policy.dto.PolicyDTO" for the query "SELECT * FROM SPRT_SG_POLICY WHERE GUID = ?", since that is a common way we load the policy object from the database. Some other caches are used to cache lists of results, or the result of queries that look up an object based on many criteria.
Almost all of the caching present in the application is toggled by the CACHING_ENABLED property in encore.properties. Disabling this will certainly increase the query load on the database. All I have on-hand is a list of the cache names, not the queries that back them (that would require some digging to put them into an expanded SQL form), but I suspect we would only need to focus in on a few of these if we think caching is part of the problem. Here is the list of cache names:
arraylist_of.com.supportsoft.encore.inventory.dto.FirmwareProductClassDTO
arraylist_of.com.supportsoft.encore.system.dto.AdministrativeDomainDTO
arraylist_of.com.supportsoft.encore.system.dto.KeyPerformanceIndicatorDTO
arraylist_of.com.supportsoft.encore.system.dto.PrefGroupDTO
arraylist_of.com.supportsoft.encore.system.dto.PreferenceOptionDTO
arraylist_of.com.supportsoft.servicegateway.cwmp.dto.CwmpParamValueDTO
com.supportsoft.encore.inventory.dto.CapabilityDTO
com.supportsoft.encore.inventory.dto.FirmwareDTO
com.supportsoft.encore.inventory.dto.HardwareDTO
com.supportsoft.encore.inventory.info.DeviceContextInfo
com.supportsoft.encore.system.SystemKeysCache
com.supportsoft.encore.system.attributemanager.AdminDomainAttributeGuidCache
com.supportsoft.encore.system.attributemanager.DeviceAttributeGuidCache
com.supportsoft.encore.system.attributemanager.IsRegisteredCache
com.supportsoft.encore.system.dto.AttribDefinitionDTO
com.supportsoft.encore.system.dto.AttribGroupDTO
com.supportsoft.encore.system.dto.AttribTypeDTO
com.supportsoft.encore.system.dto.GenericServerDTO
com.supportsoft.encore.system.dto.GlobalPreferenceDTO
com.supportsoft.encore.system.dto.KeyPerformanceIndicatorDTO
com.supportsoft.encore.system.dto.ObjectTypeDTO
com.supportsoft.encore.system.dto.RealmDTO
com.supportsoft.encore.system.dto.UserPreferenceDTO
com.supportsoft.servicegateway.configuration.dto.ConditionDTO
com.supportsoft.servicegateway.configuration.dto.ServiceConfigDTO
com.supportsoft.servicegateway.configuration.dto.ServiceTechConfigDTO
com.supportsoft.servicegateway.configuration.dto.SoftwareModuleDTO
com.supportsoft.servicegateway.policy.HybridPolicyGuidCache
com.supportsoft.servicegateway.policy.dto.ActionConfigStubDTO
com.supportsoft.servicegateway.policy.dto.ActionDTO
com.supportsoft.servicegateway.policy.dto.PolicyActionDTO
com.supportsoft.servicegateway.policy
So that we are not spending time guessing, and if you find it an acceptable path forward, I suggest that we deliver a small HotFix that would log the attributes that are resulting in the constraint violation, and perhaps even log a message each time an attribute is changed, so that we could identify the attribute(s) causing the problem, and hopefully identify the "other" thread conflicting with it (assuming there is one!) If you are open to this idea, we can start on this late next week (we are currently down to the final days of our 4.3 project, so bad timing for this week). If you would rather exhaust other avenues of investigation first, that's fine too. We'll put some more thought into it and come up with some additional theories to pursue.
Regarding the caching, it is typically used for specific high-volume queries of records that have a good read:write ratio. It isn't necessarily correct to say that a specific object is cached (e.g. a Policy) since there may be many queries that retrieve a policy record or a list of policy records. However, the majority of caches are used to cache lookups of an object when queried by the primary key. For instance, we have a self-populating cache called "com.supportsoft.servicegateway.policy.dto.PolicyDTO" for the query "SELECT * FROM SPRT_SG_POLICY WHERE GUID = ?", since that is a common way we load the policy object from the database. Some other caches are used to cache lists of results, or the result of queries that look up an object based on many criteria.
Almost all of the caching present in the application is toggled by the CACHING_ENABLED property in encore.properties. Disabling this will certainly increase the query load on the database. All I have on-hand is a list of the cache names, not the queries that back them (that would require some digging to put them into an expanded SQL form), but I suspect we would only need to focus in on a few of these if we think caching is part of the problem. Here is the list of cache names:
arraylist_of.com.supportsoft.encore.inventory.dto.FirmwareProductClassDTO
arraylist_of.com.supportsoft.encore.system.dto.AdministrativeDomainDTO
arraylist_of.com.supportsoft.encore.system.dto.KeyPerformanceIndicatorDTO
arraylist_of.com.supportsoft.encore.system.dto.PrefGroupDTO
arraylist_of.com.supportsoft.encore.system.dto.PreferenceOptionDTO
arraylist_of.com.supportsoft.servicegateway.cwmp.dto.CwmpParamValueDTO
com.supportsoft.encore.inventory.dto.CapabilityDTO
com.supportsoft.encore.inventory.dto.FirmwareDTO
com.supportsoft.encore.inventory.dto.HardwareDTO
com.supportsoft.encore.inventory.info.DeviceContextInfo
com.supportsoft.encore.system.SystemKeysCache
com.supportsoft.encore.system.attributemanager.AdminDomainAttributeGuidCache
com.supportsoft.encore.system.attributemanager.DeviceAttributeGuidCache
com.supportsoft.encore.system.attributemanager.IsRegisteredCache
com.supportsoft.encore.system.dto.AttribDefinitionDTO
com.supportsoft.encore.system.dto.AttribGroupDTO
com.supportsoft.encore.system.dto.AttribTypeDTO
com.supportsoft.encore.system.dto.GenericServerDTO
com.supportsoft.encore.system.dto.GlobalPreferenceDTO
com.supportsoft.encore.system.dto.KeyPerformanceIndicatorDTO
com.supportsoft.encore.system.dto.ObjectTypeDTO
com.supportsoft.encore.system.dto.RealmDTO
com.supportsoft.encore.system.dto.UserPreferenceDTO
com.supportsoft.servicegateway.configuration.dto.ConditionDTO
com.supportsoft.servicegateway.configuration.dto.ServiceConfigDTO
com.supportsoft.servicegateway.configuration.dto.ServiceTechConfigDTO
com.supportsoft.servicegateway.configuration.dto.SoftwareModuleDTO
com.supportsoft.servicegateway.policy.HybridPolicyGuidCache
com.supportsoft.servicegateway.policy.dto.ActionConfigStubDTO
com.supportsoft.servicegateway.policy.dto.ActionDTO
com.supportsoft.servicegateway.policy.dto.PolicyActionDTO
com.supportsoft.servicegateway.policy
com.supportsoft.servicegateway.policy.dto.PolicyDTO
com.supportsoft.servicegateway.policy.policydevicemanager.PolicyDeviceManagerCache
sprt.CompiledConditionCache
sprt.FirmwareFieldsCache
sprt.HardwareFieldsCache
sprt.HwFwAssociationFieldsCache
In addition, there are two caches controlled by the DEVICE_CACHING_ENABLED property:
sprt.DeviceDTOCacheById
sprt.DeviceDTOCacheByUniqueIdString
These both cache the Device object, based on lookup by primary key, or by Unique ID String. Until 4.3, enabling this was not recommended. Since device records can be quite numerous and are updated very frequently, the impact on memory and the multicast expiry messages was not well understood. It was also not proven that device objects were actually being expired from the caches when they needed to be.
We addressed these concerns in 4.3, and will be recommending it be enabled after an upgrade to 4.3. We also added a dozen or more caches in 4.3, reducing the number of queries during steady-state inform processing by 25-40%.
Hope this sheds some light on your question.
com.supportsoft.servicegateway.policy.policydevicemanager.PolicyDeviceManagerCache
sprt.CompiledConditionCache
sprt.FirmwareFieldsCache
sprt.HardwareFieldsCache
sprt.HwFwAssociationFieldsCache
In addition, there are two caches controlled by the DEVICE_CACHING_ENABLED property:
sprt.DeviceDTOCacheById
sprt.DeviceDTOCacheByUniqueIdString
These both cache the Device object, based on lookup by primary key, or by Unique ID String. Until 4.3, enabling this was not recommended. Since device records can be quite numerous and are updated very frequently, the impact on memory and the multicast expiry messages was not well understood. It was also not proven that device objects were actually being expired from the caches when they needed to be.
We addressed these concerns in 4.3, and will be recommending it be enabled after an upgrade to 4.3. We also added a dozen or more caches in 4.3, reducing the number of queries during steady-state inform processing by 25-40%.
Hope this sheds some light on your question.
Next update from Engineering:
See under Resolution:
Solution
We came up with another theory and a possible workaround.
It seems it is possible that one of the caches involved in this process can get out of sync with the database, as follows:
1) Thread #1 is in the process of updating the attributes for a device. It runs an insert query to add an attribute, but it is not immediately committed. It also invalidates the cache that hold the current list of attribute record guids.
2) Thread #2 then reads the attributes for the same device. Since the cache does not have the list of attribute guids for the device, the database is queried (returning the "old" data - since Thread #1 has not committed) and the cache is repopulated!
3) Thread #1 then completes all subsequent queries and commits the transaction.
This leaves the cache out-of-sync with the database. The next write operation will retrieve the stale data from the cache, determine from it that an attribute must be added, and attempt an insert statement that results in the UK violation.
If this is indeed what is happening, the immediate workaround is to disable this cache. This can be done by adding the following to ehcache.xml on each of your application servers, and restarting the application:
maxElementsInMemory="0"
overflowToDisk="false"
/>
It seems it is possible that one of the caches involved in this process can get out of sync with the database, as follows:
1) Thread #1 is in the process of updating the attributes for a device. It runs an insert query to add an attribute, but it is not immediately committed. It also invalidates the cache that hold the current list of attribute record guids.
2) Thread #2 then reads the attributes for the same device. Since the cache does not have the list of attribute guids for the device, the database is queried (returning the "old" data - since Thread #1 has not committed) and the cache is repopulated!
3) Thread #1 then completes all subsequent queries and commits the transaction.
This leaves the cache out-of-sync with the database. The next write operation will retrieve the stale data from the cache, determine from it that an attribute must be added, and attempt an insert statement that results in the UK violation.
If this is indeed what is happening, the immediate workaround is to disable this cache. This can be done by adding the following to ehcache.xml on each of your application servers, and restarting the application:
overflowToDisk="false"
/>
This should be placed after the existing tag, and before the tag.
The longer term solution will probably see us take advantage of the JTA support in ehcache 2.x. This will require a change to use XA transactions for the database as well (which is not necessarily a low-risk change).
In the meantime, let me know if this workaround has a positive effect.
Answer from Customer:
we applied the workaround and it is working.
In the last 36 hours we had no errors
In the last 36 hours we had no errors