Deadlocks may cause connection pool corruption when in a distributed transaction
Description
Environment
Attachments
is caused by
is related to
Activity

Frédéric Delaporte July 4, 2017 at 5:18 PMEdited
Some additional contrived testing finally got the corruption pool reproduced with .Net Framework 4.6.1. But that involves some race preventing lock to be removed and adding some Thread.Sleep
inside NHibernate code to get some race occurring, and even with that, the corruption does not always occurs.
Since those locks are recents dev artifact, that means that the pool corruption may still occurs with NHibernate 4.0 series, but that should be very rarely occurring. Still, that should be fixed with NHibernate v5. (At least it will provide an option for disabling a feature causing connection to be released from transaction completion.)
If someone can confirms he had seen the pool corruption with NHibernate v4.0 or higher, and whether that was very rare or not, that would help determine if my conclusion here are right or not.
Update: getting it to fail was so contrived that I have missed a scope disposal. That was not at all a pool corruption, but a scope leak. So no, pool corruption seems still not reproduced with .Net Framework 4.0 or higher. If anyone have witness the opposite, please let us know.

Frédéric Delaporte July 2, 2017 at 8:51 PM
Tes case added in a NHibernate PR. No way to get a pool corruption, but instead it triggers an abnormal scope disposal slowness.
So I have adjusted the test to fail on that, then added some changes to avoid, in all possible cases, closing the connection from transaction completion event. It will be closed instead on next session usage requiring a connection, or on its closing. It will still be closed from transaction completion event if the session was disposed inside the scope while not having disabled the new option UseConnectionOnSystemTransactionPrepare
. (Added in same PR, for NH-2928.)

Frédéric Delaporte July 2, 2017 at 2:39 PM
Added a new version of the test case: NH3023_Fixed.zip.
All binaries removed: Nuget will restore them.
Test will now run distributed and non-distributed. Both fails with NHibernate 3.2.0.4000, but only the distributed one fail with NHibernate 3.4.1.4000. I have not tried to narrow down which changes would have cause the non distributed case to no more fail.
Test never fails with .Net Framework 4.0, although having attempted with more load than the (now removed) comment was suggesting. But it at least suggests that closing the connection from transaction completion event can have averse effects, even when synchronizing threads or not having threading involved at all. To be considered in NH-4011.
I have made some cleanup, simplifications, ... I do not remove the initial test case (having bugs on its own), for letting whoever wishes it checking it too.
I think I will add this test in NHibernate source too, although it does no more fail. It will be there in case some later framework changes causes the trouble to come back.

Frédéric Delaporte July 2, 2017 at 12:58 PMEdited
Upgrading the target .Net framework of the test project from 3.5 to 4.0 causes the issue to disappear. It looks to me there was some weakness into System.Transactions
, which have been fixed in framework 4.0.
The ticket description tells this was reproduced with .Net Framework 4.0, but I suspect that only the test case own bug (causing subsequent session to always fail, whatever the pool state is) was reproduced with it, not the actual pool corruption.
I even tried this test in parallel with running the 5 000+ NHibernate tests with a profiler tracing all locks and transaction events, no troubles. By the way, with framework 4, the fact that the deadlock was "hidden" when distributed until scope disposal (if not further hidden by another exception) does no more occur. Microsoft has definitely fixed something there.
In my tests, troubles were occurring even without threading implied, when the transaction was not distributed. It seems that was the connection disposal from transaction completed event which was not supported in all cases by .Net Framework 3.5, even if occurring on the same thread.
Setting ConnectionReleaseMode.OnClose
allows to avoid this trouble if using an older framework than 4.0, provided the session is closed after the scope. (configuration.SetProperty(Cfg.Environment.ReleaseConnections, "on_close");
or in .cfg
, set property connection.release_mode
to on_close
.)
Unless someone can reproduce the issue with an up-to-date .Net Framework, this issue is going to be closed as obsolete.

Frédéric Delaporte June 30, 2017 at 2:36 PMEdited
I am currently reproducing an issue, while having removed all ISession.BeginTransaction
calls, having removed distributed scope from the subsequent session tests, and even without distributed scope at all, and even without closure of connection in transaction completed event. (I have re-enabled the pool in the connection string.)
The triggering fact seems to be the enlistment of a volatile resource in the scope to be deadlocked. Having it distributed just obfuscates the trouble by moving the transaction coordination responsibility to SQL Server, which notifies MSDTC of the deadlock, which raises the failure only at scope disposal, if the scope was completed.
When not distributed, the deadlock is thrown as "normal" from the deadlocked operation, but the connection still goes corrupted in the pool, and causes the next try to fail. (I have change the code for trying at least ten times, when we receive the "normal" deadlock error.) And this happen even when avoiding the connection closing from transaction completion event by disposing the session only after the scope (I have forgotten using ConnectionReleaseMode.OnClose
, so it was still closed from completed event). (This requires to manually enlist the session in the transaction before testing the deadlock, for still having the trouble to occur. session.GetSessionImplementation().Factory.TransactionFactory.EnlistInDistributedTransactionIfNeeded(session.GetSessionImplementation())
This enlistment is triggered by most session operations, including its opening. But if we open it before the scope, the opening cannot enlist. So we need to do it manually after having opened the scope and before running the deadlock code, code which does not imply the session.)
So well, it does not look to me closing the connection from the transaction completion event is responsible of this trouble. I still have to strip down this test case of more fat for trying to get the actual minimal conditions for triggering it.
This issue is detailed in the following Stack Overflow question: http://stackoverflow.com/questions/8581956/deadlocks-causing-server-failed-to-resume-the-transaction-with-nhibernate-and
When a deadlock occurs on SQL Server, occasionally the exception will not be raised by the CLR. This has only happened in our environment when the SQL Server is under extreme (i.e. unnatural) load. It appears to be caused by the AdoNetWithDistributedTransactionsFactory closing the connection on transaction completion. There seems to be a race condition wherein the transaction rollback event may be raised before the deadlock error propagates across the connection. When this happens, the connection will be returned to the pool in an unusable state. Subsequent (brand-new) sessions will be unable to make a connection to the database. The error message varies depending on the situation. Usually it is one of the following:
The deadlock exception that should have been raised originally
The server failed to resume the transaction.
New request is not allowed to start because it should come with valid transaction descriptor.
The transaction has aborted.
Import of Microsoft Distributed Transaction Coordinator (MS DTC) transaction failed
As we've been able to reproduce similar behavior without using NHibernate, I believe at heart this is a bug in ADO.NET or maybe SQL Server. However I hope that NHibernate could be modified to avoid it. No patch because I don't have enough understanding of AdoNetWithDistributedTransactionsFactory.
This has been reproduced with the following variations:
NHibernate 2.1.2, 3.1, 3.2
SQL Server 2005, 2008, Express 2008
.NET Framework 3.5, 4.0