Closed
Bug 704984
Opened 13 years ago
Closed 13 years ago
Thunderbird Daily/Earlybird (11.0a1 20111122-20111122) Crashes or hangs in LDAP Address Book Search
Categories
(Thunderbird :: Address Book, defect)
Tracking
(thunderbird11+ fixed)
RESOLVED
DUPLICATE
of bug 712363
People
(Reporter: ohghoo8n.mzl.al, Assigned: mayhemer)
References
Details
(Keywords: crash, dogfood, hang, Whiteboard: [has stack trace])
Crash Data
Attachments
(4 files, 3 obsolete files)
When I run Thunderbird Daily 11.0a1 on Windows 7 SP1 32-bit on the latest of 22 and 23 November, I open the Address Book and try to run a query against the university LDAP server (over LDAPS on port 636). The Address Book is greyed out after further attempts at user interaction. The main Thunderbird window subsequently does the same, and the windows associated with the Thunderbird process are registered as "Not Responding" according to Windows Task Manager. Generated a debug log with a short script I generated (see for logging parameters: https://gist.github.com/1388881). Trying to get the build tools in place since I do not see a crash report on any of multiple attempts.
Not sure if it is an indentical situation, but updated the See Also field because this is the closest I can get to the issue I face in the Bug Tracker. Please update with what I can do for better debugging information and techniques.
Just created a clean profile and configured just the LDAP server. After first test search, I was still able to recreate the issue. Crashed immediately.
Comment 3•13 years ago
|
||
exactly how is your ldap configured in thunderbird?
and just be clear, you are experiencing a hang, not a crash (thunderbird terminates), correct?
(over to ludo, et al)
Severity: normal → critical
Keywords: hang
Comment 4•13 years ago
|
||
Can we get a crash ID ? (see http://support.mozillamessaging.com/en-US/kb/mozilla-crash-reporter) or does the crash reporter not show up ?
(In reply to Ludovic Hirlimann [:Usul] from comment #4)
> Can we get a crash ID ? (see
> http://support.mozillamessaging.com/en-US/kb/mozilla-crash-reporter) or does
> the crash reporter not show up ?
The crash never shows up. To be clear, this is not an installed version. I unzip and run it from a directory with other alpha and beta builds in my user profile. Regardless, I have manually had to force close the application, or opt to close it once the Windows detects the application hang.
Log Name: Application
Source: Application Hang
Date: 11/24/2011 2:03:17 AM
Event ID: 1002
Task Category: (101)
Level: Error
Keywords: Classic
User: N/A
Computer: COMPNAME.ad-domain.domain.tld
Description:
The program thunderbird.exe version 11.0.0.4344 stopped interacting with Windows and was closed. To see if more information about the problem is available, check the problem history in the Action Center control panel.
Process ID: c4c
Start Time: 01ccaa33d4e7a672
Termination Time: 36
Application Path: C:\Users\ajs67\QA\thunderbird-10.0a1.en-US.win32\thunderbird\thunderbird.exe
Report Id: 3ff5b8ca-1627-11e1-8e6e-90f55a0cdfe3
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Application Hang" />
<EventID Qualifiers="0">1002</EventID>
<Level>2</Level>
<Task>101</Task>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2011-11-23T23:03:17.000000000Z" />
<EventRecordID>13863</EventRecordID>
<Channel>Application</Channel>
<Computer>COMPNAME.ad-domain.domain.tld</Computer>
<Security />
</System>
<EventData>
<Data>thunderbird.exe</Data>
<Data>11.0.0.4344</Data>
<Data>c4c</Data>
<Data>01ccaa33d4e7a672</Data>
<Data>36</Data>
<Data>C:\Users\Username\QA\thunderbird-10.0a1.en-US.win32\thunderbird\thunderbird.exe</Data>
<Data>3ff5b8ca-1627-11e1-8e6e-90f55a0cdfe3</Data>
<Binary>55006E006B006E006F0077006E0000000000</Binary>
</EventData>
</Event>
(In reply to Wayne Mery (:wsmwk) from comment #3)
> exactly how is your ldap configured in thunderbird?
What do you mean? My client:
Name: Georgetown Directory
Hostname: directory.georgetown.edu
Base DN: dc=georgetown,dc=edu
Port number: 636
Bind DN: id=username,ou=people,dc=georgetown,dc=edu
[x] Use secure connection (SSL)
Don't return more than 300 results.
Scope: [-] One Level [x] Subtree
Search filter: (objectclass=*)
Login method: Simple
As for the server, I think it is an aging iPlanet SDK implementation. How much detail do you want? Just to be clear, the same configuration on the client works against the university server normally. I can try without LDAPS/LDAP-SSL on the same config for testing if you want.
> and just be clear, you are experiencing a hang, not a crash (thunderbird
> terminates), correct?
>
> (over to ludo, et al)
Comment 7•13 years ago
|
||
(In reply to alharaka from comment #5)
> (In reply to Ludovic Hirlimann [:Usul] from comment #4)
> > Can we get a crash ID ? (see
> > http://support.mozillamessaging.com/en-US/kb/mozilla-crash-reporter) or does
> > the crash reporter not show up ?
>
> The crash never shows up. To be clear, this is not an installed version. I
> unzip and run it from a directory with other alpha and beta builds in my
> user profile. Regardless, I have manually had to force close the
> application, or opt to close it once the Windows detects the application
> hang.
We need a stack trace to be able to diagnose what is going on. Could you follow the instructions at https://developer.mozilla.org/en/How_to_get_a_stacktrace_with_WinDbg (replace firefox , by thunderbird) and attach the stack trace here ?
(In reply to Ludovic Hirlimann [:Usul] from comment #7)
> (In reply to alharaka from comment #5)
> > (In reply to Ludovic Hirlimann [:Usul] from comment #4)
> > > Can we get a crash ID ? (see
> > > http://support.mozillamessaging.com/en-US/kb/mozilla-crash-reporter) or does
> > > the crash reporter not show up ?
> >
> > The crash never shows up. To be clear, this is not an installed version. I
> > unzip and run it from a directory with other alpha and beta builds in my
> > user profile. Regardless, I have manually had to force close the
> > application, or opt to close it once the Windows detects the application
> > hang.
>
> We need a stack trace to be able to diagnose what is going on. Could you
> follow the instructions at
> https://developer.mozilla.org/en/How_to_get_a_stacktrace_with_WinDbg
> (replace firefox , by thunderbird) and attach the stack trace here ?
Working on it now. Sorry for the delay.
I am having an issue(In reply to alharaka from comment #8)
> (In reply to Ludovic Hirlimann [:Usul] from comment #7)
> > (In reply to alharaka from comment #5)
> > > (In reply to Ludovic Hirlimann [:Usul] from comment #4)
> > > > Can we get a crash ID ? (see
> > > > http://support.mozillamessaging.com/en-US/kb/mozilla-crash-reporter) or does
> > > > the crash reporter not show up ?
> > >
> > > The crash never shows up. To be clear, this is not an installed version. I
> > > unzip and run it from a directory with other alpha and beta builds in my
> > > user profile. Regardless, I have manually had to force close the
> > > application, or opt to close it once the Windows detects the application
> > > hang.
> >
> > We need a stack trace to be able to diagnose what is going on. Could you
> > follow the instructions at
> > https://developer.mozilla.org/en/How_to_get_a_stacktrace_with_WinDbg
> > (replace firefox , by thunderbird) and attach the stack trace here ?
>
> Working on it now. Sorry for the delay.
I am having an issue. When I follow the instructions, I get up a "Debugee is running" message and the Thunderbird interface becomes inaccessible after I go back to WinDbg, having tried the LDAP query that causes this issue. When I try to go back to the Thunderbird interface, either the main window or the address book window, I can not get it again. I can see thumbnails when I highlight over them, but clicking ot Alt+Tab operations just open the WinDbg interface again. It is very unusual.
I read the wiki page. I see this message, which the wiki suggests a workaround for ntdll!LdrpDoDebuggerBreak+0x2c or whatever.
> ModLoad: 736a0000 7373b000 C:\Windows\WinSxS
> \x86_microsoft.vc80.crt_1fc8b3b9a1e18e3b_8.0.50727.6195_none_d09154e044272b9a
> \MSVCR80.dll
> (918.14e4): Break instruction exception - code 80000003 (first chance)
> eax=00000000 ebx=00000000 ecx=001df838 edx=770370b4 esi=fffffffe edi=00000000
> eip=770904f6 esp=001df854 ebp=001df880 iopl=0 nv up ei pl zr na pe nc
> cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246
> ntdll!LdrpDoDebuggerBreak+0x2c:
> 770904f6 cc int 3
This is before I run the debugger (Go). This is from initial startup of WinDbg. When I try the wiki suggestion to set different breakpoints at these addresses. Then, running the process in WinDbg, it crashes immediately. It is still hanging while I write this message, and I recreated the LDAP issue about ~15 minutes ago. Any pointers?
Reporter | ||
Comment 10•13 years ago
|
||
Forced close the process, and uploaded the log I described in https://bugzilla.mozilla.org/show_bug.cgi?id=704984#c9.
Reporter | ||
Comment 11•13 years ago
|
||
Updated•13 years ago
|
Whiteboard: [has stack trace]
Reporter | ||
Comment 12•13 years ago
|
||
Anything else I can do to help you guys with this? Just checking it.
Updated•13 years ago
|
tracking-thunderbird11:
--- → +
Comment 13•13 years ago
|
||
Just repro'd this on Daily on Linux - the following showed up just before the hang in the terminal:
###!!! ASSERTION: nsAbLDAPMessageBase::OnLDAPInit(): failed to perform bind operation: 'Error', file /media/Projects/mozilla/thunderbird/mailnews/addrbook/src/nsAbLDAPListenerBase.cpp, line 305
Status: UNCONFIRMED → NEW
Ever confirmed: true
Updated•13 years ago
|
Keywords: regressionwindow-wanted
Comment 14•13 years ago
|
||
(In reply to alharaka from comment #12)
> Anything else I can do to help you guys with this? Just checking it.
yes! expanding on mconley's edit, you can determine the regression range, i.e find the first nightly build which crashes. ftp://ftp.mozilla.org/pub/thunderbird/nightly/
Comment 15•13 years ago
|
||
grr, wrong button
Comment 16•13 years ago
|
||
Mike/alharaka: are you guys sure your password is stored correctly? I could reproduce this with an out-of-date password for my ldap server, but without, I can't.
Comment 17•13 years ago
|
||
Based on the investigation I was able to get done before my debugger crashed, the UI thread (thread 0) is holding the UI event loop lock and closing the LDAPS socket (actually happening deep in the destructor for the UI event object). This is trying to grab the NSS library's SSL3 Handshake Lock.
In the mean time, thread 4 is negotiating an SSL connection. It grabs the SSL3 Handshake Lock in ssl3_RestartHandshakeAfterAuthCertificate() and then calls down to nsNSSSocketInfo::GetPreviousCert() which tries to send an event back over to the main thread - this is blocking trying to get the SyncRunnableBase monitor in SyncRunnableBase::DispatchToMainThreadAndWait() - note that it's blocking on trying to get its own monitor *before* dispatching the event to the main thread, not on waiting for the main thread to reply that it has finished running the event. This looks very strange. As far as I can see that Runnable is created right before the call, so nobody else should be able to have grabbed its lock.
I'll try to repro under the debugger again.
Comment 18•13 years ago
|
||
OK, I've reproduced this under the debugger and at least one point is wrong in my previous note - the DispatchToMainThreadAndWait() call *is* waiting for the main thread to respond.
So it looks like the deadlock is:
background thread grabs SSL3 handshake lock, dispatches an event to the UI thread, and waits for the UI thread to respond
UI thread calls the destructor on a Runnable it's executing, which blocks grabbing the SSL3 handshake lock
More to come...
Comment 19•13 years ago
|
||
Changing platform from Win7 to All, since I'm getting it on Mac OS X. From the original reporter's traces, it looks like he's hitting the Windows "assertions throw into the debugger" trap. On OS X, the assertion failure gets printed but the program keeps executing (and hangs soon after).
Here's the tiny bit of info printed by the ldap:5 debug trace and the assertion I'm seeing:
2011-12-14 20:39:57.075409 UTC - 1886665920[100319260]: nsLDAPOperation::SimpleBind(): called; bindName = 'mail=ireid@mozilla.com,o=com,dc=mozilla';
2011-12-14 20:39:57.312133 UTC - 1886665920[100319260]: ###!!! ASSERTION: nsAbLDAPMessageBase::OnLDAPInit(): failed to perform bind operation: 'Error', file /Users/ireid/tbird/comm-central/mailnews/addrbook/src/nsAbLDAPListenerBase.cpp, line 305
2011-12-14 20:39:57.349016 UTC - 1886665920[100319260]: unbinding
OS: Windows 7 → All
Comment 20•13 years ago
|
||
(In reply to Irving Reid (:irving) from comment #18)
> OK, I've reproduced this under the debugger and at least one point is wrong
> in my previous note - the DispatchToMainThreadAndWait() call *is* waiting
> for the main thread to respond.
>
> So it looks like the deadlock is:
>
> background thread grabs SSL3 handshake lock, dispatches an event to the UI
> thread, and waits for the UI thread to respond
>
> UI thread calls the destructor on a Runnable it's executing, which blocks
> grabbing the SSL3 handshake lock
>
> More to come...
Have confirmed in the debugger that the background thread and the UI thread are both working on the *same* SSL socket.
I'm not quite sure yet, but I don't think the UI thread is working on exactly the same Runnable that the background thread posted. True to form, XCode crashed again as I was trying to figure that out...
Comment 21•13 years ago
|
||
In my (Windows) build, LDAPS calls through to PSM which calls DispatchToMainThreadAndWait ... on the main thread. Oops.
Comment 22•13 years ago
|
||
Is this because of the psm/ssl thread reworking to stop using xpcom proxies? I forget if those patches landed on the trunk.
I see similar hangs with SSL IMAP on the trunk as well.
Comment 23•13 years ago
|
||
Irving seems to be correct about the deadlock.
I think the bug is that the socket is being closed on the main thread. Instead, AFAICT, all sockets must be closed only on the socket transport thread. Honza, Kai, Patrick: please confirm or refute.
Compare nsHTTPConnection::~nsHTTPConnection(). There is no Close() call there.
PSM's NSS socket layer is not thread safe (see bug 427948 for example). AFAICT, this has always been the case, even before the XPCOM-removal patches. Thus, based on my limited knowledge of comm-central-specific code, it seems like there has always been a race condition here.
I have made the assumption in the XPCOM-removal patch that only the socket transport thread is allowed to perform operations on sockets that are managed by the socket transport thread.
Comment 24•13 years ago
|
||
(In reply to neil@parkwaycc.co.uk from comment #21)
> In my (Windows) build, LDAPS calls through to PSM which calls
> DispatchToMainThreadAndWait ... on the main thread. Oops.
In my XPCOM removal patches, I was very careful to ensure that DispatchToMainThreadAndWait is never called by any method that is expected/allowed to be called on the main thread. I specifically did not implement any logic like
if (NS_IsMainThread() {
runnable->Run();
} else {
runnable->DispatchToMainThreadAndWait();
}
specifically to catch these kinds of bugs.
So, either:
(1) My assumptions about which threads are allowed to do what operations is wrong, or
(2) The LDAPS code that is operating on the file descriptors is doing so on the wrong thread.
Comment 25•13 years ago
|
||
The hang I'm seeing is calling nsSocketTransport::IsAlive from the UI thread. Is that not allowed anymore?
![]() |
Assignee | |
Comment 26•13 years ago
|
||
Confirming Brian's words: sockets must be completely managed only on the socket thread. This is result of both XPCOM proxy removal and SSL thread removal patches that we now see this bug in LDAP implementation.
This is how LDAP creates the SSL socket, this happens on the main thread:
xul.dll!nsSSLIOLayerAddToSocket(int family=2, const char * host=0x0481deb8, int port=636, const char * proxyHost=0x00000000, int proxyPort=0, PRFileDesc * fd=0x086f6160, nsISupports * * info=0x00296fac, bool forSTARTTLS=true, bool anonymousLoad=false) Line 3914 C++
xul.dll!nsTLSSocketProvider::AddToSocket(int family=2, const char * host=0x0481deb8, int port=636, const char * proxyHost=0x00000000, int proxyPort=0, unsigned int flags=0, PRFileDesc * aSocket=0x086f6160, nsISupports * * securityInfo=0x00296fac) Line 96 + 0x30 bytes C++
xul.dll!nsLDAPSSLConnect(const char * hostlist=0x04819738, int defport=636, int timeout=-1, unsigned long options=0, lextiof_session_private * sessionarg=0x04826138, lextiof_socket_private * * socketargp=0x08c16e0c) Line 217 + 0x47 bytes C++
nsldap32v60.dll!nsldapi_connect_to_host(ldap * ld=0x08bfa9f8, sockbuf * sb=0x08c16ca8, const char * hostlist=0x04819738, int defport=636, int secure=1, char * * krbinstancep=0x049a8e10) Line 510 + 0x35 bytes C
nsldap32v60.dll!nsldapi_new_connection(ldap * ld=0x08bfa9f8, ldap_server * * srvlistp=0x00297490, int use_ldsb=1, int connect=1, int bind=0) Line 651 + 0x33 bytes C
nsldap32v60.dll!nsldapi_open_ldap_defconn(ldap * ld=0x08bfa9f8) Line 716 + 0x13 bytes C
nsldap32v60.dll!nsldapi_send_server_request(ldap * ld=0x08bfa9f8, berelement * ber=0x08dd2e60, int msgid=1, ldapreq * parentreq=0x00000000, ldap_server * srvlist=0x00000000, ldap_conn * lc=0x00000000, char * bindreqdn=0x07421eb8, int bind=0) Line 195 + 0x9 bytes C
nsldap32v60.dll!nsldapi_send_initial_request(ldap * ld=0x08bfa9f8, int msgid=1, unsigned long msgtype=96, char * dn=0x07421eb8, berelement * ber=0x08dd2e60) Line 148 + 0x2b bytes C
nsldap32v60.dll!simple_bind_nolock(ldap * ld=0x08bfa9f8, const char * dn=0x07421eb8, const char * passwd=0x04823f40, int unlock_permitted=1) Line 153 + 0x17 bytes C
nsldap32v60.dll!ldap_simple_bind(ldap * ld=0x08bfa9f8, const char * dn=0x07421eb8, const char * passwd=0x04823f40) Line 82 + 0x13 bytes C
xul.dll!nsLDAPOperation::SimpleBind(const nsACString_internal & passwd={...}) Line 322 + 0x30 bytes C++
xul.dll!nsAbLDAPListenerBase::OnLDAPInit(nsILDAPConnection * aConn=0x04a04de8, unsigned int aStatus=0) Line 302 + 0x35 bytes C++
xul.dll!nsLDAPConnection::OnLookupComplete(nsICancelable * aRequest=0x049a8d94, nsIDNSRecord * aRecord=0x08eb6710, unsigned int aStatus=0) Line 663 C++
The ldap code is not simple and I really don't understand well how it works and how it even poll sockets.
Adding some folks that might know more.
Comment 27•13 years ago
|
||
(In reply to David :Bienvenu from comment #25)
> The hang I'm seeing is calling nsSocketTransport::IsAlive from the UI
> thread. Is that not allowed anymore?
It seems like in this instance, my assumption is wrong. nsISocketTransport's documentation says: "* NOTE: This is a free-threaded interface, meaning that the methods on this interface may be called from any thread."
The implementation of nsSocketTransport::IsAlive calls PR_Recv(fd, &c, 1, PR_MSG_PEEK, 0);
I believe there are two issues:
(1) The SSL code now assumes that PR_Recv is being called on the socket transport thread. I guess this worked previously (before bug 674147) because the I/O operations were never being done on the calling thread, but were instead proxied to the SSL thread.
(2) This PR_Recv() is driving the SSL handshake forward, perhaps in a way that the NPN code does not expect. I see that nsHttpConnection is already aware that IsAlive() drives the handshake forward, based on a comment in nsHttpConnection::IsAlive. However, does nsHttpConnection correctly take into account the fact that IsAlive() could have been called from outside of Necko, on another thread?
It seems like in mozilla-central and comm-central, there are very few callers to IsAlive().
Still investigating...
Comment 28•13 years ago
|
||
(In reply to Brian Smith (:bsmith) from comment #27)
> (2) This PR_Recv() is driving the SSL handshake forward, perhaps in a way
> that the NPN code does not expect. I see that nsHttpConnection is already
> aware that IsAlive() drives the handshake forward, based on a comment in
> nsHttpConnection::IsAlive. However, does nsHttpConnection correctly take
> into account the fact that IsAlive() could have been called from outside of
> Necko, on another thread?
I don't see a problem for NPN - it has to be called on the http/spdy socket and nothing outside of necko has access to that. That http socket is allocated by necko I don't see how code outside of necko could operate on it or have access to it. (ldap code acting on a different socket won't impact this socket/npn/psm state).. The comment is referring to an isalive call that necko's nshttpconnectionmgr can make (also on the socket thread).
![]() |
Assignee | |
Comment 29•13 years ago
|
||
(In reply to David :Bienvenu from comment #25)
> The hang I'm seeing is calling nsSocketTransport::IsAlive from the UI
> thread. Is that not allowed anymore?
I don't see IsAlive call in https://bugzilla.mozilla.org/attachment.cgi?id=581743. Does your comment belong to a different bug?
LDAP is not using nsSocketTransport. It's only using the NSPR socket layer directly.
Comment 30•13 years ago
|
||
IMAP is using socket transport - see comment 22, and http://mxr.mozilla.org/comm-central/source/mailnews/imap/src/nsImapProtocol.cpp#1231
![]() |
Assignee | |
Comment 31•13 years ago
|
||
Interesting. nsSocketTransport::IsAlive has always been implemented as PR_Read(1, MSG_PEEK). It's interesting to allow it be called on any thread. There can easily be IMO race between read on the socket thread and e.g the main thread.
We might start thinking of making IsAlive just a passive check, not active (check state variables only). IMO it is better to claim the socket is alive and let consumers figure the state out by call of read or write on the socket streams.
Could also make bug 631801 patch simpler.
Looks like there is more then one issue:
- LDAP implementation accesses NSPR sockets on the main thread (fix it by moving to the socket thread, if possible)
- IMAP is using socket transport and accesses some methods on the main (non-networking) thread (fix it by making the socket transport implementation thread safe)
Comment 32•13 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #31)
> Interesting. nsSocketTransport::IsAlive has always been implemented as
> PR_Read(1, MSG_PEEK). It's interesting to allow it be called on any thread.
> There can easily be IMO race between read on the socket thread and e.g the
> main thread.
>
> We might start thinking of making IsAlive just a passive check, not active
> (check state variables only). IMO it is better to claim the socket is alive
> and let consumers figure the state out by call of read or write on the
> socket streams.
I filed bug 711786 for this.
> - LDAP implementation accesses NSPR sockets on the main thread (fix it by
> moving to the socket thread, if possible)
That is this bug.
> - IMAP is using socket transport and accesses some methods on the main
> (non-networking) thread (fix it by making the socket transport
> implementation thread safe)
IMO, it is better to change this IMAP code to work on the socket transport thread. I filed I filed bug 711787 for this. But, it may also be possible to fix it by making IsAlive thread safe.
Comment 33•13 years ago
|
||
Wen we ensure that nsLDAPConnection::Close() (or at least ldap_unbind) is always called on the socket transport thread? At least, we would have to change the LDAP code so that it always closes the LDAP connection before the destructor is called, so that the Close() in nsLDAPConnection::~nsLDAPConnection() can be removed.
That is the approach I would like to try first. However, I do not know how much other code is expecting to be able to close connections off the socket transport thread, or what other I/O-related threading issues there may be in Mail/News.
![]() |
Assignee | |
Comment 34•13 years ago
|
||
This is an attempt to fix the immediate crash with close (try: http://build.mozillamessaging.com/tinderboxpushlog/?tree=ThunderbirdTry&rev=79bf9e6f566c)
Here https://ftp.mozilla.org/pub/mozilla.org/thunderbird/try-builds/honzab.moz@firemni.cz-79bf9e6f566c/ will soon be modified Thunderbird Daily build for try.
Anyone able to reproduce the issue, please install the build for your platform from the above URL (if not working, try later, it will work!) and try to reproduce the crash.
Assignee: nobody → honzab.moz
Status: NEW → ASSIGNED
Reporter | ||
Comment 35•13 years ago
|
||
So far, tried the Windows build and was able to reproduce the crash.
https://ftp.mozilla.org/pub/mozilla.org/thunderbird/try-builds/honzab.moz@firemni.cz-79bf9e6f566c/try-comm-central-win32/thunderbird-11.0a1.en-US.win32.zip
What was it you wanted exactly? It is a little unclear to me from your message, sorry.
Comment 36•13 years ago
|
||
Honza:
So I'm having a few issues with this build on Windows 7:
1) On TB open, if I immediately open a compose window, and enter fragments of addresses into the To field, I get "LDAP Server Search problem" displayed in an autocomplete dropdown. Clicking on that item reveals:
"Error code 0x80070057: Unknown error
Please contact your System Administrator."
2) I'm able to open the address book, and do a search on an LDAP address book, and have it return results. Doing so seems to clear the issue in (1).
No hangs though, which is good.
-Mike
Comment 37•13 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #31)
> We might start thinking of making IsAlive just a passive check, not active
> (check state variables only).
checking for a fin queued in the kernel (without having to worry about buffering any goodput) is the major reason for IsAlive() usage, so that would severely harm its utility imo.
![]() |
Assignee | |
Comment 38•13 years ago
|
||
(In reply to alharaka from comment #35)
> What was it you wanted exactly? It is a little unclear to me from your
> message, sorry.
Thanks for checking on this. I forgot to add that if you experience another crash, to please report it and add a link here to that report. It might a very different issue, or the patch just moves the bug to a different place of the code.
(In reply to Mike Conley (:mconley) from comment #36)
> Honza:
>
> So I'm having a few issues with this build on Windows 7:
>
> 1) On TB open, if I immediately open a compose window, and enter fragments
> of addresses into the To field, I get "LDAP Server Search problem" displayed
> in an autocomplete dropdown. Clicking on that item reveals:
>
> "Error code 0x80070057: Unknown error
>
> Please contact your System Administrator."
As I've no test environment unfortunately, I cannot tell you what that is :( Are you 100% sure this is caused by this patch? Doesn't the current Daily build do the same?
>
> 2) I'm able to open the address book, and do a search on an LDAP address
> book, and have it return results. Doing so seems to clear the issue in (1).
>
> No hangs though, which is good.
This fix is about preventing a crash, not a hang. Hang we know about is in IMAP, bug 711787.
Comment 39•13 years ago
|
||
With the patch, I no longer see a deadlock on OS X, but I'm still seeing an assertion failing just before the deadlock would have happened. I'm not 100% sure, but the observer service assertion might be new as well.
2011-12-19 16:25:46.013227 UTC - 1886665920[100319260]: nsLDAPOperation::SimpleBind(): called; bindName = 'mail=ireid@mozilla.com,o=com,dc=mozilla';
2011-12-19 16:25:46.308532 UTC - 1886665920[100319260]: ###!!! ASSERTION: nsAbLDAPMessageBase::OnLDAPInit(): failed to perform bind operation: 'Error', file /Users/ireid/tbird/comm-central/mailnews/addrbook/src/nsAbLDAPListenerBase.cpp, line 305
2011-12-19 16:25:46.344750 UTC - 411570176[1003194a0]: ###!!! ASSERTION: Using observer service off the main thread!: 'Error', file /Users/ireid/tbird/comm-central/mozilla/xpcom/ds/nsObserverService.cpp, line 143
2011-12-19 16:25:46.344771 UTC - 411570176[1003194a0]: unbinding
2011-12-19 16:25:46.345153 UTC - 411570176[1003194a0]: unbound
![]() |
Assignee | |
Comment 40•13 years ago
|
||
(In reply to Irving Reid (:irving) from comment #39)
> With the patch, I no longer see a deadlock on OS X, but I'm still seeing an
> assertion failing just before the deadlock would have happened. I'm not 100%
> sure, but the observer service assertion might be new as well.
Irving, so with an unmodified Daily build, you hang or you crash?
The "Using observer service off the main thread!" is new, caused by my patch, I have to fix that too.
Comment 41•13 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #40)
> (In reply to Irving Reid (:irving) from comment #39)
> > With the patch, I no longer see a deadlock on OS X, but I'm still seeing an
> > assertion failing just before the deadlock would have happened. I'm not 100%
> > sure, but the observer service assertion might be new as well.
>
> Irving, so with an unmodified Daily build, you hang or you crash?
>
> The "Using observer service off the main thread!" is new, caused by my
> patch, I have to fix that too.
On Mac OS X I hang, see comments 16,17 on this bug. Based on comments 5 and 9, I think the original submitter was also seeing either a hang, or a crash caused by the breakpoint instruction we execute after assertion failures on Windows debug builds.
![]() |
Assignee | |
Comment 42•13 years ago
|
||
So, this is an updated patch, the assertion about observer service should be gone, the assertion about LDAP operation failure should remain (that is expected), deadlocks and crashes should be gone.
Anyone interested (so far thanks for all the help!) please check this try build again (remove the previous one or install this new one over the old one): https://ftp.mozilla.org/pub/mozilla.org/thunderbird/try-builds/honzab.moz@firemni.cz-e281fd419c4a/
Attachment #582731 -
Attachment is obsolete: true
![]() |
Assignee | |
Comment 43•13 years ago
|
||
Comment on attachment 582878 [details] [diff] [review]
attempt 2
(In reply to Honza Bambas (:mayhemer) from comment #42)
> Created attachment 582878 [details] [diff] [review]
> attempt 2
Pressed enter too soon...
I managed to have an environment reproducing the deadlock (seems 100% reliable STR).
This patch DOESN'T fix it. So please forget about this try build now...
Attachment #582878 -
Attachment is obsolete: true
![]() |
Assignee | |
Comment 44•13 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #43)
> This patch DOESN'T fix it. So please forget about this try build now...
And for Patrick and David: this patch is also a wrong patch file, the fix was actually in LDAP code and was about posting unbind to the socket thread.
Comment 45•13 years ago
|
||
Have done a bunch of debugging to try and isolate the OnLDAPInit() assertion failure
2011-12-19 16:25:46.308532 UTC - 1886665920[100319260]: ###!!! ASSERTION: nsAbLDAPMessageBase::OnLDAPInit(): failed to perform bind operation: 'Error', file /Users/ireid/tbird/comm-central/mailnews/addrbook/src/nsAbLDAPListenerBase.cpp, line 305
The underlying failure is happening in the SSL handshake, which is all taking place on the main thread as far as I can tell. Looking back, I guess this is what Honza saw in comment 26.
Do we have any options other than reworking Thunderbird to move the LDAP calls in address autocomplete (if not everywhere else) off the main thread?
![]() |
Assignee | |
Comment 46•13 years ago
|
||
So,
I have a patch that makes our PSM runnables more generic and thread independent. All seems to work well for PSM and Firefox and fixes the deadlock.
However, removing deadlocks revealed another issue: SSL thread removal patch somehow broke support for blocking sockets. I believe the fix should be fairly simple hopefully, investigating right now.
![]() |
Assignee | |
Comment 47•13 years ago
|
||
More patches will follow.
Cause of why we return WOULD_BLOCK even for a blocking socket is a pending cert verification. LDAP lib cannot handle WOULD_BLOCK.
My original assumption about a simple fix is now gone, this won't be that simple to fix :(
Solutions:
- let the socket be blocked during cert verification (PSM changes)
- or, make LDAP handle blocking sockets (LDAP changes)
Attachment #582975 -
Flags: feedback?(bsmith)
![]() |
Assignee | |
Comment 48•13 years ago
|
||
LDAP has some support for non-blocking IO. But it is disabled for windows and mac: [1], [2].
Suggestions:
- unrealistic in one day: allow async api for all platforms
- unrealistic in one day: make cert verification blocking on blocking sockets (would deadlock on OCSP requests)
- move binding to socket thread (sounds realistic but needs good testing)
- simply disable LDAPS and announce LDAPS is broken and will be fixed ASAP
[1] http://bonsai.mozilla.org/cvsview2.cgi?diff_mode=context&whitespace_mode=show&root=/cvsroot&subdir=mozilla/directory/c-sdk/ldap/include&command=DIFF_FRAMESET&file=portable.h&rev2=5.0&rev1=1.2
[2] http://hg.mozilla.org/projects/ldap-sdks/annotate/2da94bd1028c/c-sdk/ldap/include/portable.h#l168
![]() |
Assignee | |
Comment 49•13 years ago
|
||
Comment on attachment 582975 [details] [diff] [review]
fixes deadlock but leaves LDAPS non-functional
I would also rename RunOnTargetThread to RanOnMainThread to be more distinctive.
Attachment #582975 -
Attachment description: the PSM runnable patch → fixes deadlock but leaves LDAPS non-functional
Comment 50•13 years ago
|
||
No longer deadlocks on Mac OS X with the latest patch. Address autocomplete over LDAPS still fails. Looking up addresses by searching in the addressbook does return results, but it still logs some assertion failures:
###!!! ASSERTION: nsAbLDAPMessageBase::OnLDAPInit(): failed to perform bind operation: 'Error', file /Users/ireid/tbird/comm-central/mailnews/addrbook/src/nsAbLDAPListenerBase.cpp, line 305
###!!! ASSERTION: nsNSSSocketInfo::GetPreviousCert called on the main thread: 'Error', file /Users/ireid/tbird/comm-central/mozilla/security/manager/ssl/src/nsNSSIOLayer.cpp, line 944
###!!! ASSERTION: nsNSSSocketInfo::GetPreviousCert called on the main thread: 'Error', file /Users/ireid/tbird/comm-central/mozilla/security/manager/ssl/src/nsNSSIOLayer.cpp, line 944
![]() |
Assignee | |
Comment 51•13 years ago
|
||
Irving, which of the two try builds did you test with? From comment 34 or comment 42? (Comment 42 is not working when connection with the server succeeded, I didn't test the patch properly before submit.)
Comment 52•13 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #51)
> Irving, which of the two try builds did you test with? From comment 34 or
> comment 42? (Comment 42 is not working when connection with the server
> succeeded, I didn't test the patch properly before submit.)
I was running my own dev build patched with https://bug704984.bugzilla.mozilla.org/attachment.cgi?id=582975 (comment 47)
![]() |
Assignee | |
Comment 53•13 years ago
|
||
(In reply to Irving Reid (:irving) from comment #52)
> I was running my own dev build patched with
> https://bug704984.bugzilla.mozilla.org/attachment.cgi?id=582975 (comment 47)
Awesome! Thanks.
Re:
###!!! ASSERTION: nsNSSSocketInfo::GetPreviousCert called on the main thread: 'Error', file /Users/ireid/tbird/comm-central/mozilla/security/manager/ssl/src/nsNSSIOLayer.cpp, line 944
These are still to be fixed...
![]() |
Assignee | |
Comment 54•13 years ago
|
||
Full patch that fixes the issue (deadlock and non-working LDAPS).
- PSMRunnable (SyncRunnableBase) allowed to be dispatched also from the main thread (calls its operation method directly)
- SSLServerCertVerificationJob remembers the thread it has to dispatch the result to (= the thread where the socket is being called on)
- NS_IsMainThread() special handling in PK11PasswordPrompt no longer necessary
- nsNSSSocketInfo::GetPreviousCert can now be called on the main thread too
- when send operation on a blocking socket returns would-block then if we are waiting for cert verify, block by looping the event queue and then retry the send again
The true fix should be made in NSS (not for today): when the socket is set blocking, we must never return would-block since it will be handled as a fatal error by the calling layer. No idea how to achieve this This presumption might be wrong for the ssl socket, though..
Attachment #582975 -
Attachment is obsolete: true
Attachment #582975 -
Flags: feedback?(bsmith)
Attachment #583184 -
Flags: review?(bsmith)
![]() |
Assignee | |
Comment 55•13 years ago
|
||
Irving, please check this patch and let me know. All should work now.
![]() |
Assignee | |
Comment 56•13 years ago
|
||
Comment on attachment 583184 [details] [diff] [review]
[ONLY FOR REFERENCE] v1, PSM fix
https://tbpl.mozilla.org/?tree=Try&rev=3778130a1cd8
Comment 57•13 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #55)
> Irving, please check this patch and let me know. All should work now.
On Mac OS X, attachment 583184 [details] [diff] [review] fixes both address autocomplete in the compose window and search in the address book. No error or warning messages.
Updated•13 years ago
|
Comment 58•13 years ago
|
||
I think that making SyncRunnableBase::RunOnTargetThread special-case the main thread will not prevent this deadlock, at least in all cases.
The LDAP code manages of queue of LDAP operations, and those operations are executed on the LDAP connections background thread:
http://mxr.mozilla.org/comm-central/source/ldap/xpcom/src/nsLDAPConnection.cpp#366
Some of these LDAP operations perform I/O, right? Assuming so, then there will still be a deadlock where PR_Close is called on the main thread when an LDAP operation is executing on the background thread.
It seems to me that, at least, Close() needs to be modified so that it shuts down the background thread (finishes and/or cancels all LDAP operations) before closing the connection. Currently, the background thread is shut down after the connection is closed (note: ldap_unbind eventually calls PR_Close):
http://mxr.mozilla.org/comm-central/source/ldap/xpcom/src/nsLDAPConnection.cpp#216
Also, isn't there a race between AddPendingOperation and Close? If we Close() and shutdown the background thread, then AddPendingOperation would just re-create it. Maybe it is implicitly assumed that AddPendingOperation and the destructor for nsLDAPConnection will both only be called on the main thread? If so, that would be a good thing to document and check in both (using NS_IsMainThread() and NS_ERROR).
Anyway, it is very important that the main thread waiting for LDAP operations, or the same kind of deadlock will occur, even with Honza's patch.
![]() |
Assignee | |
Comment 59•13 years ago
|
||
(In reply to Brian Smith (:bsmith) from comment #58)
> I think that making SyncRunnableBase::RunOnTargetThread special-case the
> main thread will not prevent this deadlock, at least in all cases.
Can you provide a POC ?
> Some of these LDAP operations perform I/O, right? Assuming so, then there
> will still be a deadlock where PR_Close is called on the main thread when an
> LDAP operation is executing on the background thread.
I had a patch for it, but I've later threw it away. That patch wasn't it self enough to fix.
Also, only unbind call of the LDAP sdk should be posted to a background thread.
Close is called only from two places:
- destructor
- called from OnLookupComplete throwing the last ref to |self| on the main thread when sending the initial request fails
- when the object is released by JS GC
- I don't think the destructor is called sooner then all background ops finish
- "profile-change-net-teardown", here could potentially be a problem
As I can see now, I didn't enclose the deadlock analyzes:
Without any patches:
- main thread is using a blocking PSM socket to establish an SSL connection trying to send the primary request
- the operation fails because we return WOULD_BLOCK from PSMSend (first flaw)
- main thread throws the connection away, calls close (this happens in case of a failure to send the initial request)
- however, the cert verification background thread finished its work and is posting the SSLServerCertVerificationResult to the network thread, which is here not involved at all (seconds flaw)
- on the network thread, SSL_RestartHandshakeAfterAuthCertificate calls nsNSSSocketInfo::GetPreviousCert that waits for the main thread
=> ssl mutex is held on the socket thread and cannot be entered on the main when closing the connection thread -> deadlock
I've also seen deadlock when main thread was calling ClientAuthDataRunnable::DispatchToMainThreadAndWait on the main thread.
> Anyway, it is very important that the main thread waiting for LDAP
> operations, or the same kind of deadlock will occur, even with Honza's patch.
I don't understand this sentence.
(In reply to Honza Bambas (:mayhemer) from comment #54)
> The true fix should be made in NSS (not for today): when the socket is set
> blocking, we must never return would-block since it will be handled as a
> fatal error by the calling layer. No idea how to achieve this This
> presumption might be wrong for the ssl socket, though..
Adding some NSS people to have more detailed statement how NSS has to behave with blocking sockets.
Comment 60•13 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #54)
>
> The true fix should be made in NSS (not for today): when the socket is set
> blocking, we must never return would-block since it will be handled as a
> fatal error by the calling layer.
Perhaps we should add a new error code to indicate that PSMSend is taking
a break to allow the calling layer to verify the server certificate?
But we will still need to update the calling layer to know that this new
error code is not a fatal error.
![]() |
Assignee | |
Comment 61•13 years ago
|
||
In a long term, we should rather fix LDAP code to work on either a background thread since the beginning (i.e. connect and send the initial request) or use non-blocking sockets. The former seems simpler.
We also need to introduce tests that involve:
- LDAP secure server
- the server cert has a valid OCSP AIA
- optionally, OCSP server for the LDAP server cert CA validation
Comment 62•13 years ago
|
||
Please test the patches in bug 706955 and bug 712363.
I have sent this to comm-central's tryserver:
http://build.mozillamessaging.com/tinderboxpushlog/?tree=ThunderbirdTry&rev=100a88b06f8d
Comment 63•13 years ago
|
||
Let's try that again:
http://build.mozillamessaging.com/tinderboxpushlog/?tree=ThunderbirdTry&rev=d9ebf52a9060
Comment 64•13 years ago
|
||
Patches failed to apply on the try server, so the build didn't go through. I also get hunks failing when I try to manually apply the patches to my own working copy.
Comment 65•13 years ago
|
||
Comment on attachment 583184 [details] [diff] [review]
[ONLY FOR REFERENCE] v1, PSM fix
Canceling review request since we have another patch in bug 712363 that seems to solve this.
Attachment #583184 -
Flags: review?(bsmith)
Updated•13 years ago
|
Keywords: regressionwindow-wanted
Summary: Thunderbird Daily (11.0a1 20111122-20111122) Crashes in LDAP Address Book Search → Thunderbird Daily (11.0a1 20111122-20111122) Crashes or hangs in LDAP Address Book Search
Comment 67•13 years ago
|
||
I hit this once yesterday and once today, using up-to-date Earlybird. I used kill -11 to interrupt the hang and generate a crash report -- the crash reports are:
bp-700fd806-deee-4bdc-974a-b63f72120106
bp-e9bcec56-bd1f-4051-abcb-7dec62120105
Mozilla/5.0 (X11; Linux x86_64; rv:11.0a2) Gecko/20120106 Thunderbird/11.0a2
Updated•13 years ago
|
Summary: Thunderbird Daily (11.0a1 20111122-20111122) Crashes or hangs in LDAP Address Book Search → Thunderbird Daily/Earlybird (11.0a1 20111122-20111122) Crashes or hangs in LDAP Address Book Search
Comment 68•13 years ago
|
||
One unfortunate aspect of this bug that merits mentioning -- at least on my Linux system, the hang happens while Thunderbird has locked-in focus, so it locks up my whole windowing system. AFAICT, the only way to regain control is by getting to a Ctrl+Alt+F1 TTY or connecting via SSH and running "killall thunderbird-bin". (For the majority of users who don't know about those options, this is effectively a full-system lockup that requires a hard reboot to fix. :-/ )
(This seems to happen to me every time I type in a "To" field in a compose window now :-/ Just bit me 3 times in a row. Adding "dogfood" keyword.)
Keywords: dogfood
Comment 69•13 years ago
|
||
The (not terribly satisfying) workaround is to go into Preferences / Composition / Addressing and un-check the address autocompletion from directory servers, if you're using any LDAPS address books. Hopefully we'll get a fix landed soon and you'll be able to turn it back on.
Might be worth filing a separate bug for the fact that TB uses locked-in focus while it's doing address autocomplete. Doesn't make a lot of sense that you can't switch to another application at that time.
Comment 71•13 years ago
|
||
I opened bug 718131 with this issue, it was not only ldap address book searches. Daily was hanging even when I was auto-completing from Collected Addresses.
Comment 72•13 years ago
|
||
(In reply to Guillermo Huerta [:guillermo] from comment #71)
> I opened bug 718131 with this issue, it was not only ldap address book
> searches. Daily was hanging even when I was auto-completing from Collected
> Addresses.
The bug didn't specify about the collected addresses. If you've turned LDAP off completely (check in both global prefs & account specific prefs) and it is still hanging, please file a bug on just the collected addresses issue.
attempting to download an offline copy of the address book also hangs the app
Just tested shredded/daily and the problem is no longer there.
![]() |
Assignee | |
Updated•13 years ago
|
Attachment #583184 -
Attachment description: v1, PSM fix → [ONLY FOR REFERENCE] v1, PSM fix
![]() |
Assignee | |
Comment 75•13 years ago
|
||
Marking fixed.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → DUPLICATE
Comment 76•13 years ago
|
||
From a tracking perspective, we shipped this in TB 11 with a relbranch in mozilla-*. All is fine, and the fix is completely in 12 & onwards. Updating flags.
status-thunderbird11:
--- → fixed
Updated•10 years ago
|
Comment 77•10 years ago
|
||
Please don't play with the flags.
Comment 78•9 years ago
|
||
I'm still experiencing this bug in TB 45.2.0 on OSX 10.11.5. As soon as I start typing some characters it hangs if it can't reach the LDAP server (on SSL port 636). The only way to fix it is to disable the LDAP server in the settings.
Comment 79•9 years ago
|
||
To clarify: it hangs = it must be killed (force quit). Then restarted and LDAP disabled. The LDAP server is the university's and doesn't allow connection from outside the university network. So my TB is happy while it's on the university network, but hangs everywhere else unless I disable LDAP.
You need to log in
before you can comment on or make changes to this bug.
Description
•