3/29/2007

db2 HA fallover problem

DB2 HA on two AIX server A and B, HACMP take over test is OK.
but when we issue"halt -q" on on server A, B can take over all the resource of A but it is very slow when it come to "db2start", 16nodes start cost 30 minutes।
some problems about networking is reported:
1. the service ip of B is moved to another interface. also the service ip of A is moved to the same interface.
2। "Interface 192.168.7.3 has failed on node PDBA","Interface 192.168.7.3 is now available on node PDBA". 7.3 is bootip of B.

I have to leave on tomorrow, and suggest 800 support। wait for further progress.

4 comments:

Yonghang Wang 说...

Possible APAR
--------------------------------------

APAR status
Closed as program error.

Error description
During hacmp failover , db2start command is seeing
delays.

Local fix

Problem summary
1. Application startup is slow during hacmp takeover
2. Client lock reclaiming may be slow as client lock
threads sleeps for long time.

Problem conclusion
Changed the lm_delay function to wakeup and return after
predefined sleep.

Temporary fix

Comments

APAR information
APAR number IY92336
Reported component name AIX 5L POWER V5
Reported component ID 5765E6200
Reported release 520
Status CLOSED PER
PE NoPE
HIPER NoHIPER
Submitted date 2006-11-30
Closed date 2007-01-03
Last modified date 2007-01-03

APAR is sysrouted FROM one or more of the following:
IY92334

APAR is sysrouted TO one or more of the following:

Publications Referenced


Fix information
Fixed component name AIX 5L POWER V5
Fixed component ID 5765E6200

Applicable component levels
R520 PSY UP




-----------------------------------------------

APAR status
Closed as program error.

Error description
During hacmp failover , db2start command is seeing
delays.

Local fix

Problem summary
1. Application startup is slow during hacmp takeover
2. Client lock reclaiming may be slow as client lock
threads sleeps for long time.

Problem conclusion
Changed the lm_delay function to wakeup and return after
predefined sleep.

Temporary fix

Comments

APAR information
APAR number IY92334
Reported component name AIX 5.3
Reported component ID 5765G0300
Reported release 530
Status CLOSED PER
PE NoPE
HIPER NoHIPER
Submitted date 2006-11-30
Closed date 2007-01-03
Last modified date 2007-01-03

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:
IY92336

Publications Referenced


Fix information
Fixed component name AIX 5.3
Fixed component ID 5765G0300

Applicable component levels
R530 PSY UP

Yonghang Wang 说...

继续跟进,明天去现场看看。

高度怀疑IY92334, IY89475...

Yonghang Wang 说...

IY89475: RPC.STATD DROPS CORE ON HACMP FAILOVER

APAR status
Closed as duplicate of another APAR.

Error description

Steps to reproduce
1. start HACMP on both node1sp and node2sp node.
2. clstop on node2sp.
3. clstart on node2sp.
4. COREDUMP was created with rpc.statd on node2sp.

The stacks of "_svc_run_mt" and "svc_run.svc_run" show:

(dbx) t
__svc_rm_from_xlist(??, ??, ??) at 0xd0777c2c
__xprt_unregister_private(??, ??) at 0xd0777498
_svc_vc_destroy_private(??, ??) at 0xd0784ba4
_svc_destroy_private(??) at 0xd07781e0
_svc_done_private(??) at 0xd077ebb0
_svc_run_mt() at 0xd077f1d4
svc_run.svc_run() at 0xd077fc04
main(0x5, 0x2ff22da0) at 0x100011c4
(dbx) x
$r0:0xfb81ffe0 $stkp:0x2ff229d0 $toc:0xf03bc888
$r3:0xfba1ffe8
$r4:0x30009f48 $r5:0x00000008 $r6:0x00000002
$r7:0x100c51ff
$r8:0x000c51ff $r9:0x00000000 $r10:0x301b2410
$r11:0x00000000
$r12:0xd0777c04 $r13:0x00000001 $r14:0x301b21e8
$r15:0x301b20d8
$r16:0x301b22e8 $r17:0xf0443b98 $r18:0xf0402930
$r19:0xf03b5924
$r20:0xf0402ae8 $r21:0xf0443b90 $r22:0xf03bfda4
$r23:0xf03bfda0
$r24:0xf0400928 $r25:0xf0402b28 $r26:0x00000035
$r27:0x00000001
$r28:0xf0402af0 $r29:0x00019870 $r30:0x301b2238
$r31:0xf03bfa00
$iar:0xd0777c2c $msr:0x0000d0b2 $cr:0x8248822b
$link:0xd0777c04
$ctr:0xd07d7430 $xer:0x20000000 $mq:0xffffffff
Condition status = 0:l 1:e 2:g 3:l 4:l 5:e 6:e
7:leo
[unset $noflregs to view floating point
registers]
[unset $novregs to view vector registers]
in __svc_rm_from_xlist at 0xd0777c2c ($t1)
0xd0777c2c (__svc_rm_from_xlist+0x64) 80030004 lwz
r0,0x4(r3)

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

This APAR is a duplicate of IY91868

APAR information
APAR number IY89475
Reported component name AIX 5.3
Reported component ID 5765G0300
Reported release 530
Status CLOSED DUB
PE NoPE
HIPER NoHIPER
Submitted date 2006-09-13
Closed date 2007-01-03
Last modified date 2007-01-03

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:
IY89476 IY89477

Yonghang Wang 说...

IY92334 ifix got and patched. problem resolved.