[Orca-users] Orcallator - Segmentation Fault
Brian Poole
pooleb at gmail.com
Sat Sep 9 18:20:24 PDT 2006
I believe I wrote very close to the same sequence where se segfaults
but someone double-checking is probably a good idea.
iostat -n ran nearly instantly on the servers here.
Also as of tonight, my servers are all working fine again. We had a
quarterly outage window where we rebuilt the device trees so my
servers should be good until we start moving disks around again.
Hopefully this problem can be figured out before then.
On 9/8/06, Cockcroft, Adrian <acockcroft at ebay.com> wrote:
>
> So its reading a directory full of disk devices and failing. Is the same sequence working when coded in C?
> If so then its an internal problem in SE and one way to work around it would be to build C code that works as a shared library and attach to it.
>
> The other workaround is to disable this part of the code entirely on systems that give problems. The disk name mapping will not be correct, thats all.
>
> Whats the startup time for iostat -n on this system? There is a related problem in libdevinfo.
>
> % time iostat -n
>
> should be a millisecond or so, I've seen it take several seconds on some systems, blocked in a single system call where libdevinfo is trying to figure out the disk devices.
>
> Adrian
>
> -----Original Message-----
> From: Dmitry Berezin [mailto:dberezin at surfside.rutgers.edu]
> Sent: Fri 9/8/2006 7:05 AM
> To: 'Biju Joseph'
> Cc: Cockcroft, Adrian; 'Brian Poole'; orca-users at orcaware.com
> Subject: RE: [Orca-users] Orcallator - Segmentation Fault
>
> Can you rebuild device tree on this server and try to run Orcallator again?
>
>
>
> -Dmitry.
>
>
>
> -----Original Message-----
> From: Biju Joseph [mailto:biju.joseph at gmail.com]
> Sent: Friday, September 08, 2006 9:49 AM
> To: Dmitry Berezin
> Cc: Cockcroft, Adrian; Brian Poole; orca-users at orcaware.com
> Subject: Re: [Orca-users] Orcallator - Segmentation Fault
>
>
>
> Today I tried to run orcallator on a different machine which has no cluster
> software, but EMC disks attached and VxVM installed. Getting same problem. (
> Segmentation Fault )
>
>
>
> But as mentioned earlier, It has got installed successfully on machines
> which are not having EMC SAN disks.
>
>
>
> Any solution is highly appreciated.
>
>
>
> Thanks
>
> Biju..
>
>
>
> On 9/8/06, Dmitry Berezin <dberezin at acs.rutgers.edu> wrote:
>
> It is failing while dereferencing, but the pointer is not null -
>
> dp = *((dirent_t *) ld<4281687800>)
> > Segmentation Fault (core dumped)
>
> -Dmitry.
>
>
> > -----Original Message-----
> > From: orca-users-bounces+dberezin=acs.rutgers.edu at orcaware.com
> > [mailto:orca-users-bounces+dberezin=acs.rutgers.edu at orcaware.com] On
> > Behalf Of Cockcroft, Adrian
> > Sent: Thursday, September 07, 2006 2:32 PM
> > To: Brian Poole
> > Cc: Dmitry Berezin; orca-users at orcaware.com; Biju Joseph
> > Subject: Re: [Orca-users] Orcallator - Segmentation Fault
> >
> > OK, so it's failing while walking the directory tree, I can see that the
> > renew is already in place a line or so earlier.
> >
> > Its dereferencing a directory structure that isn't there, so a test
> > needs to be added to skip this if readdir returns something bad. Its
> > already testing for null, so there is something bad happening between
> > the null test and the actual usage of the dirp.
> >
> > http://docs.sun.com/app/docs/doc/819-2243/6n4i099g0?q=readdir
> <http://docs.sun.com/app/docs/doc/819-2243/6n4i099g0?q=readdir&a=view>
> &a=view
> >
> > I'm not sure how to fix this, maybe a second test for null immediately
> > before it's de-referenced?
> >
> > Adrian
> >
> > -----Original Message-----
> > From: Brian Poole [mailto:pooleb at gmail.com]
> > Sent: Thursday, September 07, 2006 10:39 AM
> > To: Cockcroft, Adrian
> > Cc: Dmitry Berezin; Biju Joseph; orca-users at orcaware.com
> > Subject: Re: [Orca-users] Orcallator - Segmentation Fault
> >
> > Here is all of the information I've been able to gather on the crash
> > (SE Toolkit 3.4 on Solaris 10). I compiled it fresh using Forte with
> > debugging enabled. I took a quick look at trying to find where the
> > problem actually lies but was unable to come up with anything useful.
> >
> > Here is running the disks.se with debug:
> >
> > # /opt/RICHPse/bin/se.sparcv9 -d /opt/RICHPse/examples/disks.se
> > if (count<31> == GLOBAL_diskinfo_size<101>)
> > dp = *((dirent_t *) ld<4281687704>)
> > if (dp.d_name<c3t8d24s3> == <.> || dp.d_name<c3t8d24s3> == <..>)
> > if (!(dp.d_name<c3t8d24s3> =~ <s0$>))
> > ld = readdir(dirp<4281664128>)
> > if (count<31> == GLOBAL_diskinfo_size<101>)
> > dp = *((dirent_t *) ld<4281687736>)
> > if (dp.d_name<c3t8d24s4> == <.> || dp.d_name<c3t8d24s4> == <..>)
> > if (!( dp.d_name<c3t8d24s4> =~ <s0$>))
> > ld = readdir(dirp<4281664128>)
> > if (count<31> == GLOBAL_diskinfo_size<101>)
> > dp = *((dirent_t *) ld<4281687768>)
> > if (dp.d_name <c3t8d24s5> == <.> || dp.d_name<c3t8d24s5> == <..>)
> > if (!(dp.d_name<c3t8d24s5> =~ <s0$>))
> > ld = readdir(dirp<4281664128>)
> > if (count<31> == GLOBAL_diskinfo_size<101>)
> > dp = *((dirent_t *) ld<4281687800>)
> > Segmentation Fault (core dumped)
> >
> > So tracking that back shows the segfault occurs on line 215 of
> > include/diskinfo.se:
> >
> > for (ld = readdir(dirp); ld != 0; ld = readdir(dirp)) {
> > // grow the array if needed
> > if (count == GLOBAL_diskinfo_size) {
> > GLOBAL_diskinfo_size += 4;
> > GLOBAL_disk_info = renew GLOBAL_disk_info[GLOBAL_diskinfo_size];
> > }
> > dp = *((dirent_t *) ld); <---------
> >
> > Also the truss output:
> >
> > # truss -fo /tmp/truss.log /opt/RICHPse/bin/se.sparcv9
> > /opt/RICHPse/examples/disks.se
> > # tail -15 /tmp/truss.log
> > 5967: ioctl(4, KSTAT_IOC_READ, "sd3547,err") = 701015
> > 5967: ioctl(4, KSTAT_IOC_CHAIN_ID, 0x00000000) = 701015
> > 5967: ioctl(4, KSTAT_IOC_READ, "sd2146,err") = 701015
> > 5967: ioctl(4, KSTAT_IOC_CHAIN_ID, 0x00000000) = 701015
> > 5967: ioctl(4, KSTAT_IOC_READ, "sd2177,err") = 701015
> > 5967: ioctl(4, KSTAT_IOC_CHAIN_ID, 0x00000000) = 701015
> > 5967: ioctl(4, KSTAT_IOC_READ, "sd3935,err") = 701015
> > 5967: ioctl(4, KSTAT_IOC_CHAIN_ID, 0x00000000) = 701015
> > 5967: ioctl(4, KSTAT_IOC_READ, "sd1971,err") = 701015
> > 5967: ioctl(4, KSTAT_IOC_CHAIN_ID, 0x00000000) = 701015
> > 5967: ioctl(4, KSTAT_IOC_READ, "sd1972,err") = 701015
> > 5967: Incurred fault #6, FLTBOUNDS %pc = 0xFF2E08EC
> > 5967: siginfo: SIGSEGV SEGV_MAPERR addr=0xFF356000
> > 5967: Received signal #11, SIGSEGV [default]
> > 5967: siginfo: SIGSEGV SEGV_MAPERR addr=0xFF356000
> >
> > And perhaps more indicative, the trace:
> >
> > # /opt/SUNWspro/bin/dbx /opt/RICHPse/bin/se.sparcv9 core
> > For information about new features see `help changes'
> > To remove this message, put `dbxenv suppress_startup_message 7.5' in
> > your .dbxrc
> > Reading se.sparcv9
> > core file header read successfully
> > Reading ld.so.1
> > Reading libkvm.so.1
> > Reading libkstat.so.1
> > Reading libdl.so.1
> > Reading libelf.so.1
> > Reading libgen.so.1
> > Reading libm.so.2
> > Reading libsocket.so.1
> > Reading libnsl.so.1
> > Reading libc.so.1
> > Reading libc_psr.so.1
> > Reading libmp.so.2
> > Reading libmd5.so.1
> > Reading libscf.so.1
> > Reading libdoor.so.1
> > Reading libuutil.so.1
> > Reading librt.so.1
> > Reading libaio.so.1
> > program terminated by signal SEGV (no mapping at the fault address)
> > 0xff2e08ec: _memcpy+0x042c: ldd [%o1], %c2
> > Current function is member_fill
> > dbx: warning: can't find file "/tmp/se-src/run.c"
> > dbx: warning: see `help finding-files'
> > (dbx) where
> > [1] _memcpy(0x129938, 0xff356000, 0x8, 0xfffffffa, 0x4, 0x1), at
> > 0xff2e08ec
> > =>[2] member_fill(vp = 0x1297f0, area = 0xff355ef8 "", bias = 0), line
> > 994 in "run.c"
> > [3] struct_fill(vp = 0x1296b0, area = 0xff355ef8 "", bias = 0), line
> > 1043 in "run.c"
> > [4] run_indirection(sp = 0xffbfc4b8), line 1308 in "run.c"
> > [5] run_call(sp = 0xffbfc4b8), line 1608 in "run.c"
> > [6] resolve_expression(vp = 0xffbfcae0, ep = 0x129620, runit = 1),
> > line 2892 in "run.c"
> > [7] run_assign(sp = 0x127530), line 1675 in "run.c"
> > [8] run_statement_list(lp = 0x127510), line 513 in "run.c"
> > [9] run_for(sp = 0x12c078), line 2538 in " run.c"
> > [10] run_statement_list(lp = 0x127330), line 513 in "run.c"
> > [11] run_for(sp = 0x12c0b8), line 2538 in "run.c"
> > [12] run_statement_list(lp = 0x121208), line 513 in " run.c"
> > [13] run_block(bp = 0x133288), line 402 in "run.c"
> > [14] run_call(sp = 0xffbfcec8), line 1625 in "run.c"
> > [15] resolve_expression(vp = 0xffbfd450, ep = 0x13cd80, runit = 1),
> > line 2892 in "run.c"
> > [16] resolve_l_expression(ep = 0x13ae18), line 2659 in "run.c"
> > [17] run_if(sp = 0x13cf88), line 523 in "run.c"
> > [18] run_statement_list(lp = 0x13cf88), line 513 in " run.c"
> > [19] run_block(bp = 0x1426f8), line 402 in "run.c"
> > [20] se_run(argc = 1, argv = 0x74b88), line 366 in "run.c"
> > [21] main(argc = 2, argv = 0xffbffcc4), line 542 in " main.c"
> > *vp = {
> > var_flags = VF_MEMBER
> > var_special = 0
> > var_type = VAR_CHAR
> > var_struct = (nil)
> > var_name = 0xc44f0 "d_name"
> > var_qname = (nil)
> > var_attach_lib = (nil)
> > var_address = (nil)
> > var_initial = (nil)
> > var_un = {
> > var_string = 0x129840 "c3t8d24s6"
> > var_digit = 1218624
> > var_udigit = 1218624U
> > var_ldigit = 5233950226120704LL
> > var_uldigit = 5233950226120704ULL
> > var_rdigit = 2.5859149987693e-308
> > var_user = 0x129840
> > var_array = 0x129840
> > }
> > var_dimension = 256
> > var_subscript = (nil)
> > var_instances = (nil)
> > var_offset = 10
> > var_parent = 0xffbfd588
> > var_next = (nil)
> > }
> >
> > I would be more than happy to provide any additional information on
> > the problem you might need. Feel free to contact me directly on this
> > issue.
> >
> > Thank you,
> >
> > Brian
> >
> > On 9/7/06, Cockcroft, Adrian <acockcroft at ebay.com> wrote:
> > > It should still be possible to avoid the crash by checking for a null
> > at
> > > the right point.
> > >
> > > Is it crashing in kstat read of the iostats, or the devinfo name
> > mapping
> > > at startup?
> > >
> > > Adrian
> > >
> > > -----Original Message-----
> > > From: Dmitry Berezin [mailto:dberezin at surfside.rutgers.edu]
> > > Sent: Thursday, September 07, 2006 8:43 AM
> > > To: Cockcroft, Adrian; 'Biju Joseph'; orca-users at orcaware.com
> > > Subject: RE: [Orca-users] Orcallator - Segmentation Fault
> > >
> > > Adrian,
> > >
> > > I believe that the actual problem is not with the array sizes, but has
> > > to do
> > > with the "stale" disk devices. SE "segfaults" when it tries to access
> > a
> > > device that is not currently present on the system. That is why the
> > > problem
> > > is usually seen on the clustered systems with shared storage or
> > systems
> > > with
> > > BCV devices that frequently change their state to offline. A number of
> > > people had previously reported that rebuilding device tree fixed the
> > > problem.
> > >
> > > I have not had time to look at the code, so I do not know if this
> > could
> > > be
> > > solved by changing scripts or SE itself has to be patched.
> > >
> > > -Dmitry.
> > >
> > >
> > > > -----Original Message-----
> > > > From: orca-users-bounces+dberezin=acs.rutgers.edu at orcaware.com
> <mailto:acs.rutgers.edu at orcaware.com>
> > > > [mailto:orca-users-bounces+dberezin=acs.rutgers.edu at orcaware.com] On
> > > > Behalf Of Cockcroft, Adrian
> > > > Sent: Thursday, September 07, 2006 11:13 AM
> > > > To: Biju Joseph; orca-users at orcaware.com
> > > > Subject: Re: [Orca-users] Orcallator - Segmentation Fault
> > > >
> > > > Years ago I fixed the code that looks at disks to resize the array
> > > > dynamically, I guess that this code got overwritten at some point,
> > but
> > > its
> > > > a simple fix, just doesn't look much like C code...
> > > >
> > > > You can use the "renew" keyword to make a new array that is bigger
> > and
> > > > contains the same items, so figure out where its indexing into the
> > > disk
> > > > array, check the index and renew the array to be size+10 or
> > something.
> > > > There's example code in the generic SE disk class, which for some
> > > reason
> > > > orcallator doesn't seem to use?
> > > >
> > > > I'm not currently working on a Solaris box, so it will take me a
> > while
> > > to
> > > > get a setup I could test this fix on, probably a few weeks when I
> > get
> > > back
> > > > from a business trip.
> > > >
> > > > Adrian
> > > >
> > > > -----Original Message-----
> > > > From: orca-users-bounces+acockcroft= ebay.com at orcaware.com on behalf
> > of
> > > > Biju Joseph
> > > > Sent: Thu 9/7/2006 7:28 AM
> > > > To: orca-users at orcaware.com
> > > > Subject: [Orca-users] Orcallator - Segmentation Fault
> > > >
> > > > Hello All,
> > > >
> > > > I am trying to start orcallator on two nodes of VCS cluster ( 4.1 )
> > > with
> > > > VxVM 4.1 . Database is on EMC disks. Orcallator is giving
> > segmentation
> > > > fault.
> > > >
> > > > RICHPse version is 3.4 (03:59 PM 01/05/05). I tried using
> > > orcallator.se
> > > > 1.36 and 1.37. Both giving same problem.
> > > >
> > > > The same combination is working on non clustered systems. All
> > systems
> > > are
> > > > Solaris 10
> > > >
> > > > Can any of you help.
> > > >
> > > > Appreciate your help.
> > > >
> > > > Regards
> > > > Biju K Joseph
> > > > +91-9866116298
> > > >
> > > > _______________________________________________
> > > > Orca-users mailing list
> > > > Orca-users at orcaware.com
> > > > http://www.orcaware.com/mailman/listinfo/orca-users
> > >
> > > _______________________________________________
> > > Orca-users mailing list
> > > Orca-users at orcaware.com
> > > http://www.orcaware.com/mailman/listinfo/orca-users
> > >
> >
> > _______________________________________________
> > Orca-users mailing list
> > Orca-users at orcaware.com
> > http://www.orcaware.com/mailman/listinfo/orca-users
>
>
>
>
More information about the Orca-users
mailing list