Guest Mike O. Posted September 26, 2007 Posted September 26, 2007 (This was also posted on the server clustering group) I'm trying to find out some information about using CHKDSK on a clustered drive. We have a two node cluster (active/passive) running Windows 2003 R2 enterprise 32 bit with SP1. The cluster has three shared drives located on an EMC CX700 SAN. The three drives are a 500MB for the quorum, and two data drives 1.5TB (drive "E") and 2.4TB (drive "W"). Drive E is a basic disk, the 2.4TB drive W is a GPT disk. They're both about 70% full The E drive has been active for about a year, the W: one was added around June. Yesterday the active node became sluggish and then stopped serving data. It still responded to low level stuff like PING, users were getting errors on the server. Logging in gave a blank screen. This has happened a couple of times before (that's a separate issue we're looking into). We went to the inactive node and did a "move group" in the cluster administrator. We've done this before for various reasons with no problems, it usually takes about 20 seconds to bring the resources up on the other node. This time when the resources came on line on the 2nd node, we started getting an application popup that "Windows - Corrupt File : The file or directory E:\$Mft is corrupt and unreadable. Please run the Chkdsk utility." The drive seems to be running OK with users accessing the information normally. I did some research and it appears that Windows will use the duplicate copy of the MFT if the primary one is corrupted. I know we need to run CHKDSK soon, but unfortunately, running chkdsk and taking the drive off line for several hours is not something we can do during daytime hours. If necessary we could run it overnight, but with that size of drive I don't know if it would finish by the next morning. The server has dual fiber connections (we're using the EMC Powerpath software for SAN failover), and we didn't have anything happen with the SAN at that time, so based on the timing I'm assuming the MFT corruption was related to the cluster failover, not a physical hardware issue, so I wasn't planning on running the sector scan. I would imagine a sector scan on a 1.5TB "disk" would run for a while… At this point I'm planning on running CHKDSK over the weekend. I've never run it on a clustered disk before and I'm looking for some information about it. I've read Microsoft KB176970 and KB903650, but frankly they're a little confusing with the issues about "maintenance mode". Also, is my understanding about the mirrored/secondary MFT valid? Since users appear to be getting information correctly can the CHKDSK wait until the weekend?. Our backup policy does a full backup each week and an incremental daily, so if something really bad happens we should be able to recover. Any information on this would be appreciated. Mike O.
Guest Mathieu CHATEAU Posted September 26, 2007 Posted September 26, 2007 Re: Correcting a corrupted $MFT on a shared clustered disk HEllo, GPT disk and cluster are not friend by default, forcing them to be friend may lead to issue... By default, server clusters do not support GPT shared disks in Windows Server 2003 http://support.microsoft.com/kb/284134/en-us That's the problem with so big data volumes....You should have in mind data recovery, defrag & chkdsk when sizing data volumes... You will start having issue when raising 4 Millions of files too Now, it's clear you have to run the chkdsk. Downtime for downtime, run it on both if you can For the performance part: -did you exclude all shared data from real time antivirus scan on cluster node ? -Do you have huge MS Access database ? -Any monitoring/graphing tool to get some history on ram;cpu;network usage? -- Cordialement, Mathieu CHATEAU http://lordoftheping.blogspot.com "Mike O." <MikeO@discussions.microsoft.com> wrote in message news:3CDFA86D-D7A6-4512-AA64-276CB90144A2@microsoft.com... > (This was also posted on the server clustering group) > > I'm trying to find out some information about using CHKDSK on a clustered > drive. > We have a two node cluster (active/passive) running Windows 2003 R2 > enterprise 32 bit with SP1. The cluster has three shared drives located > on > an EMC CX700 SAN. The three drives are a 500MB for the quorum, and two > data > drives 1.5TB (drive "E") and 2.4TB (drive "W"). Drive E is a basic disk, > the > 2.4TB drive W is a GPT disk. They're both about 70% full The E drive has > been active for about a year, the W: one was added around June. > > Yesterday the active node became sluggish and then stopped serving data. > It > still responded to low level stuff like PING, users were getting errors on > the server. Logging in gave a blank screen. This has happened a couple > of > times before (that's a separate issue we're looking into). > > We went to the inactive node and did a "move group" in the cluster > administrator. We've done this before for various reasons with no > problems, > it usually takes about 20 seconds to bring the resources up on the other > node. > > This time when the resources came on line on the 2nd node, we started > getting an application popup that "Windows - Corrupt File : The file or > directory E:\$Mft is corrupt and unreadable. Please run the Chkdsk > utility." > The drive seems to be running OK with users accessing the information > normally. I did some research and it appears that Windows will use the > duplicate copy of the MFT if the primary one is corrupted. > > I know we need to run CHKDSK soon, but unfortunately, running chkdsk and > taking the drive off line for several hours is not something we can do > during daytime hours. If necessary we could run it overnight, but with > that > size of drive I don't know if it would finish by the next morning. > > The server has dual fiber connections (we're using the EMC Powerpath > software for SAN failover), and we didn't have anything happen with the > SAN > at that time, so based on the timing I'm assuming the MFT corruption was > related to the cluster failover, not a physical hardware issue, so I > wasn't > planning on running the sector scan. I would imagine a sector scan on a > 1.5TB "disk" would run for a while… > At this point I'm planning on running CHKDSK over the weekend. I've never > run it on a clustered disk before and I'm looking for some information > about > it. I've read Microsoft KB176970 and KB903650, but frankly they're a > little > confusing with the issues about "maintenance mode". > > Also, is my understanding about the mirrored/secondary MFT valid? Since > users appear to be getting information correctly can the CHKDSK wait until > the weekend?. Our backup policy does a full backup each week and an > incremental daily, so if something really bad happens we should be able to > recover. > > Any information on this would be appreciated. > > Mike O.
Guest Mike O. Posted September 27, 2007 Posted September 27, 2007 Re: Correcting a corrupted $MFT on a shared clustered disk Per the KB284134, clustering supports GPT if you apply the hotfix, which we did prior to connecting the GPT disk. I don't believe that applying the Microsoft supported hotfix to correct the issue is "forcing" it. Also, the problem I'm having is on the smaller basic disk, the GPT one is fine. We thought about breaking the "drive" into smaller partitions, but the issues we run into are space allocation. Eventually we'll end up with one partition running out of space and another one with space to spare. Our backup system is an enterprise system, running over 1Gb ethernet (we're looking at backing up over the SAN soon), so backing up a 1-2 TB is not a problem. As for your other questions, I'm not sure where you got the "performance problems" part. The server was working fine, performance was acceptable then it quickly (over 30 minutes) failed. I'm still investigating it, but I'm wondering if a memory leak in one of the drivers or other processes running on it caused the issue. We can't exclude real time virus scanning since these are user files. We've had McAfee products and a support contract with them for years. According to the tech there are no problems with Virusscan 8.x on the cluster. We don't have any large access databases on this system. I'm sure there are some, but it's primarily a user file server, not supposed to be for applications. As for the error I'm receiving, should I be able to wait until this weekend for the CHKDSK, or is it something that's only going to get worse? From some Microsoft KB articles (and other stuff I found), it seems that NTFS keeps two copies of the MFT and will use the other one if the primary is corrupted. Is this correct? When I do run chkdsk, are there any special issues with the cluster? I know normally Windows can't chkdsk on an active disk and would have to when the server is rebooted. The problem is that when the server reboots it doesn't see the clustered disk until the cluster service starts, so chkdsk can't access the disk "pre" bootup. We have a test cluster (with a 50G shared disk). I ran chkdsk /f on it and it said the drive needed to be unmounted and offered to do that for me. I told it yes and it seemed to work OK. Of course the disk was unavailable while the chkdsk was running, but it can back on line as soon as it finished. "Mathieu CHATEAU" wrote: > HEllo, > > GPT disk and cluster are not friend by default, forcing them to be friend > may lead to issue... > > By default, server clusters do not support GPT shared disks in Windows > Server 2003 > http://support.microsoft.com/kb/284134/en-us > > That's the problem with so big data volumes....You should have in mind data > recovery, defrag & chkdsk when sizing data volumes... > You will start having issue when raising 4 Millions of files too > > Now, it's clear you have to run the chkdsk. Downtime for downtime, run it on > both if you can > > For the performance part: > -did you exclude all shared data from real time antivirus scan on cluster > node ? > -Do you have huge MS Access database ? > -Any monitoring/graphing tool to get some history on ram;cpu;network usage? > > > > > > -- > Cordialement, > Mathieu CHATEAU > http://lordoftheping.blogspot.com > > > "Mike O." <MikeO@discussions.microsoft.com> wrote in message > news:3CDFA86D-D7A6-4512-AA64-276CB90144A2@microsoft.com... > > (This was also posted on the server clustering group) > > > > I'm trying to find out some information about using CHKDSK on a clustered > > drive. > > We have a two node cluster (active/passive) running Windows 2003 R2 > > enterprise 32 bit with SP1. The cluster has three shared drives located > > on > > an EMC CX700 SAN. The three drives are a 500MB for the quorum, and two > > data > > drives 1.5TB (drive "E") and 2.4TB (drive "W"). Drive E is a basic disk, > > the > > 2.4TB drive W is a GPT disk. They're both about 70% full The E drive has > > been active for about a year, the W: one was added around June. > > > > Yesterday the active node became sluggish and then stopped serving data. > > It > > still responded to low level stuff like PING, users were getting errors on > > the server. Logging in gave a blank screen. This has happened a couple > > of > > times before (that's a separate issue we're looking into). > > > > We went to the inactive node and did a "move group" in the cluster > > administrator. We've done this before for various reasons with no > > problems, > > it usually takes about 20 seconds to bring the resources up on the other > > node. > > > > This time when the resources came on line on the 2nd node, we started > > getting an application popup that "Windows - Corrupt File : The file or > > directory E:\$Mft is corrupt and unreadable. Please run the Chkdsk > > utility." > > The drive seems to be running OK with users accessing the information > > normally. I did some research and it appears that Windows will use the > > duplicate copy of the MFT if the primary one is corrupted. > > > > I know we need to run CHKDSK soon, but unfortunately, running chkdsk and > > taking the drive off line for several hours is not something we can do > > during daytime hours. If necessary we could run it overnight, but with > > that > > size of drive I don't know if it would finish by the next morning. > > > > The server has dual fiber connections (we're using the EMC Powerpath > > software for SAN failover), and we didn't have anything happen with the > > SAN > > at that time, so based on the timing I'm assuming the MFT corruption was > > related to the cluster failover, not a physical hardware issue, so I > > wasn't > > planning on running the sector scan. I would imagine a sector scan on a > > 1.5TB "disk" would run for a while… > > At this point I'm planning on running CHKDSK over the weekend. I've never > > run it on a clustered disk before and I'm looking for some information > > about > > it. I've read Microsoft KB176970 and KB903650, but frankly they're a > > little > > confusing with the issues about "maintenance mode". > > > > Also, is my understanding about the mirrored/secondary MFT valid? Since > > users appear to be getting information correctly can the CHKDSK wait until > > the weekend?. Our backup policy does a full backup each week and an > > incremental daily, so if something really bad happens we should be able to > > recover. > > > > Any information on this would be appreciated. > > > > Mike O. > >
Guest Mike O Posted September 28, 2007 Posted September 28, 2007 Re: Correcting a corrupted $MFT on a shared clustered disk By the way, after posting the message below, I realized some of my wording may have come off sounding a little cranky.. It's been a long, tiring week, in addition to this I've had a couple of other issues and I may have overreacted a little bit. "Mike O." <MikeO@discussions.microsoft.com> wrote in message news:B655B129-B516-4C8A-8993-B045062AE633@microsoft.com... > Per the KB284134, clustering supports GPT if you apply the hotfix, which > we > did prior to connecting the GPT disk. I don't believe that applying the > Microsoft supported hotfix to correct the issue is "forcing" it. > > Also, the problem I'm having is on the smaller basic disk, the GPT one is > fine. > > We thought about breaking the "drive" into smaller partitions, but the > issues we run into are space allocation. Eventually we'll end up with > one > partition running out of space and another one with space to spare. Our > backup system is an enterprise system, running over 1Gb ethernet (we're > looking at backing up over the SAN soon), so backing up a 1-2 TB is not a > problem. > > As for your other questions, I'm not sure where you got the "performance > problems" part. The server was working fine, performance was acceptable > then > it quickly (over 30 minutes) failed. I'm still investigating it, but I'm > wondering if a memory leak in one of the drivers or other processes > running > on it caused the issue. > > We can't exclude real time virus scanning since these are user files. > We've > had McAfee products and a support contract with them for years. According > to > the tech there are no problems with Virusscan 8.x on the cluster. > > We don't have any large access databases on this system. I'm sure there > are > some, but it's primarily a user file server, not supposed to be for > applications. > > As for the error I'm receiving, should I be able to wait until this > weekend > for the CHKDSK, or is it something that's only going to get worse? From > some > Microsoft KB articles (and other stuff I found), it seems that NTFS keeps > two > copies of the MFT and will use the other one if the primary is corrupted. > Is > this correct? > > When I do run chkdsk, are there any special issues with the cluster? I > know normally Windows can't chkdsk on an active disk and would have to > when > the server is rebooted. The problem is that when the server reboots it > doesn't see the clustered disk until the cluster service starts, so chkdsk > can't access the disk "pre" bootup. > > We have a test cluster (with a 50G shared disk). I ran chkdsk /f on it > and > it said the drive needed to be unmounted and offered to do that for me. I > told it yes and it seemed to work OK. Of course the disk was unavailable > while the chkdsk was running, but it can back on line as soon as it > finished. > > "Mathieu CHATEAU" wrote: > >> HEllo, >> >> GPT disk and cluster are not friend by default, forcing them to be friend >> may lead to issue... >> >> By default, server clusters do not support GPT shared disks in Windows >> Server 2003 >> http://support.microsoft.com/kb/284134/en-us >> >> That's the problem with so big data volumes....You should have in mind >> data >> recovery, defrag & chkdsk when sizing data volumes... >> You will start having issue when raising 4 Millions of files too >> >> Now, it's clear you have to run the chkdsk. Downtime for downtime, run it >> on >> both if you can >> >> For the performance part: >> -did you exclude all shared data from real time antivirus scan on cluster >> node ? >> -Do you have huge MS Access database ? >> -Any monitoring/graphing tool to get some history on ram;cpu;network >> usage? >> >> >> >> >> >> -- >> Cordialement, >> Mathieu CHATEAU >> http://lordoftheping.blogspot.com >> >> >> "Mike O." <MikeO@discussions.microsoft.com> wrote in message >> news:3CDFA86D-D7A6-4512-AA64-276CB90144A2@microsoft.com... >> > (This was also posted on the server clustering group) >> > >> > I'm trying to find out some information about using CHKDSK on a >> > clustered >> > drive. >> > We have a two node cluster (active/passive) running Windows 2003 R2 >> > enterprise 32 bit with SP1. The cluster has three shared drives >> > located >> > on >> > an EMC CX700 SAN. The three drives are a 500MB for the quorum, and two >> > data >> > drives 1.5TB (drive "E") and 2.4TB (drive "W"). Drive E is a basic >> > disk, >> > the >> > 2.4TB drive W is a GPT disk. They're both about 70% full The E drive >> > has >> > been active for about a year, the W: one was added around June. >> > >> > Yesterday the active node became sluggish and then stopped serving >> > data. >> > It >> > still responded to low level stuff like PING, users were getting errors >> > on >> > the server. Logging in gave a blank screen. This has happened a >> > couple >> > of >> > times before (that's a separate issue we're looking into). >> > >> > We went to the inactive node and did a "move group" in the cluster >> > administrator. We've done this before for various reasons with no >> > problems, >> > it usually takes about 20 seconds to bring the resources up on the >> > other >> > node. >> > >> > This time when the resources came on line on the 2nd node, we started >> > getting an application popup that "Windows - Corrupt File : The file or >> > directory E:\$Mft is corrupt and unreadable. Please run the Chkdsk >> > utility." >> > The drive seems to be running OK with users accessing the information >> > normally. I did some research and it appears that Windows will use the >> > duplicate copy of the MFT if the primary one is corrupted. >> > >> > I know we need to run CHKDSK soon, but unfortunately, running chkdsk >> > and >> > taking the drive off line for several hours is not something we can do >> > during daytime hours. If necessary we could run it overnight, but with >> > that >> > size of drive I don't know if it would finish by the next morning. >> > >> > The server has dual fiber connections (we're using the EMC Powerpath >> > software for SAN failover), and we didn't have anything happen with the >> > SAN >> > at that time, so based on the timing I'm assuming the MFT corruption >> > was >> > related to the cluster failover, not a physical hardware issue, so I >> > wasn't >> > planning on running the sector scan. I would imagine a sector scan on >> > a >> > 1.5TB "disk" would run for a while… >> > At this point I'm planning on running CHKDSK over the weekend. I've >> > never >> > run it on a clustered disk before and I'm looking for some information >> > about >> > it. I've read Microsoft KB176970 and KB903650, but frankly they're a >> > little >> > confusing with the issues about "maintenance mode". >> > >> > Also, is my understanding about the mirrored/secondary MFT valid? >> > Since >> > users appear to be getting information correctly can the CHKDSK wait >> > until >> > the weekend?. Our backup policy does a full backup each week and an >> > incremental daily, so if something really bad happens we should be able >> > to >> > recover. >> > >> > Any information on this would be appreciated. >> > >> > Mike O. >> >>
Guest Mathieu CHATEAU Posted September 28, 2007 Posted September 28, 2007 Re: Correcting a corrupted $MFT on a shared clustered disk it's ok, i understand that you are under pressure ! I was just trying to make you think about this current pressure, that may be lower if you would only have to make offline a part and not the whole cake ;) Let's us know how it's going after the chkdsk -- Cordialement, Mathieu CHATEAU http://lordoftheping.blogspot.com "Mike O" <put_the_spam@the.can> wrote in message news:%23NFNazWAIHA.5960@TK2MSFTNGP05.phx.gbl... > By the way, after posting the message below, I realized some of my wording > may have come off sounding a little cranky.. It's been a long, tiring > week, in addition to this I've had a couple of other issues and I may have > overreacted a little bit. > > > > "Mike O." <MikeO@discussions.microsoft.com> wrote in message > news:B655B129-B516-4C8A-8993-B045062AE633@microsoft.com... >> Per the KB284134, clustering supports GPT if you apply the hotfix, which >> we >> did prior to connecting the GPT disk. I don't believe that applying the >> Microsoft supported hotfix to correct the issue is "forcing" it. >> >> Also, the problem I'm having is on the smaller basic disk, the GPT one is >> fine. >> >> We thought about breaking the "drive" into smaller partitions, but the >> issues we run into are space allocation. Eventually we'll end up with >> one >> partition running out of space and another one with space to spare. Our >> backup system is an enterprise system, running over 1Gb ethernet (we're >> looking at backing up over the SAN soon), so backing up a 1-2 TB is not a >> problem. >> >> As for your other questions, I'm not sure where you got the "performance >> problems" part. The server was working fine, performance was acceptable >> then >> it quickly (over 30 minutes) failed. I'm still investigating it, but >> I'm >> wondering if a memory leak in one of the drivers or other processes >> running >> on it caused the issue. >> >> We can't exclude real time virus scanning since these are user files. >> We've >> had McAfee products and a support contract with them for years. >> According to >> the tech there are no problems with Virusscan 8.x on the cluster. >> >> We don't have any large access databases on this system. I'm sure there >> are >> some, but it's primarily a user file server, not supposed to be for >> applications. >> >> As for the error I'm receiving, should I be able to wait until this >> weekend >> for the CHKDSK, or is it something that's only going to get worse? From >> some >> Microsoft KB articles (and other stuff I found), it seems that NTFS keeps >> two >> copies of the MFT and will use the other one if the primary is corrupted. >> Is >> this correct? >> >> When I do run chkdsk, are there any special issues with the cluster? I >> know normally Windows can't chkdsk on an active disk and would have to >> when >> the server is rebooted. The problem is that when the server reboots it >> doesn't see the clustered disk until the cluster service starts, so >> chkdsk >> can't access the disk "pre" bootup. >> >> We have a test cluster (with a 50G shared disk). I ran chkdsk /f on it >> and >> it said the drive needed to be unmounted and offered to do that for me. >> I >> told it yes and it seemed to work OK. Of course the disk was unavailable >> while the chkdsk was running, but it can back on line as soon as it >> finished. >> >> "Mathieu CHATEAU" wrote: >> >>> HEllo, >>> >>> GPT disk and cluster are not friend by default, forcing them to be >>> friend >>> may lead to issue... >>> >>> By default, server clusters do not support GPT shared disks in Windows >>> Server 2003 >>> http://support.microsoft.com/kb/284134/en-us >>> >>> That's the problem with so big data volumes....You should have in mind >>> data >>> recovery, defrag & chkdsk when sizing data volumes... >>> You will start having issue when raising 4 Millions of files too >>> >>> Now, it's clear you have to run the chkdsk. Downtime for downtime, run >>> it on >>> both if you can >>> >>> For the performance part: >>> -did you exclude all shared data from real time antivirus scan on >>> cluster >>> node ? >>> -Do you have huge MS Access database ? >>> -Any monitoring/graphing tool to get some history on ram;cpu;network >>> usage? >>> >>> >>> >>> >>> >>> -- >>> Cordialement, >>> Mathieu CHATEAU >>> http://lordoftheping.blogspot.com >>> >>> >>> "Mike O." <MikeO@discussions.microsoft.com> wrote in message >>> news:3CDFA86D-D7A6-4512-AA64-276CB90144A2@microsoft.com... >>> > (This was also posted on the server clustering group) >>> > >>> > I'm trying to find out some information about using CHKDSK on a >>> > clustered >>> > drive. >>> > We have a two node cluster (active/passive) running Windows 2003 R2 >>> > enterprise 32 bit with SP1. The cluster has three shared drives >>> > located >>> > on >>> > an EMC CX700 SAN. The three drives are a 500MB for the quorum, and >>> > two >>> > data >>> > drives 1.5TB (drive "E") and 2.4TB (drive "W"). Drive E is a basic >>> > disk, >>> > the >>> > 2.4TB drive W is a GPT disk. They're both about 70% full The E drive >>> > has >>> > been active for about a year, the W: one was added around June. >>> > >>> > Yesterday the active node became sluggish and then stopped serving >>> > data. >>> > It >>> > still responded to low level stuff like PING, users were getting >>> > errors on >>> > the server. Logging in gave a blank screen. This has happened a >>> > couple >>> > of >>> > times before (that's a separate issue we're looking into). >>> > >>> > We went to the inactive node and did a "move group" in the cluster >>> > administrator. We've done this before for various reasons with no >>> > problems, >>> > it usually takes about 20 seconds to bring the resources up on the >>> > other >>> > node. >>> > >>> > This time when the resources came on line on the 2nd node, we started >>> > getting an application popup that "Windows - Corrupt File : The file >>> > or >>> > directory E:\$Mft is corrupt and unreadable. Please run the Chkdsk >>> > utility." >>> > The drive seems to be running OK with users accessing the information >>> > normally. I did some research and it appears that Windows will use >>> > the >>> > duplicate copy of the MFT if the primary one is corrupted. >>> > >>> > I know we need to run CHKDSK soon, but unfortunately, running chkdsk >>> > and >>> > taking the drive off line for several hours is not something we can >>> > do >>> > during daytime hours. If necessary we could run it overnight, but >>> > with >>> > that >>> > size of drive I don't know if it would finish by the next morning. >>> > >>> > The server has dual fiber connections (we're using the EMC Powerpath >>> > software for SAN failover), and we didn't have anything happen with >>> > the >>> > SAN >>> > at that time, so based on the timing I'm assuming the MFT corruption >>> > was >>> > related to the cluster failover, not a physical hardware issue, so I >>> > wasn't >>> > planning on running the sector scan. I would imagine a sector scan on >>> > a >>> > 1.5TB "disk" would run for a while… >>> > At this point I'm planning on running CHKDSK over the weekend. I've >>> > never >>> > run it on a clustered disk before and I'm looking for some information >>> > about >>> > it. I've read Microsoft KB176970 and KB903650, but frankly they're a >>> > little >>> > confusing with the issues about "maintenance mode". >>> > >>> > Also, is my understanding about the mirrored/secondary MFT valid? >>> > Since >>> > users appear to be getting information correctly can the CHKDSK wait >>> > until >>> > the weekend?. Our backup policy does a full backup each week and an >>> > incremental daily, so if something really bad happens we should be >>> > able to >>> > recover. >>> > >>> > Any information on this would be appreciated. >>> > >>> > Mike O. >>> >>> >
Guest Mike O Posted September 30, 2007 Posted September 30, 2007 Re: Correcting a corrupted $MFT on a shared clustered disk It's not looking too good at the moment. I started it about 3:15 this afternoon. Phase 1 went through pretty fast, it found and corrected the 60 or so corrupted attribute & orphaned records that the read-only chkdsk passes were detecting. However, it started phase 2 around 4:00, and now at 9:00 it's still at 0 percent... I seem to remember that the stage 2 steps go in 10% increments (at least I hope so!), and I know that this stage isn't linear, and that it might move erratically, but I was hoping to see something besides 0 by now.. According to task manager the chkdsk is running, it shows the process running about 30+cpu time. Assuming it doesn't finish by the end of our maintenance window, do you know if there would be any problems cancelling the process? I know it wouldn't fix the main problem, but at least we could get the system up and running until another time (or relocate the data to another drive). "Mathieu CHATEAU" <gollum123@free.fr> wrote in message news:e4mBYyhAIHA.5124@TK2MSFTNGP04.phx.gbl... > it's ok, i understand that you are under pressure ! > > I was just trying to make you think about this current pressure, that may > be lower if you would only have to make offline a part and not the whole > cake ;) > > Let's us know how it's going after the chkdsk > -- > Cordialement, > Mathieu CHATEAU > http://lordoftheping.blogspot.com > > > "Mike O" <put_the_spam@the.can> wrote in message > news:%23NFNazWAIHA.5960@TK2MSFTNGP05.phx.gbl... >> By the way, after posting the message below, I realized some of my >> wording may have come off sounding a little cranky.. It's been a long, >> tiring week, in addition to this I've had a couple of other issues and I >> may have overreacted a little bit. >> >> >> >> "Mike O." <MikeO@discussions.microsoft.com> wrote in message >> news:B655B129-B516-4C8A-8993-B045062AE633@microsoft.com... >>> Per the KB284134, clustering supports GPT if you apply the hotfix, which >>> we >>> did prior to connecting the GPT disk. I don't believe that applying the >>> Microsoft supported hotfix to correct the issue is "forcing" it. >>> >>> Also, the problem I'm having is on the smaller basic disk, the GPT one >>> is >>> fine. >>> >>> We thought about breaking the "drive" into smaller partitions, but the >>> issues we run into are space allocation. Eventually we'll end up with >>> one >>> partition running out of space and another one with space to spare. Our >>> backup system is an enterprise system, running over 1Gb ethernet (we're >>> looking at backing up over the SAN soon), so backing up a 1-2 TB is not >>> a >>> problem. >>> >>> As for your other questions, I'm not sure where you got the "performance >>> problems" part. The server was working fine, performance was acceptable >>> then >>> it quickly (over 30 minutes) failed. I'm still investigating it, but >>> I'm >>> wondering if a memory leak in one of the drivers or other processes >>> running >>> on it caused the issue. >>> >>> We can't exclude real time virus scanning since these are user files. >>> We've >>> had McAfee products and a support contract with them for years. >>> According to >>> the tech there are no problems with Virusscan 8.x on the cluster. >>> >>> We don't have any large access databases on this system. I'm sure there >>> are >>> some, but it's primarily a user file server, not supposed to be for >>> applications. >>> >>> As for the error I'm receiving, should I be able to wait until this >>> weekend >>> for the CHKDSK, or is it something that's only going to get worse? From >>> some >>> Microsoft KB articles (and other stuff I found), it seems that NTFS >>> keeps two >>> copies of the MFT and will use the other one if the primary is >>> corrupted. Is >>> this correct? >>> >>> When I do run chkdsk, are there any special issues with the cluster? I >>> know normally Windows can't chkdsk on an active disk and would have to >>> when >>> the server is rebooted. The problem is that when the server reboots it >>> doesn't see the clustered disk until the cluster service starts, so >>> chkdsk >>> can't access the disk "pre" bootup. >>> >>> We have a test cluster (with a 50G shared disk). I ran chkdsk /f on it >>> and >>> it said the drive needed to be unmounted and offered to do that for me. >>> I >>> told it yes and it seemed to work OK. Of course the disk was >>> unavailable >>> while the chkdsk was running, but it can back on line as soon as it >>> finished. >>> >>> "Mathieu CHATEAU" wrote: >>> >>>> HEllo, >>>> >>>> GPT disk and cluster are not friend by default, forcing them to be >>>> friend >>>> may lead to issue... >>>> >>>> By default, server clusters do not support GPT shared disks in Windows >>>> Server 2003 >>>> http://support.microsoft.com/kb/284134/en-us >>>> >>>> That's the problem with so big data volumes....You should have in mind >>>> data >>>> recovery, defrag & chkdsk when sizing data volumes... >>>> You will start having issue when raising 4 Millions of files too >>>> >>>> Now, it's clear you have to run the chkdsk. Downtime for downtime, run >>>> it on >>>> both if you can >>>> >>>> For the performance part: >>>> -did you exclude all shared data from real time antivirus scan on >>>> cluster >>>> node ? >>>> -Do you have huge MS Access database ? >>>> -Any monitoring/graphing tool to get some history on ram;cpu;network >>>> usage? >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Cordialement, >>>> Mathieu CHATEAU >>>> http://lordoftheping.blogspot.com >>>> >>>> >>>> "Mike O." <MikeO@discussions.microsoft.com> wrote in message >>>> news:3CDFA86D-D7A6-4512-AA64-276CB90144A2@microsoft.com... >>>> > (This was also posted on the server clustering group) >>>> > >>>> > I'm trying to find out some information about using CHKDSK on a >>>> > clustered >>>> > drive. >>>> > We have a two node cluster (active/passive) running Windows 2003 R2 >>>> > enterprise 32 bit with SP1. The cluster has three shared drives >>>> > located >>>> > on >>>> > an EMC CX700 SAN. The three drives are a 500MB for the quorum, and >>>> > two >>>> > data >>>> > drives 1.5TB (drive "E") and 2.4TB (drive "W"). Drive E is a basic >>>> > disk, >>>> > the >>>> > 2.4TB drive W is a GPT disk. They're both about 70% full The E >>>> > drive has >>>> > been active for about a year, the W: one was added around June. >>>> > >>>> > Yesterday the active node became sluggish and then stopped serving >>>> > data. >>>> > It >>>> > still responded to low level stuff like PING, users were getting >>>> > errors on >>>> > the server. Logging in gave a blank screen. This has happened a >>>> > couple >>>> > of >>>> > times before (that's a separate issue we're looking into). >>>> > >>>> > We went to the inactive node and did a "move group" in the cluster >>>> > administrator. We've done this before for various reasons with no >>>> > problems, >>>> > it usually takes about 20 seconds to bring the resources up on the >>>> > other >>>> > node. >>>> > >>>> > This time when the resources came on line on the 2nd node, we started >>>> > getting an application popup that "Windows - Corrupt File : The file >>>> > or >>>> > directory E:\$Mft is corrupt and unreadable. Please run the Chkdsk >>>> > utility." >>>> > The drive seems to be running OK with users accessing the information >>>> > normally. I did some research and it appears that Windows will use >>>> > the >>>> > duplicate copy of the MFT if the primary one is corrupted. >>>> > >>>> > I know we need to run CHKDSK soon, but unfortunately, running chkdsk >>>> > and >>>> > taking the drive off line for several hours is not something we can >>>> > do >>>> > during daytime hours. If necessary we could run it overnight, but >>>> > with >>>> > that >>>> > size of drive I don't know if it would finish by the next morning. >>>> > >>>> > The server has dual fiber connections (we're using the EMC Powerpath >>>> > software for SAN failover), and we didn't have anything happen with >>>> > the >>>> > SAN >>>> > at that time, so based on the timing I'm assuming the MFT corruption >>>> > was >>>> > related to the cluster failover, not a physical hardware issue, so I >>>> > wasn't >>>> > planning on running the sector scan. I would imagine a sector scan >>>> > on a >>>> > 1.5TB "disk" would run for a while… >>>> > At this point I'm planning on running CHKDSK over the weekend. I've >>>> > never >>>> > run it on a clustered disk before and I'm looking for some >>>> > information >>>> > about >>>> > it. I've read Microsoft KB176970 and KB903650, but frankly they're a >>>> > little >>>> > confusing with the issues about "maintenance mode". >>>> > >>>> > Also, is my understanding about the mirrored/secondary MFT valid? >>>> > Since >>>> > users appear to be getting information correctly can the CHKDSK wait >>>> > until >>>> > the weekend?. Our backup policy does a full backup each week and an >>>> > incremental daily, so if something really bad happens we should be >>>> > able to >>>> > recover. >>>> > >>>> > Any information on this would be appreciated. >>>> > >>>> > Mike O. >>>> >>>> >> >
Guest Mike O Posted September 30, 2007 Posted September 30, 2007 Re: Correcting a corrupted $MFT on a shared clustered disk This is not looking good at all. It just jumped all the way to 1%. It's been running phase 2 since about 4:00pm today. 7-1/2 hours for 1% isn't a good sign. "Mike O" <put_the_spam@the.can> wrote in message news:u5mYq7vAIHA.4984@TK2MSFTNGP06.phx.gbl... > It's not looking too good at the moment. I started it about 3:15 this > afternoon. Phase 1 went through pretty fast, it found and corrected the > 60 or so corrupted attribute & orphaned records that the read-only chkdsk > passes were detecting. > > However, it started phase 2 around 4:00, and now at 9:00 it's still at 0 > percent... I seem to remember that the stage 2 steps go in 10% increments > (at least I hope so!), and I know that this stage isn't linear, and that > it might move erratically, but I was hoping to see something besides 0 by > now.. According to task manager the chkdsk is running, it shows the > process running about 30+cpu time. > > Assuming it doesn't finish by the end of our maintenance window, do you > know if there would be any problems cancelling the process? I know it > wouldn't fix the main problem, but at least we could get the system up and > running until another time (or relocate the data to another drive). > > > "Mathieu CHATEAU" <gollum123@free.fr> wrote in message > news:e4mBYyhAIHA.5124@TK2MSFTNGP04.phx.gbl... >> it's ok, i understand that you are under pressure ! >> >> I was just trying to make you think about this current pressure, that may >> be lower if you would only have to make offline a part and not the whole >> cake ;) >> >> Let's us know how it's going after the chkdsk >> -- >> Cordialement, >> Mathieu CHATEAU >> http://lordoftheping.blogspot.com >> >> >> "Mike O" <put_the_spam@the.can> wrote in message >> news:%23NFNazWAIHA.5960@TK2MSFTNGP05.phx.gbl... >>> By the way, after posting the message below, I realized some of my >>> wording may have come off sounding a little cranky.. It's been a long, >>> tiring week, in addition to this I've had a couple of other issues and I >>> may have overreacted a little bit. >>> >>> >>> >>> "Mike O." <MikeO@discussions.microsoft.com> wrote in message >>> news:B655B129-B516-4C8A-8993-B045062AE633@microsoft.com... >>>> Per the KB284134, clustering supports GPT if you apply the hotfix, >>>> which we >>>> did prior to connecting the GPT disk. I don't believe that applying >>>> the >>>> Microsoft supported hotfix to correct the issue is "forcing" it. >>>> >>>> Also, the problem I'm having is on the smaller basic disk, the GPT one >>>> is >>>> fine. >>>> >>>> We thought about breaking the "drive" into smaller partitions, but the >>>> issues we run into are space allocation. Eventually we'll end up with >>>> one >>>> partition running out of space and another one with space to spare. >>>> Our >>>> backup system is an enterprise system, running over 1Gb ethernet (we're >>>> looking at backing up over the SAN soon), so backing up a 1-2 TB is not >>>> a >>>> problem. >>>> >>>> As for your other questions, I'm not sure where you got the >>>> "performance >>>> problems" part. The server was working fine, performance was >>>> acceptable then >>>> it quickly (over 30 minutes) failed. I'm still investigating it, but >>>> I'm >>>> wondering if a memory leak in one of the drivers or other processes >>>> running >>>> on it caused the issue. >>>> >>>> We can't exclude real time virus scanning since these are user files. >>>> We've >>>> had McAfee products and a support contract with them for years. >>>> According to >>>> the tech there are no problems with Virusscan 8.x on the cluster. >>>> >>>> We don't have any large access databases on this system. I'm sure >>>> there are >>>> some, but it's primarily a user file server, not supposed to be for >>>> applications. >>>> >>>> As for the error I'm receiving, should I be able to wait until this >>>> weekend >>>> for the CHKDSK, or is it something that's only going to get worse? >>>> From some >>>> Microsoft KB articles (and other stuff I found), it seems that NTFS >>>> keeps two >>>> copies of the MFT and will use the other one if the primary is >>>> corrupted. Is >>>> this correct? >>>> >>>> When I do run chkdsk, are there any special issues with the cluster? >>>> I >>>> know normally Windows can't chkdsk on an active disk and would have to >>>> when >>>> the server is rebooted. The problem is that when the server reboots it >>>> doesn't see the clustered disk until the cluster service starts, so >>>> chkdsk >>>> can't access the disk "pre" bootup. >>>> >>>> We have a test cluster (with a 50G shared disk). I ran chkdsk /f on it >>>> and >>>> it said the drive needed to be unmounted and offered to do that for me. >>>> I >>>> told it yes and it seemed to work OK. Of course the disk was >>>> unavailable >>>> while the chkdsk was running, but it can back on line as soon as it >>>> finished. >>>> >>>> "Mathieu CHATEAU" wrote: >>>> >>>>> HEllo, >>>>> >>>>> GPT disk and cluster are not friend by default, forcing them to be >>>>> friend >>>>> may lead to issue... >>>>> >>>>> By default, server clusters do not support GPT shared disks in Windows >>>>> Server 2003 >>>>> http://support.microsoft.com/kb/284134/en-us >>>>> >>>>> That's the problem with so big data volumes....You should have in mind >>>>> data >>>>> recovery, defrag & chkdsk when sizing data volumes... >>>>> You will start having issue when raising 4 Millions of files too >>>>> >>>>> Now, it's clear you have to run the chkdsk. Downtime for downtime, run >>>>> it on >>>>> both if you can >>>>> >>>>> For the performance part: >>>>> -did you exclude all shared data from real time antivirus scan on >>>>> cluster >>>>> node ? >>>>> -Do you have huge MS Access database ? >>>>> -Any monitoring/graphing tool to get some history on ram;cpu;network >>>>> usage? >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Cordialement, >>>>> Mathieu CHATEAU >>>>> http://lordoftheping.blogspot.com >>>>> >>>>> >>>>> "Mike O." <MikeO@discussions.microsoft.com> wrote in message >>>>> news:3CDFA86D-D7A6-4512-AA64-276CB90144A2@microsoft.com... >>>>> > (This was also posted on the server clustering group) >>>>> > >>>>> > I'm trying to find out some information about using CHKDSK on a >>>>> > clustered >>>>> > drive. >>>>> > We have a two node cluster (active/passive) running Windows 2003 R2 >>>>> > enterprise 32 bit with SP1. The cluster has three shared drives >>>>> > located >>>>> > on >>>>> > an EMC CX700 SAN. The three drives are a 500MB for the quorum, and >>>>> > two >>>>> > data >>>>> > drives 1.5TB (drive "E") and 2.4TB (drive "W"). Drive E is a basic >>>>> > disk, >>>>> > the >>>>> > 2.4TB drive W is a GPT disk. They're both about 70% full The E >>>>> > drive has >>>>> > been active for about a year, the W: one was added around June. >>>>> > >>>>> > Yesterday the active node became sluggish and then stopped serving >>>>> > data. >>>>> > It >>>>> > still responded to low level stuff like PING, users were getting >>>>> > errors on >>>>> > the server. Logging in gave a blank screen. This has happened a >>>>> > couple >>>>> > of >>>>> > times before (that's a separate issue we're looking into). >>>>> > >>>>> > We went to the inactive node and did a "move group" in the cluster >>>>> > administrator. We've done this before for various reasons with no >>>>> > problems, >>>>> > it usually takes about 20 seconds to bring the resources up on the >>>>> > other >>>>> > node. >>>>> > >>>>> > This time when the resources came on line on the 2nd node, we >>>>> > started >>>>> > getting an application popup that "Windows - Corrupt File : The file >>>>> > or >>>>> > directory E:\$Mft is corrupt and unreadable. Please run the Chkdsk >>>>> > utility." >>>>> > The drive seems to be running OK with users accessing the >>>>> > information >>>>> > normally. I did some research and it appears that Windows will use >>>>> > the >>>>> > duplicate copy of the MFT if the primary one is corrupted. >>>>> > >>>>> > I know we need to run CHKDSK soon, but unfortunately, running chkdsk >>>>> > and >>>>> > taking the drive off line for several hours is not something we can >>>>> > do >>>>> > during daytime hours. If necessary we could run it overnight, but >>>>> > with >>>>> > that >>>>> > size of drive I don't know if it would finish by the next morning. >>>>> > >>>>> > The server has dual fiber connections (we're using the EMC Powerpath >>>>> > software for SAN failover), and we didn't have anything happen with >>>>> > the >>>>> > SAN >>>>> > at that time, so based on the timing I'm assuming the MFT corruption >>>>> > was >>>>> > related to the cluster failover, not a physical hardware issue, so I >>>>> > wasn't >>>>> > planning on running the sector scan. I would imagine a sector scan >>>>> > on a >>>>> > 1.5TB "disk" would run for a while… >>>>> > At this point I'm planning on running CHKDSK over the weekend. I've >>>>> > never >>>>> > run it on a clustered disk before and I'm looking for some >>>>> > information >>>>> > about >>>>> > it. I've read Microsoft KB176970 and KB903650, but frankly they're >>>>> > a >>>>> > little >>>>> > confusing with the issues about "maintenance mode". >>>>> > >>>>> > Also, is my understanding about the mirrored/secondary MFT valid? >>>>> > Since >>>>> > users appear to be getting information correctly can the CHKDSK wait >>>>> > until >>>>> > the weekend?. Our backup policy does a full backup each week and an >>>>> > incremental daily, so if something really bad happens we should be >>>>> > able to >>>>> > recover. >>>>> > >>>>> > Any information on this would be appreciated. >>>>> > >>>>> > Mike O. >>>>> >>>>> >>> >> >
Guest Mike O Posted September 30, 2007 Posted September 30, 2007 Re: Correcting a corrupted $MFT on a shared clustered disk It's gone from 1% to 76% in the last hour. As I'm watching it, it now seems to be moving about 1% every two minutes... The optimism is starting to slowly come back.. "Mike O" <put_the_spam@the.can> wrote in message news:OE%23FbKxAIHA.4232@TK2MSFTNGP04.phx.gbl... > This is not looking good at all. It just jumped all the way to 1%. It's > been running phase 2 since about 4:00pm today. 7-1/2 hours for 1% isn't a > good sign. > > > > "Mike O" <put_the_spam@the.can> wrote in message > news:u5mYq7vAIHA.4984@TK2MSFTNGP06.phx.gbl... >> It's not looking too good at the moment. I started it about 3:15 this >> afternoon. Phase 1 went through pretty fast, it found and corrected the >> 60 or so corrupted attribute & orphaned records that the read-only chkdsk >> passes were detecting. >> >> However, it started phase 2 around 4:00, and now at 9:00 it's still at 0 >> percent... I seem to remember that the stage 2 steps go in 10% >> increments (at least I hope so!), and I know that this stage isn't >> linear, and that it might move erratically, but I was hoping to see >> something besides 0 by now.. According to task manager the chkdsk is >> running, it shows the process running about 30+cpu time. >> >> Assuming it doesn't finish by the end of our maintenance window, do you >> know if there would be any problems cancelling the process? I know it >> wouldn't fix the main problem, but at least we could get the system up >> and running until another time (or relocate the data to another drive). >> >> >> "Mathieu CHATEAU" <gollum123@free.fr> wrote in message >> news:e4mBYyhAIHA.5124@TK2MSFTNGP04.phx.gbl... >>> it's ok, i understand that you are under pressure ! >>> >>> I was just trying to make you think about this current pressure, that >>> may be lower if you would only have to make offline a part and not the >>> whole cake ;) >>> >>> Let's us know how it's going after the chkdsk >>> -- >>> Cordialement, >>> Mathieu CHATEAU >>> http://lordoftheping.blogspot.com >>> >>> >>> "Mike O" <put_the_spam@the.can> wrote in message >>> news:%23NFNazWAIHA.5960@TK2MSFTNGP05.phx.gbl... >>>> By the way, after posting the message below, I realized some of my >>>> wording may have come off sounding a little cranky.. It's been a long, >>>> tiring week, in addition to this I've had a couple of other issues and >>>> I may have overreacted a little bit. >>>> >>>> >>>> >>>> "Mike O." <MikeO@discussions.microsoft.com> wrote in message >>>> news:B655B129-B516-4C8A-8993-B045062AE633@microsoft.com... >>>>> Per the KB284134, clustering supports GPT if you apply the hotfix, >>>>> which we >>>>> did prior to connecting the GPT disk. I don't believe that applying >>>>> the >>>>> Microsoft supported hotfix to correct the issue is "forcing" it. >>>>> >>>>> Also, the problem I'm having is on the smaller basic disk, the GPT one >>>>> is >>>>> fine. >>>>> >>>>> We thought about breaking the "drive" into smaller partitions, but the >>>>> issues we run into are space allocation. Eventually we'll end up >>>>> with one >>>>> partition running out of space and another one with space to spare. >>>>> Our >>>>> backup system is an enterprise system, running over 1Gb ethernet >>>>> (we're >>>>> looking at backing up over the SAN soon), so backing up a 1-2 TB is >>>>> not a >>>>> problem. >>>>> >>>>> As for your other questions, I'm not sure where you got the >>>>> "performance >>>>> problems" part. The server was working fine, performance was >>>>> acceptable then >>>>> it quickly (over 30 minutes) failed. I'm still investigating it, but >>>>> I'm >>>>> wondering if a memory leak in one of the drivers or other processes >>>>> running >>>>> on it caused the issue. >>>>> >>>>> We can't exclude real time virus scanning since these are user files. >>>>> We've >>>>> had McAfee products and a support contract with them for years. >>>>> According to >>>>> the tech there are no problems with Virusscan 8.x on the cluster. >>>>> >>>>> We don't have any large access databases on this system. I'm sure >>>>> there are >>>>> some, but it's primarily a user file server, not supposed to be for >>>>> applications. >>>>> >>>>> As for the error I'm receiving, should I be able to wait until this >>>>> weekend >>>>> for the CHKDSK, or is it something that's only going to get worse? >>>>> From some >>>>> Microsoft KB articles (and other stuff I found), it seems that NTFS >>>>> keeps two >>>>> copies of the MFT and will use the other one if the primary is >>>>> corrupted. Is >>>>> this correct? >>>>> >>>>> When I do run chkdsk, are there any special issues with the cluster? I >>>>> know normally Windows can't chkdsk on an active disk and would have to >>>>> when >>>>> the server is rebooted. The problem is that when the server reboots >>>>> it >>>>> doesn't see the clustered disk until the cluster service starts, so >>>>> chkdsk >>>>> can't access the disk "pre" bootup. >>>>> >>>>> We have a test cluster (with a 50G shared disk). I ran chkdsk /f on >>>>> it and >>>>> it said the drive needed to be unmounted and offered to do that for >>>>> me. I >>>>> told it yes and it seemed to work OK. Of course the disk was >>>>> unavailable >>>>> while the chkdsk was running, but it can back on line as soon as it >>>>> finished. >>>>> >>>>> "Mathieu CHATEAU" wrote: >>>>> >>>>>> HEllo, >>>>>> >>>>>> GPT disk and cluster are not friend by default, forcing them to be >>>>>> friend >>>>>> may lead to issue... >>>>>> >>>>>> By default, server clusters do not support GPT shared disks in >>>>>> Windows >>>>>> Server 2003 >>>>>> http://support.microsoft.com/kb/284134/en-us >>>>>> >>>>>> That's the problem with so big data volumes....You should have in >>>>>> mind data >>>>>> recovery, defrag & chkdsk when sizing data volumes... >>>>>> You will start having issue when raising 4 Millions of files too >>>>>> >>>>>> Now, it's clear you have to run the chkdsk. Downtime for downtime, >>>>>> run it on >>>>>> both if you can >>>>>> >>>>>> For the performance part: >>>>>> -did you exclude all shared data from real time antivirus scan on >>>>>> cluster >>>>>> node ? >>>>>> -Do you have huge MS Access database ? >>>>>> -Any monitoring/graphing tool to get some history on ram;cpu;network >>>>>> usage? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Cordialement, >>>>>> Mathieu CHATEAU >>>>>> http://lordoftheping.blogspot.com >>>>>> >>>>>> >>>>>> "Mike O." <MikeO@discussions.microsoft.com> wrote in message >>>>>> news:3CDFA86D-D7A6-4512-AA64-276CB90144A2@microsoft.com... >>>>>> > (This was also posted on the server clustering group) >>>>>> > >>>>>> > I'm trying to find out some information about using CHKDSK on a >>>>>> > clustered >>>>>> > drive. >>>>>> > We have a two node cluster (active/passive) running Windows 2003 R2 >>>>>> > enterprise 32 bit with SP1. The cluster has three shared drives >>>>>> > located >>>>>> > on >>>>>> > an EMC CX700 SAN. The three drives are a 500MB for the quorum, and >>>>>> > two >>>>>> > data >>>>>> > drives 1.5TB (drive "E") and 2.4TB (drive "W"). Drive E is a basic >>>>>> > disk, >>>>>> > the >>>>>> > 2.4TB drive W is a GPT disk. They're both about 70% full The E >>>>>> > drive has >>>>>> > been active for about a year, the W: one was added around June. >>>>>> > >>>>>> > Yesterday the active node became sluggish and then stopped serving >>>>>> > data. >>>>>> > It >>>>>> > still responded to low level stuff like PING, users were getting >>>>>> > errors on >>>>>> > the server. Logging in gave a blank screen. This has happened a >>>>>> > couple >>>>>> > of >>>>>> > times before (that's a separate issue we're looking into). >>>>>> > >>>>>> > We went to the inactive node and did a "move group" in the cluster >>>>>> > administrator. We've done this before for various reasons with no >>>>>> > problems, >>>>>> > it usually takes about 20 seconds to bring the resources up on the >>>>>> > other >>>>>> > node. >>>>>> > >>>>>> > This time when the resources came on line on the 2nd node, we >>>>>> > started >>>>>> > getting an application popup that "Windows - Corrupt File : The >>>>>> > file or >>>>>> > directory E:\$Mft is corrupt and unreadable. Please run the Chkdsk >>>>>> > utility." >>>>>> > The drive seems to be running OK with users accessing the >>>>>> > information >>>>>> > normally. I did some research and it appears that Windows will use >>>>>> > the >>>>>> > duplicate copy of the MFT if the primary one is corrupted. >>>>>> > >>>>>> > I know we need to run CHKDSK soon, but unfortunately, running >>>>>> > chkdsk and >>>>>> > taking the drive off line for several hours is not something we >>>>>> > can do >>>>>> > during daytime hours. If necessary we could run it overnight, but >>>>>> > with >>>>>> > that >>>>>> > size of drive I don't know if it would finish by the next morning. >>>>>> > >>>>>> > The server has dual fiber connections (we're using the EMC >>>>>> > Powerpath >>>>>> > software for SAN failover), and we didn't have anything happen with >>>>>> > the >>>>>> > SAN >>>>>> > at that time, so based on the timing I'm assuming the MFT >>>>>> > corruption was >>>>>> > related to the cluster failover, not a physical hardware issue, so >>>>>> > I >>>>>> > wasn't >>>>>> > planning on running the sector scan. I would imagine a sector scan >>>>>> > on a >>>>>> > 1.5TB "disk" would run for a while… >>>>>> > At this point I'm planning on running CHKDSK over the weekend. >>>>>> > I've never >>>>>> > run it on a clustered disk before and I'm looking for some >>>>>> > information >>>>>> > about >>>>>> > it. I've read Microsoft KB176970 and KB903650, but frankly they're >>>>>> > a >>>>>> > little >>>>>> > confusing with the issues about "maintenance mode". >>>>>> > >>>>>> > Also, is my understanding about the mirrored/secondary MFT valid? >>>>>> > Since >>>>>> > users appear to be getting information correctly can the CHKDSK >>>>>> > wait until >>>>>> > the weekend?. Our backup policy does a full backup each week and >>>>>> > an >>>>>> > incremental daily, so if something really bad happens we should be >>>>>> > able to >>>>>> > recover. >>>>>> > >>>>>> > Any information on this would be appreciated. >>>>>> > >>>>>> > Mike O. >>>>>> >>>>>> >>>> >>> >> >
Guest Mathieu CHATEAU Posted September 30, 2007 Posted September 30, 2007 Re: Correcting a corrupted $MFT on a shared clustered disk keep going ! Else can you format the drive and restore from backup ? It may go faster, depending of your backup storage and type of files (bigger the better) -- Cordialement, Mathieu CHATEAU English blog: http://lordoftheping.blogspot.com French blog: http://www.lotp.fr "Mike O" <put_the_spam@the.can> wrote in message news:uCaiwtxAIHA.4612@TK2MSFTNGP03.phx.gbl... > It's gone from 1% to 76% in the last hour. As I'm watching it, it now > seems to be moving about 1% every two minutes... > > The optimism is starting to slowly come back.. > > > "Mike O" <put_the_spam@the.can> wrote in message > news:OE%23FbKxAIHA.4232@TK2MSFTNGP04.phx.gbl... >> This is not looking good at all. It just jumped all the way to 1%. It's >> been running phase 2 since about 4:00pm today. 7-1/2 hours for 1% isn't >> a good sign. >> >> >> >> "Mike O" <put_the_spam@the.can> wrote in message >> news:u5mYq7vAIHA.4984@TK2MSFTNGP06.phx.gbl... >>> It's not looking too good at the moment. I started it about 3:15 this >>> afternoon. Phase 1 went through pretty fast, it found and corrected the >>> 60 or so corrupted attribute & orphaned records that the read-only >>> chkdsk passes were detecting. >>> >>> However, it started phase 2 around 4:00, and now at 9:00 it's still at 0 >>> percent... I seem to remember that the stage 2 steps go in 10% >>> increments (at least I hope so!), and I know that this stage isn't >>> linear, and that it might move erratically, but I was hoping to see >>> something besides 0 by now.. According to task manager the chkdsk is >>> running, it shows the process running about 30+cpu time. >>> >>> Assuming it doesn't finish by the end of our maintenance window, do you >>> know if there would be any problems cancelling the process? I know it >>> wouldn't fix the main problem, but at least we could get the system up >>> and running until another time (or relocate the data to another drive). >>> >>> >>> "Mathieu CHATEAU" <gollum123@free.fr> wrote in message >>> news:e4mBYyhAIHA.5124@TK2MSFTNGP04.phx.gbl... >>>> it's ok, i understand that you are under pressure ! >>>> >>>> I was just trying to make you think about this current pressure, that >>>> may be lower if you would only have to make offline a part and not the >>>> whole cake ;) >>>> >>>> Let's us know how it's going after the chkdsk >>>> -- >>>> Cordialement, >>>> Mathieu CHATEAU >>>> http://lordoftheping.blogspot.com >>>> >>>> >>>> "Mike O" <put_the_spam@the.can> wrote in message >>>> news:%23NFNazWAIHA.5960@TK2MSFTNGP05.phx.gbl... >>>>> By the way, after posting the message below, I realized some of my >>>>> wording may have come off sounding a little cranky.. It's been a >>>>> long, tiring week, in addition to this I've had a couple of other >>>>> issues and I may have overreacted a little bit. >>>>> >>>>> >>>>> >>>>> "Mike O." <MikeO@discussions.microsoft.com> wrote in message >>>>> news:B655B129-B516-4C8A-8993-B045062AE633@microsoft.com... >>>>>> Per the KB284134, clustering supports GPT if you apply the hotfix, >>>>>> which we >>>>>> did prior to connecting the GPT disk. I don't believe that applying >>>>>> the >>>>>> Microsoft supported hotfix to correct the issue is "forcing" it. >>>>>> >>>>>> Also, the problem I'm having is on the smaller basic disk, the GPT >>>>>> one is >>>>>> fine. >>>>>> >>>>>> We thought about breaking the "drive" into smaller partitions, but >>>>>> the >>>>>> issues we run into are space allocation. Eventually we'll end up >>>>>> with one >>>>>> partition running out of space and another one with space to spare. >>>>>> Our >>>>>> backup system is an enterprise system, running over 1Gb ethernet >>>>>> (we're >>>>>> looking at backing up over the SAN soon), so backing up a 1-2 TB is >>>>>> not a >>>>>> problem. >>>>>> >>>>>> As for your other questions, I'm not sure where you got the >>>>>> "performance >>>>>> problems" part. The server was working fine, performance was >>>>>> acceptable then >>>>>> it quickly (over 30 minutes) failed. I'm still investigating it, >>>>>> but I'm >>>>>> wondering if a memory leak in one of the drivers or other processes >>>>>> running >>>>>> on it caused the issue. >>>>>> >>>>>> We can't exclude real time virus scanning since these are user files. >>>>>> We've >>>>>> had McAfee products and a support contract with them for years. >>>>>> According to >>>>>> the tech there are no problems with Virusscan 8.x on the cluster. >>>>>> >>>>>> We don't have any large access databases on this system. I'm sure >>>>>> there are >>>>>> some, but it's primarily a user file server, not supposed to be for >>>>>> applications. >>>>>> >>>>>> As for the error I'm receiving, should I be able to wait until this >>>>>> weekend >>>>>> for the CHKDSK, or is it something that's only going to get worse? >>>>>> From some >>>>>> Microsoft KB articles (and other stuff I found), it seems that NTFS >>>>>> keeps two >>>>>> copies of the MFT and will use the other one if the primary is >>>>>> corrupted. Is >>>>>> this correct? >>>>>> >>>>>> When I do run chkdsk, are there any special issues with the cluster? >>>>>> I >>>>>> know normally Windows can't chkdsk on an active disk and would have >>>>>> to when >>>>>> the server is rebooted. The problem is that when the server reboots >>>>>> it >>>>>> doesn't see the clustered disk until the cluster service starts, so >>>>>> chkdsk >>>>>> can't access the disk "pre" bootup. >>>>>> >>>>>> We have a test cluster (with a 50G shared disk). I ran chkdsk /f on >>>>>> it and >>>>>> it said the drive needed to be unmounted and offered to do that for >>>>>> me. I >>>>>> told it yes and it seemed to work OK. Of course the disk was >>>>>> unavailable >>>>>> while the chkdsk was running, but it can back on line as soon as it >>>>>> finished. >>>>>> >>>>>> "Mathieu CHATEAU" wrote: >>>>>> >>>>>>> HEllo, >>>>>>> >>>>>>> GPT disk and cluster are not friend by default, forcing them to be >>>>>>> friend >>>>>>> may lead to issue... >>>>>>> >>>>>>> By default, server clusters do not support GPT shared disks in >>>>>>> Windows >>>>>>> Server 2003 >>>>>>> http://support.microsoft.com/kb/284134/en-us >>>>>>> >>>>>>> That's the problem with so big data volumes....You should have in >>>>>>> mind data >>>>>>> recovery, defrag & chkdsk when sizing data volumes... >>>>>>> You will start having issue when raising 4 Millions of files too >>>>>>> >>>>>>> Now, it's clear you have to run the chkdsk. Downtime for downtime, >>>>>>> run it on >>>>>>> both if you can >>>>>>> >>>>>>> For the performance part: >>>>>>> -did you exclude all shared data from real time antivirus scan on >>>>>>> cluster >>>>>>> node ? >>>>>>> -Do you have huge MS Access database ? >>>>>>> -Any monitoring/graphing tool to get some history on ram;cpu;network >>>>>>> usage? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Cordialement, >>>>>>> Mathieu CHATEAU >>>>>>> http://lordoftheping.blogspot.com >>>>>>> >>>>>>> >>>>>>> "Mike O." <MikeO@discussions.microsoft.com> wrote in message >>>>>>> news:3CDFA86D-D7A6-4512-AA64-276CB90144A2@microsoft.com... >>>>>>> > (This was also posted on the server clustering group) >>>>>>> > >>>>>>> > I'm trying to find out some information about using CHKDSK on a >>>>>>> > clustered >>>>>>> > drive. >>>>>>> > We have a two node cluster (active/passive) running Windows 2003 >>>>>>> > R2 >>>>>>> > enterprise 32 bit with SP1. The cluster has three shared drives >>>>>>> > located >>>>>>> > on >>>>>>> > an EMC CX700 SAN. The three drives are a 500MB for the quorum, >>>>>>> > and two >>>>>>> > data >>>>>>> > drives 1.5TB (drive "E") and 2.4TB (drive "W"). Drive E is a >>>>>>> > basic disk, >>>>>>> > the >>>>>>> > 2.4TB drive W is a GPT disk. They're both about 70% full The E >>>>>>> > drive has >>>>>>> > been active for about a year, the W: one was added around June. >>>>>>> > >>>>>>> > Yesterday the active node became sluggish and then stopped serving >>>>>>> > data. >>>>>>> > It >>>>>>> > still responded to low level stuff like PING, users were getting >>>>>>> > errors on >>>>>>> > the server. Logging in gave a blank screen. This has happened a >>>>>>> > couple >>>>>>> > of >>>>>>> > times before (that's a separate issue we're looking into). >>>>>>> > >>>>>>> > We went to the inactive node and did a "move group" in the cluster >>>>>>> > administrator. We've done this before for various reasons with no >>>>>>> > problems, >>>>>>> > it usually takes about 20 seconds to bring the resources up on the >>>>>>> > other >>>>>>> > node. >>>>>>> > >>>>>>> > This time when the resources came on line on the 2nd node, we >>>>>>> > started >>>>>>> > getting an application popup that "Windows - Corrupt File : The >>>>>>> > file or >>>>>>> > directory E:\$Mft is corrupt and unreadable. Please run the Chkdsk >>>>>>> > utility." >>>>>>> > The drive seems to be running OK with users accessing the >>>>>>> > information >>>>>>> > normally. I did some research and it appears that Windows will >>>>>>> > use the >>>>>>> > duplicate copy of the MFT if the primary one is corrupted. >>>>>>> > >>>>>>> > I know we need to run CHKDSK soon, but unfortunately, running >>>>>>> > chkdsk and >>>>>>> > taking the drive off line for several hours is not something we >>>>>>> > can do >>>>>>> > during daytime hours. If necessary we could run it overnight, but >>>>>>> > with >>>>>>> > that >>>>>>> > size of drive I don't know if it would finish by the next morning. >>>>>>> > >>>>>>> > The server has dual fiber connections (we're using the EMC >>>>>>> > Powerpath >>>>>>> > software for SAN failover), and we didn't have anything happen >>>>>>> > with the >>>>>>> > SAN >>>>>>> > at that time, so based on the timing I'm assuming the MFT >>>>>>> > corruption was >>>>>>> > related to the cluster failover, not a physical hardware issue, so >>>>>>> > I >>>>>>> > wasn't >>>>>>> > planning on running the sector scan. I would imagine a sector >>>>>>> > scan on a >>>>>>> > 1.5TB "disk" would run for a while… >>>>>>> > At this point I'm planning on running CHKDSK over the weekend. >>>>>>> > I've never >>>>>>> > run it on a clustered disk before and I'm looking for some >>>>>>> > information >>>>>>> > about >>>>>>> > it. I've read Microsoft KB176970 and KB903650, but frankly >>>>>>> > they're a >>>>>>> > little >>>>>>> > confusing with the issues about "maintenance mode". >>>>>>> > >>>>>>> > Also, is my understanding about the mirrored/secondary MFT valid? >>>>>>> > Since >>>>>>> > users appear to be getting information correctly can the CHKDSK >>>>>>> > wait until >>>>>>> > the weekend?. Our backup policy does a full backup each week and >>>>>>> > an >>>>>>> > incremental daily, so if something really bad happens we should be >>>>>>> > able to >>>>>>> > recover. >>>>>>> > >>>>>>> > Any information on this would be appreciated. >>>>>>> > >>>>>>> > Mike O. >>>>>>> >>>>>>> >>>>> >>>> >>> >> >
Guest Mike O Posted September 30, 2007 Posted September 30, 2007 Re: Correcting a corrupted $MFT on a shared clustered disk It finished around 1:00am. So it took 7-1/2 hours for the first 1%, then 90 minutes for the other 99.. It what looked like a couple of hundred "resetting security id to default" messages, but I've spot checked the drive and don't see anything out of the ordinary. Unfortunately I didn't redirect the output, and it doesn't look chkdsk logs the errors (other than the event log entry), and that's not long enough to be useful. I was thinking about the restore option, and if it was still running this morning, I was going to start investigating that option. Thanks for all the comments. "Mathieu CHATEAU" <gollum123@free.fr> wrote in message news:uWMYOe0AIHA.5960@TK2MSFTNGP05.phx.gbl... > keep going ! > > Else can you format the drive and restore from backup ? It may go faster, > depending of your backup storage and type of files (bigger the better) > > -- > Cordialement, > Mathieu CHATEAU > English blog: http://lordoftheping.blogspot.com > French blog: http://www.lotp.fr > > > "Mike O" <put_the_spam@the.can> wrote in message > news:uCaiwtxAIHA.4612@TK2MSFTNGP03.phx.gbl... >> It's gone from 1% to 76% in the last hour. As I'm watching it, it now >> seems to be moving about 1% every two minutes... >> >> The optimism is starting to slowly come back.. >> >> >> "Mike O" <put_the_spam@the.can> wrote in message >> news:OE%23FbKxAIHA.4232@TK2MSFTNGP04.phx.gbl... >>> This is not looking good at all. It just jumped all the way to 1%. >>> It's been running phase 2 since about 4:00pm today. 7-1/2 hours for 1% >>> isn't a good sign. >>> >>> >>> >>> "Mike O" <put_the_spam@the.can> wrote in message >>> news:u5mYq7vAIHA.4984@TK2MSFTNGP06.phx.gbl... >>>> It's not looking too good at the moment. I started it about 3:15 this >>>> afternoon. Phase 1 went through pretty fast, it found and corrected >>>> the 60 or so corrupted attribute & orphaned records that the read-only >>>> chkdsk passes were detecting. >>>> >>>> However, it started phase 2 around 4:00, and now at 9:00 it's still at >>>> 0 percent... I seem to remember that the stage 2 steps go in 10% >>>> increments (at least I hope so!), and I know that this stage isn't >>>> linear, and that it might move erratically, but I was hoping to see >>>> something besides 0 by now.. According to task manager the chkdsk is >>>> running, it shows the process running about 30+cpu time. >>>> >>>> Assuming it doesn't finish by the end of our maintenance window, do you >>>> know if there would be any problems cancelling the process? I know it >>>> wouldn't fix the main problem, but at least we could get the system up >>>> and running until another time (or relocate the data to another drive). >>>> >>>> >>>> "Mathieu CHATEAU" <gollum123@free.fr> wrote in message >>>> news:e4mBYyhAIHA.5124@TK2MSFTNGP04.phx.gbl... >>>>> it's ok, i understand that you are under pressure ! >>>>> >>>>> I was just trying to make you think about this current pressure, that >>>>> may be lower if you would only have to make offline a part and not the >>>>> whole cake ;) >>>>> >>>>> Let's us know how it's going after the chkdsk >>>>> -- >>>>> Cordialement, >>>>> Mathieu CHATEAU >>>>> http://lordoftheping.blogspot.com >>>>> >>>>> >>>>> "Mike O" <put_the_spam@the.can> wrote in message >>>>> news:%23NFNazWAIHA.5960@TK2MSFTNGP05.phx.gbl... >>>>>> By the way, after posting the message below, I realized some of my >>>>>> wording may have come off sounding a little cranky.. It's been a >>>>>> long, tiring week, in addition to this I've had a couple of other >>>>>> issues and I may have overreacted a little bit. >>>>>> >>>>>> >>>>>> >>>>>> "Mike O." <MikeO@discussions.microsoft.com> wrote in message >>>>>> news:B655B129-B516-4C8A-8993-B045062AE633@microsoft.com... >>>>>>> Per the KB284134, clustering supports GPT if you apply the hotfix, >>>>>>> which we >>>>>>> did prior to connecting the GPT disk. I don't believe that applying >>>>>>> the >>>>>>> Microsoft supported hotfix to correct the issue is "forcing" it. >>>>>>> >>>>>>> Also, the problem I'm having is on the smaller basic disk, the GPT >>>>>>> one is >>>>>>> fine. >>>>>>> >>>>>>> We thought about breaking the "drive" into smaller partitions, but >>>>>>> the >>>>>>> issues we run into are space allocation. Eventually we'll end up >>>>>>> with one >>>>>>> partition running out of space and another one with space to spare. >>>>>>> Our >>>>>>> backup system is an enterprise system, running over 1Gb ethernet >>>>>>> (we're >>>>>>> looking at backing up over the SAN soon), so backing up a 1-2 TB is >>>>>>> not a >>>>>>> problem. >>>>>>> >>>>>>> As for your other questions, I'm not sure where you got the >>>>>>> "performance >>>>>>> problems" part. The server was working fine, performance was >>>>>>> acceptable then >>>>>>> it quickly (over 30 minutes) failed. I'm still investigating it, >>>>>>> but I'm >>>>>>> wondering if a memory leak in one of the drivers or other processes >>>>>>> running >>>>>>> on it caused the issue. >>>>>>> >>>>>>> We can't exclude real time virus scanning since these are user >>>>>>> files. We've >>>>>>> had McAfee products and a support contract with them for years. >>>>>>> According to >>>>>>> the tech there are no problems with Virusscan 8.x on the cluster. >>>>>>> >>>>>>> We don't have any large access databases on this system. I'm sure >>>>>>> there are >>>>>>> some, but it's primarily a user file server, not supposed to be for >>>>>>> applications. >>>>>>> >>>>>>> As for the error I'm receiving, should I be able to wait until this >>>>>>> weekend >>>>>>> for the CHKDSK, or is it something that's only going to get worse? >>>>>>> From some >>>>>>> Microsoft KB articles (and other stuff I found), it seems that NTFS >>>>>>> keeps two >>>>>>> copies of the MFT and will use the other one if the primary is >>>>>>> corrupted. Is >>>>>>> this correct? >>>>>>> >>>>>>> When I do run chkdsk, are there any special issues with the cluster? >>>>>>> I >>>>>>> know normally Windows can't chkdsk on an active disk and would have >>>>>>> to when >>>>>>> the server is rebooted. The problem is that when the server reboots >>>>>>> it >>>>>>> doesn't see the clustered disk until the cluster service starts, so >>>>>>> chkdsk >>>>>>> can't access the disk "pre" bootup. >>>>>>> >>>>>>> We have a test cluster (with a 50G shared disk). I ran chkdsk /f on >>>>>>> it and >>>>>>> it said the drive needed to be unmounted and offered to do that for >>>>>>> me. I >>>>>>> told it yes and it seemed to work OK. Of course the disk was >>>>>>> unavailable >>>>>>> while the chkdsk was running, but it can back on line as soon as it >>>>>>> finished. >>>>>>> >>>>>>> "Mathieu CHATEAU" wrote: >>>>>>> >>>>>>>> HEllo, >>>>>>>> >>>>>>>> GPT disk and cluster are not friend by default, forcing them to be >>>>>>>> friend >>>>>>>> may lead to issue... >>>>>>>> >>>>>>>> By default, server clusters do not support GPT shared disks in >>>>>>>> Windows >>>>>>>> Server 2003 >>>>>>>> http://support.microsoft.com/kb/284134/en-us >>>>>>>> >>>>>>>> That's the problem with so big data volumes....You should have in >>>>>>>> mind data >>>>>>>> recovery, defrag & chkdsk when sizing data volumes... >>>>>>>> You will start having issue when raising 4 Millions of files too >>>>>>>> >>>>>>>> Now, it's clear you have to run the chkdsk. Downtime for downtime, >>>>>>>> run it on >>>>>>>> both if you can >>>>>>>> >>>>>>>> For the performance part: >>>>>>>> -did you exclude all shared data from real time antivirus scan on >>>>>>>> cluster >>>>>>>> node ? >>>>>>>> -Do you have huge MS Access database ? >>>>>>>> -Any monitoring/graphing tool to get some history on >>>>>>>> ram;cpu;network usage? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Cordialement, >>>>>>>> Mathieu CHATEAU >>>>>>>> http://lordoftheping.blogspot.com >>>>>>>> >>>>>>>> >>>>>>>> "Mike O." <MikeO@discussions.microsoft.com> wrote in message >>>>>>>> news:3CDFA86D-D7A6-4512-AA64-276CB90144A2@microsoft.com... >>>>>>>> > (This was also posted on the server clustering group) >>>>>>>> > >>>>>>>> > I'm trying to find out some information about using CHKDSK on a >>>>>>>> > clustered >>>>>>>> > drive. >>>>>>>> > We have a two node cluster (active/passive) running Windows 2003 >>>>>>>> > R2 >>>>>>>> > enterprise 32 bit with SP1. The cluster has three shared drives >>>>>>>> > located >>>>>>>> > on >>>>>>>> > an EMC CX700 SAN. The three drives are a 500MB for the quorum, >>>>>>>> > and two >>>>>>>> > data >>>>>>>> > drives 1.5TB (drive "E") and 2.4TB (drive "W"). Drive E is a >>>>>>>> > basic disk, >>>>>>>> > the >>>>>>>> > 2.4TB drive W is a GPT disk. They're both about 70% full The E >>>>>>>> > drive has >>>>>>>> > been active for about a year, the W: one was added around June. >>>>>>>> > >>>>>>>> > Yesterday the active node became sluggish and then stopped >>>>>>>> > serving data. >>>>>>>> > It >>>>>>>> > still responded to low level stuff like PING, users were getting >>>>>>>> > errors on >>>>>>>> > the server. Logging in gave a blank screen. This has happened a >>>>>>>> > couple >>>>>>>> > of >>>>>>>> > times before (that's a separate issue we're looking into). >>>>>>>> > >>>>>>>> > We went to the inactive node and did a "move group" in the >>>>>>>> > cluster >>>>>>>> > administrator. We've done this before for various reasons with >>>>>>>> > no >>>>>>>> > problems, >>>>>>>> > it usually takes about 20 seconds to bring the resources up on >>>>>>>> > the other >>>>>>>> > node. >>>>>>>> > >>>>>>>> > This time when the resources came on line on the 2nd node, we >>>>>>>> > started >>>>>>>> > getting an application popup that "Windows - Corrupt File : The >>>>>>>> > file or >>>>>>>> > directory E:\$Mft is corrupt and unreadable. Please run the >>>>>>>> > Chkdsk >>>>>>>> > utility." >>>>>>>> > The drive seems to be running OK with users accessing the >>>>>>>> > information >>>>>>>> > normally. I did some research and it appears that Windows will >>>>>>>> > use the >>>>>>>> > duplicate copy of the MFT if the primary one is corrupted. >>>>>>>> > >>>>>>>> > I know we need to run CHKDSK soon, but unfortunately, running >>>>>>>> > chkdsk and >>>>>>>> > taking the drive off line for several hours is not something we >>>>>>>> > can do >>>>>>>> > during daytime hours. If necessary we could run it overnight, >>>>>>>> > but with >>>>>>>> > that >>>>>>>> > size of drive I don't know if it would finish by the next >>>>>>>> > morning. >>>>>>>> > >>>>>>>> > The server has dual fiber connections (we're using the EMC >>>>>>>> > Powerpath >>>>>>>> > software for SAN failover), and we didn't have anything happen >>>>>>>> > with the >>>>>>>> > SAN >>>>>>>> > at that time, so based on the timing I'm assuming the MFT >>>>>>>> > corruption was >>>>>>>> > related to the cluster failover, not a physical hardware issue, >>>>>>>> > so I >>>>>>>> > wasn't >>>>>>>> > planning on running the sector scan. I would imagine a sector >>>>>>>> > scan on a >>>>>>>> > 1.5TB "disk" would run for a while… >>>>>>>> > At this point I'm planning on running CHKDSK over the weekend. >>>>>>>> > I've never >>>>>>>> > run it on a clustered disk before and I'm looking for some >>>>>>>> > information >>>>>>>> > about >>>>>>>> > it. I've read Microsoft KB176970 and KB903650, but frankly >>>>>>>> > they're a >>>>>>>> > little >>>>>>>> > confusing with the issues about "maintenance mode". >>>>>>>> > >>>>>>>> > Also, is my understanding about the mirrored/secondary MFT valid? >>>>>>>> > Since >>>>>>>> > users appear to be getting information correctly can the CHKDSK >>>>>>>> > wait until >>>>>>>> > the weekend?. Our backup policy does a full backup each week and >>>>>>>> > an >>>>>>>> > incremental daily, so if something really bad happens we should >>>>>>>> > be able to >>>>>>>> > recover. >>>>>>>> > >>>>>>>> > Any information on this would be appreciated. >>>>>>>> > >>>>>>>> > Mike O. >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>> >> >
Guest Mathieu CHATEAU Posted September 30, 2007 Posted September 30, 2007 Re: Correcting a corrupted $MFT on a shared clustered disk I already got the "resetting security id to default". If it's the same problem, it may go bad. Do you more than 2 Millions of files ? -- Cordialement, Mathieu CHATEAU English blog: http://lordoftheping.blogspot.com French blog: http://www.lotp.fr "Mike O" <put_the_spam@the.can> wrote in message news:OXvnQ92AIHA.5752@TK2MSFTNGP02.phx.gbl... > It finished around 1:00am. So it took 7-1/2 hours for the first 1%, then > 90 minutes for the other 99.. > > It what looked like a couple of hundred "resetting security id to default" > messages, but I've spot checked the drive and don't see anything out of > the ordinary. Unfortunately I didn't redirect the output, and it doesn't > look chkdsk logs the errors (other than the event log entry), and that's > not long enough to be useful. > > I was thinking about the restore option, and if it was still running this > morning, I was going to start investigating that option. > > Thanks for all the comments. > > > "Mathieu CHATEAU" <gollum123@free.fr> wrote in message > news:uWMYOe0AIHA.5960@TK2MSFTNGP05.phx.gbl... >> keep going ! >> >> Else can you format the drive and restore from backup ? It may go faster, >> depending of your backup storage and type of files (bigger the better) >> >> -- >> Cordialement, >> Mathieu CHATEAU >> English blog: http://lordoftheping.blogspot.com >> French blog: http://www.lotp.fr >> >> >> "Mike O" <put_the_spam@the.can> wrote in message >> news:uCaiwtxAIHA.4612@TK2MSFTNGP03.phx.gbl... >>> It's gone from 1% to 76% in the last hour. As I'm watching it, it now >>> seems to be moving about 1% every two minutes... >>> >>> The optimism is starting to slowly come back.. >>> >>> >>> "Mike O" <put_the_spam@the.can> wrote in message >>> news:OE%23FbKxAIHA.4232@TK2MSFTNGP04.phx.gbl... >>>> This is not looking good at all. It just jumped all the way to 1%. >>>> It's been running phase 2 since about 4:00pm today. 7-1/2 hours for 1% >>>> isn't a good sign. >>>> >>>> >>>> >>>> "Mike O" <put_the_spam@the.can> wrote in message >>>> news:u5mYq7vAIHA.4984@TK2MSFTNGP06.phx.gbl... >>>>> It's not looking too good at the moment. I started it about 3:15 this >>>>> afternoon. Phase 1 went through pretty fast, it found and corrected >>>>> the 60 or so corrupted attribute & orphaned records that the read-only >>>>> chkdsk passes were detecting. >>>>> >>>>> However, it started phase 2 around 4:00, and now at 9:00 it's still at >>>>> 0 percent... I seem to remember that the stage 2 steps go in 10% >>>>> increments (at least I hope so!), and I know that this stage isn't >>>>> linear, and that it might move erratically, but I was hoping to see >>>>> something besides 0 by now.. According to task manager the chkdsk is >>>>> running, it shows the process running about 30+cpu time. >>>>> >>>>> Assuming it doesn't finish by the end of our maintenance window, do >>>>> you know if there would be any problems cancelling the process? I >>>>> know it wouldn't fix the main problem, but at least we could get the >>>>> system up and running until another time (or relocate the data to >>>>> another drive). >>>>> >>>>> >>>>> "Mathieu CHATEAU" <gollum123@free.fr> wrote in message >>>>> news:e4mBYyhAIHA.5124@TK2MSFTNGP04.phx.gbl... >>>>>> it's ok, i understand that you are under pressure ! >>>>>> >>>>>> I was just trying to make you think about this current pressure, that >>>>>> may be lower if you would only have to make offline a part and not >>>>>> the whole cake ;) >>>>>> >>>>>> Let's us know how it's going after the chkdsk >>>>>> -- >>>>>> Cordialement, >>>>>> Mathieu CHATEAU >>>>>> http://lordoftheping.blogspot.com >>>>>> >>>>>> >>>>>> "Mike O" <put_the_spam@the.can> wrote in message >>>>>> news:%23NFNazWAIHA.5960@TK2MSFTNGP05.phx.gbl... >>>>>>> By the way, after posting the message below, I realized some of my >>>>>>> wording may have come off sounding a little cranky.. It's been a >>>>>>> long, tiring week, in addition to this I've had a couple of other >>>>>>> issues and I may have overreacted a little bit. >>>>>>> >>>>>>> >>>>>>> >>>>>>> "Mike O." <MikeO@discussions.microsoft.com> wrote in message >>>>>>> news:B655B129-B516-4C8A-8993-B045062AE633@microsoft.com... >>>>>>>> Per the KB284134, clustering supports GPT if you apply the hotfix, >>>>>>>> which we >>>>>>>> did prior to connecting the GPT disk. I don't believe that >>>>>>>> applying the >>>>>>>> Microsoft supported hotfix to correct the issue is "forcing" it. >>>>>>>> >>>>>>>> Also, the problem I'm having is on the smaller basic disk, the GPT >>>>>>>> one is >>>>>>>> fine. >>>>>>>> >>>>>>>> We thought about breaking the "drive" into smaller partitions, but >>>>>>>> the >>>>>>>> issues we run into are space allocation. Eventually we'll end up >>>>>>>> with one >>>>>>>> partition running out of space and another one with space to spare. >>>>>>>> Our >>>>>>>> backup system is an enterprise system, running over 1Gb ethernet >>>>>>>> (we're >>>>>>>> looking at backing up over the SAN soon), so backing up a 1-2 TB is >>>>>>>> not a >>>>>>>> problem. >>>>>>>> >>>>>>>> As for your other questions, I'm not sure where you got the >>>>>>>> "performance >>>>>>>> problems" part. The server was working fine, performance was >>>>>>>> acceptable then >>>>>>>> it quickly (over 30 minutes) failed. I'm still investigating it, >>>>>>>> but I'm >>>>>>>> wondering if a memory leak in one of the drivers or other processes >>>>>>>> running >>>>>>>> on it caused the issue. >>>>>>>> >>>>>>>> We can't exclude real time virus scanning since these are user >>>>>>>> files. We've >>>>>>>> had McAfee products and a support contract with them for years. >>>>>>>> According to >>>>>>>> the tech there are no problems with Virusscan 8.x on the cluster. >>>>>>>> >>>>>>>> We don't have any large access databases on this system. I'm sure >>>>>>>> there are >>>>>>>> some, but it's primarily a user file server, not supposed to be for >>>>>>>> applications. >>>>>>>> >>>>>>>> As for the error I'm receiving, should I be able to wait until this >>>>>>>> weekend >>>>>>>> for the CHKDSK, or is it something that's only going to get worse? >>>>>>>> From some >>>>>>>> Microsoft KB articles (and other stuff I found), it seems that NTFS >>>>>>>> keeps two >>>>>>>> copies of the MFT and will use the other one if the primary is >>>>>>>> corrupted. Is >>>>>>>> this correct? >>>>>>>> >>>>>>>> When I do run chkdsk, are there any special issues with the >>>>>>>> cluster? I >>>>>>>> know normally Windows can't chkdsk on an active disk and would have >>>>>>>> to when >>>>>>>> the server is rebooted. The problem is that when the server >>>>>>>> reboots it >>>>>>>> doesn't see the clustered disk until the cluster service starts, so >>>>>>>> chkdsk >>>>>>>> can't access the disk "pre" bootup. >>>>>>>> >>>>>>>> We have a test cluster (with a 50G shared disk). I ran chkdsk /f >>>>>>>> on it and >>>>>>>> it said the drive needed to be unmounted and offered to do that for >>>>>>>> me. I >>>>>>>> told it yes and it seemed to work OK. Of course the disk was >>>>>>>> unavailable >>>>>>>> while the chkdsk was running, but it can back on line as soon as it >>>>>>>> finished. >>>>>>>> >>>>>>>> "Mathieu CHATEAU" wrote: >>>>>>>> >>>>>>>>> HEllo, >>>>>>>>> >>>>>>>>> GPT disk and cluster are not friend by default, forcing them to be >>>>>>>>> friend >>>>>>>>> may lead to issue... >>>>>>>>> >>>>>>>>> By default, server clusters do not support GPT shared disks in >>>>>>>>> Windows >>>>>>>>> Server 2003 >>>>>>>>> http://support.microsoft.com/kb/284134/en-us >>>>>>>>> >>>>>>>>> That's the problem with so big data volumes....You should have in >>>>>>>>> mind data >>>>>>>>> recovery, defrag & chkdsk when sizing data volumes... >>>>>>>>> You will start having issue when raising 4 Millions of files too >>>>>>>>> >>>>>>>>> Now, it's clear you have to run the chkdsk. Downtime for downtime, >>>>>>>>> run it on >>>>>>>>> both if you can >>>>>>>>> >>>>>>>>> For the performance part: >>>>>>>>> -did you exclude all shared data from real time antivirus scan on >>>>>>>>> cluster >>>>>>>>> node ? >>>>>>>>> -Do you have huge MS Access database ? >>>>>>>>> -Any monitoring/graphing tool to get some history on >>>>>>>>> ram;cpu;network usage? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Cordialement, >>>>>>>>> Mathieu CHATEAU >>>>>>>>> http://lordoftheping.blogspot.com >>>>>>>>> >>>>>>>>> >>>>>>>>> "Mike O." <MikeO@discussions.microsoft.com> wrote in message >>>>>>>>> news:3CDFA86D-D7A6-4512-AA64-276CB90144A2@microsoft.com... >>>>>>>>> > (This was also posted on the server clustering group) >>>>>>>>> > >>>>>>>>> > I'm trying to find out some information about using CHKDSK on a >>>>>>>>> > clustered >>>>>>>>> > drive. >>>>>>>>> > We have a two node cluster (active/passive) running Windows 2003 >>>>>>>>> > R2 >>>>>>>>> > enterprise 32 bit with SP1. The cluster has three shared drives >>>>>>>>> > located >>>>>>>>> > on >>>>>>>>> > an EMC CX700 SAN. The three drives are a 500MB for the quorum, >>>>>>>>> > and two >>>>>>>>> > data >>>>>>>>> > drives 1.5TB (drive "E") and 2.4TB (drive "W"). Drive E is a >>>>>>>>> > basic disk, >>>>>>>>> > the >>>>>>>>> > 2.4TB drive W is a GPT disk. They're both about 70% full The E >>>>>>>>> > drive has >>>>>>>>> > been active for about a year, the W: one was added around June. >>>>>>>>> > >>>>>>>>> > Yesterday the active node became sluggish and then stopped >>>>>>>>> > serving data. >>>>>>>>> > It >>>>>>>>> > still responded to low level stuff like PING, users were getting >>>>>>>>> > errors on >>>>>>>>> > the server. Logging in gave a blank screen. This has happened >>>>>>>>> > a couple >>>>>>>>> > of >>>>>>>>> > times before (that's a separate issue we're looking into). >>>>>>>>> > >>>>>>>>> > We went to the inactive node and did a "move group" in the >>>>>>>>> > cluster >>>>>>>>> > administrator. We've done this before for various reasons with >>>>>>>>> > no >>>>>>>>> > problems, >>>>>>>>> > it usually takes about 20 seconds to bring the resources up on >>>>>>>>> > the other >>>>>>>>> > node. >>>>>>>>> > >>>>>>>>> > This time when the resources came on line on the 2nd node, we >>>>>>>>> > started >>>>>>>>> > getting an application popup that "Windows - Corrupt File : The >>>>>>>>> > file or >>>>>>>>> > directory E:\$Mft is corrupt and unreadable. Please run the >>>>>>>>> > Chkdsk >>>>>>>>> > utility." >>>>>>>>> > The drive seems to be running OK with users accessing the >>>>>>>>> > information >>>>>>>>> > normally. I did some research and it appears that Windows will >>>>>>>>> > use the >>>>>>>>> > duplicate copy of the MFT if the primary one is corrupted. >>>>>>>>> > >>>>>>>>> > I know we need to run CHKDSK soon, but unfortunately, running >>>>>>>>> > chkdsk and >>>>>>>>> > taking the drive off line for several hours is not something we >>>>>>>>> > can do >>>>>>>>> > during daytime hours. If necessary we could run it overnight, >>>>>>>>> > but with >>>>>>>>> > that >>>>>>>>> > size of drive I don't know if it would finish by the next >>>>>>>>> > morning. >>>>>>>>> > >>>>>>>>> > The server has dual fiber connections (we're using the EMC >>>>>>>>> > Powerpath >>>>>>>>> > software for SAN failover), and we didn't have anything happen >>>>>>>>> > with the >>>>>>>>> > SAN >>>>>>>>> > at that time, so based on the timing I'm assuming the MFT >>>>>>>>> > corruption was >>>>>>>>> > related to the cluster failover, not a physical hardware issue, >>>>>>>>> > so I >>>>>>>>> > wasn't >>>>>>>>> > planning on running the sector scan. I would imagine a sector >>>>>>>>> > scan on a >>>>>>>>> > 1.5TB "disk" would run for a while… >>>>>>>>> > At this point I'm planning on running CHKDSK over the weekend. >>>>>>>>> > I've never >>>>>>>>> > run it on a clustered disk before and I'm looking for some >>>>>>>>> > information >>>>>>>>> > about >>>>>>>>> > it. I've read Microsoft KB176970 and KB903650, but frankly >>>>>>>>> > they're a >>>>>>>>> > little >>>>>>>>> > confusing with the issues about "maintenance mode". >>>>>>>>> > >>>>>>>>> > Also, is my understanding about the mirrored/secondary MFT >>>>>>>>> > valid? Since >>>>>>>>> > users appear to be getting information correctly can the CHKDSK >>>>>>>>> > wait until >>>>>>>>> > the weekend?. Our backup policy does a full backup each week >>>>>>>>> > and an >>>>>>>>> > incremental daily, so if something really bad happens we should >>>>>>>>> > be able to >>>>>>>>> > recover. >>>>>>>>> > >>>>>>>>> > Any information on this would be appreciated. >>>>>>>>> > >>>>>>>>> > Mike O. >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
Guest Mike O Posted October 1, 2007 Posted October 1, 2007 Re: Correcting a corrupted $MFT on a shared clustered disk "Mathieu CHATEAU" <gollum123@free.fr> wrote in message news:%234kEXX3AIHA.2004@TK2MSFTNGP06.phx.gbl... >I already got the "resetting security id to default". If it's the same >problem, it may go bad. > Do you more than 2 Millions of files ? Do you mean do we have more than 2 million files? If so, then yes. I did see an issue related to the "security id" and chkdsk if you have over 4 million, but the hotfix is a few years old and the version of the system dll's on the server are later than the ones in the hotfix, so it appears that the fix is already there. After seeing the messages, I went and spot checked random places on the system. Everything I checked seemed to have the correct security on it. Besides, I still think this was a better option than telling everyone that the data was completely unavailable until we do a restore. Also, from the info I've found at least it sets the security to locked down, not wide open. Besides, having to fix them may have some benefits. Generally we don't give out full rights to modify security to users, because they almost always end up locking out the system admins, backup system, etc. The most users get is "modify". Since the files on this server is a consolidation of about 30 servers from 10 different departments. File security is not set very consistently. There's a lot of individuals set on each file, inheritance not set up efficently, etc. Cleaning up file security has been on our list of things to do for quite a while, I guess this may force the issue.. Of course, after our help desk (and us in the server group) gets swamped tomorrow with phone calls I might feel differently. I may check and see if we can do a restore anyway and tell it not to overwrite anything newer. This way the existing files will get their security reset. Mike O. > > -- > Cordialement, > Mathieu CHATEAU > English blog: http://lordoftheping.blogspot.com > French blog: http://www.lotp.fr > > > "Mike O" <put_the_spam@the.can> wrote in message > news:OXvnQ92AIHA.5752@TK2MSFTNGP02.phx.gbl... >> It finished around 1:00am. So it took 7-1/2 hours for the first 1%, then >> 90 minutes for the other 99.. >> >> It what looked like a couple of hundred "resetting security id to >> default" messages, but I've spot checked the drive and don't see anything >> out of the ordinary. Unfortunately I didn't redirect the output, and it >> doesn't look chkdsk logs the errors (other than the event log entry), and >> that's not long enough to be useful. >> >> I was thinking about the restore option, and if it was still running this >> morning, I was going to start investigating that option. >> >> Thanks for all the comments. >> >> >> "Mathieu CHATEAU" <gollum123@free.fr> wrote in message >> news:uWMYOe0AIHA.5960@TK2MSFTNGP05.phx.gbl... >>> keep going ! >>> >>> Else can you format the drive and restore from backup ? It may go >>> faster, depending of your backup storage and type of files (bigger the >>> better) >>> >>> -- >>> Cordialement, >>> Mathieu CHATEAU >>> English blog: http://lordoftheping.blogspot.com >>> French blog: http://www.lotp.fr >>> >>> >>> "Mike O" <put_the_spam@the.can> wrote in message >>> news:uCaiwtxAIHA.4612@TK2MSFTNGP03.phx.gbl... >>>> It's gone from 1% to 76% in the last hour. As I'm watching it, it now >>>> seems to be moving about 1% every two minutes... >>>> >>>> The optimism is starting to slowly come back.. >>>> >>>> >>>> "Mike O" <put_the_spam@the.can> wrote in message >>>> news:OE%23FbKxAIHA.4232@TK2MSFTNGP04.phx.gbl... >>>>> This is not looking good at all. It just jumped all the way to 1%. >>>>> It's been running phase 2 since about 4:00pm today. 7-1/2 hours for >>>>> 1% isn't a good sign. >>>>> >>>>> >>>>> >>>>> "Mike O" <put_the_spam@the.can> wrote in message >>>>> news:u5mYq7vAIHA.4984@TK2MSFTNGP06.phx.gbl... >>>>>> It's not looking too good at the moment. I started it about 3:15 >>>>>> this afternoon. Phase 1 went through pretty fast, it found and >>>>>> corrected the 60 or so corrupted attribute & orphaned records that >>>>>> the read-only chkdsk passes were detecting. >>>>>> >>>>>> However, it started phase 2 around 4:00, and now at 9:00 it's still >>>>>> at 0 percent... I seem to remember that the stage 2 steps go in 10% >>>>>> increments (at least I hope so!), and I know that this stage isn't >>>>>> linear, and that it might move erratically, but I was hoping to see >>>>>> something besides 0 by now.. According to task manager the chkdsk is >>>>>> running, it shows the process running about 30+cpu time. >>>>>> >>>>>> Assuming it doesn't finish by the end of our maintenance window, do >>>>>> you know if there would be any problems cancelling the process? I >>>>>> know it wouldn't fix the main problem, but at least we could get the >>>>>> system up and running until another time (or relocate the data to >>>>>> another drive). >>>>>> >>>>>> >>>>>> "Mathieu CHATEAU" <gollum123@free.fr> wrote in message >>>>>> news:e4mBYyhAIHA.5124@TK2MSFTNGP04.phx.gbl... >>>>>>> it's ok, i understand that you are under pressure ! >>>>>>> >>>>>>> I was just trying to make you think about this current pressure, >>>>>>> that may be lower if you would only have to make offline a part and >>>>>>> not the whole cake ;) >>>>>>> >>>>>>> Let's us know how it's going after the chkdsk >>>>>>> -- >>>>>>> Cordialement, >>>>>>> Mathieu CHATEAU >>>>>>> http://lordoftheping.blogspot.com >>>>>>> >>>>>>> >>>>>>> "Mike O" <put_the_spam@the.can> wrote in message >>>>>>> news:%23NFNazWAIHA.5960@TK2MSFTNGP05.phx.gbl... >>>>>>>> By the way, after posting the message below, I realized some of my >>>>>>>> wording may have come off sounding a little cranky.. It's been a >>>>>>>> long, tiring week, in addition to this I've had a couple of other >>>>>>>> issues and I may have overreacted a little bit. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> "Mike O." <MikeO@discussions.microsoft.com> wrote in message >>>>>>>> news:B655B129-B516-4C8A-8993-B045062AE633@microsoft.com... >>>>>>>>> Per the KB284134, clustering supports GPT if you apply the hotfix, >>>>>>>>> which we >>>>>>>>> did prior to connecting the GPT disk. I don't believe that >>>>>>>>> applying the >>>>>>>>> Microsoft supported hotfix to correct the issue is "forcing" it. >>>>>>>>> >>>>>>>>> Also, the problem I'm having is on the smaller basic disk, the GPT >>>>>>>>> one is >>>>>>>>> fine. >>>>>>>>> >>>>>>>>> We thought about breaking the "drive" into smaller partitions, but >>>>>>>>> the >>>>>>>>> issues we run into are space allocation. Eventually we'll end up >>>>>>>>> with one >>>>>>>>> partition running out of space and another one with space to >>>>>>>>> spare. Our >>>>>>>>> backup system is an enterprise system, running over 1Gb ethernet >>>>>>>>> (we're >>>>>>>>> looking at backing up over the SAN soon), so backing up a 1-2 TB >>>>>>>>> is not a >>>>>>>>> problem. >>>>>>>>> >>>>>>>>> As for your other questions, I'm not sure where you got the >>>>>>>>> "performance >>>>>>>>> problems" part. The server was working fine, performance was >>>>>>>>> acceptable then >>>>>>>>> it quickly (over 30 minutes) failed. I'm still investigating it, >>>>>>>>> but I'm >>>>>>>>> wondering if a memory leak in one of the drivers or other >>>>>>>>> processes running >>>>>>>>> on it caused the issue. >>>>>>>>> >>>>>>>>> We can't exclude real time virus scanning since these are user >>>>>>>>> files. We've >>>>>>>>> had McAfee products and a support contract with them for years. >>>>>>>>> According to >>>>>>>>> the tech there are no problems with Virusscan 8.x on the cluster. >>>>>>>>> >>>>>>>>> We don't have any large access databases on this system. I'm sure >>>>>>>>> there are >>>>>>>>> some, but it's primarily a user file server, not supposed to be >>>>>>>>> for >>>>>>>>> applications. >>>>>>>>> >>>>>>>>> As for the error I'm receiving, should I be able to wait until >>>>>>>>> this weekend >>>>>>>>> for the CHKDSK, or is it something that's only going to get worse? >>>>>>>>> From some >>>>>>>>> Microsoft KB articles (and other stuff I found), it seems that >>>>>>>>> NTFS keeps two >>>>>>>>> copies of the MFT and will use the other one if the primary is >>>>>>>>> corrupted. Is >>>>>>>>> this correct? >>>>>>>>> >>>>>>>>> When I do run chkdsk, are there any special issues with the >>>>>>>>> cluster? I >>>>>>>>> know normally Windows can't chkdsk on an active disk and would >>>>>>>>> have to when >>>>>>>>> the server is rebooted. The problem is that when the server >>>>>>>>> reboots it >>>>>>>>> doesn't see the clustered disk until the cluster service starts, >>>>>>>>> so chkdsk >>>>>>>>> can't access the disk "pre" bootup. >>>>>>>>> >>>>>>>>> We have a test cluster (with a 50G shared disk). I ran chkdsk /f >>>>>>>>> on it and >>>>>>>>> it said the drive needed to be unmounted and offered to do that >>>>>>>>> for me. I >>>>>>>>> told it yes and it seemed to work OK. Of course the disk was >>>>>>>>> unavailable >>>>>>>>> while the chkdsk was running, but it can back on line as soon as >>>>>>>>> it finished. >>>>>>>>> >>>>>>>>> "Mathieu CHATEAU" wrote: >>>>>>>>> >>>>>>>>>> HEllo, >>>>>>>>>> >>>>>>>>>> GPT disk and cluster are not friend by default, forcing them to >>>>>>>>>> be friend >>>>>>>>>> may lead to issue... >>>>>>>>>> >>>>>>>>>> By default, server clusters do not support GPT shared disks in >>>>>>>>>> Windows >>>>>>>>>> Server 2003 >>>>>>>>>> http://support.microsoft.com/kb/284134/en-us >>>>>>>>>> >>>>>>>>>> That's the problem with so big data volumes....You should have in >>>>>>>>>> mind data >>>>>>>>>> recovery, defrag & chkdsk when sizing data volumes... >>>>>>>>>> You will start having issue when raising 4 Millions of files too >>>>>>>>>> >>>>>>>>>> Now, it's clear you have to run the chkdsk. Downtime for >>>>>>>>>> downtime, run it on >>>>>>>>>> both if you can >>>>>>>>>> >>>>>>>>>> For the performance part: >>>>>>>>>> -did you exclude all shared data from real time antivirus scan on >>>>>>>>>> cluster >>>>>>>>>> node ? >>>>>>>>>> -Do you have huge MS Access database ? >>>>>>>>>> -Any monitoring/graphing tool to get some history on >>>>>>>>>> ram;cpu;network usage? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Cordialement, >>>>>>>>>> Mathieu CHATEAU >>>>>>>>>> http://lordoftheping.blogspot.com >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> "Mike O." <MikeO@discussions.microsoft.com> wrote in message >>>>>>>>>> news:3CDFA86D-D7A6-4512-AA64-276CB90144A2@microsoft.com... >>>>>>>>>> > (This was also posted on the server clustering group) >>>>>>>>>> > >>>>>>>>>> > I'm trying to find out some information about using CHKDSK on a >>>>>>>>>> > clustered >>>>>>>>>> > drive. >>>>>>>>>> > We have a two node cluster (active/passive) running Windows >>>>>>>>>> > 2003 R2 >>>>>>>>>> > enterprise 32 bit with SP1. The cluster has three shared >>>>>>>>>> > drives located >>>>>>>>>> > on >>>>>>>>>> > an EMC CX700 SAN. The three drives are a 500MB for the quorum, >>>>>>>>>> > and two >>>>>>>>>> > data >>>>>>>>>> > drives 1.5TB (drive "E") and 2.4TB (drive "W"). Drive E is a >>>>>>>>>> > basic disk, >>>>>>>>>> > the >>>>>>>>>> > 2.4TB drive W is a GPT disk. They're both about 70% full The >>>>>>>>>> > E drive has >>>>>>>>>> > been active for about a year, the W: one was added around June. >>>>>>>>>> > >>>>>>>>>> > Yesterday the active node became sluggish and then stopped >>>>>>>>>> > serving data. >>>>>>>>>> > It >>>>>>>>>> > still responded to low level stuff like PING, users were >>>>>>>>>> > getting errors on >>>>>>>>>> > the server. Logging in gave a blank screen. This has happened >>>>>>>>>> > a couple >>>>>>>>>> > of >>>>>>>>>> > times before (that's a separate issue we're looking into). >>>>>>>>>> > >>>>>>>>>> > We went to the inactive node and did a "move group" in the >>>>>>>>>> > cluster >>>>>>>>>> > administrator. We've done this before for various reasons with >>>>>>>>>> > no >>>>>>>>>> > problems, >>>>>>>>>> > it usually takes about 20 seconds to bring the resources up on >>>>>>>>>> > the other >>>>>>>>>> > node. >>>>>>>>>> > >>>>>>>>>> > This time when the resources came on line on the 2nd node, we >>>>>>>>>> > started >>>>>>>>>> > getting an application popup that "Windows - Corrupt File : The >>>>>>>>>> > file or >>>>>>>>>> > directory E:\$Mft is corrupt and unreadable. Please run the >>>>>>>>>> > Chkdsk >>>>>>>>>> > utility." >>>>>>>>>> > The drive seems to be running OK with users accessing the >>>>>>>>>> > information >>>>>>>>>> > normally. I did some research and it appears that Windows will >>>>>>>>>> > use the >>>>>>>>>> > duplicate copy of the MFT if the primary one is corrupted. >>>>>>>>>> > >>>>>>>>>> > I know we need to run CHKDSK soon, but unfortunately, running >>>>>>>>>> > chkdsk and >>>>>>>>>> > taking the drive off line for several hours is not something >>>>>>>>>> > we can do >>>>>>>>>> > during daytime hours. If necessary we could run it overnight, >>>>>>>>>> > but with >>>>>>>>>> > that >>>>>>>>>> > size of drive I don't know if it would finish by the next >>>>>>>>>> > morning. >>>>>>>>>> > >>>>>>>>>> > The server has dual fiber connections (we're using the EMC >>>>>>>>>> > Powerpath >>>>>>>>>> > software for SAN failover), and we didn't have anything happen >>>>>>>>>> > with the >>>>>>>>>> > SAN >>>>>>>>>> > at that time, so based on the timing I'm assuming the MFT >>>>>>>>>> > corruption was >>>>>>>>>> > related to the cluster failover, not a physical hardware issue, >>>>>>>>>> > so I >>>>>>>>>> > wasn't >>>>>>>>>> > planning on running the sector scan. I would imagine a sector >>>>>>>>>> > scan on a >>>>>>>>>> > 1.5TB "disk" would run for a while… >>>>>>>>>> > At this point I'm planning on running CHKDSK over the weekend. >>>>>>>>>> > I've never >>>>>>>>>> > run it on a clustered disk before and I'm looking for some >>>>>>>>>> > information >>>>>>>>>> > about >>>>>>>>>> > it. I've read Microsoft KB176970 and KB903650, but frankly >>>>>>>>>> > they're a >>>>>>>>>> > little >>>>>>>>>> > confusing with the issues about "maintenance mode". >>>>>>>>>> > >>>>>>>>>> > Also, is my understanding about the mirrored/secondary MFT >>>>>>>>>> > valid? Since >>>>>>>>>> > users appear to be getting information correctly can the CHKDSK >>>>>>>>>> > wait until >>>>>>>>>> > the weekend?. Our backup policy does a full backup each week >>>>>>>>>> > and an >>>>>>>>>> > incremental daily, so if something really bad happens we should >>>>>>>>>> > be able to >>>>>>>>>> > recover. >>>>>>>>>> > >>>>>>>>>> > Any information on this would be appreciated. >>>>>>>>>> > >>>>>>>>>> > Mike O. >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
Guest Mathieu CHATEAU Posted October 1, 2007 Posted October 1, 2007 Re: Correcting a corrupted $MFT on a shared clustered disk We had this issue while not having 4 Millions of Files. The fix is to prevent it from coming back, when it's there, you have to correct it. PSS provided us tools and methods to correct it. -- Cordialement, Mathieu CHATEAU English blog: http://lordoftheping.blogspot.com French blog: http://www.lotp.fr "Mike O" <put_the_spam@the.can> wrote in message news:OH3RKS9AIHA.4160@TK2MSFTNGP06.phx.gbl... > > "Mathieu CHATEAU" <gollum123@free.fr> wrote in message > news:%234kEXX3AIHA.2004@TK2MSFTNGP06.phx.gbl... >>I already got the "resetting security id to default". If it's the same >>problem, it may go bad. >> Do you more than 2 Millions of files ? > > Do you mean do we have more than 2 million files? If so, then yes. I did > see an issue related to the "security id" and chkdsk if you have over 4 > million, but the hotfix is a few years old and the version of the system > dll's on the server are later than the ones in the hotfix, so it appears > that the fix is already there. After seeing the messages, I went and > spot checked random places on the system. Everything I checked seemed to > have the correct security on it. Besides, I still think this was a better > option than telling everyone that the data was completely unavailable > until we do a restore. > > Also, from the info I've found at least it sets the security to locked > down, not wide open. Besides, having to fix them may have some benefits. > Generally we don't give out full rights to modify security to users, > because they almost always end up locking out the system admins, backup > system, etc. The most users get is "modify". Since the files on this > server is a consolidation of about 30 servers from 10 different > departments. File security is not set very consistently. There's a lot > of individuals set on each file, inheritance not set up efficently, etc. > Cleaning up file security has been on our list of things to do for quite a > while, I guess this may force the issue.. > > Of course, after our help desk (and us in the server group) gets swamped > tomorrow with phone calls I might feel differently. > > I may check and see if we can do a restore anyway and tell it not to > overwrite anything newer. This way the existing files will get their > security reset. > > Mike O. > >> >> -- >> Cordialement, >> Mathieu CHATEAU >> English blog: http://lordoftheping.blogspot.com >> French blog: http://www.lotp.fr >> >> >> "Mike O" <put_the_spam@the.can> wrote in message >> news:OXvnQ92AIHA.5752@TK2MSFTNGP02.phx.gbl... >>> It finished around 1:00am. So it took 7-1/2 hours for the first 1%, >>> then 90 minutes for the other 99.. >>> >>> It what looked like a couple of hundred "resetting security id to >>> default" messages, but I've spot checked the drive and don't see >>> anything out of the ordinary. Unfortunately I didn't redirect the >>> output, and it doesn't look chkdsk logs the errors (other than the event >>> log entry), and that's not long enough to be useful. >>> >>> I was thinking about the restore option, and if it was still running >>> this morning, I was going to start investigating that option. >>> >>> Thanks for all the comments. >>> >>> >>> "Mathieu CHATEAU" <gollum123@free.fr> wrote in message >>> news:uWMYOe0AIHA.5960@TK2MSFTNGP05.phx.gbl... >>>> keep going ! >>>> >>>> Else can you format the drive and restore from backup ? It may go >>>> faster, depending of your backup storage and type of files (bigger the >>>> better) >>>> >>>> -- >>>> Cordialement, >>>> Mathieu CHATEAU >>>> English blog: http://lordoftheping.blogspot.com >>>> French blog: http://www.lotp.fr >>>> >>>> >>>> "Mike O" <put_the_spam@the.can> wrote in message >>>> news:uCaiwtxAIHA.4612@TK2MSFTNGP03.phx.gbl... >>>>> It's gone from 1% to 76% in the last hour. As I'm watching it, it now >>>>> seems to be moving about 1% every two minutes... >>>>> >>>>> The optimism is starting to slowly come back.. >>>>> >>>>> >>>>> "Mike O" <put_the_spam@the.can> wrote in message >>>>> news:OE%23FbKxAIHA.4232@TK2MSFTNGP04.phx.gbl... >>>>>> This is not looking good at all. It just jumped all the way to 1%. >>>>>> It's been running phase 2 since about 4:00pm today. 7-1/2 hours for >>>>>> 1% isn't a good sign. >>>>>> >>>>>> >>>>>> >>>>>> "Mike O" <put_the_spam@the.can> wrote in message >>>>>> news:u5mYq7vAIHA.4984@TK2MSFTNGP06.phx.gbl... >>>>>>> It's not looking too good at the moment. I started it about 3:15 >>>>>>> this afternoon. Phase 1 went through pretty fast, it found and >>>>>>> corrected the 60 or so corrupted attribute & orphaned records that >>>>>>> the read-only chkdsk passes were detecting. >>>>>>> >>>>>>> However, it started phase 2 around 4:00, and now at 9:00 it's still >>>>>>> at 0 percent... I seem to remember that the stage 2 steps go in 10% >>>>>>> increments (at least I hope so!), and I know that this stage isn't >>>>>>> linear, and that it might move erratically, but I was hoping to see >>>>>>> something besides 0 by now.. According to task manager the chkdsk is >>>>>>> running, it shows the process running about 30+cpu time. >>>>>>> >>>>>>> Assuming it doesn't finish by the end of our maintenance window, do >>>>>>> you know if there would be any problems cancelling the process? I >>>>>>> know it wouldn't fix the main problem, but at least we could get the >>>>>>> system up and running until another time (or relocate the data to >>>>>>> another drive). >>>>>>> >>>>>>> >>>>>>> "Mathieu CHATEAU" <gollum123@free.fr> wrote in message >>>>>>> news:e4mBYyhAIHA.5124@TK2MSFTNGP04.phx.gbl... >>>>>>>> it's ok, i understand that you are under pressure ! >>>>>>>> >>>>>>>> I was just trying to make you think about this current pressure, >>>>>>>> that may be lower if you would only have to make offline a part and >>>>>>>> not the whole cake ;) >>>>>>>> >>>>>>>> Let's us know how it's going after the chkdsk >>>>>>>> -- >>>>>>>> Cordialement, >>>>>>>> Mathieu CHATEAU >>>>>>>> http://lordoftheping.blogspot.com >>>>>>>> >>>>>>>> >>>>>>>> "Mike O" <put_the_spam@the.can> wrote in message >>>>>>>> news:%23NFNazWAIHA.5960@TK2MSFTNGP05.phx.gbl... >>>>>>>>> By the way, after posting the message below, I realized some of my >>>>>>>>> wording may have come off sounding a little cranky.. It's been a >>>>>>>>> long, tiring week, in addition to this I've had a couple of other >>>>>>>>> issues and I may have overreacted a little bit. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> "Mike O." <MikeO@discussions.microsoft.com> wrote in message >>>>>>>>> news:B655B129-B516-4C8A-8993-B045062AE633@microsoft.com... >>>>>>>>>> Per the KB284134, clustering supports GPT if you apply the >>>>>>>>>> hotfix, which we >>>>>>>>>> did prior to connecting the GPT disk. I don't believe that >>>>>>>>>> applying the >>>>>>>>>> Microsoft supported hotfix to correct the issue is "forcing" it. >>>>>>>>>> >>>>>>>>>> Also, the problem I'm having is on the smaller basic disk, the >>>>>>>>>> GPT one is >>>>>>>>>> fine. >>>>>>>>>> >>>>>>>>>> We thought about breaking the "drive" into smaller partitions, >>>>>>>>>> but the >>>>>>>>>> issues we run into are space allocation. Eventually we'll end >>>>>>>>>> up with one >>>>>>>>>> partition running out of space and another one with space to >>>>>>>>>> spare. Our >>>>>>>>>> backup system is an enterprise system, running over 1Gb ethernet >>>>>>>>>> (we're >>>>>>>>>> looking at backing up over the SAN soon), so backing up a 1-2 TB >>>>>>>>>> is not a >>>>>>>>>> problem. >>>>>>>>>> >>>>>>>>>> As for your other questions, I'm not sure where you got the >>>>>>>>>> "performance >>>>>>>>>> problems" part. The server was working fine, performance was >>>>>>>>>> acceptable then >>>>>>>>>> it quickly (over 30 minutes) failed. I'm still investigating >>>>>>>>>> it, but I'm >>>>>>>>>> wondering if a memory leak in one of the drivers or other >>>>>>>>>> processes running >>>>>>>>>> on it caused the issue. >>>>>>>>>> >>>>>>>>>> We can't exclude real time virus scanning since these are user >>>>>>>>>> files. We've >>>>>>>>>> had McAfee products and a support contract with them for years. >>>>>>>>>> According to >>>>>>>>>> the tech there are no problems with Virusscan 8.x on the cluster. >>>>>>>>>> >>>>>>>>>> We don't have any large access databases on this system. I'm >>>>>>>>>> sure there are >>>>>>>>>> some, but it's primarily a user file server, not supposed to be >>>>>>>>>> for >>>>>>>>>> applications. >>>>>>>>>> >>>>>>>>>> As for the error I'm receiving, should I be able to wait until >>>>>>>>>> this weekend >>>>>>>>>> for the CHKDSK, or is it something that's only going to get >>>>>>>>>> worse? From some >>>>>>>>>> Microsoft KB articles (and other stuff I found), it seems that >>>>>>>>>> NTFS keeps two >>>>>>>>>> copies of the MFT and will use the other one if the primary is >>>>>>>>>> corrupted. Is >>>>>>>>>> this correct? >>>>>>>>>> >>>>>>>>>> When I do run chkdsk, are there any special issues with the >>>>>>>>>> cluster? I >>>>>>>>>> know normally Windows can't chkdsk on an active disk and would >>>>>>>>>> have to when >>>>>>>>>> the server is rebooted. The problem is that when the server >>>>>>>>>> reboots it >>>>>>>>>> doesn't see the clustered disk until the cluster service starts, >>>>>>>>>> so chkdsk >>>>>>>>>> can't access the disk "pre" bootup. >>>>>>>>>> >>>>>>>>>> We have a test cluster (with a 50G shared disk). I ran chkdsk /f >>>>>>>>>> on it and >>>>>>>>>> it said the drive needed to be unmounted and offered to do that >>>>>>>>>> for me. I >>>>>>>>>> told it yes and it seemed to work OK. Of course the disk was >>>>>>>>>> unavailable >>>>>>>>>> while the chkdsk was running, but it can back on line as soon as >>>>>>>>>> it finished. >>>>>>>>>> >>>>>>>>>> "Mathieu CHATEAU" wrote: >>>>>>>>>> >>>>>>>>>>> HEllo, >>>>>>>>>>> >>>>>>>>>>> GPT disk and cluster are not friend by default, forcing them to >>>>>>>>>>> be friend >>>>>>>>>>> may lead to issue... >>>>>>>>>>> >>>>>>>>>>> By default, server clusters do not support GPT shared disks in >>>>>>>>>>> Windows >>>>>>>>>>> Server 2003 >>>>>>>>>>> http://support.microsoft.com/kb/284134/en-us >>>>>>>>>>> >>>>>>>>>>> That's the problem with so big data volumes....You should have >>>>>>>>>>> in mind data >>>>>>>>>>> recovery, defrag & chkdsk when sizing data volumes... >>>>>>>>>>> You will start having issue when raising 4 Millions of files too >>>>>>>>>>> >>>>>>>>>>> Now, it's clear you have to run the chkdsk. Downtime for >>>>>>>>>>> downtime, run it on >>>>>>>>>>> both if you can >>>>>>>>>>> >>>>>>>>>>> For the performance part: >>>>>>>>>>> -did you exclude all shared data from real time antivirus scan >>>>>>>>>>> on cluster >>>>>>>>>>> node ? >>>>>>>>>>> -Do you have huge MS Access database ? >>>>>>>>>>> -Any monitoring/graphing tool to get some history on >>>>>>>>>>> ram;cpu;network usage? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Cordialement, >>>>>>>>>>> Mathieu CHATEAU >>>>>>>>>>> http://lordoftheping.blogspot.com >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> "Mike O." <MikeO@discussions.microsoft.com> wrote in message >>>>>>>>>>> news:3CDFA86D-D7A6-4512-AA64-276CB90144A2@microsoft.com... >>>>>>>>>>> > (This was also posted on the server clustering group) >>>>>>>>>>> > >>>>>>>>>>> > I'm trying to find out some information about using CHKDSK on >>>>>>>>>>> > a clustered >>>>>>>>>>> > drive. >>>>>>>>>>> > We have a two node cluster (active/passive) running Windows >>>>>>>>>>> > 2003 R2 >>>>>>>>>>> > enterprise 32 bit with SP1. The cluster has three shared >>>>>>>>>>> > drives located >>>>>>>>>>> > on >>>>>>>>>>> > an EMC CX700 SAN. The three drives are a 500MB for the >>>>>>>>>>> > quorum, and two >>>>>>>>>>> > data >>>>>>>>>>> > drives 1.5TB (drive "E") and 2.4TB (drive "W"). Drive E is a >>>>>>>>>>> > basic disk, >>>>>>>>>>> > the >>>>>>>>>>> > 2.4TB drive W is a GPT disk. They're both about 70% full The >>>>>>>>>>> > E drive has >>>>>>>>>>> > been active for about a year, the W: one was added around >>>>>>>>>>> > June. >>>>>>>>>>> > >>>>>>>>>>> > Yesterday the active node became sluggish and then stopped >>>>>>>>>>> > serving data. >>>>>>>>>>> > It >>>>>>>>>>> > still responded to low level stuff like PING, users were >>>>>>>>>>> > getting errors on >>>>>>>>>>> > the server. Logging in gave a blank screen. This has >>>>>>>>>>> > happened a couple >>>>>>>>>>> > of >>>>>>>>>>> > times before (that's a separate issue we're looking into). >>>>>>>>>>> > >>>>>>>>>>> > We went to the inactive node and did a "move group" in the >>>>>>>>>>> > cluster >>>>>>>>>>> > administrator. We've done this before for various reasons >>>>>>>>>>> > with no >>>>>>>>>>> > problems, >>>>>>>>>>> > it usually takes about 20 seconds to bring the resources up on >>>>>>>>>>> > the other >>>>>>>>>>> > node. >>>>>>>>>>> > >>>>>>>>>>> > This time when the resources came on line on the 2nd node, we >>>>>>>>>>> > started >>>>>>>>>>> > getting an application popup that "Windows - Corrupt File : >>>>>>>>>>> > The file or >>>>>>>>>>> > directory E:\$Mft is corrupt and unreadable. Please run the >>>>>>>>>>> > Chkdsk >>>>>>>>>>> > utility." >>>>>>>>>>> > The drive seems to be running OK with users accessing the >>>>>>>>>>> > information >>>>>>>>>>> > normally. I did some research and it appears that Windows >>>>>>>>>>> > will use the >>>>>>>>>>> > duplicate copy of the MFT if the primary one is corrupted. >>>>>>>>>>> > >>>>>>>>>>> > I know we need to run CHKDSK soon, but unfortunately, running >>>>>>>>>>> > chkdsk and >>>>>>>>>>> > taking the drive off line for several hours is not something >>>>>>>>>>> > we can do >>>>>>>>>>> > during daytime hours. If necessary we could run it overnight, >>>>>>>>>>> > but with >>>>>>>>>>> > that >>>>>>>>>>> > size of drive I don't know if it would finish by the next >>>>>>>>>>> > morning. >>>>>>>>>>> > >>>>>>>>>>> > The server has dual fiber connections (we're using the EMC >>>>>>>>>>> > Powerpath >>>>>>>>>>> > software for SAN failover), and we didn't have anything happen >>>>>>>>>>> > with the >>>>>>>>>>> > SAN >>>>>>>>>>> > at that time, so based on the timing I'm assuming the MFT >>>>>>>>>>> > corruption was >>>>>>>>>>> > related to the cluster failover, not a physical hardware >>>>>>>>>>> > issue, so I >>>>>>>>>>> > wasn't >>>>>>>>>>> > planning on running the sector scan. I would imagine a sector >>>>>>>>>>> > scan on a >>>>>>>>>>> > 1.5TB "disk" would run for a while… >>>>>>>>>>> > At this point I'm planning on running CHKDSK over the weekend. >>>>>>>>>>> > I've never >>>>>>>>>>> > run it on a clustered disk before and I'm looking for some >>>>>>>>>>> > information >>>>>>>>>>> > about >>>>>>>>>>> > it. I've read Microsoft KB176970 and KB903650, but frankly >>>>>>>>>>> > they're a >>>>>>>>>>> > little >>>>>>>>>>> > confusing with the issues about "maintenance mode". >>>>>>>>>>> > >>>>>>>>>>> > Also, is my understanding about the mirrored/secondary MFT >>>>>>>>>>> > valid? Since >>>>>>>>>>> > users appear to be getting information correctly can the >>>>>>>>>>> > CHKDSK wait until >>>>>>>>>>> > the weekend?. Our backup policy does a full backup each week >>>>>>>>>>> > and an >>>>>>>>>>> > incremental daily, so if something really bad happens we >>>>>>>>>>> > should be able to >>>>>>>>>>> > recover. >>>>>>>>>>> > >>>>>>>>>>> > Any information on this would be appreciated. >>>>>>>>>>> > >>>>>>>>>>> > Mike O. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
Guest Mike O Posted October 2, 2007 Posted October 2, 2007 Re: Correcting a corrupted $MFT on a shared clustered disk "Mathieu CHATEAU" <gollum123@free.fr> wrote in message news:O2uGzG$AIHA.3900@TK2MSFTNGP02.phx.gbl... > We had this issue while not having 4 Millions of Files. The fix is to > prevent it from coming back, when it's there, you have to correct it. PSS > provided us tools and methods to correct it. > > > -- > Cordialement, > Mathieu CHATEAU > English blog: http://lordoftheping.blogspot.com > French blog: http://www.lotp.fr I'm not sure I understand. Once chkdsk has reset the security to the default, how do you "correct" it? And how do you prevent it from coming back? Is there a hotfix or KB article available? Can you give me some idea of the tools that you used? Unfortunatly we don't have a maintenance agreement with Microsoft at this time. I work for a local government, and apparently microsoft won't agree to some terms in the city purchasing requirements. I can get the hotfixes they release, or request them through the web site, but for support questions we have some support calls with a 3rd party company. We're limited on the number of calls we can use, so I usually save them for something major (like if chkdsk didn't fix the corrupt disk). Mike O. > > > "Mike O" <put_the_spam@the.can> wrote in message > news:OH3RKS9AIHA.4160@TK2MSFTNGP06.phx.gbl... >> >> "Mathieu CHATEAU" <gollum123@free.fr> wrote in message >> news:%234kEXX3AIHA.2004@TK2MSFTNGP06.phx.gbl... >>>I already got the "resetting security id to default". If it's the same >>>problem, it may go bad. >>> Do you more than 2 Millions of files ? >> >> Do you mean do we have more than 2 million files? If so, then yes. I >> did see an issue related to the "security id" and chkdsk if you have over >> 4 million, but the hotfix is a few years old and the version of the >> system dll's on the server are later than the ones in the hotfix, so it >> appears that the fix is already there. After seeing the messages, I >> went and spot checked random places on the system. Everything I checked >> seemed to have the correct security on it. Besides, I still think this >> was a better option than telling everyone that the data was completely >> unavailable until we do a restore. >> >> Also, from the info I've found at least it sets the security to locked >> down, not wide open. Besides, having to fix them may have some >> benefits. Generally we don't give out full rights to modify security to >> users, because they almost always end up locking out the system admins, >> backup system, etc. The most users get is "modify". Since the files on >> this server is a consolidation of about 30 servers from 10 different >> departments. File security is not set very consistently. There's a lot >> of individuals set on each file, inheritance not set up efficently, etc. >> Cleaning up file security has been on our list of things to do for quite >> a while, I guess this may force the issue.. >> >> Of course, after our help desk (and us in the server group) gets swamped >> tomorrow with phone calls I might feel differently. >> >> I may check and see if we can do a restore anyway and tell it not to >> overwrite anything newer. This way the existing files will get their >> security reset. >> >> Mike O. >> >>> >>> -- >>> Cordialement, >>> Mathieu CHATEAU >>> English blog: http://lordoftheping.blogspot.com >>> French blog: http://www.lotp.fr >>> >>> >>> "Mike O" <put_the_spam@the.can> wrote in message >>> news:OXvnQ92AIHA.5752@TK2MSFTNGP02.phx.gbl... >>>> It finished around 1:00am. So it took 7-1/2 hours for the first 1%, >>>> then 90 minutes for the other 99.. >>>> >>>> It what looked like a couple of hundred "resetting security id to >>>> default" messages, but I've spot checked the drive and don't see >>>> anything out of the ordinary. Unfortunately I didn't redirect the >>>> output, and it doesn't look chkdsk logs the errors (other than the >>>> event log entry), and that's not long enough to be useful. >>>> >>>> I was thinking about the restore option, and if it was still running >>>> this morning, I was going to start investigating that option. >>>> >>>> Thanks for all the comments. >>>> >>>> >>>> "Mathieu CHATEAU" <gollum123@free.fr> wrote in message >>>> news:uWMYOe0AIHA.5960@TK2MSFTNGP05.phx.gbl... >>>>> keep going ! >>>>> >>>>> Else can you format the drive and restore from backup ? It may go >>>>> faster, depending of your backup storage and type of files (bigger the >>>>> better) >>>>> >>>>> -- >>>>> Cordialement, >>>>> Mathieu CHATEAU >>>>> English blog: http://lordoftheping.blogspot.com >>>>> French blog: http://www.lotp.fr >>>>> >>>>> >>>>> "Mike O" <put_the_spam@the.can> wrote in message >>>>> news:uCaiwtxAIHA.4612@TK2MSFTNGP03.phx.gbl... >>>>>> It's gone from 1% to 76% in the last hour. As I'm watching it, it >>>>>> now seems to be moving about 1% every two minutes... >>>>>> >>>>>> The optimism is starting to slowly come back.. >>>>>> >>>>>> >>>>>> "Mike O" <put_the_spam@the.can> wrote in message >>>>>> news:OE%23FbKxAIHA.4232@TK2MSFTNGP04.phx.gbl... >>>>>>> This is not looking good at all. It just jumped all the way to 1%. >>>>>>> It's been running phase 2 since about 4:00pm today. 7-1/2 hours for >>>>>>> 1% isn't a good sign. >>>>>>> >>>>>>> >>>>>>> >>>>>>> "Mike O" <put_the_spam@the.can> wrote in message >>>>>>> news:u5mYq7vAIHA.4984@TK2MSFTNGP06.phx.gbl... >>>>>>>> It's not looking too good at the moment. I started it about 3:15 >>>>>>>> this afternoon. Phase 1 went through pretty fast, it found and >>>>>>>> corrected the 60 or so corrupted attribute & orphaned records that >>>>>>>> the read-only chkdsk passes were detecting. >>>>>>>> >>>>>>>> However, it started phase 2 around 4:00, and now at 9:00 it's still >>>>>>>> at 0 percent... I seem to remember that the stage 2 steps go in >>>>>>>> 10% increments (at least I hope so!), and I know that this stage >>>>>>>> isn't linear, and that it might move erratically, but I was hoping >>>>>>>> to see something besides 0 by now.. According to task manager the >>>>>>>> chkdsk is running, it shows the process running about 30+cpu time. >>>>>>>> >>>>>>>> Assuming it doesn't finish by the end of our maintenance window, do >>>>>>>> you know if there would be any problems cancelling the process? I >>>>>>>> know it wouldn't fix the main problem, but at least we could get >>>>>>>> the system up and running until another time (or relocate the data >>>>>>>> to another drive). >>>>>>>> >>>>>>>> >>>>>>>> "Mathieu CHATEAU" <gollum123@free.fr> wrote in message >>>>>>>> news:e4mBYyhAIHA.5124@TK2MSFTNGP04.phx.gbl... >>>>>>>>> it's ok, i understand that you are under pressure ! >>>>>>>>> >>>>>>>>> I was just trying to make you think about this current pressure, >>>>>>>>> that may be lower if you would only have to make offline a part >>>>>>>>> and not the whole cake ;) >>>>>>>>> >>>>>>>>> Let's us know how it's going after the chkdsk >>>>>>>>> -- >>>>>>>>> Cordialement, >>>>>>>>> Mathieu CHATEAU >>>>>>>>> http://lordoftheping.blogspot.com >>>>>>>>> >>>>>>>>> >>>>>>>>> "Mike O" <put_the_spam@the.can> wrote in message >>>>>>>>> news:%23NFNazWAIHA.5960@TK2MSFTNGP05.phx.gbl... >>>>>>>>>> By the way, after posting the message below, I realized some of >>>>>>>>>> my wording may have come off sounding a little cranky.. It's >>>>>>>>>> been a long, tiring week, in addition to this I've had a couple >>>>>>>>>> of other issues and I may have overreacted a little bit. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> "Mike O." <MikeO@discussions.microsoft.com> wrote in message >>>>>>>>>> news:B655B129-B516-4C8A-8993-B045062AE633@microsoft.com... >>>>>>>>>>> Per the KB284134, clustering supports GPT if you apply the >>>>>>>>>>> hotfix, which we >>>>>>>>>>> did prior to connecting the GPT disk. I don't believe that >>>>>>>>>>> applying the >>>>>>>>>>> Microsoft supported hotfix to correct the issue is "forcing" it. >>>>>>>>>>> >>>>>>>>>>> Also, the problem I'm having is on the smaller basic disk, the >>>>>>>>>>> GPT one is >>>>>>>>>>> fine. >>>>>>>>>>> >>>>>>>>>>> We thought about breaking the "drive" into smaller partitions, >>>>>>>>>>> but the >>>>>>>>>>> issues we run into are space allocation. Eventually we'll end >>>>>>>>>>> up with one >>>>>>>>>>> partition running out of space and another one with space to >>>>>>>>>>> spare. Our >>>>>>>>>>> backup system is an enterprise system, running over 1Gb ethernet >>>>>>>>>>> (we're >>>>>>>>>>> looking at backing up over the SAN soon), so backing up a 1-2 TB >>>>>>>>>>> is not a >>>>>>>>>>> problem. >>>>>>>>>>> >>>>>>>>>>> As for your other questions, I'm not sure where you got the >>>>>>>>>>> "performance >>>>>>>>>>> problems" part. The server was working fine, performance was >>>>>>>>>>> acceptable then >>>>>>>>>>> it quickly (over 30 minutes) failed. I'm still investigating >>>>>>>>>>> it, but I'm >>>>>>>>>>> wondering if a memory leak in one of the drivers or other >>>>>>>>>>> processes running >>>>>>>>>>> on it caused the issue. >>>>>>>>>>> >>>>>>>>>>> We can't exclude real time virus scanning since these are user >>>>>>>>>>> files. We've >>>>>>>>>>> had McAfee products and a support contract with them for years. >>>>>>>>>>> According to >>>>>>>>>>> the tech there are no problems with Virusscan 8.x on the >>>>>>>>>>> cluster. >>>>>>>>>>> >>>>>>>>>>> We don't have any large access databases on this system. I'm >>>>>>>>>>> sure there are >>>>>>>>>>> some, but it's primarily a user file server, not supposed to be >>>>>>>>>>> for >>>>>>>>>>> applications. >>>>>>>>>>> >>>>>>>>>>> As for the error I'm receiving, should I be able to wait until >>>>>>>>>>> this weekend >>>>>>>>>>> for the CHKDSK, or is it something that's only going to get >>>>>>>>>>> worse? From some >>>>>>>>>>> Microsoft KB articles (and other stuff I found), it seems that >>>>>>>>>>> NTFS keeps two >>>>>>>>>>> copies of the MFT and will use the other one if the primary is >>>>>>>>>>> corrupted. Is >>>>>>>>>>> this correct? >>>>>>>>>>> >>>>>>>>>>> When I do run chkdsk, are there any special issues with the >>>>>>>>>>> cluster? I >>>>>>>>>>> know normally Windows can't chkdsk on an active disk and would >>>>>>>>>>> have to when >>>>>>>>>>> the server is rebooted. The problem is that when the server >>>>>>>>>>> reboots it >>>>>>>>>>> doesn't see the clustered disk until the cluster service starts, >>>>>>>>>>> so chkdsk >>>>>>>>>>> can't access the disk "pre" bootup. >>>>>>>>>>> >>>>>>>>>>> We have a test cluster (with a 50G shared disk). I ran chkdsk >>>>>>>>>>> /f on it and >>>>>>>>>>> it said the drive needed to be unmounted and offered to do that >>>>>>>>>>> for me. I >>>>>>>>>>> told it yes and it seemed to work OK. Of course the disk was >>>>>>>>>>> unavailable >>>>>>>>>>> while the chkdsk was running, but it can back on line as soon as >>>>>>>>>>> it finished. >>>>>>>>>>> >>>>>>>>>>> "Mathieu CHATEAU" wrote: >>>>>>>>>>> >>>>>>>>>>>> HEllo, >>>>>>>>>>>> >>>>>>>>>>>> GPT disk and cluster are not friend by default, forcing them to >>>>>>>>>>>> be friend >>>>>>>>>>>> may lead to issue... >>>>>>>>>>>> >>>>>>>>>>>> By default, server clusters do not support GPT shared disks in >>>>>>>>>>>> Windows >>>>>>>>>>>> Server 2003 >>>>>>>>>>>> http://support.microsoft.com/kb/284134/en-us >>>>>>>>>>>> >>>>>>>>>>>> That's the problem with so big data volumes....You should have >>>>>>>>>>>> in mind data >>>>>>>>>>>> recovery, defrag & chkdsk when sizing data volumes... >>>>>>>>>>>> You will start having issue when raising 4 Millions of files >>>>>>>>>>>> too >>>>>>>>>>>> >>>>>>>>>>>> Now, it's clear you have to run the chkdsk. Downtime for >>>>>>>>>>>> downtime, run it on >>>>>>>>>>>> both if you can >>>>>>>>>>>> >>>>>>>>>>>> For the performance part: >>>>>>>>>>>> -did you exclude all shared data from real time antivirus scan >>>>>>>>>>>> on cluster >>>>>>>>>>>> node ? >>>>>>>>>>>> -Do you have huge MS Access database ? >>>>>>>>>>>> -Any monitoring/graphing tool to get some history on >>>>>>>>>>>> ram;cpu;network usage? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Cordialement, >>>>>>>>>>>> Mathieu CHATEAU >>>>>>>>>>>> http://lordoftheping.blogspot.com >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> "Mike O." <MikeO@discussions.microsoft.com> wrote in message >>>>>>>>>>>> news:3CDFA86D-D7A6-4512-AA64-276CB90144A2@microsoft.com... >>>>>>>>>>>> > (This was also posted on the server clustering group) >>>>>>>>>>>> > >>>>>>>>>>>> > I'm trying to find out some information about using CHKDSK on >>>>>>>>>>>> > a clustered >>>>>>>>>>>> > drive. >>>>>>>>>>>> > We have a two node cluster (active/passive) running Windows >>>>>>>>>>>> > 2003 R2 >>>>>>>>>>>> > enterprise 32 bit with SP1. The cluster has three shared >>>>>>>>>>>> > drives located >>>>>>>>>>>> > on >>>>>>>>>>>> > an EMC CX700 SAN. The three drives are a 500MB for the >>>>>>>>>>>> > quorum, and two >>>>>>>>>>>> > data >>>>>>>>>>>> > drives 1.5TB (drive "E") and 2.4TB (drive "W"). Drive E is a >>>>>>>>>>>> > basic disk, >>>>>>>>>>>> > the >>>>>>>>>>>> > 2.4TB drive W is a GPT disk. They're both about 70% full >>>>>>>>>>>> > The E drive has >>>>>>>>>>>> > been active for about a year, the W: one was added around >>>>>>>>>>>> > June. >>>>>>>>>>>> > >>>>>>>>>>>> > Yesterday the active node became sluggish and then stopped >>>>>>>>>>>> > serving data. >>>>>>>>>>>> > It >>>>>>>>>>>> > still responded to low level stuff like PING, users were >>>>>>>>>>>> > getting errors on >>>>>>>>>>>> > the server. Logging in gave a blank screen. This has >>>>>>>>>>>> > happened a couple >>>>>>>>>>>> > of >>>>>>>>>>>> > times before (that's a separate issue we're looking into). >>>>>>>>>>>> > >>>>>>>>>>>> > We went to the inactive node and did a "move group" in the >>>>>>>>>>>> > cluster >>>>>>>>>>>> > administrator. We've done this before for various reasons >>>>>>>>>>>> > with no >>>>>>>>>>>> > problems, >>>>>>>>>>>> > it usually takes about 20 seconds to bring the resources up >>>>>>>>>>>> > on the other >>>>>>>>>>>> > node. >>>>>>>>>>>> > >>>>>>>>>>>> > This time when the resources came on line on the 2nd node, we >>>>>>>>>>>> > started >>>>>>>>>>>> > getting an application popup that "Windows - Corrupt File : >>>>>>>>>>>> > The file or >>>>>>>>>>>> > directory E:\$Mft is corrupt and unreadable. Please run the >>>>>>>>>>>> > Chkdsk >>>>>>>>>>>> > utility." >>>>>>>>>>>> > The drive seems to be running OK with users accessing the >>>>>>>>>>>> > information >>>>>>>>>>>> > normally. I did some research and it appears that Windows >>>>>>>>>>>> > will use the >>>>>>>>>>>> > duplicate copy of the MFT if the primary one is corrupted. >>>>>>>>>>>> > >>>>>>>>>>>> > I know we need to run CHKDSK soon, but unfortunately, running >>>>>>>>>>>> > chkdsk and >>>>>>>>>>>> > taking the drive off line for several hours is not something >>>>>>>>>>>> > we can do >>>>>>>>>>>> > during daytime hours. If necessary we could run it >>>>>>>>>>>> > overnight, but with >>>>>>>>>>>> > that >>>>>>>>>>>> > size of drive I don't know if it would finish by the next >>>>>>>>>>>> > morning. >>>>>>>>>>>> > >>>>>>>>>>>> > The server has dual fiber connections (we're using the EMC >>>>>>>>>>>> > Powerpath >>>>>>>>>>>> > software for SAN failover), and we didn't have anything >>>>>>>>>>>> > happen with the >>>>>>>>>>>> > SAN >>>>>>>>>>>> > at that time, so based on the timing I'm assuming the MFT >>>>>>>>>>>> > corruption was >>>>>>>>>>>> > related to the cluster failover, not a physical hardware >>>>>>>>>>>> > issue, so I >>>>>>>>>>>> > wasn't >>>>>>>>>>>> > planning on running the sector scan. I would imagine a >>>>>>>>>>>> > sector scan on a >>>>>>>>>>>> > 1.5TB "disk" would run for a while… >>>>>>>>>>>> > At this point I'm planning on running CHKDSK over the >>>>>>>>>>>> > weekend. I've never >>>>>>>>>>>> > run it on a clustered disk before and I'm looking for some >>>>>>>>>>>> > information >>>>>>>>>>>> > about >>>>>>>>>>>> > it. I've read Microsoft KB176970 and KB903650, but frankly >>>>>>>>>>>> > they're a >>>>>>>>>>>> > little >>>>>>>>>>>> > confusing with the issues about "maintenance mode". >>>>>>>>>>>> > >>>>>>>>>>>> > Also, is my understanding about the mirrored/secondary MFT >>>>>>>>>>>> > valid? Since >>>>>>>>>>>> > users appear to be getting information correctly can the >>>>>>>>>>>> > CHKDSK wait until >>>>>>>>>>>> > the weekend?. Our backup policy does a full backup each week >>>>>>>>>>>> > and an >>>>>>>>>>>> > incremental daily, so if something really bad happens we >>>>>>>>>>>> > should be able to >>>>>>>>>>>> > recover. >>>>>>>>>>>> > >>>>>>>>>>>> > Any information on this would be appreciated. >>>>>>>>>>>> > >>>>>>>>>>>> > Mike O. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
Guest Mathieu CHATEAU Posted October 2, 2007 Posted October 2, 2007 Re: Correcting a corrupted $MFT on a shared clustered disk I posted our chkdsk experience here: http://lordoftheping.blogspot.com/2006/08/chkdsk-sd-resolved.html We ended by restoring a backup on known problems folders The cool thing is that their tool gave us the mapping between Files numbers from chkdsk and the name & location So we only restored a part of our data -- Cordialement, Mathieu CHATEAU English blog: http://lordoftheping.blogspot.com French blog: http://www.lotp.fr "Mike O" <put_the_spam@the.can> wrote in message news:O494TvJBIHA.1208@TK2MSFTNGP05.phx.gbl... > > "Mathieu CHATEAU" <gollum123@free.fr> wrote in message > news:O2uGzG$AIHA.3900@TK2MSFTNGP02.phx.gbl... >> We had this issue while not having 4 Millions of Files. The fix is to >> prevent it from coming back, when it's there, you have to correct it. PSS >> provided us tools and methods to correct it. >> >> >> -- >> Cordialement, >> Mathieu CHATEAU >> English blog: http://lordoftheping.blogspot.com >> French blog: http://www.lotp.fr > > I'm not sure I understand. Once chkdsk has reset the security to the > default, how do you "correct" it? And how do you prevent it from coming > back? Is there a hotfix or KB article available? Can you give me some > idea of the tools that you used? > > Unfortunatly we don't have a maintenance agreement with Microsoft at this > time. I work for a local government, and apparently microsoft won't > agree to some terms in the city purchasing requirements. I can get the > hotfixes they release, or request them through the web site, but for > support questions we have some support calls with a 3rd party company. > We're limited on the number of calls we can use, so I usually save them > for something major (like if chkdsk didn't fix the corrupt disk). > > Mike O. > > >> >> >> "Mike O" <put_the_spam@the.can> wrote in message >> news:OH3RKS9AIHA.4160@TK2MSFTNGP06.phx.gbl... >>> >>> "Mathieu CHATEAU" <gollum123@free.fr> wrote in message >>> news:%234kEXX3AIHA.2004@TK2MSFTNGP06.phx.gbl... >>>>I already got the "resetting security id to default". If it's the same >>>>problem, it may go bad. >>>> Do you more than 2 Millions of files ? >>> >>> Do you mean do we have more than 2 million files? If so, then yes. I >>> did see an issue related to the "security id" and chkdsk if you have >>> over 4 million, but the hotfix is a few years old and the version of the >>> system dll's on the server are later than the ones in the hotfix, so it >>> appears that the fix is already there. After seeing the messages, I >>> went and spot checked random places on the system. Everything I checked >>> seemed to have the correct security on it. Besides, I still think this >>> was a better option than telling everyone that the data was completely >>> unavailable until we do a restore. >>> >>> Also, from the info I've found at least it sets the security to locked >>> down, not wide open. Besides, having to fix them may have some >>> benefits. Generally we don't give out full rights to modify security to >>> users, because they almost always end up locking out the system admins, >>> backup system, etc. The most users get is "modify". Since the files on >>> this server is a consolidation of about 30 servers from 10 different >>> departments. File security is not set very consistently. There's a lot >>> of individuals set on each file, inheritance not set up efficently, etc. >>> Cleaning up file security has been on our list of things to do for quite >>> a while, I guess this may force the issue.. >>> >>> Of course, after our help desk (and us in the server group) gets swamped >>> tomorrow with phone calls I might feel differently. >>> >>> I may check and see if we can do a restore anyway and tell it not to >>> overwrite anything newer. This way the existing files will get their >>> security reset. >>> >>> Mike O. >>> >>>> >>>> -- >>>> Cordialement, >>>> Mathieu CHATEAU >>>> English blog: http://lordoftheping.blogspot.com >>>> French blog: http://www.lotp.fr >>>> >>>> >>>> "Mike O" <put_the_spam@the.can> wrote in message >>>> news:OXvnQ92AIHA.5752@TK2MSFTNGP02.phx.gbl... >>>>> It finished around 1:00am. So it took 7-1/2 hours for the first 1%, >>>>> then 90 minutes for the other 99.. >>>>> >>>>> It what looked like a couple of hundred "resetting security id to >>>>> default" messages, but I've spot checked the drive and don't see >>>>> anything out of the ordinary. Unfortunately I didn't redirect the >>>>> output, and it doesn't look chkdsk logs the errors (other than the >>>>> event log entry), and that's not long enough to be useful. >>>>> >>>>> I was thinking about the restore option, and if it was still running >>>>> this morning, I was going to start investigating that option. >>>>> >>>>> Thanks for all the comments. >>>>> >>>>> >>>>> "Mathieu CHATEAU" <gollum123@free.fr> wrote in message >>>>> news:uWMYOe0AIHA.5960@TK2MSFTNGP05.phx.gbl... >>>>>> keep going ! >>>>>> >>>>>> Else can you format the drive and restore from backup ? It may go >>>>>> faster, depending of your backup storage and type of files (bigger >>>>>> the better) >>>>>> >>>>>> -- >>>>>> Cordialement, >>>>>> Mathieu CHATEAU >>>>>> English blog: http://lordoftheping.blogspot.com >>>>>> French blog: http://www.lotp.fr >>>>>> >>>>>> >>>>>> "Mike O" <put_the_spam@the.can> wrote in message >>>>>> news:uCaiwtxAIHA.4612@TK2MSFTNGP03.phx.gbl... >>>>>>> It's gone from 1% to 76% in the last hour. As I'm watching it, it >>>>>>> now seems to be moving about 1% every two minutes... >>>>>>> >>>>>>> The optimism is starting to slowly come back.. >>>>>>> >>>>>>> >>>>>>> "Mike O" <put_the_spam@the.can> wrote in message >>>>>>> news:OE%23FbKxAIHA.4232@TK2MSFTNGP04.phx.gbl... >>>>>>>> This is not looking good at all. It just jumped all the way to 1%. >>>>>>>> It's been running phase 2 since about 4:00pm today. 7-1/2 hours >>>>>>>> for 1% isn't a good sign. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> "Mike O" <put_the_spam@the.can> wrote in message >>>>>>>> news:u5mYq7vAIHA.4984@TK2MSFTNGP06.phx.gbl... >>>>>>>>> It's not looking too good at the moment. I started it about 3:15 >>>>>>>>> this afternoon. Phase 1 went through pretty fast, it found and >>>>>>>>> corrected the 60 or so corrupted attribute & orphaned records that >>>>>>>>> the read-only chkdsk passes were detecting. >>>>>>>>> >>>>>>>>> However, it started phase 2 around 4:00, and now at 9:00 it's >>>>>>>>> still at 0 percent... I seem to remember that the stage 2 steps >>>>>>>>> go in 10% increments (at least I hope so!), and I know that this >>>>>>>>> stage isn't linear, and that it might move erratically, but I was >>>>>>>>> hoping to see something besides 0 by now.. According to task >>>>>>>>> manager the chkdsk is running, it shows the process running about >>>>>>>>> 30+cpu time. >>>>>>>>> >>>>>>>>> Assuming it doesn't finish by the end of our maintenance window, >>>>>>>>> do you know if there would be any problems cancelling the process? >>>>>>>>> I know it wouldn't fix the main problem, but at least we could get >>>>>>>>> the system up and running until another time (or relocate the data >>>>>>>>> to another drive). >>>>>>>>> >>>>>>>>> >>>>>>>>> "Mathieu CHATEAU" <gollum123@free.fr> wrote in message >>>>>>>>> news:e4mBYyhAIHA.5124@TK2MSFTNGP04.phx.gbl... >>>>>>>>>> it's ok, i understand that you are under pressure ! >>>>>>>>>> >>>>>>>>>> I was just trying to make you think about this current pressure, >>>>>>>>>> that may be lower if you would only have to make offline a part >>>>>>>>>> and not the whole cake ;) >>>>>>>>>> >>>>>>>>>> Let's us know how it's going after the chkdsk >>>>>>>>>> -- >>>>>>>>>> Cordialement, >>>>>>>>>> Mathieu CHATEAU >>>>>>>>>> http://lordoftheping.blogspot.com >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> "Mike O" <put_the_spam@the.can> wrote in message >>>>>>>>>> news:%23NFNazWAIHA.5960@TK2MSFTNGP05.phx.gbl... >>>>>>>>>>> By the way, after posting the message below, I realized some of >>>>>>>>>>> my wording may have come off sounding a little cranky.. It's >>>>>>>>>>> been a long, tiring week, in addition to this I've had a couple >>>>>>>>>>> of other issues and I may have overreacted a little bit. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> "Mike O." <MikeO@discussions.microsoft.com> wrote in message >>>>>>>>>>> news:B655B129-B516-4C8A-8993-B045062AE633@microsoft.com... >>>>>>>>>>>> Per the KB284134, clustering supports GPT if you apply the >>>>>>>>>>>> hotfix, which we >>>>>>>>>>>> did prior to connecting the GPT disk. I don't believe that >>>>>>>>>>>> applying the >>>>>>>>>>>> Microsoft supported hotfix to correct the issue is "forcing" >>>>>>>>>>>> it. >>>>>>>>>>>> >>>>>>>>>>>> Also, the problem I'm having is on the smaller basic disk, the >>>>>>>>>>>> GPT one is >>>>>>>>>>>> fine. >>>>>>>>>>>> >>>>>>>>>>>> We thought about breaking the "drive" into smaller partitions, >>>>>>>>>>>> but the >>>>>>>>>>>> issues we run into are space allocation. Eventually we'll end >>>>>>>>>>>> up with one >>>>>>>>>>>> partition running out of space and another one with space to >>>>>>>>>>>> spare. Our >>>>>>>>>>>> backup system is an enterprise system, running over 1Gb >>>>>>>>>>>> ethernet (we're >>>>>>>>>>>> looking at backing up over the SAN soon), so backing up a 1-2 >>>>>>>>>>>> TB is not a >>>>>>>>>>>> problem. >>>>>>>>>>>> >>>>>>>>>>>> As for your other questions, I'm not sure where you got the >>>>>>>>>>>> "performance >>>>>>>>>>>> problems" part. The server was working fine, performance was >>>>>>>>>>>> acceptable then >>>>>>>>>>>> it quickly (over 30 minutes) failed. I'm still investigating >>>>>>>>>>>> it, but I'm >>>>>>>>>>>> wondering if a memory leak in one of the drivers or other >>>>>>>>>>>> processes running >>>>>>>>>>>> on it caused the issue. >>>>>>>>>>>> >>>>>>>>>>>> We can't exclude real time virus scanning since these are user >>>>>>>>>>>> files. We've >>>>>>>>>>>> had McAfee products and a support contract with them for years. >>>>>>>>>>>> According to >>>>>>>>>>>> the tech there are no problems with Virusscan 8.x on the >>>>>>>>>>>> cluster. >>>>>>>>>>>> >>>>>>>>>>>> We don't have any large access databases on this system. I'm >>>>>>>>>>>> sure there are >>>>>>>>>>>> some, but it's primarily a user file server, not supposed to be >>>>>>>>>>>> for >>>>>>>>>>>> applications. >>>>>>>>>>>> >>>>>>>>>>>> As for the error I'm receiving, should I be able to wait until >>>>>>>>>>>> this weekend >>>>>>>>>>>> for the CHKDSK, or is it something that's only going to get >>>>>>>>>>>> worse? From some >>>>>>>>>>>> Microsoft KB articles (and other stuff I found), it seems that >>>>>>>>>>>> NTFS keeps two >>>>>>>>>>>> copies of the MFT and will use the other one if the primary is >>>>>>>>>>>> corrupted. Is >>>>>>>>>>>> this correct? >>>>>>>>>>>> >>>>>>>>>>>> When I do run chkdsk, are there any special issues with the >>>>>>>>>>>> cluster? I >>>>>>>>>>>> know normally Windows can't chkdsk on an active disk and would >>>>>>>>>>>> have to when >>>>>>>>>>>> the server is rebooted. The problem is that when the server >>>>>>>>>>>> reboots it >>>>>>>>>>>> doesn't see the clustered disk until the cluster service >>>>>>>>>>>> starts, so chkdsk >>>>>>>>>>>> can't access the disk "pre" bootup. >>>>>>>>>>>> >>>>>>>>>>>> We have a test cluster (with a 50G shared disk). I ran chkdsk >>>>>>>>>>>> /f on it and >>>>>>>>>>>> it said the drive needed to be unmounted and offered to do that >>>>>>>>>>>> for me. I >>>>>>>>>>>> told it yes and it seemed to work OK. Of course the disk was >>>>>>>>>>>> unavailable >>>>>>>>>>>> while the chkdsk was running, but it can back on line as soon >>>>>>>>>>>> as it finished. >>>>>>>>>>>> >>>>>>>>>>>> "Mathieu CHATEAU" wrote: >>>>>>>>>>>> >>>>>>>>>>>>> HEllo, >>>>>>>>>>>>> >>>>>>>>>>>>> GPT disk and cluster are not friend by default, forcing them >>>>>>>>>>>>> to be friend >>>>>>>>>>>>> may lead to issue... >>>>>>>>>>>>> >>>>>>>>>>>>> By default, server clusters do not support GPT shared disks in >>>>>>>>>>>>> Windows >>>>>>>>>>>>> Server 2003 >>>>>>>>>>>>> http://support.microsoft.com/kb/284134/en-us >>>>>>>>>>>>> >>>>>>>>>>>>> That's the problem with so big data volumes....You should have >>>>>>>>>>>>> in mind data >>>>>>>>>>>>> recovery, defrag & chkdsk when sizing data volumes... >>>>>>>>>>>>> You will start having issue when raising 4 Millions of files >>>>>>>>>>>>> too >>>>>>>>>>>>> >>>>>>>>>>>>> Now, it's clear you have to run the chkdsk. Downtime for >>>>>>>>>>>>> downtime, run it on >>>>>>>>>>>>> both if you can >>>>>>>>>>>>> >>>>>>>>>>>>> For the performance part: >>>>>>>>>>>>> -did you exclude all shared data from real time antivirus scan >>>>>>>>>>>>> on cluster >>>>>>>>>>>>> node ? >>>>>>>>>>>>> -Do you have huge MS Access database ? >>>>>>>>>>>>> -Any monitoring/graphing tool to get some history on >>>>>>>>>>>>> ram;cpu;network usage? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Cordialement, >>>>>>>>>>>>> Mathieu CHATEAU >>>>>>>>>>>>> http://lordoftheping.blogspot.com >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> "Mike O." <MikeO@discussions.microsoft.com> wrote in message >>>>>>>>>>>>> news:3CDFA86D-D7A6-4512-AA64-276CB90144A2@microsoft.com... >>>>>>>>>>>>> > (This was also posted on the server clustering group) >>>>>>>>>>>>> > >>>>>>>>>>>>> > I'm trying to find out some information about using CHKDSK >>>>>>>>>>>>> > on a clustered >>>>>>>>>>>>> > drive. >>>>>>>>>>>>> > We have a two node cluster (active/passive) running Windows >>>>>>>>>>>>> > 2003 R2 >>>>>>>>>>>>> > enterprise 32 bit with SP1. The cluster has three shared >>>>>>>>>>>>> > drives located >>>>>>>>>>>>> > on >>>>>>>>>>>>> > an EMC CX700 SAN. The three drives are a 500MB for the >>>>>>>>>>>>> > quorum, and two >>>>>>>>>>>>> > data >>>>>>>>>>>>> > drives 1.5TB (drive "E") and 2.4TB (drive "W"). Drive E is >>>>>>>>>>>>> > a basic disk, >>>>>>>>>>>>> > the >>>>>>>>>>>>> > 2.4TB drive W is a GPT disk. They're both about 70% full >>>>>>>>>>>>> > The E drive has >>>>>>>>>>>>> > been active for about a year, the W: one was added around >>>>>>>>>>>>> > June. >>>>>>>>>>>>> > >>>>>>>>>>>>> > Yesterday the active node became sluggish and then stopped >>>>>>>>>>>>> > serving data. >>>>>>>>>>>>> > It >>>>>>>>>>>>> > still responded to low level stuff like PING, users were >>>>>>>>>>>>> > getting errors on >>>>>>>>>>>>> > the server. Logging in gave a blank screen. This has >>>>>>>>>>>>> > happened a couple >>>>>>>>>>>>> > of >>>>>>>>>>>>> > times before (that's a separate issue we're looking into). >>>>>>>>>>>>> > >>>>>>>>>>>>> > We went to the inactive node and did a "move group" in the >>>>>>>>>>>>> > cluster >>>>>>>>>>>>> > administrator. We've done this before for various reasons >>>>>>>>>>>>> > with no >>>>>>>>>>>>> > problems, >>>>>>>>>>>>> > it usually takes about 20 seconds to bring the resources up >>>>>>>>>>>>> > on the other >>>>>>>>>>>>> > node. >>>>>>>>>>>>> > >>>>>>>>>>>>> > This time when the resources came on line on the 2nd node, >>>>>>>>>>>>> > we started >>>>>>>>>>>>> > getting an application popup that "Windows - Corrupt File : >>>>>>>>>>>>> > The file or >>>>>>>>>>>>> > directory E:\$Mft is corrupt and unreadable. Please run the >>>>>>>>>>>>> > Chkdsk >>>>>>>>>>>>> > utility." >>>>>>>>>>>>> > The drive seems to be running OK with users accessing the >>>>>>>>>>>>> > information >>>>>>>>>>>>> > normally. I did some research and it appears that Windows >>>>>>>>>>>>> > will use the >>>>>>>>>>>>> > duplicate copy of the MFT if the primary one is corrupted. >>>>>>>>>>>>> > >>>>>>>>>>>>> > I know we need to run CHKDSK soon, but unfortunately, >>>>>>>>>>>>> > running chkdsk and >>>>>>>>>>>>> > taking the drive off line for several hours is not >>>>>>>>>>>>> > something we can do >>>>>>>>>>>>> > during daytime hours. If necessary we could run it >>>>>>>>>>>>> > overnight, but with >>>>>>>>>>>>> > that >>>>>>>>>>>>> > size of drive I don't know if it would finish by the next >>>>>>>>>>>>> > morning. >>>>>>>>>>>>> > >>>>>>>>>>>>> > The server has dual fiber connections (we're using the EMC >>>>>>>>>>>>> > Powerpath >>>>>>>>>>>>> > software for SAN failover), and we didn't have anything >>>>>>>>>>>>> > happen with the >>>>>>>>>>>>> > SAN >>>>>>>>>>>>> > at that time, so based on the timing I'm assuming the MFT >>>>>>>>>>>>> > corruption was >>>>>>>>>>>>> > related to the cluster failover, not a physical hardware >>>>>>>>>>>>> > issue, so I >>>>>>>>>>>>> > wasn't >>>>>>>>>>>>> > planning on running the sector scan. I would imagine a >>>>>>>>>>>>> > sector scan on a >>>>>>>>>>>>> > 1.5TB "disk" would run for a while… >>>>>>>>>>>>> > At this point I'm planning on running CHKDSK over the >>>>>>>>>>>>> > weekend. I've never >>>>>>>>>>>>> > run it on a clustered disk before and I'm looking for some >>>>>>>>>>>>> > information >>>>>>>>>>>>> > about >>>>>>>>>>>>> > it. I've read Microsoft KB176970 and KB903650, but frankly >>>>>>>>>>>>> > they're a >>>>>>>>>>>>> > little >>>>>>>>>>>>> > confusing with the issues about "maintenance mode". >>>>>>>>>>>>> > >>>>>>>>>>>>> > Also, is my understanding about the mirrored/secondary MFT >>>>>>>>>>>>> > valid? Since >>>>>>>>>>>>> > users appear to be getting information correctly can the >>>>>>>>>>>>> > CHKDSK wait until >>>>>>>>>>>>> > the weekend?. Our backup policy does a full backup each >>>>>>>>>>>>> > week and an >>>>>>>>>>>>> > incremental daily, so if something really bad happens we >>>>>>>>>>>>> > should be able to >>>>>>>>>>>>> > recover. >>>>>>>>>>>>> > >>>>>>>>>>>>> > Any information on this would be appreciated. >>>>>>>>>>>>> > >>>>>>>>>>>>> > Mike O. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
Recommended Posts