Reports of all the way up to restoration of DELL server which broke off unexpectedly



The other day, the server that was still in operation and still running for 2 weeks suddenly froze and I could not access the server at all, so I ran into the server room in a hurry, and there was an error on the hard disk It was.

Since this server consists of two SAS hard disks with RAID 1 (mirroring), the system should not freeze even if one hard disk is broken. However, since the server actually freezes, I decided to investigate the cause and restore it.

Details of the whole part until server restoration are as follows.
First, the server configuration looks like this.
· Enclosure: DELL PowerEdge T300
· RAID controller: SAS 6 / iR Adapter
· OS: CentOS 5.2

Like this, the lamp on the hard disk is repeatedly green, orange, and off. Clearly it is not normal How to light it up.


So it is completely freezing and I can not do anything, so if I forcibly reboot, normal operation starts for the time being.
However, as the hard disk error indication does not disappear, I decided to contact the DELL support center.

·support Center
For the time being, when I told the phenomenon to DELL's support center, I heard that the HDD is not broken yet just by warning that the hard disk will light up "HDD will be destroyed soon". Next, when informing that the OS freezes, it is still saying that the possibility of HDD is high occasionally. If it seems that it stops with an error of that sign even though it is not broken, it is safe not to say that RAID 1 has no meaning, and conversely it is safer if it is not RAID 1 .... Anyway, I asked for repair of the HDD, and it is said that the hard disk exchange is possible the next day.

As a result of examination, DELL Honorable Tool's "OpenManage Server AdministratorIt is said that it is possible to replace the HDD without stopping the server (it can check whether it is rebuilding normally etc.), so I decided to install it immediately.

· Installation of "OpenManage Server Administrator"

Repository / OMSA - Dell Linux Wiki

First of all, we will register the repository at the beginning.
wget -q -O - http://linux.dell.com/repo/hardware/latest/bootstrap.cgi | bash

Install next
Yum install srvadmin - all

Installation is complete.

Next, activate the following.
Service instsvcdrv start
Service dsm_sa_ipmi start
Service mptctl start
Service dsm_om_connsvc start
Service dsm_om_shrsvc start
Service dataeng start

Finally connect via URL. If such a screen is displayed OK.
https://ドメインまたはIPアドレス:1311/

The login screen looks something like this. Enter user name and password.


When logging in, "Attention" appears in "Storage". Basically, if you follow this caution mark, you can reach the error location and select "Storage".


Select "SAS 6 / iR Adapter"


Select "connector 0 (RAID)"


Select "Enclosure (Backplane)"


Select "Physical Disks"


You can see that an error has occurred in "Physical Disk 0: 0: 1"



· HDD exchange

First of all, remove the warning HDD.

When the disc is pulled out, the front LED switches to error display.


Like a movie it looks like this.


I removed the disc. The warning mark changed from a complete error.


Also, when you insert a new HDD, rebuild starts automatically.

LED goes from orange to green


Rebuild. After that, I just hope that the rebuild will end normally.



Rebuild is completed. To normal display.


Of course, the whole system is also normal


Disk lamp also turns green to indicate normal. Complete restoration.



Finally, how to check the alert log. Select "Logs" tag, "Alert".


HDD warning log is this


Log when removing hard disk


In addition,IT AssistantIf there is a monitoring server installed, it is also possible to notify you of errors, warnings, etc. by e-mail.

However, it is something you want the system to freeze with the RAID 1 only by issuing a warning on one of the HDDs.

in Review,   Software,   Hardware,   Video, Posted by darkhorse_log