how to troubleshooting unexpected Linux server reboot

·

3 min read

There could be many different reasons for an unexpected reboot of a Linux server. Here is one step-by-step troubleshooting case:

  1. check the server reboot time with last reboot command

    The best way to check the last reboot time in Linux is using the last reboot command. Open the terminal and type last reboot. It will show all reboots since the log file was created. To list the last reboot time, run last reboot| head -2.

    Example: last reboot|head -2 reboot system boot 4.18.0-348.20.1. Mon Aug 29 08:43 still running reboot system boot 4.18.0-348.20.1. Mon Aug 29 08:27 – 08:38 (00:11)

    From the example, we can see that the server reboot time is Mon Aug 29 08:38. This article explains more about how to check server reboot time in Linux.

  2. Check the system logs for any error messages or other clues that might indicate the cause of the reboot during the reboot time.

    The system logs can be accessed through the /var/log directory. Some common log files to check include messages, syslog, dmesg, and kern.log.

    Use vi command to open the log file and move the cursor to the issue time.

Here are 3 reasons that could cause a server reboot.

  1. Check if the server experienced a power outage or other hardware failure that might have caused the reboot.
    This can be done by checking the system logs for messages related to power failure or hardware issues, or by inspecting the server hardware for any visible signs of damage.

  2. Check if the server was rebooted due to a kernel panic or other system error. This can be done by checking the system logs for messages related to kernel panics or other errors.
    Kernel panic is a condition that occurs when the Linux kernel is unable to function properly due to a serious error or failure. When a kernel panic occurs, the kernel will stop executing and display an error message, which is known as a "panic code."

    Here are some examples of kernel panic codes that may be displayed in Linux:

    1. "BUG: unable to handle kernel NULL pointer dereference" - This error message indicates that the kernel has encountered a NULL pointer and is unable to continue execution.

    2. "BUG: bad page state in process" - This error message indicates that the kernel has encountered an issue with the memory management of a process.

    3. "BUG: soft lockup - CPU# stuck for xs!" - This error message indicates that the kernel has detected that a CPU has been stuck in a loop for an extended period of time. check this post to get info about this error.

    4. "BUG: unable to handle kernel paging request" - This error message indicates that the kernel has encountered an issue with paging, which is the process of transferring data between main memory and secondary storage.

    5. "BUG: scheduling while atomic" - This error message indicates that the kernel has encountered an issue with scheduling, which is the process of allocating CPU time to processes.

  1. Check if the server was rebooted due to a software issue, such as a bug in an application or system component. If the cause of the unexpected reboot cannot be determined, it may be necessary to gather more information by collecting system logs and other diagnostic data, and possibly consulting with system administrators or technical support.