“When engineers develop programs under Linux, have you encountered the “Oops” prompt due to some minor faults in the system? How do you troubleshoot the fault? Look at the code line by line? In fact, it doesn’t have to be that complicated. This article will introduce you to an efficient troubleshooting method for Linux programming.
“
When engineers develop programs under Linux, have you encountered the “Oops” prompt due to some minor faults in the system? How do you troubleshoot the fault? Look at the code line by line? In fact, it doesn’t have to be that complicated. This article will introduce you to an efficient troubleshooting method for Linux programming.
Before analyzing Oops, let’s take a look at the following example, using GPIO interrupt for power-down detection, refer to the driver framework of “Embedded Linux Development Course Part II”, and design the following block diagram:
The ideal process at the beginning of the design of this framework is: application startup -> program initialization -> application open device -> waiting for an interrupt event, but in actual project development, many unpredictable things often happen. If Xiao Wang is tuning the Qt application and finds that Lao Wang’s process is always printing, then do not let Lao Wang’s process start automatically. Appears, new additions appear, then the cause is definitely in the new code” inertial thinking, thinking that it is caused by the newly added Qt, and then Xiao Wang keeps testing and constantly looking for bugs… This has passed ten years.
But the reason is actually that Xiao Wang does not have an open device, that is, the driver layer does not initialize the timer queue, then the queue triggered by 50ms in the interrupt handler function is a null value. When the pointer is null, the Linux kernel will of course remind you “ouch”, and The occasional prompt is actually because the power supply is loose from time to time, and the gpio detects that the power is off, so an interrupt is triggered.
In fact, such cases are very common. The original idea is A->B->C, but the actual use is A->D->C, or there is a variable in the driver that forgets to initialize, etc. At this time, analyzing Oops can be very useful. Solve problems quickly. Then we will use the standard driver in Linux to trigger an Oops. You are right, there is such an exception in the standard source code of the Linux kernel, and we can also fix this problem.
Using our EasyARM-iMX283 development board, the kernel source code is Linux-2.6.35.3.tar.bz2 in the CD-ROM. Please refer to the CD-ROM for the compilation method. We need to change the backlight driver of the lcd to ko mode.
After burning the new kernel, loading the newly compiled drivers/video/backlight/mxs_bl.ko file will prompt the following Oops information:
At first glance, this information looks like gibberish, but as long as you analyze it layer by layer, you will find that this information has already told us the reason for the error. Next begins our Oops analysis journey.
1. Main error message
The type used to prompt the error, here means using a null pointer.
2. Operation entrance
The operation used to prompt the error, here means that there is an error when loading the mxs_bl module, which corresponds to the loading operation insmod mxs_bl.ko.
3. PC pointer
It is used to prompt the PC pointer position when an error occurs. The PC pointer is the address of the current program running point. The prompt here indicates that the error function is regulator_set_current_limit, and the offset address is 0xc.
4. LR pointer
It is used to indicate the position of the LR pointer when an error occurs. The LR pointer is the last function name and entry offset of the calling sub-function. Here, it indicates that the last function is set_bl_intensity, and the offset address is 0xd8. i.e. error when set_bl_intensity calls regulator_set_current_limit.
5. Register value
It is used to record the value of each register when an error occurs. Comrades who are familiar with assembly can study this information.
6. Error process information
The process id number and process name used to prompt errors. The error process is insmod, and the PID number is 2261. In a multitasking system, there may be multiple PIDs calling the same interface.
7. Stack information when error occurs
It is used to prompt the register information saved in the stack when an error occurs. When the program is interrupted or a subroutine is called, the stack operation will be performed, that is, the running environment will be saved to the stack to ensure that the running environment will not occur after exiting the interrupt or jumping out of the subroutine. Change.
The stack information here records the environment information when the program is running. From it, we can find many LR addresses, so as to analyze the function call relationship, which has a similar effect with the information in the next paragraph.
8. The backtracking relationship of function execution
It is used to represent the calling relationship of the function. Through this information, we can know the entire execution process of the function and its function calling relationship. Finally, the function execution process sorted out is as follows:
From this, we can see the familiar init function, probe function, and understand where the operation process performed under the probe function is wrong. Now we know the execution flow of the code and the location of the PC pointer that made the error, but we still can’t see the code. We only see a string of numbers at the error pointer, so let’s operate it next to change the data of the pc pointer to meaningful code.
The first step is to identify where the error code is
The binary files involved in this experiment include the burning firmware of the kernel and the ko file of the driver, so the first step of analysis needs to determine whether the error code is in the kernel firmware or the ko file.
First get the scope of the kernel code, disassemble the kernel with the following command.
Check out the format of this file as follows:
The number of rows in the first column, the running address in the second column, the binary code in the third column, and the assembly code in the fourth column. Since the second column is the running address, it is equivalent to the value of the pc pointer when the program runs to this row. In this way, as long as you look at the head and tail of this file, you can know that the PC pointer range of the kernel code is: c0008000~c0562338.
According to the register value in the previous step 5, the PC pointer is c02f1878 when the error occurs, that is, it is within the scope of the kernel source code.
The second step is to analyze the error statement of the error function
Then according to the PC pointer in step 3, the assembly code of regulator_set_current_limit is obtained, as follows:
The function entry address is c02f186c
In step 3 the PC pointer points to the offset address “PC is at regulator_set_current_limit+0xc”.
PC = 0xc02f1878 = 0xc02f186c + 0xc, which corresponds to the assembly code address.
The third step is to find the C language code of the error function
This step can be said to be the most difficult, because there are many layers of kernel code, and there may be many copies of the function with the same name. Several copies may be compiled into the kernel (local functions declared statically), or they may not be compiled into the kernel. Analyze which section.
I used some small methods. First, add garbled characters to the entry of each function of the same name, let the compiler filter out the files compiled into the kernel (because of garbled characters, so the compilation will report an error), and then add print statements to the remaining functions, Usually after the first step, there are only two or three optional targets, and the code can be further confirmed by printing.
The following is the filtered C language code.
Seeing that this seems to locate the function, but for those who are not familiar with assembly, C and assembly are still not related, it seems to have entered a dead end, but don’t be discouraged, from the above assembly code, we know that the function name is The first address of the function, then calling the sub-function needs to let the CPU know the sub-function name, so how does the assembly call the sub-function? Use the bl command, bl+subfunction name. Since assembly has such a feature, let’s look at assembly code.
Line 582734 above “bl c0493104
Then the result is obvious, it is impossible to define a variable and report an error, so the only statement that may be wrong is struct regulator_dev *rdev = regulator->rdev. Similarly, the first half of this sentence is just to define an rdev variable, and then combine it with the kernel to give The prompt comes out – a null pointer, so the error is that regulator->rdev is a null pointer.
The ultimate question comes down to why regularar->rdev is a null pointer. This part of the code and reasoning needs to be dug deeper, and the workload is beyond the scope of this article, so the author boldly speculates that it is similar to the A->B->C model above. So we need to assign a value to this resource before calling it when it exists.
At this time, we need to take out the backtracking relationship diagram of the function execution in step 8. Since we know that the rdev of the input parameter regulator of the last function in this figure is empty, then we care about the regulator structure and its meaning. From the meaning of the structure, we can know how to assign a value to it.
Search for the keyword “regulator” or “regulator =” in the relevant code file (it is recommended to search for this, because this is the assignment statement) to get the following code.
Analysis of this function shows that the regulator is actually a member of pdata. It needs data to initialize, so the next thing is simple. Find a place in the backtracking relationship to insert the data of data into pdata, just this function is initialization. The regulator, then use it directly.
Adding this paragraph to this position within the probe function implements the assignment of this variable between mxsbl_probe and mxsbl_do_probe.
In this way, the ko file can be loaded normally after recompiling.
The Links: QM100DY-H TPS51125RGER