[vmips] TLB question
Cable Guy
donimus at gmail.com
Sun Jan 16 03:24:00 CST 2005
After rebuilding BusyBox and VMIPS roms at least a hundred times over
the last few weeks I think I'm very close now to getting Linux working
on VMIPS. Here's my setup:
Linux stock 2.4.28 kernel
uClibc 0.9.27 (built without debugging symbols)
busybox-1.00 (built with debugging symbols and linked statically)
gcc 3.0.4 MIPS EL cross-compiler with software floating point enabled
binutils 2.13
VMIPS 1.3
I am constrained to the 2.4 kernel because the router I'm playing with
has a 2.4 kernel on it, and to gcc 3.0.x because newer versions insist
on generating instructions the R3000 can't handle with regard to
atomicity. The software floating point toolchain is working as
evidenced by the fact that I was able to compile and link a C program
with a floating point variable in it and printf() it's value to the
console. I was worried about that aspect for quite a while but I'm
satisfied now I have a handle on those issues and my test program
proves it.
After booting, Linux starts the busybox init program. That works.
That's how I ran my FP test program, and now it's set to run the shell
script /etc/init.d/rcS. That also works and I can echo output to the
console and see it. All that is set up via the /etc/inittab file
which is a busybox variation on the standard inittab. So far so good.
The next thing bb tries to do is start a shell (ash) on the console
(ttyS3). It starts to run. I see the welcome/version message. It
initializes up to a point, but when it calls the routine that writes
the very first prompt to the screen it crashes. At that point a whole
bunch of signal handling has been set up so the only indication I get
that there's a problem is that ash terminates silently and bb
faithfully tries to start it up again as per the "respawn" command I
have tied to it in the inittab file. The sequence repeats over and
over at the same point.
Busybox is statically linked so the problem most likely isn't related
to the well-known MIPS dynamic library loader problem. Here's an
excerpt of the linker map from busybox:
004000f0 __start
00400150 __do_global_dtors_aux
00400350 main
00400490 busybox_main
...
00440534 cmdedit_read_input
...
00467300 options
The "cmdedit_read_input" routine is the one that fails to execute.
This routine is the one that prints the shell prompt to the terminal
and reads a line of input from the keyboard. When I looked at this
with a remote gdb session (vmips -o debug ...) it told me that the
address 00440534 could not be read. In fact, the last address that
COULD be read was 0042FFFC. 00430000 and above weren't available.
I'm not sure if this is an artifact of the VMIPS gdb processing or
what, but when ash (busybox) tries to call the routine at 00440534
it's not there. Crash without any error messages, lather, rinse,
repeat.
This could be due to any number of problems. There is a known problem
that has to do with how Linux's "mmap" is implemented on a MIPS
processor. I've tried it both with and without the patch for that
particular problem but the result is the same. Only the first 3 pages
of the static busybox ELF executable file are being read into memory.
Or else the paging/memory management is screwed up somewhere.
I made a small mod to VMIPS in the CPU::exception() method to log
whenever a user TLB miss occurrs. By the time the problem arises
there are already 1200 entries in the log. Near the end of the log
there's indication of a user TLB miss with EPC = 0x00440534 followed
immediately by another one with EPC = 0x00440538 (the very next
instruction). The 2 instructions at the start of the routine are an
"lui" and an "addiu", fairly standard stuff for the start of a
function. I can see where a page fault (TLB miss) would put the first
address in the EPC because that's where the program should restart,
but the fact that the second instruction is also causing a TLB miss
seems strange. Maybe it's normal, I don't know. I'm not completely
versed in the ins and outs of the R3000's TLB logic. Nor Linux's
memory management for that matter. Still it seems like the very next
instruction would only cause a TLB miss if the page holding the first
instruction never actually got paged in in the first place. Is that
reasoning off the mark?
And finally, the very next entry in the user TLB miss log has an EPC
of 0x800559dc which, according to my kernel map, is in kernel routine
"sys_wait4" which is the equvalent of "waitpid()". The busybox init
process does issue waitpids for forked processes like ash so this is
probably just a normal indication that the child ash process has
ended. Still, it seems strange that a USER TLB Miss would have a
kernel EPC. I guess that's "legal". Whatever the reason for the
crash, it happens silently. I guess this is because the fault is
handled by signal processing that has been set up. What a nightmare
that code is between kernel and user space.
Next I'll try to trace where the kernel loads in the busybox binary
for execution to see if all the bytes are being read into memory.
That should be fun. I've seen it said that the places where Linux
transitions from kernel mode to userspace are some of the most
difficult things in the entire Linux universe to debug. There's only
one busybox binary. Every other busybox command is essentially a
symlink to it. For all I know it's only loaded once and references to
it, both in physical and virtual memory, are passed around from one
process address space to another, from kernel space to user space and
back again.
I hope whatever the problem is it jumps out at me soon because all
this is getting a bit, shall we say, overwhelming. I hate to give up,
however, because it's so close I can taste it.
More information about the Vmips
mailing list