[vmips] Re: TLB question
Brian R. Gaeke
brg at dgate.ORG
Sun Jan 16 23:27:48 CST 2005
On Sun, Jan 16, 2005 at 04:24:00AM -0500, Cable Guy wrote:
> After rebuilding BusyBox and VMIPS roms at least a hundred times over
> the last few weeks I think I'm very close now to getting Linux working
> on VMIPS. Here's my setup:
>
> Linux stock 2.4.28 kernel
> uClibc 0.9.27 (built without debugging symbols)
> busybox-1.00 (built with debugging symbols and linked statically)
> gcc 3.0.4 MIPS EL cross-compiler with software floating point enabled
> binutils 2.13
> VMIPS 1.3
This sounds very similar to what I use.
> I was worried about [software fp] for quite a while but I'm
> satisfied now I have a handle on those issues and my test program
> proves it.
Excellent, that's good to know.
> When I looked at this
> with a remote gdb session (vmips -o debug ...) it told me that the
> address 00440534 could not be read. In fact, the last address that
> COULD be read was 0042FFFC. 00430000 and above weren't available.
> I'm not sure if this is an artifact of the VMIPS gdb processing or
> what, but when ash (busybox) tries to call the routine at 00440534
> it's not there.
This sounds like you've hit a bug in the TLB.
> I made a small mod to VMIPS in the CPU::exception() method to log
> whenever a user TLB miss occurrs. By the time the problem arises
> there are already 1200 entries in the log. Near the end of the log
> there's indication of a user TLB miss with EPC = 0x00440534 followed
> immediately by another one with EPC = 0x00440538 (the very next
> instruction).
> [...] the fact that the second instruction is also causing a TLB miss
> seems strange.
This definitely seems strange. If it successfully handled the tlb miss for
an LUI instruction at 0x00440534 it should not have also had a tlb miss for
an ADDIU at 0x00440538. (This wouldn't necessarily hold if the second
instruction were a load or store, such as LW; those instructions access
memory locations other than the PC.) The faulting address for the user TLB
miss is always stored into the coprocessor 0 register number 8 (BadVAddr).
You might want to add something to your logging routine that prints out the
BadVAddr along with the EPC.
Then, it might be interesting to look at what happens in the kernel between
those two addresses. I would suggest logging all the TLB instructions along
with your TLB miss log - i.e., any tlbr, tlbwi, tlbwr, or tlbp instructions
that get executed, along with the values of the coprocessor 0's Index, Random,
EntryHi and EntryLo registers, would be worth knowing.
> Maybe it's normal, I don't know. I'm not completely
> versed in the ins and outs of the R3000's TLB logic. Nor Linux's
> memory management for that matter. Still it seems like the very next
> instruction would only cause a TLB miss if the page holding the first
> instruction never actually got paged in in the first place. Is that
> reasoning off the mark?
It makes sense to me - at least, that's where I'd start trying to
investigate it. The TLB miss is going to take apart that address,
0x00440534, and separate it into a virtual page number (VPN) (top
20 bits of virtual address = 0x00440) and an offset (0x534). Then
it will search for a TLB entry which has that VPN and is either
marked global or has the same PID as the faulting process (in R3000
lingo, these PIDs are called ASIDs = address space identifiers).
If it doesn't find one, you get a TLB miss exception. This is all
done in vmips/cpzero.cc:tlb_translate().
Since TLB misses are dealt with on a page-by-page basis, and the
first faulting address 0x00440534 is in the middle of a page (offset
0x534), the kernel's TLB miss handler should add a page to the TLB
with the virtual page number 0x00440, so that when it returns from
the exception to re-execute the instruction, fetching from PC
0x00440534 results in a successful TLB translation the second time
around. 0x00440538 is on the same page. I don't see how a second
TLB miss for that page on the very next instruction would make
sense, unless somehow virtual page 0x00440 were paged out between
those two instructions (unlikely).
> Still, it seems strange that a USER TLB Miss would have a
> kernel EPC. I guess that's "legal".
Yeah, you could have e.g. a load from a kernel routine accessing user space.
> What a nightmare that code is between kernel and user space.
Amen to that.
> Next I'll try to trace where the kernel loads in the busybox binary
> for execution to see if all the bytes are being read into memory.
You may find that they have not yet been loaded. Linux loads executables using
"demand paging", which means that executable code pages are loaded into memory
only once they are needed. We are probably seeing Linux fail to page in a
piece of busybox, due to some bug in the VMIPS TLB.
> I hope whatever the problem is it jumps out at me soon because all
> this is getting a bit, shall we say, overwhelming. I hate to give up,
> however, because it's so close I can taste it.
It sounds like you're definitely on the right track. I'm excited
to hear what you find. Right now, I am in the middle of moving
across the country, so my ability to help will be limited, but
let me know if you have any questions.
-Brian
More information about the Vmips
mailing list