Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bytecode further work tracking issue #148

Open
11 of 18 tasks
Wuerfel21 opened this issue Jun 3, 2021 · 65 comments
Open
11 of 18 tasks

Bytecode further work tracking issue #148

Wuerfel21 opened this issue Jun 3, 2021 · 65 comments

Comments

@Wuerfel21
Copy link
Contributor

Wuerfel21 commented Jun 3, 2021

Here's all the stuff that'd be nice to do with the bytecode output. Mostly just so I don't forget when I come back to it.

  • Add bytecode tests to test suite (or repurpose existing tests) (Eric added exec tests, still no expected output tests)
  • update documentation with relevant info
  • CASE/LOOK* with lookup table (LOOK* still isn't, but whatever)
    • Needs way to word-align the table
  • Investigate cycle counts for different constant encodings
  • Use logic AND instead of two jumps when compiling a K_BOOL_AND condition where the right side is trivial (single memop and such)
  • Introduce -Os to use suboptimal constructs to save bytes (such as encoding negative constants as abs value + a NEG)
  • Optimize x + (-1) into x - 1
  • Optimize x == 0 and x <> 0 conditions
  • Optimize empty repeat while (currently compiles 1 unnecessary jump. Could be a peephole)
  • Improve listing output (incorporate source lines)
    • Perhaps difficult due to reordering optimization
  • Support some flexspin-isms (subobject variables, object pointers)
    • Maybe: compiling of runtime-loadable objects
  • implement byte/word locals
  • fix loop-reduce optimization
  • Integrate optimized RAM interpreter for P1
  • Implement limited inlining (for getter/setter functions that are common in Spin)
  • Investigate P2/spin2 support
  • Add C-specific optimization (calls to global modules)
@totalspectrum
Copy link
Owner

I've started work on C/BASIC support, or at least support for the runtime libraries. I'm able to compile a very simple BASIC program (LED blinker) now. I think we'll need a helper COG for printf and inline assembly, but that should only be pulled in if actually required.

@Wuerfel21
Copy link
Contributor Author

Running inline ASM in another cog seems not very useful - takes a cog and would acess that cog's registers.

I'd just not allow it with the ROM interpreter.

@Wuerfel21
Copy link
Contributor Author

You can also get away without a cog for serial if you can live with low baud rate (19200 seems to work)

@Wuerfel21
Copy link
Contributor Author

Wuerfel21 commented Jun 9, 2021

Not sure how to handle variadic functions though. Dynamic parameter count is easy, but then all the locals shift around. I think the locals would need to be put before the parameters and the caller needs to make space for them on the stack (presumably by using a REG_MODIFY to increment DCURR?)

@Wuerfel21
Copy link
Contributor Author

Also, how'd we include the interpreter code (when not using the ROM interpreter / on P2)? I guess the same kinda thing that happens with the syscode, though I have a very strong distaste for manual updating of intermediate files, i.e. it not automatically updating the header file from the source file (and IDK what tool is used to generate it to begin with). I think inline ASM can be used to include a file directly, though that'll only work for compilers with GAS-like inline asm. I think as is flexspin only builds on GCC/LLVM, anyways, so that wouldn't be an issue?

@totalspectrum
Copy link
Owner

The .spin files in the sys/ directory are converted to binary blobs with xxd automatically, and we can do something similar for the interpreter (although it will have to be compiled first, but we can do the compilation with flexspin itself). The .spin.h files are checked in to github to help people bootstrap and to make it easier to build on Windows (not sure if xxd is widely available there, it's a standard Unix tool though).

@Wuerfel21
Copy link
Contributor Author

Could you, like, not close the tracking issue please?

@Wuerfel21
Copy link
Contributor Author

Wuerfel21 commented Jun 9, 2021

Also,

not sure if xxd is widely available there

xxd seems to be included with Git for Windows, so it seems unlikely that it'd be not installed, though it isn't on PATH (Git Bash seems to add it though.

@totalspectrum totalspectrum reopened this Jun 9, 2021
@totalspectrum
Copy link
Owner

Sorry, clicked on the wrong button

@totalspectrum
Copy link
Owner

For multiple return values, we could perhaps use the DIRB, INB, and OUTB registers, e.g. the first return value goes on the stack as usual, second one goes in DIRB, third one in INB, and fourth one in OUTB. For plain Spin programs this won't be an issue (they don't have multiple returns) but it will allow us to compile Spin2, BASIC, and C.

@Wuerfel21
Copy link
Contributor Author

Yeah, that seems like a good workaround. Though for Spin2 they'll still need to be allocated on the stack for indexing and such. The same add-to-DCURR thing that'll need to happen for varargs, I guess.

@Wuerfel21
Copy link
Contributor Author

the same add-to-DCURR thing that'll need to happen for varargs, I guess.

Actually, results are zero-initialized in Spin2, too, right? so it'd have to be bunch of zero constants before the parameter expressions. That's smaller than adding to DCURR for small counts, anyways.

@totalspectrum
Copy link
Owner

totalspectrum commented Jun 11, 2021

There's a new test script, Test/runtests_bc.sh, for testing byte code. I've disabled the C and BASIC tests (we know they won't work). Quite a few of the Spin tests do work, and one of them showed a bug that I've fixed (converting division into right shift fails for negative numbers that are not powers of 2).

Of the remaining failing tests, I think the big missing features are multiple assignment (exec03) and method/object pointers (exec04). exec11 is a timing test so its failure is entirely expected (and I should add an #ifdef for bytecode output so we can ignore it).

@Wuerfel21
Copy link
Contributor Author

Wuerfel21 commented Jun 11, 2021

multi-assign between variables should be easy, I think? Unless there's a silly Spin2 quirk that makes it stupid, that is.
Just eval all the expressions, then the stores in reverse order. Not sure if that's the right eval order for the right hand though.

Calling through pointers is a problem with the P1 ROM interpreter. The obvious idea is to create a dummy header entry and rewrite it's offset before the call, but

  • not thread/multicore safe
  • could be made multicore safe by using 8 dummies and indexing with cogid, but that's lame.
  • function header entries need to know the local size

For object pointers, I think just writing them to VBASE between the call parameters and the call opcode would work? That'd be thread-safe and probably less opcodes. Would still need a dummy header entry with zero offset.

C-style Function pointers, uhhhh, idk. Again, need to know the local size to call a function. I guess make it like a Spin2 method pointer, but with PBASE instead of VBASE

Spin2-style method pointers would need quite a bit of helper code to work on P1 and will require implementing Spin2-style RTTI (wherein the first VAR long of an object gets PBASE written to it when taking a method pointer). There'll need to be a special helper function to handle calls to a method pointer that

  • sets VBASE from the method pointer
  • sets PBASE from the RTTI pointer (now at VBASE+0)
  • indexes into PBASE with the ID part of the method pointer to get the method's header entry
  • allocates the locals
  • jumps into the method

I think the pointer will need to be passed in a register, because it simplifies the helper function a lot.

I think this'd work:

PUB __methodptr_helper
' doesn't declare parameters, but will be called with whatever the target function expects 
  __interp_vbase := inb[0..15]
  __interp_pbase := long[__interp_vbase][0] ' There is no syntax for reading the variable at a given offset. I guess we could just optimize this workaround.
  result := long[__interp_pbase][inb[20..31]] ' This is where Spin2 keeps the method ID, but for P1 it'd be more sensible to be at 24..31
  __interp_dcurr += result.word[1]~ ' allocate locals
  inb := 0
  __interp_pcurr := result.word[0]~ + __interp_pbase ' jump into function

' Same thing but for function pointer
PUB __funcptr_helper
  __interp_pbase := inb[0..15]
  result := long[__interp_pbase][inb[20..31]]
  __interp_dcurr += result.word[1]~ ' allocate locals
  inb := 0
  __interp_pcurr := result.word[0]~ + __interp_pbase ' jump into function

@Wuerfel21
Copy link
Contributor Author

Also, I think:tm: the proper fix for the division to right-shift would be change the else if (shiftOptOp && isPowerOf2(rightVal)) further down into else if (shiftOptOp && rightVal > 0 && isPowerOf2(rightVal))

@totalspectrum
Copy link
Owner

Dave Hein created a method pointer class in Spin1 that used 4 words to store pcurr, vbase, stack variable offset, and pbase. This aligns pretty well with how the assembly code handles it (it uses 2 longs instead of 4 words, but same amount of space).

@totalspectrum
Copy link
Owner

The division right shift issue is a problem if the left hand side is negative. That is, -1 SHR 1 is still -1, whereas -1 / 2 should give 0. In the assembly for x / 2^N we generate something like (x < 0) ? -(ABS(x)>>N) : (x>>N), but I'm not sure if it's worth doing that for bytecode? I guess it's quite a bit faster than division, but it'll consume a lot more space, too.

@Wuerfel21
Copy link
Contributor Author

There's problems with that approach though

  • Spin2 interpeter supports method pointers natively, so using similar on P1 seems appropriate
  • Doesn't fit in a single long
  • Need to know the local size of the called function (which would prevent the aforementioned idea of loading objects at runtime)
  • More code to init each pointer

@Wuerfel21
Copy link
Contributor Author

The division right shift issue is a problem if the left hand side is negative. That is, -1 SHR 1 is still -1, whereas -1 / 2 should give 0. In the assembly for x / 2^N we generate something like (x < 0) ? -(ABS(x)>>N) : (x>>N), but I'm not sure if it's worth doing that for bytecode? I guess it's quite a bit faster than division, but it'll consume a lot more space, too.

Ahh, of course.

Doing that workaround is likely not (much) faster than an actual divide and doesn't work as an assignment, so yeah, probably just remove it. Keep the no-op optimization though.

@totalspectrum
Copy link
Owner

Here's another idea for function/method pointers: they could always point at the descriptor for the function inside the object (the one with the dcurr and pcurr offsets). Immediately after any function whose address was taken we would insert a dummy function descriptor which points back at pbase. The function pointer itself would then fit in 16 bits, and for method pointers we'd need an additional 16 bits for vbase.

@Wuerfel21
Copy link
Contributor Author

That could work well, yeah. Any major advantage over the Spin2-like approach though?

@Wuerfel21
Copy link
Contributor Author

Unrelatedly, I think I'm going to look into jump table CASE now. Saying that so we don't step on our feet.

@totalspectrum
Copy link
Owner

Sounds good, thanks. I'm going to take a stab at function/method pointers, so as to get C/BASIC printing to work (the file handles use a table of pointers).

@Wuerfel21
Copy link
Contributor Author

Regarding the zero-extend operator, it'd be useful to optimize constant zero-extend into an AND. Can you add that to OptimizeOperator? Also, setting unary to have it not compile the expression again seems really jank, I think it'd be better to just push the SHR/SAR op in the case and return, like BOOL_AND does.

@Wuerfel21
Copy link
Contributor Author

Also, I think it's just implemented wrong to begin with, shouldn't the shift amount be 31-x or something?

@Wuerfel21
Copy link
Contributor Author

Also, the current implementation for jump table generation relies on arbitrary GOTO and there being a variable holding the expression result. The former needs to be implemented for C/BASIC, anyways, the latter I'll see if I can work around.

@Wuerfel21
Copy link
Contributor Author

RE: the zeroext thing

As I said, the zeroext to AND transform should really be in OptimizeOperator, since that way it can apply to assignments. Also, if you want a constant, you can just use BCCompileInteger instead of allocating an integer and running it through BCCompileExpression

@Wuerfel21
Copy link
Contributor Author

I've got the jump tables implemented, will finish up tomorrow. Is in my BC-casefast branch. Main problem RN is that the jump table index needs unsigned min/max to be computed. I've built an optimization that converts unsigned ops to signed ones where it makes sense, but right now that only works if the first case is zero (because otherwise a subtraction is inserted) and the expression is explicitly masked (#157 blocks it from detecting small unsigned types).

But I think the actual solution is to insert an additional OTHER case at the beginning of a table, such that an expression like ((x-offset)<#max)>#0 can be used.

The unsigned->signed optimization will be neat for C though, I guess.

@Wuerfel21
Copy link
Contributor Author

Looking into multi-assign, wondering what's up with the eval order. PASM backend seems to eval both LHS and RHS from left to right, whereas the """intuitive""" order would be to eval the LHR right to left. Not sure what PNut/proptool do, listings are garbo and I don't have P2 HW hooked up right now.

@totalspectrum
Copy link
Owner

I'm not sure what PNut does, but I agree, it makes sense to evaluate the RHS in the same order as function parameters (so left to right) and hence do the LHS in the opposite order in order to pop things off the stack. Probably the PASM should be changed to do this too.
My thoughts on multi-assign were to add "numResults" fields to the IR attributes for calls and returns. When numResults is 0 or 1 the current code is generated. When numResults > 1, then the Spin1 bytecode generator adds instructions to pop the stack into INB, OUTB, and DIRB (as needed) before the RETURN bytecode. Similarly for calls we insert automatic pushes of those registers into the code after the call instruction. So the Spin1 backend generates larger code for multiple call/return.
I started work on this, checked in on branch test/bytecode_multiassign, but it's still unfinished and messy. I can finish it if you'd like, but it may be a few days before I get to it... or if you have some better ideas I'm certainly fine with that.

@Wuerfel21
Copy link
Contributor Author

Yours looks a bit nicer, so close my PR if you want.

Slight nitpick: In BCCompileExpression, you need to either set popResults to the number of results or set it to zero and not push anything. For expression lists, just set it to zero and pass asStatement through, I guess.

@Wuerfel21
Copy link
Contributor Author

Not sure if handling the register passing at the IR->bytecode step is a good idea though. Brings a lot of complexity there, when you can do it just as well in the code that emits the IR (that needs to know the result counts, anyways).

@Wuerfel21
Copy link
Contributor Author

Will have to check how Pnut handles multireturn though, maybe that needs the result count, anyways

@totalspectrum
Copy link
Owner

I think it needs to go in IR->bytecode because it's only something we do in the Spin1 ROM (in P2 we definitely won't be pushing/popping registers for the extra results, although I'm not sure exactly how it gets handled... maybe the RETURN and/or CALL encodes the number of results). I've got something checked in to the branch now that works well enough to pass the exec03 multiple assignment tests, although it's still incomplete (needs more error checking, for one thing!)

@Wuerfel21
Copy link
Contributor Author

That's not really a reason though, the higher level stuff can check gl_interp_kind, too, you know.

Can also spot some issues:

  • RETURN_PLAIN should not ever be popping values off the stack
  • You set popResults to zero for exprlists now, but don't pass through the asStatement argument

@totalspectrum
Copy link
Owner

I guess it does make sense to move the register push/pop up to outbc.c, because then bcir can see it and optimize it. So I've done that, thanks.
You're right RETURN_PLAIN needs to handle moving the local variables into INB etc. For now I've punted and thrown an error; I don't think the compiler will currently generate this. Multi-return ABORT also doesn't get handled correctly, but that's not surprising, the PASM backend makes ABORT do a lot of work that the bytecode isn't equipped for.

@totalspectrum
Copy link
Owner

I guess I don't understand the popResults/asStatement dichotomy. I think asStatement should always be false for EXPRLISTs, because we really do need to push the results onto the stack and explicitly pop them. I've probably misunderstood something there.

@Wuerfel21
Copy link
Contributor Author

Wuerfel21 commented Jun 16, 2021

I guess I don't understand the popResults/asStatement dichotomy. I think asStatement should always be false for EXPRLISTs, because we really do need to push the results onto the stack and explicitly pop them. I've probably misunderstood something there.

basically, BCCompileExpression with asStatement=true should evaluate the expression without actually pushing the result, such as with the third part of a for loop, but I think(tm) it's also used in at least one other place. Yeah, it's probably not going to happen with an exprlist, that's why I say its nitpick. popResults is simply the amount of results that should be popped off after. If the evaluation can happen without pushing anything in the first place, it needs to be set to zero

Multi-return ABORT also doesn't get handled correctly

In what way? Spin2 abort only allows a single parameter. Is multi-value ABORT an extension that exists?

@Wuerfel21
Copy link
Contributor Author

Perhaps interesting: new GEAR version that works with flexspin's bytecode output (previously, reverse math on MODIFY opcodes was unimplemented): https://github.com/davispuh/gear-emu/releases/tag/v21.06.01

@totalspectrum
Copy link
Owner

Thanks. BTW, I'm getting a very strange compilation error in C programs with break inside loops:

int main() {
    int i;
    for (i = 0; i < 10; i++) {
        __builtin_printf("%d\n", i);
        if (i == 5) break;
    }
    return 0;
}

gives me an error about ENDCASE outside of a CASE. I'm guessing the "break" is being mangled by mistake into an ENDCASE by some of the case handling code?

@Wuerfel21
Copy link
Contributor Author

Yeah, no idea why that'd happen

@totalspectrum
Copy link
Owner

Never mind, that's probably my fault... C doesn't have seperate keywords for "end case" and "break out of loop" so the AST_ENDCASE gets generated for both. I'll add some code to work around that.

@Wuerfel21
Copy link
Contributor Author

hmm, alloca is a tricky one. Can't just go above the current stack, because that'd break if the alloca is in an expression. Moving the temporaries up and allocating the memory inbetween the variables and temporaries would work for that, but requires a longmove and breaks if there's more than one alloca. If there was a high level transform moving the alloca into a standalone assignment expression if it isn't already, I think the top-of-stack thing would work, but the allocation size will need to be pushed again after the allocation so it can be popped off when out of scope, which means that the allocation amount needs to be in a variable (so we don't need to eval+pad it twice). I guess the variable that will get the pointer can be temporarily reused for that

@totalspectrum
Copy link
Owner

A global pass to pull out the allocas makes sense. I'll open another bug for it, alloca in general is a mess (AST_SCOPE doesn't really work right even in PASM).

@totalspectrum
Copy link
Owner

BTW, I've implemented some very primitive member variable code on the test/bytecode_objptrs branch.

@Wuerfel21
Copy link
Contributor Author

BTW, I've implemented some very primitive member variable code on the test/bytecode_objptrs branch.

looking good

@Wuerfel21
Copy link
Contributor Author

Wuerfel21 commented Jun 17, 2021

I think the top-of-stack thing would work, but the allocation size will need to be pushed again after the allocation so it can be popped off when out of scope, which means that the allocation amount needs to be in a variable (so we don't need to eval+pad it twice). I guess the variable that will get the pointer can be temporarily reused for that.

Actually, better idea: after the allocation, the first long of it still contains the allocation amount that was popped off to add onto DCURR, so we can just push that.

@Wuerfel21
Copy link
Contributor Author

Are you looking into object pointers (i.e. C struct pointers) yet? Because if not, I'll give it a look.

@totalspectrum
Copy link
Owner

No, I haven't had a chance to look at object / struct pointers yet. I'm working right now on trying to fix the assignment bug. So it'd be great if you could look at object pointers.

I think pointers to data shouldn't be too bad? Internally I think a pointer to an object will generally just be a pointer to the object's VBASE, so "p->x" should translate into something like "long[p + offsetof(x)]" (where long gets replaced by byte/word depending on the type of x).

Invoking methods on pointers will be a little bit trickier... for "p->foo()" we know the type info of p at compile time, so we can find all the object data. VBASE will come from p itself; PBASE comes from the object type info, and DCURR_OFFSET and PCURR come from the function info for "foo". I guess we could set up the 4 registers directly with bytecode commands like __interp_vbase = p?

@Wuerfel21
Copy link
Contributor Author

Ok, will look at it. For calls, I think the best method is to add a dummy entry for the used type to the object header with a zero offset and then just set VBASE after pushing the last parameter (and then do the normal function call opcode). Since the current VBASE is pushed by the anchor opcode, it will be correct again on return.

@Wuerfel21
Copy link
Contributor Author

The trouble with that though is figuring out which types are called through pointers before compiling the object.

@Wuerfel21
Copy link
Contributor Author

Also, accessing variables through pointers already works, it seems, though suboptimal because it indeed does long[p + offsetof(x)] when long[p][offsetof(x)/4] would be more appropriate. Can implement that as a general transform easily enough though.

@totalspectrum
Copy link
Owner

Unless you're already working on it, I was planning to look at method pointers again. My scheme (re-using some of the PASM code to allow for 8 byte method pointers) is probably suboptimal, but I think it will be easy to implement and will at least let us get some more of the BASIC tests working.

@Wuerfel21
Copy link
Contributor Author

No, not doing anything RN. Not sure if hacking the Spin2 kinda scheme onto the P1 interpreter is much better, either, so go ahead I guess.

@Wuerfel21
Copy link
Contributor Author

Wuerfel21 commented Jun 27, 2021

Just took a look at your method pointer branch.

'
' up to 8 parameters works OK; more than that and we could
' run into problems
'
pri __call_methodptr(a,b,c,d,e,f,g,h) | p, off, pc
  p := INB
  __interp_vbase := long[p]
  off := word[p+6]<<2  ' function offset as words
  p := word[p+4]       ' new pbase
  
  __interp_pbase := p
  p += off
  pc := word[p] + __interp_pbase   ' new pc
  'off := word[p][1] ' number of locals
  'off -= 12              ' we have 2 locals already
  'if off > 0
  '  __interp_dcurr += (off)
  __interp_pcurr := pc

I think this could be written better:

pri __call_methodptr
  __interp_vbase := long[INB]
  result := word[INB+6]<<1  ' function offset as words
  __interp_pbase := word[INB+4]       ' new pbase

  __interp_dcurr += word[__interp_pbase][result+1] ' number of locals
  __interp_pcurr := word[__interp_pbase][result~] + __interp_pbase   ' new pc

There should probably also be an optimization for memops where the baseExpr is a read from a base register....

@totalspectrum
Copy link
Owner

Thanks, I thought this could be improved but hadn't really worked it through -- your version is much better.

@Wuerfel21
Copy link
Contributor Author

There should probably also be an optimization for memops where the baseExpr is a read from a base register....

Relatedly, I think REG_WRITE to PCURR could be treated as a terminal op to get rid of the dead RETURN_PLAIN after it

@Wuerfel21
Copy link
Contributor Author

C's implicit top module is only ever instantiated once, right? Because currently calling a method on a global object in C has to go through the objpointer path, which is kinda less than ideal. So I think there could be a special optimization to just add global objects into the object table so they can be called natively....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants