Bytecode further work tracking issue #148

Wuerfel21 · 2021-06-03T11:18:42Z

totalspectrum · 2021-06-09T00:42:27Z

I've started work on C/BASIC support, or at least support for the runtime libraries. I'm able to compile a very simple BASIC program (LED blinker) now. I think we'll need a helper COG for printf and inline assembly, but that should only be pulled in if actually required.

Wuerfel21 · 2021-06-09T07:16:56Z

Running inline ASM in another cog seems not very useful - takes a cog and would acess that cog's registers.

I'd just not allow it with the ROM interpreter.

Wuerfel21 · 2021-06-09T07:18:28Z

You can also get away without a cog for serial if you can live with low baud rate (19200 seems to work)

Wuerfel21 · 2021-06-09T07:45:00Z

Not sure how to handle variadic functions though. Dynamic parameter count is easy, but then all the locals shift around. I think the locals would need to be put before the parameters and the caller needs to make space for them on the stack (presumably by using a REG_MODIFY to increment DCURR?)

Wuerfel21 · 2021-06-09T16:06:51Z

Also, how'd we include the interpreter code (when not using the ROM interpreter / on P2)? I guess the same kinda thing that happens with the syscode, though I have a very strong distaste for manual updating of intermediate files, i.e. it not automatically updating the header file from the source file (and IDK what tool is used to generate it to begin with). I think inline ASM can be used to include a file directly, though that'll only work for compilers with GAS-like inline asm. I think as is flexspin only builds on GCC/LLVM, anyways, so that wouldn't be an issue?

totalspectrum · 2021-06-09T18:02:25Z

The .spin files in the sys/ directory are converted to binary blobs with xxd automatically, and we can do something similar for the interpreter (although it will have to be compiled first, but we can do the compilation with flexspin itself). The .spin.h files are checked in to github to help people bootstrap and to make it easier to build on Windows (not sure if xxd is widely available there, it's a standard Unix tool though).

Wuerfel21 · 2021-06-09T18:10:13Z

Could you, like, not close the tracking issue please?

Wuerfel21 · 2021-06-09T18:33:33Z

Also,

not sure if xxd is widely available there

xxd seems to be included with Git for Windows, so it seems unlikely that it'd be not installed, though it isn't on PATH (Git Bash seems to add it though.

totalspectrum · 2021-06-09T19:13:47Z

Sorry, clicked on the wrong button

totalspectrum · 2021-06-10T12:33:37Z

For multiple return values, we could perhaps use the DIRB, INB, and OUTB registers, e.g. the first return value goes on the stack as usual, second one goes in DIRB, third one in INB, and fourth one in OUTB. For plain Spin programs this won't be an issue (they don't have multiple returns) but it will allow us to compile Spin2, BASIC, and C.

Wuerfel21 · 2021-06-10T13:09:46Z

Yeah, that seems like a good workaround. Though for Spin2 they'll still need to be allocated on the stack for indexing and such. The same add-to-DCURR thing that'll need to happen for varargs, I guess.

Wuerfel21 · 2021-06-10T20:21:02Z

the same add-to-DCURR thing that'll need to happen for varargs, I guess.

Actually, results are zero-initialized in Spin2, too, right? so it'd have to be bunch of zero constants before the parameter expressions. That's smaller than adding to DCURR for small counts, anyways.

totalspectrum · 2021-06-11T11:53:13Z

There's a new test script, Test/runtests_bc.sh, for testing byte code. I've disabled the C and BASIC tests (we know they won't work). Quite a few of the Spin tests do work, and one of them showed a bug that I've fixed (converting division into right shift fails for negative numbers that are not powers of 2).

Of the remaining failing tests, I think the big missing features are multiple assignment (exec03) and method/object pointers (exec04). exec11 is a timing test so its failure is entirely expected (and I should add an #ifdef for bytecode output so we can ignore it).

Wuerfel21 · 2021-06-11T12:42:29Z

multi-assign between variables should be easy, I think? Unless there's a silly Spin2 quirk that makes it stupid, that is.
Just eval all the expressions, then the stores in reverse order. Not sure if that's the right eval order for the right hand though.

Calling through pointers is a problem with the P1 ROM interpreter. The obvious idea is to create a dummy header entry and rewrite it's offset before the call, but

not thread/multicore safe
could be made multicore safe by using 8 dummies and indexing with cogid, but that's lame.
function header entries need to know the local size

For object pointers, I think just writing them to VBASE between the call parameters and the call opcode would work? That'd be thread-safe and probably less opcodes. Would still need a dummy header entry with zero offset.

C-style Function pointers, uhhhh, idk. Again, need to know the local size to call a function. I guess make it like a Spin2 method pointer, but with PBASE instead of VBASE

Spin2-style method pointers would need quite a bit of helper code to work on P1 and will require implementing Spin2-style RTTI (wherein the first VAR long of an object gets PBASE written to it when taking a method pointer). There'll need to be a special helper function to handle calls to a method pointer that

sets VBASE from the method pointer
sets PBASE from the RTTI pointer (now at VBASE+0)
indexes into PBASE with the ID part of the method pointer to get the method's header entry
allocates the locals
jumps into the method

I think the pointer will need to be passed in a register, because it simplifies the helper function a lot.

I think this'd work:

PUB __methodptr_helper
' doesn't declare parameters, but will be called with whatever the target function expects 
  __interp_vbase := inb[0..15]
  __interp_pbase := long[__interp_vbase][0] ' There is no syntax for reading the variable at a given offset. I guess we could just optimize this workaround.
  result := long[__interp_pbase][inb[20..31]] ' This is where Spin2 keeps the method ID, but for P1 it'd be more sensible to be at 24..31
  __interp_dcurr += result.word[1]~ ' allocate locals
  inb := 0
  __interp_pcurr := result.word[0]~ + __interp_pbase ' jump into function

' Same thing but for function pointer
PUB __funcptr_helper
  __interp_pbase := inb[0..15]
  result := long[__interp_pbase][inb[20..31]]
  __interp_dcurr += result.word[1]~ ' allocate locals
  inb := 0
  __interp_pcurr := result.word[0]~ + __interp_pbase ' jump into function

Wuerfel21 · 2021-06-11T12:54:42Z

Also, I think:tm: the proper fix for the division to right-shift would be change the else if (shiftOptOp && isPowerOf2(rightVal)) further down into else if (shiftOptOp && rightVal > 0 && isPowerOf2(rightVal))

totalspectrum · 2021-06-11T13:47:57Z

Dave Hein created a method pointer class in Spin1 that used 4 words to store pcurr, vbase, stack variable offset, and pbase. This aligns pretty well with how the assembly code handles it (it uses 2 longs instead of 4 words, but same amount of space).

totalspectrum · 2021-06-11T13:53:47Z

The division right shift issue is a problem if the left hand side is negative. That is, -1 SHR 1 is still -1, whereas -1 / 2 should give 0. In the assembly for x / 2^N we generate something like (x < 0) ? -(ABS(x)>>N) : (x>>N), but I'm not sure if it's worth doing that for bytecode? I guess it's quite a bit faster than division, but it'll consume a lot more space, too.

Wuerfel21 · 2021-06-11T13:57:00Z

There's problems with that approach though

Spin2 interpeter supports method pointers natively, so using similar on P1 seems appropriate
Doesn't fit in a single long
Need to know the local size of the called function (which would prevent the aforementioned idea of loading objects at runtime)
More code to init each pointer

Wuerfel21 · 2021-06-11T14:01:18Z

The division right shift issue is a problem if the left hand side is negative. That is, -1 SHR 1 is still -1, whereas -1 / 2 should give 0. In the assembly for x / 2^N we generate something like (x < 0) ? -(ABS(x)>>N) : (x>>N), but I'm not sure if it's worth doing that for bytecode? I guess it's quite a bit faster than division, but it'll consume a lot more space, too.

Ahh, of course.

Doing that workaround is likely not (much) faster than an actual divide and doesn't work as an assignment, so yeah, probably just remove it. Keep the no-op optimization though.

totalspectrum · 2021-06-12T01:14:44Z

Here's another idea for function/method pointers: they could always point at the descriptor for the function inside the object (the one with the dcurr and pcurr offsets). Immediately after any function whose address was taken we would insert a dummy function descriptor which points back at pbase. The function pointer itself would then fit in 16 bits, and for method pointers we'd need an additional 16 bits for vbase.

Wuerfel21 · 2021-06-12T09:33:50Z

That could work well, yeah. Any major advantage over the Spin2-like approach though?

Wuerfel21 · 2021-06-12T13:01:12Z

Unrelatedly, I think I'm going to look into jump table CASE now. Saying that so we don't step on our feet.

totalspectrum · 2021-06-12T13:28:27Z

Sounds good, thanks. I'm going to take a stab at function/method pointers, so as to get C/BASIC printing to work (the file handles use a table of pointers).

Wuerfel21 · 2021-06-12T13:51:08Z

Regarding the zero-extend operator, it'd be useful to optimize constant zero-extend into an AND. Can you add that to OptimizeOperator? Also, setting unary to have it not compile the expression again seems really jank, I think it'd be better to just push the SHR/SAR op in the case and return, like BOOL_AND does.

Wuerfel21 · 2021-06-12T13:55:40Z

Also, I think it's just implemented wrong to begin with, shouldn't the shift amount be 31-x or something?

Wuerfel21 · 2021-06-12T14:01:14Z

Also, the current implementation for jump table generation relies on arbitrary GOTO and there being a variable holding the expression result. The former needs to be implemented for C/BASIC, anyways, the latter I'll see if I can work around.

Wuerfel21 · 2021-06-12T16:14:31Z

RE: the zeroext thing

As I said, the zeroext to AND transform should really be in OptimizeOperator, since that way it can apply to assignments. Also, if you want a constant, you can just use BCCompileInteger instead of allocating an integer and running it through BCCompileExpression

Wuerfel21 · 2021-06-12T23:22:44Z

I've got the jump tables implemented, will finish up tomorrow. Is in my BC-casefast branch. Main problem RN is that the jump table index needs unsigned min/max to be computed. I've built an optimization that converts unsigned ops to signed ones where it makes sense, but right now that only works if the first case is zero (because otherwise a subtraction is inserted) and the expression is explicitly masked (#157 blocks it from detecting small unsigned types).

But I think the actual solution is to insert an additional OTHER case at the beginning of a table, such that an expression like ((x-offset)<#max)>#0 can be used.

The unsigned->signed optimization will be neat for C though, I guess.

Wuerfel21 · 2021-06-15T21:26:56Z

Looking into multi-assign, wondering what's up with the eval order. PASM backend seems to eval both LHS and RHS from left to right, whereas the """intuitive""" order would be to eval the LHR right to left. Not sure what PNut/proptool do, listings are garbo and I don't have P2 HW hooked up right now.

totalspectrum · 2021-06-15T21:47:38Z

I'm not sure what PNut does, but I agree, it makes sense to evaluate the RHS in the same order as function parameters (so left to right) and hence do the LHS in the opposite order in order to pop things off the stack. Probably the PASM should be changed to do this too.
My thoughts on multi-assign were to add "numResults" fields to the IR attributes for calls and returns. When numResults is 0 or 1 the current code is generated. When numResults > 1, then the Spin1 bytecode generator adds instructions to pop the stack into INB, OUTB, and DIRB (as needed) before the RETURN bytecode. Similarly for calls we insert automatic pushes of those registers into the code after the call instruction. So the Spin1 backend generates larger code for multiple call/return.
I started work on this, checked in on branch test/bytecode_multiassign, but it's still unfinished and messy. I can finish it if you'd like, but it may be a few days before I get to it... or if you have some better ideas I'm certainly fine with that.

Wuerfel21 · 2021-06-15T21:57:26Z

Yours looks a bit nicer, so close my PR if you want.

Slight nitpick: In BCCompileExpression, you need to either set popResults to the number of results or set it to zero and not push anything. For expression lists, just set it to zero and pass asStatement through, I guess.

Wuerfel21 · 2021-06-15T22:54:19Z

Not sure if handling the register passing at the IR->bytecode step is a good idea though. Brings a lot of complexity there, when you can do it just as well in the code that emits the IR (that needs to know the result counts, anyways).

Wuerfel21 · 2021-06-15T22:57:36Z

Will have to check how Pnut handles multireturn though, maybe that needs the result count, anyways

totalspectrum · 2021-06-15T23:32:33Z

I think it needs to go in IR->bytecode because it's only something we do in the Spin1 ROM (in P2 we definitely won't be pushing/popping registers for the extra results, although I'm not sure exactly how it gets handled... maybe the RETURN and/or CALL encodes the number of results). I've got something checked in to the branch now that works well enough to pass the exec03 multiple assignment tests, although it's still incomplete (needs more error checking, for one thing!)

Wuerfel21 · 2021-06-16T07:01:59Z

That's not really a reason though, the higher level stuff can check gl_interp_kind, too, you know.

Can also spot some issues:

RETURN_PLAIN should not ever be popping values off the stack
You set popResults to zero for exprlists now, but don't pass through the asStatement argument

totalspectrum · 2021-06-16T13:04:00Z

I guess it does make sense to move the register push/pop up to outbc.c, because then bcir can see it and optimize it. So I've done that, thanks.
You're right RETURN_PLAIN needs to handle moving the local variables into INB etc. For now I've punted and thrown an error; I don't think the compiler will currently generate this. Multi-return ABORT also doesn't get handled correctly, but that's not surprising, the PASM backend makes ABORT do a lot of work that the bytecode isn't equipped for.

totalspectrum · 2021-06-16T13:05:16Z

I guess I don't understand the popResults/asStatement dichotomy. I think asStatement should always be false for EXPRLISTs, because we really do need to push the results onto the stack and explicitly pop them. I've probably misunderstood something there.

Wuerfel21 · 2021-06-16T13:44:52Z

I guess I don't understand the popResults/asStatement dichotomy. I think asStatement should always be false for EXPRLISTs, because we really do need to push the results onto the stack and explicitly pop them. I've probably misunderstood something there.

basically, BCCompileExpression with asStatement=true should evaluate the expression without actually pushing the result, such as with the third part of a for loop, but I think(tm) it's also used in at least one other place. Yeah, it's probably not going to happen with an exprlist, that's why I say its nitpick. popResults is simply the amount of results that should be popped off after. If the evaluation can happen without pushing anything in the first place, it needs to be set to zero

Multi-return ABORT also doesn't get handled correctly

In what way? Spin2 abort only allows a single parameter. Is multi-value ABORT an extension that exists?

Wuerfel21 · 2021-06-16T20:07:30Z

Perhaps interesting: new GEAR version that works with flexspin's bytecode output (previously, reverse math on MODIFY opcodes was unimplemented): https://github.com/davispuh/gear-emu/releases/tag/v21.06.01

totalspectrum · 2021-06-17T00:45:51Z

Thanks. BTW, I'm getting a very strange compilation error in C programs with break inside loops:

int main() {
    int i;
    for (i = 0; i < 10; i++) {
        __builtin_printf("%d\n", i);
        if (i == 5) break;
    }
    return 0;
}

gives me an error about ENDCASE outside of a CASE. I'm guessing the "break" is being mangled by mistake into an ENDCASE by some of the case handling code?

Wuerfel21 · 2021-06-17T07:04:16Z

Yeah, no idea why that'd happen

totalspectrum · 2021-06-17T12:12:50Z

Never mind, that's probably my fault... C doesn't have seperate keywords for "end case" and "break out of loop" so the AST_ENDCASE gets generated for both. I'll add some code to work around that.

Wuerfel21 · 2021-06-17T12:33:10Z

hmm, alloca is a tricky one. Can't just go above the current stack, because that'd break if the alloca is in an expression. Moving the temporaries up and allocating the memory inbetween the variables and temporaries would work for that, but requires a longmove and breaks if there's more than one alloca. If there was a high level transform moving the alloca into a standalone assignment expression if it isn't already, I think the top-of-stack thing would work, but the allocation size will need to be pushed again after the allocation so it can be popped off when out of scope, which means that the allocation amount needs to be in a variable (so we don't need to eval+pad it twice). I guess the variable that will get the pointer can be temporarily reused for that

totalspectrum · 2021-06-17T14:48:20Z

A global pass to pull out the allocas makes sense. I'll open another bug for it, alloca in general is a mess (AST_SCOPE doesn't really work right even in PASM).

totalspectrum · 2021-06-17T14:49:06Z

BTW, I've implemented some very primitive member variable code on the test/bytecode_objptrs branch.

Wuerfel21 · 2021-06-17T15:05:06Z

BTW, I've implemented some very primitive member variable code on the test/bytecode_objptrs branch.

looking good

Wuerfel21 · 2021-06-17T17:58:07Z

I think the top-of-stack thing would work, but the allocation size will need to be pushed again after the allocation so it can be popped off when out of scope, which means that the allocation amount needs to be in a variable (so we don't need to eval+pad it twice). I guess the variable that will get the pointer can be temporarily reused for that.

Actually, better idea: after the allocation, the first long of it still contains the allocation amount that was popped off to add onto DCURR, so we can just push that.

Wuerfel21 · 2021-06-20T13:05:09Z

Are you looking into object pointers (i.e. C struct pointers) yet? Because if not, I'll give it a look.

totalspectrum · 2021-06-20T13:19:18Z

No, I haven't had a chance to look at object / struct pointers yet. I'm working right now on trying to fix the assignment bug. So it'd be great if you could look at object pointers.

I think pointers to data shouldn't be too bad? Internally I think a pointer to an object will generally just be a pointer to the object's VBASE, so "p->x" should translate into something like "long[p + offsetof(x)]" (where long gets replaced by byte/word depending on the type of x).

Invoking methods on pointers will be a little bit trickier... for "p->foo()" we know the type info of p at compile time, so we can find all the object data. VBASE will come from p itself; PBASE comes from the object type info, and DCURR_OFFSET and PCURR come from the function info for "foo". I guess we could set up the 4 registers directly with bytecode commands like __interp_vbase = p?

Wuerfel21 · 2021-06-20T13:25:50Z

Ok, will look at it. For calls, I think the best method is to add a dummy entry for the used type to the object header with a zero offset and then just set VBASE after pushing the last parameter (and then do the normal function call opcode). Since the current VBASE is pushed by the anchor opcode, it will be correct again on return.

Wuerfel21 · 2021-06-20T13:28:26Z

The trouble with that though is figuring out which types are called through pointers before compiling the object.

Wuerfel21 · 2021-06-20T13:32:14Z

Also, accessing variables through pointers already works, it seems, though suboptimal because it indeed does long[p + offsetof(x)] when long[p][offsetof(x)/4] would be more appropriate. Can implement that as a general transform easily enough though.

totalspectrum · 2021-06-26T12:53:53Z

Unless you're already working on it, I was planning to look at method pointers again. My scheme (re-using some of the PASM code to allow for 8 byte method pointers) is probably suboptimal, but I think it will be easy to implement and will at least let us get some more of the BASIC tests working.

Wuerfel21 · 2021-06-26T16:04:37Z

No, not doing anything RN. Not sure if hacking the Spin2 kinda scheme onto the P1 interpreter is much better, either, so go ahead I guess.

Wuerfel21 · 2021-06-27T16:18:25Z

Just took a look at your method pointer branch.

'
' up to 8 parameters works OK; more than that and we could
' run into problems
'
pri __call_methodptr(a,b,c,d,e,f,g,h) | p, off, pc
  p := INB
  __interp_vbase := long[p]
  off := word[p+6]<<2  ' function offset as words
  p := word[p+4]       ' new pbase
  
  __interp_pbase := p
  p += off
  pc := word[p] + __interp_pbase   ' new pc
  'off := word[p][1] ' number of locals
  'off -= 12              ' we have 2 locals already
  'if off > 0
  '  __interp_dcurr += (off)
  __interp_pcurr := pc

I think this could be written better:

pri __call_methodptr
  __interp_vbase := long[INB]
  result := word[INB+6]<<1  ' function offset as words
  __interp_pbase := word[INB+4]       ' new pbase

  __interp_dcurr += word[__interp_pbase][result+1] ' number of locals
  __interp_pcurr := word[__interp_pbase][result~] + __interp_pbase   ' new pc

There should probably also be an optimization for memops where the baseExpr is a read from a base register....

totalspectrum · 2021-06-27T20:34:27Z

Thanks, I thought this could be improved but hadn't really worked it through -- your version is much better.

Wuerfel21 · 2021-06-27T20:49:16Z

There should probably also be an optimization for memops where the baseExpr is a read from a base register....

Relatedly, I think REG_WRITE to PCURR could be treated as a terminal op to get rid of the dead RETURN_PLAIN after it

Wuerfel21 · 2021-09-19T13:10:54Z

C's implicit top module is only ever instantiated once, right? Because currently calling a method on a global object in C has to go through the objpointer path, which is kinda less than ideal. So I think there could be a special optimization to just add global objects into the object table so they can be called natively....

totalspectrum closed this as completed Jun 9, 2021

totalspectrum reopened this Jun 9, 2021

Bytecode further work tracking issue #148

Bytecode further work tracking issue #148

Comments

Wuerfel21 commented Jun 3, 2021 • edited Loading

totalspectrum commented Jun 9, 2021

Wuerfel21 commented Jun 9, 2021

Wuerfel21 commented Jun 9, 2021

Wuerfel21 commented Jun 9, 2021 • edited Loading

Wuerfel21 commented Jun 9, 2021

totalspectrum commented Jun 9, 2021

Wuerfel21 commented Jun 9, 2021

Wuerfel21 commented Jun 9, 2021 • edited Loading

totalspectrum commented Jun 9, 2021

totalspectrum commented Jun 10, 2021

Wuerfel21 commented Jun 10, 2021

Wuerfel21 commented Jun 10, 2021

totalspectrum commented Jun 11, 2021 • edited Loading

Wuerfel21 commented Jun 11, 2021 • edited Loading

Wuerfel21 commented Jun 11, 2021

totalspectrum commented Jun 11, 2021

totalspectrum commented Jun 11, 2021

Wuerfel21 commented Jun 11, 2021

Wuerfel21 commented Jun 11, 2021

totalspectrum commented Jun 12, 2021

Wuerfel21 commented Jun 12, 2021

Wuerfel21 commented Jun 12, 2021

totalspectrum commented Jun 12, 2021

Wuerfel21 commented Jun 12, 2021

Wuerfel21 commented Jun 12, 2021

Wuerfel21 commented Jun 12, 2021

Wuerfel21 commented Jun 12, 2021

Wuerfel21 commented Jun 12, 2021

Wuerfel21 commented Jun 15, 2021

totalspectrum commented Jun 15, 2021

Wuerfel21 commented Jun 15, 2021

Wuerfel21 commented Jun 15, 2021

Wuerfel21 commented Jun 15, 2021

totalspectrum commented Jun 15, 2021

Wuerfel21 commented Jun 16, 2021

totalspectrum commented Jun 16, 2021

totalspectrum commented Jun 16, 2021

Wuerfel21 commented Jun 16, 2021 • edited Loading

Wuerfel21 commented Jun 16, 2021

totalspectrum commented Jun 17, 2021

Wuerfel21 commented Jun 17, 2021

totalspectrum commented Jun 17, 2021

Wuerfel21 commented Jun 17, 2021

totalspectrum commented Jun 17, 2021

totalspectrum commented Jun 17, 2021

Wuerfel21 commented Jun 17, 2021

Wuerfel21 commented Jun 17, 2021 • edited Loading

Wuerfel21 commented Jun 20, 2021

totalspectrum commented Jun 20, 2021

Wuerfel21 commented Jun 20, 2021

Wuerfel21 commented Jun 20, 2021

Wuerfel21 commented Jun 20, 2021

totalspectrum commented Jun 26, 2021

Wuerfel21 commented Jun 26, 2021

Wuerfel21 commented Jun 27, 2021 • edited Loading

totalspectrum commented Jun 27, 2021

Wuerfel21 commented Jun 27, 2021

Wuerfel21 commented Sep 19, 2021

Wuerfel21 commented Jun 3, 2021 •

edited

Loading

Wuerfel21 commented Jun 9, 2021 •

edited

Loading

Wuerfel21 commented Jun 9, 2021 •

edited

Loading

totalspectrum commented Jun 11, 2021 •

edited

Loading

Wuerfel21 commented Jun 11, 2021 •

edited

Loading

Wuerfel21 commented Jun 16, 2021 •

edited

Loading

Wuerfel21 commented Jun 17, 2021 •

edited

Loading

Wuerfel21 commented Jun 27, 2021 •

edited

Loading