So I am writing a Ruby compiler (or shall I say translator)… YARV only has around 80 bytecode instructions and the majority of them are pretty simple. I can let YARV parse .rb files and create the AST and parse that into bytecode. All I need to do then is implement all the bytecode instructions (well, kind of, of course there is more work here).
So at first I started writing it in assembler. I made a little Ruby DSL so I could write assembly in Intel syntax as Ruby functions and mix it with Ruby code. But after some time I got tired of looking up the binary codes for all the instructions I needed. So I was thinking of using metasm, an assembler written in Ruby. It seems like a cool project, however it is not packaged as a gem which is a minus. The biggest problem is, though, that documentation for it is virtually non-existant. So because I needed a little more control over stuff this was a problem for me.
I started thinking, writing in ASM might not be a good idea… There are 2 big problems, first I have to write for a very specific architecture, like x86_64. I can’t even write for both IA-32 and x86_64 (btw did you know that x86_64 and IA-64 is not the same thing? IA-64 is Itanium). Since most of the things are wrriten in ASM, basically the code is unportable.
The other problem is that I can’t benefit from compiler optimizations – compilers are really good nowadays and it’s very hard to make better optimized code by hand, especially on newer architecture where you have tons of instructions and registers. For example, did you know that for the function prologue, ENTER is actually slower than PUSH rsp; MOV rbp, rsp; SUB rsp, X. I find this funny – why would there be an instruction that does exactly the same thing as 3 other instructions together, but do it slower. I hear it is only still there due to backwards-compatibility. That’s why if you look at code generated by good compilers you will never see ENTER – they know better.
So due to these reasons I decided to rather write a kind of translator of Ruby to C. What I mean by this is to implement all the bytecode instructions in C. One very obvious benefit is that I can use the native stack and if you look at the Ruby bytecode you can see that it is entirely based around the stack – almost every instruction either pops from or pushes to stack. So if I can reduce the time it takes to push and pop things I expect it to affect performance very much. And since I use native stack it is as fast as it can be.
There is one problem here that must be addressed and that is procs (lambdas). Procs are especially a problem, even more than blocks. A proc can of course access the local variables of the parent scope (like the method that created the proc). But the thing is, you can pass a proc as an argument somewhere else, and the proc may be executed when the parent method has already finished. So this is a problem because the native stack is not accessible anymore after the method/function finishes. In YARV this isn’t a big deal because it just leaves the stack in memory and it lets the garbage collector free it when it’s not needed anymore. But with the native stack this is not possible. There are a few possible solutions that come to mind:
- Copy the local variables somewhere for the proc. This is not OK because the proc must always be able to access the current value of the local variables, not the value they had when the proc was created. And if you don’t have control over the proc it can’t signal you “hey, give me your locals now”. It’s probably possible to workaround this somehow but…
- Let the proc access the stack directly. This works great while the function is still running. I am currently using this solution – when the function finishes, I replace the pointer to the parent’s stack with some allocated memory where I copy all the local variables. Then I would have to garbage collect this and possibly optimize so this copy is not done when there are no blocks that would need it. The copy is pretty fast since I use the movs instruction (it loops itself and the processor does all the rest).
- Allocate some memory and use it for the stack. Then when the function ends this memory can stay allocated until it’s not needed anymore. This solution is similar to the above but it requires allocation at the start. Otherwise you can just mov from and into it so access is still fast. If you use a memory pool this could be also quite an interesting option. But on the other hand you can also use a memory pool with the 2nd method and it seems better.
So the biggest problem with the stack is to garbage collect it somehow. I don’t think I can avoid this. I could probably hook into the garbage collector so that I would mark procs that use my stack and then when there are no more left I can free the memory. Garbage collection in itself is a problem because I am not quite sure yet how it works for C extensions. I will probably have to be careful about that so I don’t leak.
I have been having a lot of hard time with x86_64 ABI. I already wrote about this, basically it defines that parameters should be passed in registers instead of the stack like the good old IA-32. This is a big deal in my case because I am supposed to have the parameters on the stack because they are part of local variables in Ruby. So if I honor ABI I have to copy everything to the stack when the function starts, which takes time. The problem is it’s really hard not to honor ABI… Compilers don’t seem to provide an option to pass parameters on the stack. I was using naked C functions and setting up a stack frame by myself first, but this is a real PITA because the compiler tries to put local variables in places where you already put something, because it doesn’t care what you’re doing. It also requires a lot of ASM which is not good.
I just came up with a reasonably good solution today – using an unaligned struct and union it with my type. It sucks a little because you have to typecast stuff a little, but at least ABI is not a problem anymore. Because ABI says that unaligned structs should be passed on stack. And this does not affect performance at all because the resulting binary code is the same as if you would actually be using stdcall-like calling convetion in IA-32. So this is great. And because I generate C code from Ruby I don’t care if it is a little clunky to use.
I think the biggest problem will be passing stuff between interpreted Ruby and compiled Ruby. Especially letting Ruby access the local variables of a C function. I looked at the struct Ruby uses for an instruction sequence and it’s really complex, so I will need a lot of time to find a good way to do this. The good part is that eval will work fine, I can just pass it to the interpreter and it’s all good (except for the same problem of accessing local variables from the eval). There is a lot of things I can just let the interpreter do for me, which is pretty cool.