It may be beneficial to have a standard set of benchmarks.
Potential uses:
I don't plan to use benchmarking to contrast Retro with other Forth systems, merely as a guide to aiding in improving Retro itself.
I've added two small tests to start the process for this.
On my Linode running OpenBSD:
Python3 1,000,000 iterations of empty loops 1m49.14s real 1m47.41s user 0m00.05s system Push and discard a value 1,000,000 times 2m15.99s real 2m13.74s user 0m00.07s system PyPy 1,000,000 iterations of empty loops 0m02.95s real 0m02.67s user 0m00.25s system Push and discard a value 1,000,000 times 0m03.19s real 0m03.01s user 0m00.15s system C 1,000,000 iterations of empty loops 0m00.32s real 0m00.32s user 0m00.00s system Push and discard a value 1,000,000 times 0m00.38s real 0m00.38s user 0m00.00s system
Its great to see PyPy (Version 3 I assume) performing well as that is the platform I am targeting.
Could be interesting to see what Jython is capable of given its JVM/JIT underpinning.
Also LuaJIT VM some day :]
Just for interest, how many opcodes are decoded to perform 1,000,000 empty loops in retro?
I would assume this number is highly dependent on what retro functions have been moved into the VM.
I'm actually using 2.7 with PyPy at the moment (my CPython is v3):
Python 2.7.13 (4a68d8d3d2fc1faec2e83bcb4d28559099092574, May 08 2020, 21:47:03) [PyPy 7.2.0 with GCC 4.2.1 Compatible OpenBSD Clang 8.0.1 (tags/RELEASE_801/final)] on openbsd6
I'll have to see if OpenBSD has a port for Python3, or if I'll need to set aside some time to try building it from source.
I don't have a machine with Java at the moment, so can't run this under Jython. This is something I will look into doing.
I'll do some statistics tracking this weekend on the number of opcodes executed for the benchmarks; I haven't done so yet.
RE: Lua; this is one my todo list, hopefully for sometime next year.
Perhaps add some negative slots to the the VM spec to pull opcount counts and the high resolution timer? Do the other VMs return 0 for unsupported negative slots or abort?
An additional negative slot that takes takes the top-of-stack value, subtracts current time and returns elapsed milliseconds (perhaps distinct integer and float versions) suddenly neat ABI for benchmarking. Ideally the timing code should have as little impact on runtime & opcount as possible. Also a slot for now & top-of-stack to UTC/ISO string eg. "2020-12-19T18:45:28.640919"
To be fair; the benchmark suite should allow the JIT to warm-up before the timings are performed.
IMHO The current numbers are not a true indication of the runtime performance possible with JIT based VMs.
re: negative slots; I've opened a separate issue related to that.
re: warming up time: this depends on ones usage. I run Retro largely non-interactively, so the current approach reflects the performance as I use it. But this may admittedly not be how others use it; it's worth testing both ways IMO, so I'll ultimately work on both.