My name is Roy Ward. I specialize in writing highly optimized code.
I have expertise in optimization, algorithms, data structures, compilers, microprocessor architecture, assemblers, operating systems, number representation, concurrency and compression. I have over 20 years of full-time experience building high performance and low memory solutions, resulting in 12 US and 7 international software patents.
I have a deep understanding of what is going on at all levels, from digital circuitry, CPU design, assembly, higher level languages, data structures, all the way to compilers and operating systems, and I apply this multi-level understanding to everything I do. I also bring to bear my strong mathematics skills which help direct me towards a highly efficient solution to whatever problem I am working on.
I am currently seeking interesting and challenging work using these skills.
This is a portfolio website for some of the projects that I have been involved with. Like everything in life, it is a work in progress.
Work and views presented here are my own.
Contact Linkedin Resume Useful/Interesting External Links
Galaxy Generator Project - generate and navigate around a 400 billion star galaxy, on Android, Linux and Windows. Pre-alpha downloads now available for Android, Linux and Windows.
Open source projects#
Pseudo-Double Random Variate Poisson Proof of concept Prolog Compiler
Blog#
Note About blog postings: I have been much busier the last few months, so blog postings have become much more sporadic. Here are some notes about how I select topics. Next up: ‘Using processor vectorization for fast N-Queens’.
Introduction This post describes a reasonably performant Prolog to C compiler prototype written in Prolog. It only compiles a the subset of Prolog, but it does include unification and backtracking, which is enough to demonstrate:
N-queens, which is good for testing the performance of backtracking. the append predicate, which is an excellent example for demonstrating unification, backtracking and code being able to be run in more than one direction, Context: A few months ago, I applied for a job that involved developing code in Prolog. I hadn’t used Prolog for some time, so I was looking for a project to brush up on my Prolog skills. It occurred to me that an effective way of refreshing my Prolog skills and deepening my understanding of the internals of Prolog at the same time would be to write a compiler that would compile a subset of Prolog to a low level language. I was tempted to target x86-64 machine code (having written code generators targeting this before), but I only had a bit over a week to put something together to demonstrate, so I chose to target C++ instead, later changing it to C as I wasn’t using any features that required C++.
...
Introduction A little over three years ago, I had an idea - wouldn’t it be fun to have a whole procedurally generated universe on my phone, and have an open game where I can explore parts of it with no combat? This post describes some of the design decisions that I went through to start turning this idea into reality, and some of the consequences of those decisions.
I’m writing this post three years after the design process, and while it is easy to find out what the available options would be now, it is sometimes a little difficult to determine what the state of a particular software package or license was at the time, particularly as most web content is undated. If I had thought I was going to write this, having some documentation three years ago would have been useful. Also, this post is going to make the process sound more linear than it really was.
...
I’ve taken a break from blogging on optimization topics for a bit to get the Galaxy Generator into a releasable state - it is nowhere near feature complete, but the parts that I have done work fairly well. I’ve now got the rendering part somewhere close to my vision, and version 0.1.2 is available for download here for Android, Linux and Windows (Android will give some warnings about side-loading and self-signing, and Windows Defender will warn you about potential malware - there isn’t any).
...
Introduction I had a problem. The Galaxy Generator worked fine when I was moving around inside the Galaxy, but it seems important that I should be able to move outside it, turn the brightness up, and see the whole thing at once. When I tried that, the frame rate would drop to nearly zero and the machine would then run out of memory. Here was the sort of image that I was looking for:
...
Introduction While building my Galaxy Generator, one of the performance bottlenecks was in the generation of Poisson random variates to determine how many stars should be placed inside a voxel. This post looks at how to write a fast cross-platform Poisson random variate generator.
First I motivate this by writing about why use Poisson random variates at all, then I explore the floating point case (using double) which gives some good results, but may not reliably generate the same results cross-platform. Lastly I look at integer only versions of the Poisson random variate algorithms that can safely used cross-platform.
...
Introduction I have been working for some time on a Galaxy Generator project that procedurally generates and allows navigation in a 400 billion star galaxy. Stars are generated/removed as they enter/leave visual range. One of the performance bottlenecks is generating that data (which can be millions of stars per second) and getting it to the GPU for rendering. This post investigates doing that efficiently.
First I look at how a star is modeled and what information needs to be sent to the GPU. This is followed by my efforts to squeeze this into a small an amount of memory as reasonably possible (16 bytes). I finish with some notes about how this can be sped up at the CPU end.
...
Introduction This post continues documenting the process of optimizing a small problem - generating images of the Mandelbrot Set. This builds on what I wrote in part 1 and gets to speeds of up to 130x the naive case. Again, this is mostly meant as an exercise to demonstrate how I go about using my knowledge of software design and computer architecture to optimize software.
This post will be divided into four sections: a brief recap and some notes on the feedback I got from the last post, a section on micro-optimization where I get a small (~1.2x) extra improvement, a section on point pruning where contour following gives a significant improvement, and finally some notes on multi-threading. Again, many of the sections are independent of each other, so feel free to use the Contents to go to the sections of interest.
...
Introduction This post documents the process of optimizing a small problem - generating images of the Mandelbrot Set. Although I will show a speedup of ~8x over the naive implementation (there is a part 2 that will bring this to ~100x in some cases), this is mostly meant as an example of how I go about using my knowledge of software design and computer architecture to optimize software. This post is rather long, but many of the sections are independent of each other, so feel free to use the Contents to go to the sections of interest. In particular I spend some time setting up the problem, and the real optimizing starts here.
...