01-01-2013
Happy new year -- and a new update to the (Boost) QVM library (note, this library has not been reviewed yet and it is not part of Boost.)
Boost QVM provides a set of generic functions for working with quaternions, and vector and matrix types of static size. All operations use a type-traits system, which enables types from other vector/matrix libraries to be integrated in a common type-safe interface.
The current source code and documentation is available here. It contains minor bug fixes, and the following new features:
- view proxies that negate matrix rows and columns (negr/negc)
- view proxies that swap two rows or columns in a matrix (swapr/swapc)
- functions for indexing at run-time of any vector or matrix object of arbitrary size.
05-31-2012
If static type checking is useful for detecting bugs, it is an even more powerful refactoring tool. In a well designed program, it is often possible to make a change in an interface, then rely on the compiler to report errors for all places in the code that need to be altered to match. In general, it is A Good Thing to design code that relies on static type checking as much as possible.
But one of the most common problems programmers inflict on themselves is when they try to use the static type checking built in C and C++ compilers to detect bugs due to integer conversions that result in loss of precision.
Let's say in a class called point, we have reasons to store X and Y as shorts rather than ints:
Right, we might have a problem. On most platforms, the int and short types are 32- and 16-bits in size, respectively. This means that the call to the point constructor may result in loss of data.
It would be very nice if we could get the compiler to point out all instances where an int is implicitly converted to a short, so that we can verify that the loss of precision is desirable or at least that it isn't a problem.
So we enable the appropriate warning:
What makes this especially desirable is that function (and constructor) declarations are usually separated in a header file. The thinking is, if a careless programmer calls the constructor with int arguments, the compiler will point out the possible problem, the programmer will inspect the code carefully to confirm that the loss of precision is OK, and then... Well,
What do you say? Casting? Yes, this situation is a textbook example for using casts:
The trouble is, in the original program, the compiler would have truncated x and y only if they needed to be truncated. The program that uses casts to avoid the warning will truncate x and y even if they don't need to be truncated.
We've changed the semantics of the program in a way that's not just dangerous, it's more dangerous than the potential bug the warning was intended to detect. If during later refactoring we realize that we do need the extra precision of int, the point class will have to change accordingly:
And now we don't have a warning but we have a nasty bug, all because we explicitly instructed the compiler to do something bad (cast), something that it's carefully designed not to do!
The source of the bug is the ill-advised attempt to use static type checking to detect what is inherently a run-time error. Converting an int to a short is dangerous only if the run-time value of x exceeds the valid range for short integers. A better approach is to ditch the warning and the cast it comes with, and use assert:
But now... even if we assume that programmers will always assert before calling foo (quite an assumption!) the resulting code is pretty ugly to look at.
To solve the problem, we need to take a step back and look at the original reasons why the point constructor takes shorts instead of ints. Clearly, this is because the values are copied to the members of class point, which are of type short to save memory. Note, however, that the savings come from the point members, not from the constructor itself taking shorts -- the latter was done for "consistency", which A) doesn't make it correct, and B) as we'll see later isn't even consistent.
First, we should make the constructor take int parameters regardless of the type of the (private) members:
This promotes consistency throughout the program, regardless of what type the members of class point are this week. The constructor is also the perfect place to put our asserts, thus completely relieving the caller from the burden of truncating ints to shorts.
Second,
Using short or char in any kind of integer expression ends up promoting them to int anyway. This can be illustrated by the following program:
The typical output of this program is:
Note that it prints 131070 for x+x, even though x is of type short and (for 16-bit shorts anyway) wouldn't be able to hold that value. That's because x+x is of type int.
As another example, if we pass something like x+1 (which would be pretty common) to a function that takes short, the program will first produce an int which is then truncated to a short as part of the function call (and I'm not aware of a way to make compilers issue a warning in this case, not that it would be desirable anyway.)
The best way to be consistent is to use the consistency of C and C++ to always promote small integer types to int. This means to forget about char and short as function arguments; they are (almost) never justified. The net effect of using them is that the program will contain a lot more places where data gets truncated, which makes it more prone to bugs.
Avoiding char and short function arguments is not ideal, but it is better than using dangerous casts to suppress a retarded warning that should never have been enabled to begin with.
08-04-2011
In prototyping for a new video game, here at Reverge we needed to implement a depth of field effect in our engine. It turned out to be an interesting technical challenge so I decided to mention it here.
In photography, depth of field is an artistic effect that makes objects in the picture appear sharper or blurrier depending on their depth (how far away they are from the camera.) Here are a couple of examples (images courtesy of Wikipedia):


The following diagram shows the depth of field graphically, however keep in mind that there isn't a sudden drop in sharpness: in any photograph, only a single plane is in perfect focus -- at all other depths, the sharpness varies smoothly.

The blurriness is an artifact of projecting the scene onto a 2D surface using a lens. In the case of a camera, the 2D surface the image is projected on is the CCD; in the case of the human eye, it is the retina.
In general, a point from the observed object isn't projected into a single point on the 2D surface. Depending on the geometry of the lens, the depth of the object in the scene, and the distance between the 2D surface and the lens, a beam of light from the object will hit an area on the 2D surface which looks something like a circle, technically called the circle of confusion:

An important physical property of the retina is that it is composed of many individual photo receptors. Depending on the circle of confusion, the projection of a point from the scene will hit a different number of these receptors, and so the bigger the circle of confusion, the blurrier the object appears and the less detail is available for the brain to see. To compensate, the eye continuously adjust the geometry of its lens, minimizing the circle of confusion for the object we're focusing on, and that makes it appear sharp.
A CCD is also composed of many individual photo receptors, and so the effect of the circle of confusion on the resulting image is very similar to the effect it has on the retina. This is what makes depth of field such an important artistic tool in photography: it removes detail from parts of the image depending on their depth, very similarly to the way refocusing the lens of your eye does it normally. So, the brain focuses on the sharp areas of the image by default.
A lot of the depth of field effects you see in movies are done in post production. This means that they aren't a result of the physical properties of the lens the shots were taken with. In other words, they're fake.
Since each frame is affected by the circle of confusion, even sharp objects are not uniformly sharp: with high enough resolution, a small amount of blurring can be seen throughout the image. So, some algorithms for faking depth of field effect in post production simply increase the blurriness present in the image, which already depends "correctly" on the depth.
In computer graphics, raytracing techniques can easily be adapted to simulate lens and the resulting circle of confusion. However, games and other real-time applications use simple polygon rasterization to display images. There is no lens, and there is no circle of confusion. So, not only depth of field has to be faked, but it can not be faked by increasing the blurriness that already exists in the image, since the image is perfectly sharp (more precisely, any blurriness in the image has nothing to do with depth.)
Realtime depth of field effects are done as a post-effect, meaning, they operate on the final 2D image after all 3D rendering and lighting has been completed. The technique is sometimes called a depth blur, because it applies varying amount of blur to all pixels based on their depth. Here is a capture of what this looks like:

Depth blur is typically implemented in 3 passes:
Simple depth blur doesn't produce satisfactory results in general. Consider a foreground object in focus that is in front of a background that is supposed to be blurred. The background pixels that are immediately next to pixels of the foreground object must be blurred, meaning, some of the color from the neighboring objects should bleed into them -- but not from the pixels that are part of the foreground (in-focus) object. To get a correct result, we need to blur with pixels that are behind (deeper than) that object, and the problem is that we don't have those pixels.
Therefore, if we simply blur based on depth, a foreground in-focus object will bleed onto the background. To our eyes, this looks very wrong.
One possible fix is, in the last step where we average the 7 or 8 samples interpolated by their corresponding depth, we could exclude in-focus pixels that are closer than the rest of the samples. The results produced by this simple hack turned out to be quite satisfactory for us.
Here is the final result -- note how the foreground in-focus object does not bleed into the blurred background.


04-24-2011
It's official, we've licensed the Otter UI library to use in Skullgirls.
Why Otter and not Scaleform? Perhaps the most important design decision for any middleware is where it draws the line in the sand: how much it does, and how much it relies on the game to provide. You don't want it to leave a critical and difficult problem for the game to deal with, but you also don't want it doing trivial things, which often makes the middleware unnecessarily rigid and inflexible.
In a nutshell:
Lastly but perhaps most importantly, the Aonyx team has been a pleasure to work with.
04-06-2011
This week a coworker was porting a networking library from Windows to PS3. Once the library was compiling and sort of working, he started plowing through subtle bugs, most due to the different byte ordering.
He showed me a union similar to this:
The idea here is to use the union, bar, to convert foo to network byte order when sending, and (if necessary) back to host byte order when receiving.
Even if we set aside the obvious issues of size (we don't really know that an int is 32 bits) and aliasing rules (if you write x in the union, reading y is undefined behavior), such code is buggy also because when bitfields are used, it is unspecified whether the bits are allocated from the least significant to most significant bytes, or the other way around. The code worked on Windows only because both peers are compiled with the same compiler.
But what caught my eye was the int bitfield of size 1. Clearly, you can store two values in it, and it is safe to assume that one of them is 0. But what is the other value? I wasn't sure.
My thinking was, if the 1-bit two's complement format is used, then the possible values would be -1 and 0. But then again, we're talking about a bitfield, and in fact a 1-bit bitfield, so 0 and 1 also seem logical values.
It turns out, the C standard states that the behavior in this case is implementation-defined.
What's more, if a bitfield is of type unsigned int, then it is unsigned, but unless you explicitly say signed int, it is implementation-defined whether you're going to get a signed or unsigned int!
So, going back to the original 1-bit bitfield value of type int, it is implementation-defined whether its range is [-1,0] or [0,1].
In reality this subtlety is difficult to notice for 1-bit bitfields, because in boolean contexts 0 is false, and both -1 and 1 are true. However, to avoid portability problems, when larger bitfields are used, signed should be specified explicitly if negative values are needed.
01-24-2011
Last year I've started on a 3D math library project called (Boost) LA. As indicated by the parenthesis, my intention is to request an official Boost review. I don't have much time to devote to this project, but this last weekend I've finished a major refactoring based on feedback from the preliminary Boost review, while adding a few missing pieces -- most notably support for quaternions.
LA stands for Linear Algebra, but the initial feedback I got made it clear that the name is incorrect. To make the library domain clearer, I've decided to rename it to QVM, which stands for Quaternions, Vectors and Matrices.
Click boost-qvm.zip to download the source code and tests. This is a very preliminary announcement but of course I welcome all feedback here or privately at emildotchevski@gmail.com. Unfortunately I have not yet updated the documentation; for now, please refer to the original (Boost) LA documentation which is correct semantically, but keep in mind the differences I've outlined below.
Based on feedback, I've removed the operator| "pipe" overload which was used to chain up view proxies. View proxies can be used to access (with zero abstraction penalty) an object "as if" it is of a different type. For example in (Boost) LA you could say:
to multiply column 1 of the matrix m, as if it is a vector, by the scalar s. In (Boost) QVM the following expression achieves the same result:
Also based on feedback, I have consolidated the very many and small (Boost) LA header files into much fewer files. This makes using the library simpler.
A new naming convention helps make header file names shorter:
For example, to get function overloads that can be used with any vector type, you'd #include "boost/qvm/v.hpp", while if you only wanted to work with 2D vector types, you'd #include "boost/qvm/v2.hpp". To get operations between matrices and vectors, you'd #include "boost/qvm/vm.hpp" or "boost/qvm/vm3.hpp" if you needed only the 3D overloads.
Header files defining view proxies are named depending on the type of the input and the output type; for example view proxies that map matrix types to vector types are defined in "boost/qvm/map_mv.hpp".
The main namespace boost::qvm contains all (Boost) QVM types and functions. All function and operator overloads found in boost::qvm which use SFINAE are also available in the new namespace boost::qvm::sfinae. Therefore, while
may introduce ambiguities, it should now be always safe to say
The above directive does not bring any types in scope. The introduced overloads are extremely generic (meaning, any overload that takes a specific type will take precedence) yet are only available for types for which the appropriate boost::qvm::q_traits/v_traits/m_traits is defined.
The main innovation in this library is that it uses SFINAE to define operations which kick-in for any appropriate types, including user-defined types. This is what enables the view proxy functionality, which would have been impractical if each of the matrix or vector types it produces had to define its own operations.
This design needs a way to access quaternion, vector and matrix elements generically. Traditionally, vector and matrix types overload operator[] or operator() to provide element access, which is impossible to do generically since the C++ standard requires them to be member functions.
Using a function such as get<Y> or something similar to access elements is not very practical, so (Boost) LA used operator| overload for that purpose. (Boost) QVM uses operator% instead, because it has higher precedence (refer to this link). For example, in (Boost) QVM, to access the Y of a vector you'd use the following notation:
The same syntax is used for swizzling view proxies, for example:
01-08-2011
Call me arrogant, but when I'm not happy with how something works, besides looking for a better alternative, I start thinking: "now how'd I do this if I were to design it". Usually those thoughts are a total waste of time since I don't have the expertise or the capacity to get to anything tangible, but occasionally I get motivated and start working on problems that are way too ambitious for me to take on.
One of the projects that I've started years ago and never finished is a version control system. The main problem I have with Subversion, Perforce, Git and the like is that they're way too slow for us game developers. In games, besides code, we need to store very many and very large binary blobs. So many that even over a gigabit local network syncing the entire project tree to the latest revision is a major pain.
The net effect is that developers become hesitant to syncing: your local revision of the project is in good working order, why would you bring in changes that might break it? The answer, of course, is that frequent syncing makes merging easier and more reliable. Yes, you might sync to something broken, but why is this a problem? You have a version control system right? You should be able to undo a situation like that in no time!
But you can't. It takes forever to sync! So you keep working, until you're ready to commit. At this point you *should* first sync, but you don't. It takes forever to sync and forever more to undo a bad sync!
Is it possible for syncing to work faster? After all, the version control system already uses binary deltas and you can't easily improve the speed of your network. So it boils down to how many and how big your files are, nothing can be done about that right? Not quite.
Regardless of how large the files are, they are all being created and modified by people; and regardless of how productive people are, in practice all changes they make are highly localized. People work on individual files, and compared to the size of the entire project tree even Photoshop files are tiny on their own.
So, there is no reason whatsoever for the version control system to wait for me to request a sync before it begins to transfer the data. In fact, there is no reason whatsoever to wait for someone to commit a change. As soon as a file is modified locally, that change can be mirrored to all other users who are likely to need it in the future. That way whenever they're ready to sync, the operation would involve moving a bunch of local files. And that *is* fast.
Does such a version control system exist? Is it possible to emulate this behavior with existing software? Is there any reason why such a design is undesirable?
09-07-2010
Lately the Google logo is nice and animated. Or was, depending on when you read this post. Anyway, this is what it looks/looked like:

The colorful balls bounce away from the cursor and that's lots of fun I suppose, unless you're a geek and you notice that OMG this stupid little animation is taking 100% of my CPU time running at no more than 10 frames per second! And that comes from the same company whose product happily announces that it has found more than a billion results in 0.25 seconds. I know, not a fair comparison, but still -- I wonder how many millennia would the search engine need to find a billion matches had it been running on Javascript or whatever advanced, Just-In-Time compiled, Virtual Machine powered, Garbage Collected, uber-secure engine powers the colorful balls.
What's the big deal, you say? It's just a bunch of colorful balls, they'll be gone in a day or two!
I wish the same could be said about Java in general. Whenever I run any Java-powered program on my computer, even when the program is idle, the OS often needs to power up the fan. To collect a dead object perhaps, who knows. The point is, ever since Java's arrival, we've heard the same mantra over and over: "Oh, it's slow now but don't you worry it'll get faster. Soon. Like, tomorrow. JIT FTW!"
I've lost count on how many years I've been hearing that.
But aren't these colorful dots fun anyway? Not if you realize how wasteful and inefficient the whole thing is. It really is a Googleplex slower than it should be. :)
07-26-2010
I stumbled upon a very interesting post on concurrency. The author criticizes the transactional memory approach to simplifying multi-threading programming, compared to classical lock-based multi-threading models. Indeed, it is easy for a novice programmer to create a mess when using locks, the typical feared result being a system that exhibits random, very difficult to diagnose crashes and deadlocks.
What caught my eye is the focus of the author on a single reason why the TM model is flawed: complex software systems do not merely operate on memory.
This strikes me as important because I've observed a similar issue with garbage collection systems and languages. Like multi-threading programming, dealing with memory is difficult and it is easy for novice programmers to end up with a unmanageable mess of dangling pointers and memory leaks. And the solution? Simply don't worry about memory -- use as much as you need whenever you need it, the garbage collector will do its magic and it'll just work.
The problem with garbage collection systems is exactly the same: complex software systems do not merely operate on memory. Indeed, garbage collectors themselves do not manage memory. They manage objects lifetimes.
In the twisted mind of a Java developer we shouldn't worry about when the object will be collected: it can lurk around until an opportune moment arrives when the system has nothing better to do, and then it'll take care of it; and because the only type of resource most objects acquire is memory, most of the time we don't care when the object will be collected.
Yet, in the real world we must deal with all kinds of resources: files, sockets, locks, etc. We can't be sloppy with their acquisition and release. Their lifetime is just as hard to manage as memory and it is just as easy for a novice programmer to create a mess.
Solve that problem with finally and you're back to square one.
07-06-2010
Recently, here at Reverge we have been adding DirectX 10/11 support to some of our internal libraries. In the process we discovered a couple of problems in the new API design.
In DirectX 9 there were three types of "surfaces":
With this system, pure textures can be used by the CPU to efficiently write data to video memory for the GPU to read and render targets can be used to efficiently read data that was written by the GPU.
Textures that are also render targets can not be accessed by the CPU at all, but that's fine because if the CPU needs to read the result of a GPU pass, that pass can render into a render target (that is not a texture) which the CPU can read.
In DirectX 10/11 render targets that are not textures are no longer supported, so while the CPU can efficiently write data for the GPU to read, there is no efficient way for the application to read data written by the GPU. Instead, it is required to copy the data from a render target texture into another (non-render-target) texture which can then be mapped by the CPU for reading
In the good old DirectX 9 days, if the application needed to write data into a texture, it called the LockRect function to get a pointer. The data was written in the memory pointed by the pointer, and then UnlockRect was called.
The nice thing about this API is that it does not specify what kind of memory the returned pointer points. Depending on texture creation flags, the requested access (read/write/write-discard), the current state of the Direct3D device, and the capabilities of the graphics card, the driver might choose to map the texture video memory directly, or it might map a separate temporary video memory buffer, or it might even return a pointer to a system memory buffer which is copied to video memory when UnlockRect is called.
Instead, in DirectX 10 and 11 there seems to be only one choice: video memory is mapped directly if possible; if not, we get an error.
The new semantics are in fact in accord with a major overall architectural shift introduced with DirectX 10. It calls for removing the so-called "API magic": undocumented internal behavior designed to hide inefficiencies and/or lack of capabilities in the hardware. The rationale for avoiding API magic is that it makes application performance inconsistent across different hardware and driver versions.
It is true that depending on the actual behavior of LoctRect, the application performance can vary significantly. That is indeed a downside, but I wonder if the designers of DirectX 10 and 11 are old enough to remember DirectX 3. It also lacked any API magic. The idea was that the application would create execute buffers that would reside on the graphics card, removing all abstraction and maximizing performance. Yet, because the interface was designed with hardware engineers in mind, it was basically unusable. The more abstract DirectX 5 interface was much easier to use yet (surprise!) didn't lead to lower frame rates.
Similarly, the more abstract interface of LockRect in DirectX 9 is more practical than the less sophisticated DirectX 10/11 Map behavior. Sure, in the simplest use cases the difference is negligible, but then again in the simplest cases LockRect wouldn't use API magic either, so the user would not be penalized by it. It is the non-trivial, less frequent yet performance-critical uses that need a more abstract interface.
For example, if the texture is created and locked with the correct flags, DirectX 9 could let the application download new data to a texture even while it is currently used by a shader, which isn't possible in DirectX 10/11. How does that work in DirectX 9? If write-discard locking is requested, the driver could map a separate temporary video memory buffer for access by the application. As soon as the device is done using the texture, the new memory buffer can be associated with the texture at the cost of a pointer swap.
It can be argued that this is done behind the user's back: if that's what the application wanted to do it could do it itself right? The problem is that this type of systems are difficult to program. Besides, even if the user has the skill and resources needed to provide a high-quality implementation, they can't possibly compete with a driver programmer who is not only a specialist but also has the advantage of writing hardware-specific code and is enjoying much lower-level access to the graphics card.