Search

ReCode /

EMIL DOTCHEVSKI'S PROGRAMMING BLOG

(Boost) QVM Update

01-01-2013

Happy new year -- and a new update to the (Boost) QVM library (note, this library has not been reviewed yet and it is not part of Boost.)

Boost QVM provides a set of generic functions for working with quaternions, and vector and matrix types of static size. All operations use a type-traits system, which enables types from other vector/matrix libraries to be integrated in a common type-safe interface.

The current source code and documentation is available here. It contains minor bug fixes, and the following new features:

- view proxies that negate matrix rows and columns (negr/negc)

- view proxies that swap two rows or columns in a matrix (swapr/swapc)

- functions for indexing at run-time of any vector or matrix object of arbitrary size.

Post a comment


Possible Loss Of Data

05-31-2012

If static type checking is useful for detecting bugs, it is an even more powerful refactoring tool. In a well designed program, it is often possible to make a change in an interface, then rely on the compiler to report errors for all places in the code that need to be altered to match. In general, it is A Good Thing to design code that relies on static type checking as much as possible.

But one of the most common problems programmers inflict on themselves is when they try to use the static type checking built in C and C++ compilers to detect bugs due to integer conversions that result in loss of precision.

Let's say in a class called point, we have reasons to store X and Y as shorts rather than ints:

class point
{
  short x_, y_;
public:
  point( short x, short y ):
    x_(x),
    y_(y)
  {
  }
};

void bar( int x, int y )
{
  point p(x,y); //oops?
}

Right, we might have a problem. On most platforms, the int and short types are 32- and 16-bits in size, respectively. This means that the call to the point constructor may result in loss of data.

It would be very nice if we could get the compiler to point out all instances where an int is implicitly converted to a short, so that we can verify that the loss of precision is desirable or at least that it isn't a problem.

So we enable the appropriate warning:

warning C4242: 'argument' : conversion from 'int' to 'short', possible loss of data

What makes this especially desirable is that function (and constructor) declarations are usually separated in a header file. The thinking is, if a careless programmer calls the constructor with int arguments, the compiler will point out the possible problem, the programmer will inspect the code carefully to confirm that the loss of precision is OK, and then... Well,

Then what?

What do you say? Casting? Yes, this situation is a textbook example for using casts:

void bar( int x, int y )
{
  point p((short)x,(short)y); //Trust me, it's OK
}

The trouble is, in the original program, the compiler would have truncated x and y only if they needed to be truncated. The program that uses casts to avoid the warning will truncate x and y even if they don't need to be truncated.

We've changed the semantics of the program in a way that's not just dangerous, it's more dangerous than the potential bug the warning was intended to detect. If during later refactoring we realize that we do need the extra precision of int, the point class will have to change accordingly:

class point
{
  int x_, y_;
public:
  point( int x, int y ):
    x_(x),
    y_(y)
  {
  }
};

void bar( int x, int y )
{
  point p((short)x,(short)y); //Trust me, it's OK
}

And now we don't have a warning but we have a nasty bug, all because we explicitly instructed the compiler to do something bad (cast), something that it's carefully designed not to do!

The source of the bug is the ill-advised attempt to use static type checking to detect what is inherently a run-time error. Converting an int to a short is dangerous only if the run-time value of x exceeds the valid range for short integers. A better approach is to ditch the warning and the cast it comes with, and use assert:

void bar( int x, int y )
{
  assert(x>=std::numeric_limits<short>::min());
  assert(x<=std::numeric_limits<short>::max());
  assert(y>=std::numeric_limits<short>::min());
  assert(y<=std::numeric_limits<short>::max());
  point p(x,y); //Now we *know* we're OK
}

But now... even if we assume that programmers will always assert before calling foo (quite an assumption!) the resulting code is pretty ugly to look at.

To solve the problem, we need to take a step back and look at the original reasons why the point constructor takes shorts instead of ints. Clearly, this is because the values are copied to the members of class point, which are of type short to save memory. Note, however, that the savings come from the point members, not from the constructor itself taking shorts -- the latter was done for "consistency", which A) doesn't make it correct, and B) as we'll see later isn't even consistent.

First, we should make the constructor take int parameters regardless of the type of the (private) members:

class point
{
  short x_, y_;
public:
  point( int x, int y ):
    x_(x),
    y_(y)
  {
    assert(x>=std::numeric_limits<short>::min());
    assert(x<=std::numeric_limits<short>::max());
    assert(y>=std::numeric_limits<short>::min());
    assert(y<=std::numeric_limits<short>::max());
  }
};

This promotes consistency throughout the program, regardless of what type the members of class point are this week. The constructor is also the perfect place to put our asserts, thus completely relieving the caller from the burden of truncating ints to shorts.

Second,

C++ and C are obsessed with ints

Using short or char in any kind of integer expression ends up promoting them to int anyway. This can be illustrated by the following program:

#include <iostream>

int main()
{
  unsigned short x=65535;
  std::cout << sizeof(x) << ", " << sizeof(x+x) << ", " << x+x;
}

The typical output of this program is:

2, 4, 131070

Note that it prints 131070 for x+x, even though x is of type short and (for 16-bit shorts anyway) wouldn't be able to hold that value. That's because x+x is of type int.

As another example, if we pass something like x+1 (which would be pretty common) to a function that takes short, the program will first produce an int which is then truncated to a short as part of the function call (and I'm not aware of a way to make compilers issue a warning in this case, not that it would be desirable anyway.)

The best way to be consistent is to use the consistency of C and C++ to always promote small integer types to int. This means to forget about char and short as function arguments; they are (almost) never justified. The net effect of using them is that the program will contain a lot more places where data gets truncated, which makes it more prone to bugs.

Avoiding char and short function arguments is not ideal, but it is better than using dangerous casts to suppress a retarded warning that should never have been enabled to begin with.

Gregory — 02 June 2012, 04:46

I disagree.

Having Point's constructor take ints then assert then cast just hides information.

Now, your code breaks *eventually* at runtime while the user of the class could have been aware of coordinates range restrictions.

Also, your point about refactoring is contrived as many tutorials, blog posts or text books examples are :) The false promise of OOP: make a class, hide data in private members, provide public accessors. It's just piling up layers of useless code, coding for refactoring scenari that would never happen in real life imho

Emil — 05 June 2012, 23:45

What casts are you talking about, my suggestion doesn't have casts. Removing the asserts won't catch a possible runtime error, but even that is safer than the casts people commonly use to avoid the warning that they think protects them from something.

By the way, notice that standard functions that take or return characters (like getch, tolower, etc.) use int and not char. What'd be the point of returning char anyway? So that foo(getch()) could resolve to a different overload than foo(getch()+1)? :)

Arseny Kapoulkine — 11 June 2012, 14:52

getch() can return EOF. tolower can receive EOF. Oh, and C did not have overloading.

You can make a safe_cast utility function to avoid a hidden bug (assert) and to keep the natural type information.

Emil — 18 June 2012, 13:10

The real reason why getch() et al. communicate in terms of ints is that there is no static type checking in K&R C.

Anyway, do I understand correctly that you're proposing to use some kind of safe_cast every time you call a function that takes a short? Like:

void foo( short x );

void bar( int x )
{
  foo(safe_cast<short>(x));
}

?

Post a comment


Depth Of Field

08-04-2011

In prototyping for a new video game, here at Reverge we needed to implement a depth of field effect in our engine. It turned out to be an interesting technical challenge so I decided to mention it here.

Background

In photography, depth of field is an artistic effect that makes objects in the picture appear sharper or blurrier depending on their depth (how far away they are from the camera.) Here are a couple of examples (images courtesy of Wikipedia):

The following diagram shows the depth of field graphically, however keep in mind that there isn't a sudden drop in sharpness: in any photograph, only a single plane is in perfect focus -- at all other depths, the sharpness varies smoothly.

The blurriness is an artifact of projecting the scene onto a 2D surface using a lens. In the case of a camera, the 2D surface the image is projected on is the CCD; in the case of the human eye, it is the retina.

In general, a point from the observed object isn't projected into a single point on the 2D surface. Depending on the geometry of the lens, the depth of the object in the scene, and the distance between the 2D surface and the lens, a beam of light from the object will hit an area on the 2D surface which looks something like a circle, technically called the circle of confusion:

An important physical property of the retina is that it is composed of many individual photo receptors. Depending on the circle of confusion, the projection of a point from the scene will hit a different number of these receptors, and so the bigger the circle of confusion, the blurrier the object appears and the less detail is available for the brain to see. To compensate, the eye continuously adjust the geometry of its lens, minimizing the circle of confusion for the object we're focusing on, and that makes it appear sharp.

A CCD is also composed of many individual photo receptors, and so the effect of the circle of confusion on the resulting image is very similar to the effect it has on the retina. This is what makes depth of field such an important artistic tool in photography: it removes detail from parts of the image depending on their depth, very similarly to the way refocusing the lens of your eye does it normally. So, the brain focuses on the sharp areas of the image by default.

Depth of field in movie production

A lot of the depth of field effects you see in movies are done in post production. This means that they aren't a result of the physical properties of the lens the shots were taken with. In other words, they're fake.

Since each frame is affected by the circle of confusion, even sharp objects are not uniformly sharp: with high enough resolution, a small amount of blurring can be seen throughout the image. So, some algorithms for faking depth of field effect in post production simply increase the blurriness present in the image, which already depends "correctly" on the depth.

Depth blur

In computer graphics, raytracing techniques can easily be adapted to simulate lens and the resulting circle of confusion. However, games and other real-time applications use simple polygon rasterization to display images. There is no lens, and there is no circle of confusion. So, not only depth of field has to be faked, but it can not be faked by increasing the blurriness that already exists in the image, since the image is perfectly sharp (more precisely, any blurriness in the image has nothing to do with depth.)

Realtime depth of field effects are done as a post-effect, meaning, they operate on the final 2D image after all 3D rendering and lighting has been completed. The technique is sometimes called a depth blur, because it applies varying amount of blur to all pixels based on their depth. Here is a capture of what this looks like:

Depth blur is typically implemented in 3 passes:

1. The scene is rendered into a render target that can also be used as a texture, capturing the depth of each pixel either in the alpha channel or in a separate render target.
2. The captured 2D image is resampled to a lower resolution, and the result from that is processed by a 2D gaussian kernel (note, this processing includes the depth values captured in the previous step, not only the RGB component.)
As a side note, the 2D gaussian can be trivially decomposed into two 1D passes. This reduces the number of texture fetches significantly: for example a 5x5 2D gaussian requires 25 reads whereas decomposing it to two 1D gaussians requires only 10 reads total (5+5). Less obvious is that bilinear interpolation (which the texture samplers support for free!) can be used to further reduce the number of reads while still producing a mathematically correct result. The technique is described in this excellent article.
3. The normal and lower resolution versions of the image are bound as textures, and processed by a pixel shader which samples both of them, producing an output by linearly interpolating the samples based on the corresponding depth values. This is where the effect of the circle of confusion can be simulated: we take a fixed number of samples (for example, 7 or 8) arranged in a circle that is scaled appropriately for both images.
A second caution, keep in mind that due to mipmapping, objects that are closest to the camera typically appear at maximum sharpness in video games, which offsets the blurring of up-close objects that we need. This can be controlled by carefully lowering the maximum resolution in the textures.

A final refinement

Simple depth blur doesn't produce satisfactory results in general. Consider a foreground object in focus that is in front of a background that is supposed to be blurred. The background pixels that are immediately next to pixels of the foreground object must be blurred, meaning, some of the color from the neighboring objects should bleed into them -- but not from the pixels that are part of the foreground (in-focus) object. To get a correct result, we need to blur with pixels that are behind (deeper than) that object, and the problem is that we don't have those pixels.

Therefore, if we simply blur based on depth, a foreground in-focus object will bleed onto the background. To our eyes, this looks very wrong.

One possible fix is, in the last step where we average the 7 or 8 samples interpolated by their corresponding depth, we could exclude in-focus pixels that are closer than the rest of the samples. The results produced by this simple hack turned out to be quite satisfactory for us.

Here is the final result -- note how the foreground in-focus object does not bleed into the blurred background.

gemicha — 18 December 2012, 14:00

Post a comment


Otter

04-24-2011

It's official, we've licensed the Otter UI library to use in Skullgirls.

Why Otter and not Scaleform? Perhaps the most important design decision for any middleware is where it draws the line in the sand: how much it does, and how much it relies on the game to provide. You don't want it to leave a critical and difficult problem for the game to deal with, but you also don't want it doing trivial things, which often makes the middleware unnecessarily rigid and inflexible.

In a nutshell:

  • Otter does not use Flash, so it comes with its own animation editor.
  • Otter doesn't have a script of its own, which means that the UI logic can be implemented in C++ or in Lua. This removes the need for a separate Flash interpreter in your game.
  • Otter's runtime (the "player", in Flash terms) does not do any rendering. Instead, it just tells your game what textures to load and what polygons to draw -- which makes it very portable. Besides, it integrates seamlessly in your engine's shading framework.

Lastly but perhaps most importantly, the Aonyx team has been a pleasure to work with.

Post a comment


Bitfield Intricacies

04-06-2011

This week a coworker was porting a networking library from Windows to PS3. Once the library was compiling and sort of working, he started plowing through subtle bugs, most due to the different byte ordering.

He showed me a union similar to this:

struct foo
{
  int a:1;
  int b:31;
};

union bar
{
  foo x;
  int y;
};

The idea here is to use the union, bar, to convert foo to network byte order when sending, and (if necessary) back to host byte order when receiving.

Even if we set aside the obvious issues of size (we don't really know that an int is 32 bits) and aliasing rules (if you write x in the union, reading y is undefined behavior), such code is buggy also because when bitfields are used, it is unspecified whether the bits are allocated from the least significant to most significant bytes, or the other way around. The code worked on Windows only because both peers are compiled with the same compiler.

But what caught my eye was the int bitfield of size 1. Clearly, you can store two values in it, and it is safe to assume that one of them is 0. But what is the other value? I wasn't sure.

My thinking was, if the 1-bit two's complement format is used, then the possible values would be -1 and 0. But then again, we're talking about a bitfield, and in fact a 1-bit bitfield, so 0 and 1 also seem logical values.

It turns out, the C standard states that the behavior in this case is implementation-defined.

What's more, if a bitfield is of type unsigned int, then it is unsigned, but unless you explicitly say signed int, it is implementation-defined whether you're going to get a signed or unsigned int!

So, going back to the original 1-bit bitfield value of type int, it is implementation-defined whether its range is [-1,0] or [0,1].

In reality this subtlety is difficult to notice for 1-bit bitfields, because in boolean contexts 0 is false, and both -1 and 1 are true. However, to avoid portability problems, when larger bitfields are used, signed should be specified explicitly if negative values are needed.

Post a comment


(Boost) LA becomes (Boost) QVM

01-24-2011

Last year I've started on a 3D math library project called (Boost) LA. As indicated by the parenthesis, my intention is to request an official Boost review. I don't have much time to devote to this project, but this last weekend I've finished a major refactoring based on feedback from the preliminary Boost review, while adding a few missing pieces -- most notably support for quaternions.

LA stands for Linear Algebra, but the initial feedback I got made it clear that the name is incorrect. To make the library domain clearer, I've decided to rename it to QVM, which stands for Quaternions, Vectors and Matrices.

Click boost-qvm.zip to download the source code and tests. This is a very preliminary announcement but of course I welcome all feedback here or privately at emildotchevski@gmail.com. Unfortunately I have not yet updated the documentation; for now, please refer to the original (Boost) LA documentation which is correct semantically, but keep in mind the differences I've outlined below.

Operator| "pipe" overload gone

Based on feedback, I've removed the operator| "pipe" overload which was used to chain up view proxies. View proxies can be used to access (with zero abstraction penalty) an object "as if" it is of a different type. For example in (Boost) LA you could say:

m|col<1> *= s;

to multiply column 1 of the matrix m, as if it is a vector, by the scalar s. In (Boost) QVM the following expression achieves the same result:

col<1>(m) *= s;

Simplified header files

Also based on feedback, I have consolidated the very many and small (Boost) LA header files into much fewer files. This makes using the library simpler.

A new naming convention helps make header file names shorter:

  • "q" stands for quaternion, "v" stands for vector, and "m" stands for matrix.
  • numbers indicate dimensionality, so 2 means 2D, 3 means 3D, etc.

For example, to get function overloads that can be used with any vector type, you'd #include "boost/qvm/v.hpp", while if you only wanted to work with 2D vector types, you'd #include "boost/qvm/v2.hpp". To get operations between matrices and vectors, you'd #include "boost/qvm/vm.hpp" or "boost/qvm/vm3.hpp" if you needed only the 3D overloads.

Header files defining view proxies are named depending on the type of the input and the output type; for example view proxies that map matrix types to vector types are defined in "boost/qvm/map_mv.hpp".

New namespace sfinae

The main namespace boost::qvm contains all (Boost) QVM types and functions. All function and operator overloads found in boost::qvm which use SFINAE are also available in the new namespace boost::qvm::sfinae. Therefore, while

using namespace boost::qvm;

may introduce ambiguities, it should now be always safe to say

using namespace boost::qvm::sfinae;

The above directive does not bring any types in scope. The introduced overloads are extremely generic (meaning, any overload that takes a specific type will take precedence) yet are only available for types for which the appropriate boost::qvm::q_traits/v_traits/m_traits is defined.

Element access

The main innovation in this library is that it uses SFINAE to define operations which kick-in for any appropriate types, including user-defined types. This is what enables the view proxy functionality, which would have been impractical if each of the matrix or vector types it produces had to define its own operations.

This design needs a way to access quaternion, vector and matrix elements generically. Traditionally, vector and matrix types overload operator[] or operator() to provide element access, which is impossible to do generically since the C++ standard requires them to be member functions.

Using a function such as get<Y> or something similar to access elements is not very practical, so (Boost) LA used operator| overload for that purpose. (Boost) QVM uses operator% instead, because it has higher precedence (refer to this link). For example, in (Boost) QVM, to access the Y of a vector you'd use the following notation:

v%Y

The same syntax is used for swizzling view proxies, for example:

v%YXZ; //Swaps the X and Y of a 3D vecttor
v%YXZ1; //Maps a 3D vector as 4D vector, swapping X and Y and adding the constant 1 as the last element.

Alex — 24 January 2011, 20:10

Looks good, but the library is missing quaternion to rotation matrix conversions.

Also, slerp looks like it'll have precision issues when nm is really close to 0. I would recommend a fourth optional parameter that specifies at which point to just do linear interpolation. It should default to something like 0.01 though.

Not a fan of the % notation either. At least | looked better despite its precedence issues. Also % has precedence issues as well, in expressions like "5 * v%X".

Fairly minor stuff. I was rolling my own quaternion stuff, but now I'll probably just switch over to qvm.

Emil — 25 January 2011, 00:53

Conversions between types are handled by the make function template. As the first template parameter you specify the type you want to make, for example make<my_quat>(m), where my_quat is a quaternion type and m's type is a 3x3 matrix. See libs/qvm/test/make_test.cpp.

I hear you about the slerp precision problems, I'll fix it. I also agree that the % notation sucks but do we have a better alternative? I guess we can go back to | (for element access only.)

Thanks for the feedback!

Alex — 25 January 2011, 20:28

Whoops, I missed the make<my_mat>(my_quat) specialization. Cool!

As for the % notation. Personally I prefer the old | notation. However, I'm sidestepping the issue by overloading the [] operator in my custom vector class and the () operator in my custom matrix class for element access. With more effort, I could also support v[YXZ], but I don't currently need swizzling so I've not implemented all the overloads necessary for that.

I've switched my project over from boost::la and my own quaternion stuff (which was quite similar, but not as nicely integrated) to qvm today. Turned out to be an easy transition :)

This is a really good library, hope it makes it into boost!

Daniel — 15 February 2011, 01:12

You say that the function overloads using SFINAE are so generic that it is save to `using namespace boost::qvm::sfinae`. If that is really the case, what is the reason for not putting them directly into the global namespace?

Emil — 15 February 2011, 14:49

The reason has to do with hiding. For example, any operator* in a user-defined namespace N would hide a global namespace operator* for scopes within namespace N. To make the overload resolution consider an unrelated (in ADL terms) operator*, it needs to be brought in scope (and you can't specify the global namespace in a using directive.)

Realistically though, namespace sfinae gets undue attention in this post as well as in the documentation. By default, users should do using namespace boost::qvm. In case this introduces ambiguities, most likely they'd deal with them by qualifying the names; using boost::qvm::sfinae is simply another option which might be more convenient.

Post a comment


Version Control

01-08-2011

Call me arrogant, but when I'm not happy with how something works, besides looking for a better alternative, I start thinking: "now how'd I do this if I were to design it". Usually those thoughts are a total waste of time since I don't have the expertise or the capacity to get to anything tangible, but occasionally I get motivated and start working on problems that are way too ambitious for me to take on.

One of the projects that I've started years ago and never finished is a version control system. The main problem I have with Subversion, Perforce, Git and the like is that they're way too slow for us game developers. In games, besides code, we need to store very many and very large binary blobs. So many that even over a gigabit local network syncing the entire project tree to the latest revision is a major pain.

The net effect is that developers become hesitant to syncing: your local revision of the project is in good working order, why would you bring in changes that might break it? The answer, of course, is that frequent syncing makes merging easier and more reliable. Yes, you might sync to something broken, but why is this a problem? You have a version control system right? You should be able to undo a situation like that in no time!

But you can't. It takes forever to sync! So you keep working, until you're ready to commit. At this point you *should* first sync, but you don't. It takes forever to sync and forever more to undo a bad sync!

Is it possible for syncing to work faster? After all, the version control system already uses binary deltas and you can't easily improve the speed of your network. So it boils down to how many and how big your files are, nothing can be done about that right? Not quite.

Regardless of how large the files are, they are all being created and modified by people; and regardless of how productive people are, in practice all changes they make are highly localized. People work on individual files, and compared to the size of the entire project tree even Photoshop files are tiny on their own.

So, there is no reason whatsoever for the version control system to wait for me to request a sync before it begins to transfer the data. In fact, there is no reason whatsoever to wait for someone to commit a change. As soon as a file is modified locally, that change can be mirrored to all other users who are likely to need it in the future. That way whenever they're ready to sync, the operation would involve moving a bunch of local files. And that *is* fast.

Does such a version control system exist? Is it possible to emulate this behavior with existing software? Is there any reason why such a design is undesirable?

Andrew Armstrong — 11 January 2011, 09:05

That sounds awesome, especially if you used multicast somehow (because it's background it'd not matter if it lost some packets here and there as long as it does file checks), which would cut down on network congestion too, even for large files.

As long as hard drives were large, there would hardly be any reason not to do that, as long as there are 2 local syncs - the active one you're using and the background one (containing changed files only presumably!).

Would be fun to know if it exists anywhere already for sure.

Geoff — 15 June 2011, 11:13

This sounds like a suggestion to implement shared network drives as version control.

You would also have to solve the problem of people modifying one (potentially) large file multiple times in quick succession, possibly causing network congestion and beyond that; read/write errors or lock contention due to attempting to update the file while the file is still being broadcast to other computers over the network. Your best bet would probably instead be to sync files in an automated manner at scheduled times throughout the day (varying frequency by file size if desired).

Post a comment


Google Animation

09-07-2010

Lately the Google logo is nice and animated. Or was, depending on when you read this post. Anyway, this is what it looks/looked like:

The colorful balls bounce away from the cursor and that's lots of fun I suppose, unless you're a geek and you notice that OMG this stupid little animation is taking 100% of my CPU time running at no more than 10 frames per second! And that comes from the same company whose product happily announces that it has found more than a billion results in 0.25 seconds. I know, not a fair comparison, but still -- I wonder how many millennia would the search engine need to find a billion matches had it been running on Javascript or whatever advanced, Just-In-Time compiled, Virtual Machine powered, Garbage Collected, uber-secure engine powers the colorful balls.

What's the big deal, you say? It's just a bunch of colorful balls, they'll be gone in a day or two!

I wish the same could be said about Java in general. Whenever I run any Java-powered program on my computer, even when the program is idle, the OS often needs to power up the fan. To collect a dead object perhaps, who knows. The point is, ever since Java's arrival, we've heard the same mantra over and over: "Oh, it's slow now but don't you worry it'll get faster. Soon. Like, tomorrow. JIT FTW!"

I've lost count on how many years I've been hearing that.

But aren't these colorful dots fun anyway? Not if you realize how wasteful and inefficient the whole thing is. It really is a Googleplex slower than it should be. :)

OvermindDL1 — 19 September 2010, 17:52

Actually it takes 7% of my CPU and runs a *lot* faster then 10fps, but then I am running Firefox4 with its new javascript JIT, what are you using?

Yang Zhang — 29 September 2010, 01:01

sonicth — 05 April 2011, 13:10

java or javascript its all the same, about collecting of dead objects i guess ;)

a faster gc language is objective-c, or even c#

Post a comment


It's Not Just About Memory!

07-26-2010

I stumbled upon a very interesting post on concurrency. The author criticizes the transactional memory approach to simplifying multi-threading programming, compared to classical lock-based multi-threading models. Indeed, it is easy for a novice programmer to create a mess when using locks, the typical feared result being a system that exhibits random, very difficult to diagnose crashes and deadlocks.

What caught my eye is the focus of the author on a single reason why the TM model is flawed: complex software systems do not merely operate on memory.

This strikes me as important because I've observed a similar issue with garbage collection systems and languages. Like multi-threading programming, dealing with memory is difficult and it is easy for novice programmers to end up with a unmanageable mess of dangling pointers and memory leaks. And the solution? Simply don't worry about memory -- use as much as you need whenever you need it, the garbage collector will do its magic and it'll just work.

The problem with garbage collection systems is exactly the same: complex software systems do not merely operate on memory. Indeed, garbage collectors themselves do not manage memory. They manage objects lifetimes.

In the twisted mind of a Java developer we shouldn't worry about when the object will be collected: it can lurk around until an opportune moment arrives when the system has nothing better to do, and then it'll take care of it; and because the only type of resource most objects acquire is memory, most of the time we don't care when the object will be collected.

Yet, in the real world we must deal with all kinds of resources: files, sockets, locks, etc. We can't be sloppy with their acquisition and release. Their lifetime is just as hard to manage as memory and it is just as easy for a novice programmer to create a mess.

Solve that problem with finally and you're back to square one.

Steve — 01 August 2010, 08:05

I recently attended a code review of a rather convoluted C++ scheme (with timers, threads, and smart pointers) for acquiring and releasing a resource in a timely manner. I saw that everything of interest happened between a pair of curly braces and suggested wrapping the code (in this case a bunch of ODBC) with a simple stack-based object (RAII). I was taken aback by their response that this was 'Java-based thinking' and that I was trying to 'do garbage collection in C++', revealing no understanding of the difference between eventual vs. deterministic freeing of resources.

Emil — 03 August 2010, 16:17

Right, this is typical for programmers who learned C or C++ after they've had some experience writing Java-style code. Even the ones who do understand the semantics of local objects in C++ don't use them because they feel alien to them.

And so they use smart pointers to solve an artificial problem they created themselves. :)

OvermindDL1 — 16 August 2010, 16:31

I would prefer to handle memory like I handle files and sockets and such in C++ as I would handle them in Erlang. Erlang has no memory sharing between threads, that vastly simplifies things and allows Erlang to perform a variety of optimizations. You should take a look at that and write a blog about your thoughts (not the language of Erlang, prolog-like, but the style it operates in, pure Actor-oriented).

Post a comment


From DirectX 9 to DirectX 10/11

07-06-2010

Recently, here at Reverge we have been adding DirectX 10/11 support to some of our internal libraries. In the process we discovered a couple of problems in the new API design.

Gone: CreateRenderTarget

In DirectX 9 there were three types of "surfaces":

  • Pure textures, which the GPU can only read,
  • Render targets, which the GPU can only write (render) into, and
  • Textures that are also render targets, which the GPU can read and write.

With this system, pure textures can be used by the CPU to efficiently write data to video memory for the GPU to read and render targets can be used to efficiently read data that was written by the GPU.

Textures that are also render targets can not be accessed by the CPU at all, but that's fine because if the CPU needs to read the result of a GPU pass, that pass can render into a render target (that is not a texture) which the CPU can read.

In DirectX 10/11 render targets that are not textures are no longer supported, so while the CPU can efficiently write data for the GPU to read, there is no efficient way for the application to read data written by the GPU. Instead, it is required to copy the data from a render target texture into another (non-render-target) texture which can then be mapped by the CPU for reading

D3D9 LockRect vs. D3D10/11 Map

In the good old DirectX 9 days, if the application needed to write data into a texture, it called the LockRect function to get a pointer. The data was written in the memory pointed by the pointer, and then UnlockRect was called.

The nice thing about this API is that it does not specify what kind of memory the returned pointer points. Depending on texture creation flags, the requested access (read/write/write-discard), the current state of the Direct3D device, and the capabilities of the graphics card, the driver might choose to map the texture video memory directly, or it might map a separate temporary video memory buffer, or it might even return a pointer to a system memory buffer which is copied to video memory when UnlockRect is called.

Instead, in DirectX 10 and 11 there seems to be only one choice: video memory is mapped directly if possible; if not, we get an error.

The new semantics are in fact in accord with a major overall architectural shift introduced with DirectX 10. It calls for removing the so-called "API magic": undocumented internal behavior designed to hide inefficiencies and/or lack of capabilities in the hardware. The rationale for avoiding API magic is that it makes application performance inconsistent across different hardware and driver versions.

It is true that depending on the actual behavior of LoctRect, the application performance can vary significantly. That is indeed a downside, but I wonder if the designers of DirectX 10 and 11 are old enough to remember DirectX 3. It also lacked any API magic. The idea was that the application would create execute buffers that would reside on the graphics card, removing all abstraction and maximizing performance. Yet, because the interface was designed with hardware engineers in mind, it was basically unusable. The more abstract DirectX 5 interface was much easier to use yet (surprise!) didn't lead to lower frame rates.

Similarly, the more abstract interface of LockRect in DirectX 9 is more practical than the less sophisticated DirectX 10/11 Map behavior. Sure, in the simplest use cases the difference is negligible, but then again in the simplest cases LockRect wouldn't use API magic either, so the user would not be penalized by it. It is the non-trivial, less frequent yet performance-critical uses that need a more abstract interface.

For example, if the texture is created and locked with the correct flags, DirectX 9 could let the application download new data to a texture even while it is currently used by a shader, which isn't possible in DirectX 10/11. How does that work in DirectX 9? If write-discard locking is requested, the driver could map a separate temporary video memory buffer for access by the application. As soon as the device is done using the texture, the new memory buffer can be associated with the texture at the cost of a pointer swap.

It can be argued that this is done behind the user's back: if that's what the application wanted to do it could do it itself right? The problem is that this type of systems are difficult to program. Besides, even if the user has the skill and resources needed to provide a high-quality implementation, they can't possibly compete with a driver programmer who is not only a specialist but also has the advantage of writing hardware-specific code and is enjoying much lower-level access to the graphics card.

OvermindDL1 — 16 August 2010, 16:40

What about how the latest OpenGL handles such things, how does it compare?

Emil — 17 August 2010, 12:14

As far as I know, in terms of video memory access, OpenGL's API only provides memcpy-type functionality.

Post a comment


View all posts