Search This Blog

Tuesday, April 26, 2011

??? CuFFT what ARE you doing ???

OK, so here is what I've figured out so far.
  • CuFFT doesn't fail all the time; it fails every OTHER time its called.
  • When it does fail, it is data dependent as to which part fails; that is, if you break up line 466 of FourierOptics/src/Private/ so that pointAccumulator has its real and imaginary parts updated separately, then comment out the update of the imaginary part only, CuFFT doesn't fail, provided you input only one triangle, and that it is in a 'goldilocks' size range.  I have yet to find a range of sizes and positions that make both the real and imaginary part succeed.  Regardless, this is wrong; it shouldn't fail just because the data is an unexpected size.
I am going to continue investigating this, but won't really be able to do so for another week; other projects demand my time.  However, I talked with Joe Kider and know that he's interested in working on Fourier optics as well; turns out that it is getting interest in the graphics community.  So I'm going to work with him over the summer to see what can be done. 

Monday, April 25, 2011

Rules to live by

Don't upgrade your drivers before a deadline.  You will be unhappy.

As for the movie, its still uploading to bitbucket.  And its after midnight now... yay.  Whenever it finishes uploading, I'll post a link to it.  You can pull the code as it is, but I want to figure out WHY cufft is failing on the inverse transform, so I'm going to keep on debugging while the movie slowly (oh, SO slowly) uploads.  If it turns out is something in my code that wasn't tripping up the old drivers, I'll post that as well.

EDIT It's been 20 minutes, and its still uploading.  Which is unfortunate, as there is a chance I can use cuda-gdb to figure out what's going on, but to do so, I have to log out and then log back in in console mode, which means interrupting the upload.  At this point, I'm going to go to bed and let the upload take as long as it takes.  The movie will get up there in its own sweet time.

EDIT 2  It finished uploading sometime overnight.  You can download it here.  The movie is cheesy because everything went wrong; I'm going to keep on beating on cuFFT to figure out why its not doing the right thing, but if I can't figure it out, eventually I'm going to rip it out and use FFTW instead, or maybe look into OpenCL based implementations.  At the very least, I want to write a wrapper that makes all of the various flavors look the same, so I can have CPU & GPU unit tests that can cross-compare.  With enough time, that should make things work right (I hope).

Meh, onto other projects

Almost at the deadline!

I'm still working out how to display the output of my code.  I've downloaded and installed the cairo graphics package, and unlike AGG, it works.  Now the trick is to figure out its API between now and midnight! :(

Oh well, at least the poster and paper are up:
The only problem I've got when I look at both of those is realizing how much more I want to get done in order for this code to be complete.  Its never ending...

Saturday, April 23, 2011


In my last post, I mentioned trying to use AGG to visualize my output.  Unfortunately, it doesn't even compile on my computer.  That makes displaying the output impossible at the moment, unfortunately.  The other ways I know of are matplotlib (which requires python) and matlab (which I have no idea how to connect to to get it to do what I want).   I have some idea on how to do it with a mac, but the mac REALLY DOES NOT LIKE IT when you try to beat it into doing what you want in that way, which means many, many days of work to make that happen.  Right now, I'm fresh out of ideas, so I'm going to work on my poster and video instead.  Maybe I'll think of some nifty visualization of my output later.


I've finally gotten all my CUDA code in place, and I've realized that I've hit a brick wall.  I've written a number of unit tests using googletest, which is a fairly nice unit testing framework, and my code passes all of my tests.  The problem is that it is limited, as all unit testing frameworks are, in that it can really only test for equality well.  This is a big problem for the code I'm writing because after reading through quite a bit of material on the web, I've come to realize that not all cards support IEEE 754 in quite the same way, or as completely, as they should.  That means that even if I had a golden test set to test against, I could fail the test, while still being reasonably accurate.  The only way I may be able to bypass this problem is if I can render my output to a file or to the screen directly, and then eyeball it.  Towards that end, I just downloaded the AGG library, which claims to use only portable C++ to render images to files.  I'm going to briefly look at it, and see if I can get it to do what I want it to do (make pretty pictures) in the amount of time I have left.  However, since this is a brand-new (to me) API, and since I STILL don't have a poster or video up, if I can't get it all working within an hour or less, I'm going to have to abandon it to get my presentation working.

Thursday, April 21, 2011

Dual Quaternions

I spent much of today hacking out code so that I can use dual quaternions, only to realize that there is every chance that they are significantly slower than using homogenous coordinates.  Why?  Because dual quaternions require an 8 x 1 column vector to represent, while a rotation and translation (same power as a dual quaternion) requires a 4 x 4 matrix; the important point is that the 4 x 4 matrix may be hardware accelerated on the GPU as mat4 is an OpenGL type.  The only other advantage that dual quaternions have is that they are relatively easy to blend (interpolate & extrapolate) over.  However, for my purposes, there won't be any blending being done; you need to rotate & translate by some fixed amount, and that's it. 

For the time being, I'm side-stepping the issue.  Originally, I had the ability to embed frames within other frames, forming a mesh of frames (along with embedding objects within a frame).  Although I still allow all the embedding to happen, I don't walk out to find all the embedded parts, which means that if it isn't embedded in the top-level frame, it isn't found.  This isn't ideal, but it will allow me to get to the heart of the problem more quickly.  Once I have some really good results, I'll revisit this problem, and see what can be done about it.  The more I look at it though, the more I think I'm going to have to go with homogeneous coordinates, and just be done with it.

Saturday, April 16, 2011

Computational Holography

I've been going over the math for Fourier optics carefully for several days now, and have come to realize that what I'm actually doing is computational holography. This may appear surprising at first, but if you think about what I'm trying to do, it'll make sense.  When you create a hologram, what you're doing is recording the interference patterns of light reflected off of an object.  The only difference between what I'm doing and what a holographer does is that I'm not interested in reconstructing the scene, I'm interested in the interference patterns to determine how much power is received at some particular point in space.  I've also come to realize that pure computational holography may be inadequate to fully describe a scene; none of the papers or sources I've read take reflection into account.  That said, I do have a plan to handle reflection, provided I can handle a few, relatively simple cases first.  Basically, I'll do what any good ray-tracing program would do, and let surfaces accumulate light that they re-radiate.  However, that will likely be outside of the scope of what I'm trying to do for my class; it's just something that I need to keep in mind for the future.

Here are some papers to take a look at:

Friday, April 8, 2011

Memory Management

This is an open reply to Patrick's comment, which was important enough that it really needs to be addressed sooner rather than later.

The architecture I'm putting together for the Fourier optics library allows a mesh of contexts to form, including cycles of contexts.  Patrick pointed out that if you try automatic memory management (garbage collection, reference counting, etc.) you can get into trouble when you have a cycle.  It can get MUCH worse if you include the fact that there may be multiple threads of control, multiple processes, and (worse) a distributed network of machines that have slow, unreliable connections to one another.  In short, we're talking about 'the cloud'... and all the problems that entails.  This is many, many PhD. dissertations in itself.  It is not a problem I intend to try to solve myself!

Instead, I'm going to require that users of the library be aware of the objects that they are allocating, and know when they should all be live, and when they are dead.  In short, I'm passing the buck to you, the end user.

The bit of good news is that every object in the framework has its own UUID; you can query any object for its UUID, and you will know that if two objects have the same UUID, then you're really talking about the same object.  In addition, since UUIDs are universally unique, it means that distributed garbage collection (garbage collection across a network of machines) becomes a real possibility.  I am not going to implement this, but if you have the urge to do so, go ahead.

A note about copying objects

Copying implies different objects.  That means that even if the objects are equivalent, they are two separate chunks of memory.  For this reason, I require that they have separate UUIDs.  This can become extremely confusing if anyone tries to implement proxy objects; after all, the proxy is a stand-in for the real object, so what does it mean to copy the proxy, or to delete it?  While this isn't a big deal if you only have one process, it becomes a much bigger problem if you want something like distributed objects that are shared across the network.  I have no idea how to handle this well.  Again, it's up to you, the end user to cook up something intelligent. 

Let me know if anyone has a good solution that works across multiple platforms; I'm not touching any of that!

Wednesday, April 6, 2011

Updated API

Bit by bit, I'm getting there with the API.  Here's what I've added:
The two pieces above imply that I'm using a hierarchy, where there is some kind of tree-based system; I'm not.  Two contexts, A & B, can actually have each other as children; that is, A can be a child of B, and B can be a child of A.  There are a couple of reasons for permitting this:
  • Cameras/sensors/antennas/robots can all be placed at the origin of their own reference frame.  This means that if there are 3 robots, A, B, and C, A might want to know what C can see from A's point of view, and B might want to do the same thing from B's point of view.  If we calculate C's output across the whole world from within it's own point of view, and then translate this into A & B's point of view, we don't need to waste time/energy in recalculating the whole mess multiple times in a row.
  • If we have some relatively enclosed, and many-times-repeated system, it makes more sense to calculate what it radiates/receives once, and then copy & translate it to new positions.  For example, a really weird illumination source, like a candle within a lantern, may take a long time to calculate out, but once the light leaves the source, it is no longer affected by the glass, plastic, lava lamp fluid, etc. of the lantern.  Save all the illumination information, and replace it with a sphere that radiates differently depending on what point of the sphere that is visible.  You still have to trace the rays out of the surface of the sphere, but you don't have to trace how it bounces around inside of the lantern to get out. The downside to this is that every lantern will be a copy of each other, but depending on what you're trying to model, this might be acceptable.
Also note that despite what you might think, it is extremely easy to break cycles; breadth or depth-first search will handle it without any trouble.

I've also updated the illumination API slightly; right now, the only things you can change are the total energy and the frequency of the illumination.  In the future, I'm hoping to add the ability to specify ranges of frequencies, but that will require a great deal of time, and a great deal more thought on how to do it well.  Note that the illumination specifies something akin to a single photon or light packet.  That means that one illumination source will be generating many, many of these photon-like objects at one time.  I'm not sure if that is the best way of doing things yet, but its a start.

Finally, and perhaps most oddly, I've decided to use the GNU MP Library to pass in all scalar information as rational numbers.  Internally, I use floats or doubles, but by having the interface use rationals, I leave open the possibility of arbitrary precision sometime in the future.  At the moment, my plan is use the rationals to calculate combined rotations (since I can have an arbitrarily deep ancestor/decedent chain, accuracy can be a problem), and then convert the rotations into floats for use on the GPU.   My plan is that if you have a higher precision card (one that uses doubles instead of floats), the precision can automatically scale up.  In short, no built-in limits to my API!

Monday, April 4, 2011


Well, my project is approximately 12 days old, but when I mentioned what I was working on at an all-day meeting on Friday, I got immediate interest from 2 different groups, along the lines of 'why isn't it done yet?'.  I knew that there was a need for fast, accurate radio propagation software, but I didn't realize HOW MUCH need there was! 

On to more prosaic matters.

I spent a little while finding a good UUID generation library, which I have.  I'm using the OSSP UUID library for all the reasons I outlined in the wiki (which I updated), and more reasons besides.  I still haven't decided if I should wrap their functions to make it easy to port to other platforms or not, but I think that for the time being I'm not going to bother wrapping them.  I don't have enough time to do so at the moment, and the UUIDs will never be exposed in the public API, so in the worst case, someone is going to foolishly try to extract a UUID, and then get burned when the internals all suddenly change.  I'm fine with that, private APIs should never be linked against, it breaks modularity in a big way.

I also slightly reorganized my code so that everything that is supposed to be private will live in the Private directory from now on.  I can't think of any way to make it more clear that you shouldn't link to stuff down there than that, but I know someone is going to try anyways.

Finally, I think I'm going to have to go to a client/server architecture; I need to cache a lot of stuff on the backend, its the only way to get the necessary performance.  I was originally thinking I could do everything in C, but I need fast, well-written containers.  I can use uthash, but I know the STL is supported absolutely everywhere, which means C++.  I don't like it, but it's practical, and it lets me use Boost and Thrust, if I really need them.  I'm going to keep the public API in C though; it makes getting SWIG working easier, if I decide to add it in the future.  Thoughts?