New Blog

icare3dbloglogoAfter many years using this website as a kind of blog, I finally decided to make a separate blog and I turned out to Blogger. Publishing things involved too much formating effort and I became lazy posting here. In addition, I will keep this website to publish about my personal works and creations, while the blog will cover much wider topics. The intend of this new blog is to publish more regularly my thoughts and findings about GPUs, parallel programming, and computer graphics in general.

You can follow it here, or use the top menu.

GigaBroccoli: The Mandelbulb into GigaVoxels

gigabroccoli1Last week I discovered this website: It describes a new way of rendering 3D Mandelbrot fractals using three components simplex numbers, instead of the traditional 4D Quaternion. This new function produces a lot more interesting 3D fractal details and lead to very impresive renderings.

I have implemented this function as a GPU producer into GigaVoxels and I am able to render it in real-time as you can see on this video. It is a work-in-progress, but it already works quite well ! (around 20FPS)

The fractal is computed on the GPU, not during the ray-casting as usually done, but as voxels stored into an Octree.
Voxels are produced on-the-fly and stored into a cache in video memory in order to be reused while they stay visible. The octree is also subdivided on-the-fly and the subdivision is triggered directly by the ray-casting kernel. That prevents to generate any occluded data.
I compute Ambient Occlusion very efficiently using filtered low resolution voxels and soft shadows are computed with secondary rays.


Better quality video files can be downloaded there: GigaBroccoli1.avi GigaBroccoli2.avi

(continue reading…)

GigaVoxels summary report

cudagigavoxels_rgbdragon_01Martin Wahnschaffe, a Bachelor student at the Technische Universität Braunschweig written a very nice summary report on our GigaVoxels I3D paper for a seminar he did about it in the context of it’s computer graphics course. The report is very detailed, written nicely and in a very instructive way, so if you feel lasy reading our paper, go read this report !

The report can be read there:

Congratulations to Martin for this very good work !

GigaVoxels Siggraph 2009 Slides

siggraph09posterI come back from one week of vacation in Louisiana I took right after Siggraph. My Siggraph talk about GigaVoxels went very well I guess, I get interesting feedbacks and discussions right after the talk and it seems people were very interested. It was the first time I told about our CUDA implementation and the new cache mechanism fully implemented on GPU. It’s based on scattered visibility informations, stream compaction and a LRU mechanism implemented entirely in GPU memory and managed in a data-parallel manner from CUDA kernels. This way, the CPU only role is to  answer the GPU cache by uploading bricks and constant area information.

I also demonstrated the cone tracing approach using our continuous 3D MipMapping to implement very efficient Soft Shadows and Depth-Of-Field. and I shown examples of scene instancing using a BVH structure ray-traced on GPU. I think this was the most interesting parts for those who already known the technique.

For those who are interested and have not been able to attend, I have put my slides there:


Siggraph 2009 live report

Day 1: Sunday August 2

I was at High Performance Graphics on Sunday, HPG is the merge of two previous conferences: Graphics Hradware and Interactive Ray Tracing. There was a lot of interesting things, especially on low level GPU things and the evolution of real-time graphics with ray-tracing based algorithms.

In particular, there was an “Hot 3D” panel where NVIDIA Austin Robinson gived more information on NVIRT now called OptiX, with very interesting implementation details. In particular, Austin explained a little bit how each computation steps of rays are scheduled on the GPU (in particular in case of recursive operations), using persistent threads, launched once to maximize MP occupancy, and used as a state machine switching between computation steps. More info on this on Tuesday at Siggraph. In the same session, James McCombe of Caustic Graphics presented their own real-time RT API, that seems quite similar to NVIDIA solution, and Larry Seiller from Intel presented RT on Larrabee. About Larrabee, it appears more and more to me that Larrabee wont be able to compete with GPU for rasterisation application, but will really be a ray-tracing killer platform.

In the  evening, there was the social event of the conference on steamboat on the mississipi, very nice :-)

Day 2: Monday August 3

9:30: Still at HPG, the first talk is from Tim Sweeney, founder of Epic Games. Tim exposes its vision of the future of real-time graphics with the end of the dedicated GPUs as we know currently (with dedicated units), and their graphics API, replaced general computing devices, utilizing multi-core vector-processing units to run all code, graphics and non-graphics, uniformly 100% in software. This would be programmed directly in C++ and would allow a wide variety of new algorithms to be implemented, taking advantage of very good load balancing and memory latency hiding using large caches to provide high performances. In particular, Tim told about the usage REYES rendering pipeline in future video games, and the usage of ray-tracing for secondary rays effects. For Tim, while current video game engines development takes 3 years, next generation engines will need 5 years of development, due to the increasing complexity of these graphics engines.

9:45: Just get the news that OpenGL 3.2 specification has been announced by Khronos. Full spec can be downloaded there and as usual, NVIDIA announced the support in their driver Main additions to the spec are:

  • Increased performance for vertex arrays and fence sync
    objects to avoid idling while waiting for resources shared between the
    CPU and GPU, or multiple CPU threads;
  • Improved pipeline programmability, including geometry shaders in the OpenGL core;
  • Boosted cube map visual quality and multisampling rendering flexibility
    by enabling shaders to directly process texture samples.

 As well as new extensions: ARB_fragment_coord_convention, ARB_provoking_vertex, ARB_vertex_array_bgra, ARB_depth_clamp, WGL_ARB_create_context (updated to create profiles), GLX_ARB_create_context (updated to create profiles), GL_EXT_separate_shader_objects, GL_NV_parameter_buffer_object2, GL_NV_copy_image.txt.

3:00pm: The first Siggraph session I followed partially was “Advances in Real-Time Rendering in 3D Graphics and Games” course. It seems that the most interesting things were the talk on Light Propagation Volumes in CryEngine 3 by Kaplanyan, were he exposed very good results they got using VPL (Virtual Point Lights) and Light Propagation Volumes. The Graphics Techniques From Disney’s Pure by Moore and Jeffries was also interesting as well as Making it Smooth: Advances  in Antialiasing and Depth of Field Techniques by yang. It seems that the slides will be available there: in a few days.I didn’t see the talk on Graphics Engine Postmortem  from LittleBigPlanet by Evans but I was said that it was also very interesting, especially to see how game developers make their own mixture from published techniques, and develop ad-hoc hacks that appears to be sufficient in most cases.

16:00: Final HPG panel : Tim Sweeney (Epic Games),
Larry Gritz (Sony Pictures Imageworks),
Steve Parker (NVIDIA), Aaron Lefohn (Intel), Vineet Goel (AMD).

Aeron told about it’s vision of future cross architecture rendering abstraction: Running mostly in user space and made of multiple specialized pipelines (REYES, RASTER, RT). Steve predicted the dead of rasterisation in 7 years as well as REYES and RT: Future graphics algorithms will be a combination of RT, rasterization and REYES (blend or unification), the question is how hardware manufacturer will fit into this market.

Day 3 : Tuesday August 4

Not so many things today. In the morning, I went to the  Real-Time Global Illumination for Dynamic Scenes course. It was quite interesting and provided a good synthesis of current state of the art techniques GI. I also presented my poster on GigaVoxels at the poster session during lunch break.

At the end of the day, an interesting session was the OpenCL Birth Of Feather. Mike Houston from ATI presented the OpenCL specification and the model and features proposed. Mike also explained how OpenCL will be implemented on ATI GPU, and in particular that due to the R7xx architecture,developers will need to vectorize their algorithm using vec4, like in shaders, to take advantage of their 5 components SIMD (composing each of the 16 cores also working in SIMD and composing their kind of multi-processors). He announced the release of an implementation of OpenCL from AMD, but only for CPU ! It’s seems they still don’t have GPU implementation,
and I wonder if they are waiting for the release of evergreen
architecture to provide an implementation. Next was Aaron Lefohn and Larry Seller who exposed future INTEL implementation. They particularly pushed the Task API of OpenCL, that seems to allow to implement efficiently task-parallel algorithms, with multiple kernels running concurrently and communicating. This kind of model suppose that different kernels can run on different cores of the GPU, on thing that is not possible with current NVIDIA architecture (neither for ATI one I think). If I well understood, it also seems that their first OpenCL implementation will require the programmer to use 16 components vectors to fill their SIMD lines. Finaly Simon Green exposed NVIDIA implementation, and as you know they are the only one to propose working OpenCL implementation now (and since December 2008).

Day 4: Wednesday August 5

Interesting session from NVIDIA this afternoon: Alternative Rendering Pipelines on NVIDIA CUDA. Andrew Tatarinov and Alexander Kharlamov exposed their work on CUDA implementation of a Ray-Tracing pipeline as well as a REYES pipeline. These two implementations are using persistent threads to enhance work balancing and per-MP computation ressources usage with Uber-kernels using dynamic branching to switch between multiple tasks. In addition, the REYES implementation is using work queues, implemented using prefix-sum scan operations and used to fill persistant threads with work and to make them communicate. That’s for me a really awesome model, but for me the problem is the registers usage of such king of Uber-kernels. Ideally, a better threads scheduling on MPs and the ability to launch different threads on each MP should be even more efficient. Mode details there:

GigaVoxels at Siggraph 2009

Come to see my talk on GigaVoxels at Siggraph 2009 !

Friday, 7 August | 3:45 PM | Room 260-262

I will present the new GigaVoxels pipeline implemented in CUDA and discuss about it’s integration into video games. I will also present new results we get and especially the very efficient implementation of soft shadows and depth-of-field effects thanks to intrinsic properties of GigaVoxels hierarchical structure and volume Mipmapping mechanism.


Farrarfocus thoughts on GPU evolution

nvidia_logoTimothy Farrar blog is always a very very good source of information if you are looking for very in-depth thoughts and experiences on GPU.

Wednesday, Timothy published a post about his vision of the evolution of the GPU in a near future. In particular, Timothy exposes the idea that future GPUs  could expose a new highly flexible mechanism for job distribution based on generic hardware managed queues (FIFO) associated to kernels.

Current GPUs start threads by scheduling groups of independent jobs between dependent state changes from a master control stream (a command buffer filled by the CPU). OpenGL conditional rendering provides a starting point to on-the-fly modify the task list in this stream and DX11 seams to go further with the DispatchIndirect function that enables DX Compute grid dimensions to come directly from
device memory. The idea is that future hardware may provide generic queues that could be filed by kernels and used proactively by the hardware scheduler to set up thread blocks and route data to an available core to start new thread blocks using the kernel associated with the queue.

Much of the work in parallel processing is related to grouping, moving and compacting or expanding data and end up to be data routing problems. This model seems to provide a very good way to handle grouping for data locality. That could allow kernels that reach a divergent point (such as branch divergence or data locality divergence) to output threads to new queues with a new domain coordinate to insure a new good grouping for continued computation. Data associated to a kernel would also be in the queue and managed in hardware, to provide very fast access to threads parameters.

This can be done using a CPU like coherent cache with a large vector processor like Larabee, but data routing becomes expensive with a coherent cache that consume transistor for a rooting that could have
been define explicitly by programer. When you attempt to do all this routing manually with dedicated local memory and high throughput global memory, it is still expensive, just less expensive. The idea of Timothy is that this mechanism could be highly hardware accelerated and could provide a big advantages to “traditional” GPUs over Larabee like more generic architectures. I really think this is the way to go for GPU to continue to provide high performances to more generic graphics rendering pipelines.

The same idea is developed on a TOG paper that will be presented at Siggraph this year. This paper present GRAMPS, a programming model that generalizes concepts from modern real-time graphics pipelines by exposing a model of execution mixing task parallelism and data parallelism containing both fixed-function and application-programmable processing stages that exchange data via queues.

NVIDIA OpenGL Bindless Graphics

nvidia_logoYesterday, NVIDIA made public two new OpenGL extensions named NV_shader_buffer_load and NV_vertex_buffer_unified_memory, these new extensions allow to use OpenGL in a totally new way they called Bindless Graphics. With Bindless Graphics you can manipulate Buffer Objects directly using their GPU global memory addresses and control the residency of these objects from applications. It allows to remove the bottleneck coming from binding objects before being able to use them, that force the driver to fetch all objects states before being able to use of modify them.

The  NV_shader_buffer_load extension provides a mechanism to bind buffer objects to the context in such a way that they can be accessed by reading from a flat, 64-bit GPU address space directly from any shader stage and to query GPU addresses of buffer objects at the API level. The intent is that applications can avoid re-binding buffer objects or updating constants between each Draw call and instead simply use a VertexAttrib (or TexCoord, or InstanceID, or…) to “point” to the new object’s state.

The NV_vertex_buffer_unified_memory extension provides a mechanism to specify vertex attributes and element array locations using these GPU addresses. Binding vertex buffers is one of the most frequent and expensive operations in many GL applications, due to the cost of chasing pointers and binding objects. With this extension, application can specify vertex attributes state direcly using VBO adresses that alleviates the overhead of object binds and driver memory management.

NVIDIA provides a small bindless graphics tutorial, with a presentation of the new features.

That seems very useful, but what scare me a little bit is that each time you provide the developer with lower level access like this, you reduce a lot the potential of automatic driver optimizations and in particular, I wonder how this mechanism interact with NVIDIA SLI mode that provide automatic scaling of OpenGL applications among multiple GPU. This mode duplicate data on each GPU and broadcast drawing command to all the GPU to allow them to produce differents parts of a frame and compose them before display. Using these extensions, the same address space has to be maintained on all GPU involved in SLI drawing, that seems to be very difficult especially in case of etherogenous SLI configurations.

CUDA visual studio integration

cudalogo2 Just would like to give small tips I found to help working with CUDA under visual studio.

First, syntax highlighting for .cu files can be enabled with these few steps:

  1. Copy the content of the “usertype.dat” file provided by nvidia (NVIDIA CUDA SDK\doc\syntax_highlighting\visual_studio_8) into your “Microsoft Visual Studio 8\Common7\IDE” folder from your program files folder.

  2. Open Visual Studio and Take Tools -> Options. Under Text Editor -> File Extension tab, specify the extension “cu” as a new type.

Visual Studio rely on a feature named Intellisense to provide functions and variables names completion, definitions lookup and all these kind of features. To get intelligence working with .cu files, yo have to modify a windows registry key: Add c and cuh extensions to NCB Default C/C++ Extensions key under “HKEY_CURRENT_USER\Software\Microsoft\VisualStudio\9.0\Languages\Language Services\C/C++” path. (Thanks to for the tip)

For those using Visual Assist X, you can do the following. First, find the Visual Assist X install directory: (X:\Program Files\Visual
Assist X\AutoText\latest) and then make a copy of Cpp.tpl and rename it
to Cu.tpl. Second, Open and close Visual Studio (this initializes
Visual Assist X parameters by creating some folders/variables in the
Registry ). Third, open regedit and go to: “HKEY_CURRENT_USER\Software\Whole Tomato\Visual Assist X\VANet9″ and add “.cu;” to the ExtSource key and add “.cuh;” to the ExtHeader key. (Thanks to ciberxtrem for the tip)

Finally, build rules allowing to easily compile .cu files without having to write the rules manually can be integrated installing this little wizard: More details on CUDA build rules can be found on this website :

Larrabee ISA at GDC

intelLast week, intel gives two talks about Larrabee ISA called Larrabee New instructions (LRBNi).

The most significant thing to note is that Larrabee will expose a vector assembly, very similar to SSE instructions, but operating on 16 components vectors instead of 4. To program this, they will provide C intrinsics whose names that look… really weird !

C++ Larrabee prototype library:

Intel provide headers with x86 implementations of these instructions to allow developers to start using these instructions now. But I can’t imagine anybody using this kind of vector intrinsics to program a data parallel architecture. As we have seen with SSE instructions, very few programmers finally used them, and only for very specific algorithm parts. So I think that these intrinsics will be only used  to implement higher level programming layers, like an OpenCL implementation, that is for me a really better and more flexible way to program these architectures.

The scalar model exposed for the G80 through CUDA and the PTX assembly (and that will be exposed by OpenCL) uses scalar operations over scalar registers. In this model,  the underlining SIMD architecture is visible through the notion of warps, inside which programmers know that divergent branches are serialized. Inter-threads communication is exposed through the notion of CTA (Cooperative Threads Array), a group of threads able to communicate through a very fast shared memory. Coalescing rules are given to the programmers to allow him to make best use of the underlining SIMD architecture,  but the model is far more scalable (not restricted to a given vector size) and allows to write codes in a lot more natural way than a vector model.

Even if, for now, Larrabee exposes a vector assembly, where the G80 expose a scalar one, only the programming model vary but the underlining architecture is finally very similar. Each Larrabee core can dual issue instructions to an x86 unit and 16 scalar processors working in SIMD, that is very similar to a G80 Multiprocessor, that can dual issue instructions to a special unit or 8
scalar processor working in SIMD over 4 cycles (providing a 32 wide SIMD). Larrabee exposes 16 wide vectore registers, where the G80 expose scalar ones, that are in facts aligned parts of vector memory bank.

The true difference before the two architecture is that Larrabee will implement the whole graphics pipeline using these general purpose cores (plus dedicated texture unit), where the G80 still has a lot of very optimized units and data paths dedicated to graphics operations connected into a fixed pipeline. The bet Intel is doing is that the flexibility provided by the full programmable pipeline will allow a better load balancing that will compensate the less efficiency of the architecture for graphics operations. The major asset they rely on is a binning rasterisation model, where after the transform stage, triangles are affected by screen tiles locality to the cores where all the rasterisation, the shading and the blending is done. Thanks to this model, they could keep local screen regions per cores in dedicated parts of a global L2 cache, used for inter-cores communications. That should allow efficient programmable blending for instance. But I think that even them don’t know if it will really be competitive for consumer graphics !

And even on that point, Larrabee approach is not so different from G80 approach, where triangles are globally rasterized and then fragments are spread among Multiprocessors based on screen tiles (cf. for fragment shading, the diference is that z-test and blending are done by fix ROP units, connected to the MP via a crossbar (cf.

Finally, with these talks, Intel seems to present as a revolutionary new architecture something that  for the major part has been here for more than 2 years now with the G80, coming with a programming model that seems really weird compared to the CUDA model. This is even more weird that Larabee may not be released before Q1 2010, and at this time NVIDIA and ATI will have already released their next generation architectures that may look even more similar to Larrabee. With Larrabee, Intel has been feeding the industry with a lot of promises, like the “it will be x86, so you won’t have to do anything particular to use it”, that we have always known to be wrong, since by nature the efficiency of a data parallel architecture comes from it’s particular programming model. If a proof was needed, I think this ISA is the one.

Intel GDC presentations:

Larrabee at GDC, PCWATCH review:

Very good article about Larrabee and LRBNi:



OpenGL 3.1 Specifications

OpenGL 3.1 specification have been released at GDC 2009 a few days ago. I think this is the first time in ARB history a new version of OpenGL is released so quickly, less than one year after OpenGL 3.0 was released (at siggraph last year).

This new revision promote to the core some remaining G80 features that where not promoted into OpenGL 3.0: Texture Buffer Objects (GL_ARB_texture_buffer_object, one-dimensional array of texels used as texture without filtering, equivalent to CUDA linear textures), Uniform Buffer Objects (GL_ARB_uniform_buffer_object, enables rapid swapping of blocks of uniforms, rapid update and
sharing across program objects), Primitive Restart (GL_NV_primitive_restart, restart an executing primitive, exist as a extensin since Geforce 6 I think), Instancing (GL_ARB_draw_instanced), Texture Rectangle (GL_ARB_texture_rectangle). Uniform Buffer Objects has been enhanced quite a lot compared to the original GL_EXT_bindable_uniform, among other things, several buffer can be combined to populate a shader uniform block and a standard cross-platform data storage layout is proposed.

There is also two new “features” that were not available as extensions before. A CopyBuffer API that allows fast copied between buffer objects (VBO/PBO/UBO) that will also useful for sharing buffers with OpenCL. The other feature is a Signed Normalized Textures format that is a new integer texture formats that represent a value in the range [-1.0,1.0].

Geometry shaders (GL_ARB_geometry_shader4) where not promoted and maybe they will never be. The extension is not implemented by ATI and is not used a lot since this feature is usefull only in a few cases (due to implementation performances). Direct state access (GL_EXT_direct_state_access) was neither promoted, it’s a very usefull extension that allows to reduce states changes cost, but it’s a really new (released with GL3.0) and I did’nt expected it tobe promoted yet.

The deprecation model is a design mechanism introduced in GL 3.0 to allow to remove outdated features and commands. (the reverse of the extension mechanism). Core features are first marked as deprecated, then moved to an ARB extension, then eventually to an EXT or vendor extension, or removed entirely. The OpenGL 3.0 specification marks several features as deprecated, including the venerable glBegin/glEnd mechanism, display lists, matrix and attribute stacks, and the portion of the fixed function pipe subsumed by shaders (lighting, fog, texture mapping, and texture coordinate generation).

About deprecation, the specification is available in two formats, one with deprecated features ( and one with only “pure” GL 3.1 features ( An extension called ARB_compatibility has been introduce. If supported by an implementation, this extensien ensure that all deprecated features are available. This mechnism allows not to break the compatibility for old GL applications, keeping every features in the driver, while cleaning the API and providing new high performance paths. It seems to be a good mechanism, more convinient than the initial idea of creating specific contexts. NVIDIA for instance ensure that they will keep all deprecated features in their drivers to answer customers needs (I think mainly CAD customers).

Once again, like for OpenGL 3.0, while ATI declared that they will support GL 3.1, NVIDIA announced a BETA support of GL 3.1 and released drivers:

To conclude, it’s good to the the Khronos/ARB remaining so active since the release of OpenGL 3.0, and it’s good to see OpenGL evolving in the right direction :-)


- The announcement:

- The specifications:

- More informations:,

I3D 2009 and NVIRT

i3dlogo I have been at Boston last few days to attend the I3D (ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games) conference where I presented our paper on GigaVoxel. The presentation went well and a lot of people seemed to  be interested by our method. I like I3D conference because it’s a good opportunity to meet people and to share and discuss about everybody’s research. In addition there is usually a lot of GPU guys here ;-)

This year, the invited banquet speaker was Austin Robinson from NVIDIA research group and he announced in exclusivity NVIRT, the NVIDIA Ray-Tracing engine. NVIRT is a low level API layered over CUDA and it seems to use a lot of functionalities of NVSG (NVIDIA scene graph).

nvidia_logo The principle is to provide the API with an object scene graph and a ray generator, then it gives you the ray-intersections through a traversal black box. It seems to be quite flexible since intersection shaders can be written, allowing to compute arbitrary shadings or to launch secondary rays (for reflection, refractions or shadows).  Shaders are written in CUDA and the whole API generate PTX assembly. Efficiency strategies can be defined for rays,  they can be configured to can return the closest intersection or to terminate on the first intersection, found that is usefull for shadows computations. Different acceleration structures seem to be avalaible to store objects, like kd-trees for static objects and BVH (Bounding Volume Hierarchy) for dynami ones. This SDK seems to be thought quite generic to allow more than only ray-tracing rendering  (like collision detection, illumination or why not AI. The SDK should come with a lot of samples.

Since it runs on CUDA, it inherits the limitations from this one, like the cost of context switch between graphics API and CUDA and the current impossibility to share textures or render targets directly. That will limit in a first tim ethe usability of the API for mixed algorithms, but It seems to be a really cool toy to test ray-tracing algorithms, and will provide to NVIDIA a good black box to enhance ray-tracing support in their future hardware.

EDIT: It seems to take a little time before something appears on NVIDIA website, so since Austin shared it’s slides with some people present at i3D, Eric Haines put them on it’s blog:


raymarching_2008-04-06_18-51-08-43 A first post to talk about my work as a PhD student ! I have been doing a PhD for a little bit more than one year now at INRIA Rhone-Alpes in Grenoble, on the rendering and exploration of very (very !) large voxel scenes. I am working more specifically on GPU voxel ray-casting and ray-tracing,  complex GPU data structures and interactive large datasets streaming. I am working in close collaboration with Fabrice Neyret, my PhD supervisor, on that project. Sylvain Lefebvre and Elmar Eisemann are also collaborating on it.

My research webpage can be found here:

raymarching_2008-04-07_18-28-35-57 Voxel representations are useful for semi-transparent phenomenas and rendering advanced visual effects such as accurate reflections and refractions. Such representation provide also faster rendering and higher quality (allowing better and easier data filtering) than triangle based representation for very complex meshes (typically leading in one or more triangles per pixels).

raymarching_2008-04-07_20-30-29-75 The first result of that work have been a research report named Interactive GigaVoxels. We now have a paper accepted at I3D 2009 that is better written, more complete and presents our last work and results. It introduced a rendering system we have named GigaVoxels, that is our realtime voxel engine. GigaVoxels is based on a kind of lightwise sparse voxel octree data structure, a fully GPU voxel ray-caster that provides very high quality and real time rendering and a data streaming strategy based visibility informations provided directly by ray-casting. 3D Mip-Mapping is used to provide very-high filtering quality and the out-of-core algorithm allow virtually unlimited resolution scenes

GigaVoxels : Ray-Guided Streaming for Efficient and Detailed Voxel Rendering


OpenCL specification

The first specification of OpenCL (the Open Computing Language) has just been released by khronos. The specification can be downloaded here:

OpenCL is a general purpose data-parallel computing API, firstly initiated by apple and then standardized within Khronos. It provides a common hardware abstraction layer to program on data parallel architectures. It is very (very !) close to CUDA and supported by both NVIDIA and AMD, but also INTEL and IBM (Among others). This API don’t targets especially GPUs and we should also see CPUs implementations of it, but also maybe Cell implementation, and Larabee implementation (when it will be out) ? NVIDIA is likely to be one of the first to provide OpenCL implementation, since they already have their own now well-tried CUDA API to build on.

In addition to be cross platforms and cross-architectures, one interesting feature of OpenCL is that it have been designed to provide good interoperability with OpenGL, especially for data and textures sharing. It is meant to be the exact competitor to DX11 compute API.

NVIDIA first OpenGL 3.0 drivers released

I am at the Siggraph OpenGL 3.0 right now and NVIDIA just gives us the link to their first driver supporting OpenGL 3.0 (with some limitations yet).

The driver can be downloaded here : 

I will do a complete post on OpenGL 3.0 in a few days, but as you may know it is not the OpenGL 3.0 we were waiting for, since the complete rebuild of the API have not been done (the main idea of Long Peak). But it don’t seems so bad that we could have been afraid of, a deprecation and profiles mechanism have been introduced and everybody here seems to be confident on its usage and capacity to enhance OpenGL evolution speed.

ATI is also here, and they announced their GL3.0 drivers for Q1 2009… Hum I don’t know why I doubt we will get full functional GL3.0 support for them at this date, even if it is already late compared to NVIDIA.

OpenGL 3.0 signs of life

We have been without news of OpenGL 3.0 for nearly 9 months now, a specification was promised for September 2007 at Siggraph last year and then no news after an ARB member reported that the spec was not ready since there was some unresolved issues they had to address. This situation started to becomes really worrying and a lot of speculations have been made on this delay (see forum). The most likely thing is that there have been some disagreements inside ARB members that delayed the spec release.

But recently we had two comforting news tending to prove that OpenGL 3.0 is not dead. The first thing is the creation of the OpenGL 3.0 website with announcements for the Siggraph OpenGL BOF :

- “OpenGL 3.0 Specification Overview”

- “OpenGL hardware and driver plans – AMD, Intel, NVIDIA”

- “Developer’s perspective on OpenGL 3.0 ”

The second think is a leeked picture of an NVidia presentation slide about their future drivers release (codename Big Bang 2, see the news here) where can distinguiche a reference to OpenGL 3.0 support for September. This tend to prove that the specification is now almost done and the IHV are already working on implementation.

I will be at Siggraph this year and I will be at the OpenGL BOF, I hope I will be able to the the final spec here !


IEEE Article on G80/G92 architecture

There is a very interesting article on G80/G92 hardware architecture just published by NVIDIA guys (hardware architects) in the March/April edition of IEEE Micro:

NVIDIA Tesla: A Unified Graphics and Computing Architecture

Erik Lindholm, NVIDIA
John Nickolls, NVIDIA
Stuart Oberman, NVIDIA
John Montrym, NVIDIA

The full article can be downloaded here:

Most of the informations can be found or deduced from the CUDA manual but there is also few details about hardware model in this article that was not given before as directly by NVIDIA. Especially, there is mention of Fragments grouped into 2×2 to be scheduled into Warps, a thing we deduced with our G80 hardware analysis ( is also the firstpicture of the G80 die publicly avalable (hum ok, not very clear and in grayscale) !

A vast GPU based supercomputer in france

The the French national High-Performance Computing organization (GENCI) composed among others by the French Atomic Energy Authority (CEA) just order to Bull one of the first supercomputer (a large-scale PC cluster) using an hybrid architecture exploiting GPUs as high performance computational resource. This cluster will be part of CCRT (the Center for Research and Technology computing) and installed in the the French Atomic Energy Authority’s Directorate of Military Affairs center (CEA/DAM) of Bruyères le Châtel.

This cluster will use 48 NVidia Tesla GPU modules that are very likely to be based on the next NVIDIA processor generation, the GT200, especially because this generation will support double precision (64 bits floating points) in hardware (but the question is still the cost of this). Each GPU module is said to be composed of 512 cores that let supposed that they will be composed of two GT200 GPUs providing 256 Stream Processors each. Since I already heard rumors about a 240 Stream Processors configuration for the GT200, it is likely to be composed of 16 Multi Processors of 16 Stream Processors each. This would allow a GTX version composed of 15 MP and an ultra version composed of 16MP.

I know quite well the CEA/DAM center since I did an internship here a few years ago, and I know a few guys here who I am sure will be very happy with this new very cool toy ! ;-D



CUDA 2.0 Beta

cuda_logo NVIDIA just published a Beta version of CUDA 2.0 with associated SDK. Among other things, it brings Windows Vista support. Linux support should follow soon. The really good new for graphics programmers is that 3D texture access is now supported within the API :-D

Everything can be downloaded here:

GPU Gems 1 Readable Online

Nvidia has released a few days ago an online version of  GPU Gems: Programming Techniques, Tips, and Tricks for Real-Time Graphics, the first tome of what I consider as one of the best GPU book series.

GPU Gems is a compilation of articles covering practical real-time graphics techniques arising from the research and practice of cutting-edge developers. It focuses on the programmable graphics pipeline available in today’s graphics processing units (GPUs) and highlights quick and dirty tricks used by leading developers, as well as fundamental, performance-conscious techniques for creating advanced visual effects. The contributors and editors, collectively, bring countless years of experience to enlighten and propel the reader into the fascinating world of programmable real-time graphics.

The book can be read here:

Understanding G80 behavior


 —=== Understanding G80 behavior and performances ===—

Cyril Crassin, Fabrice Neyret LJK-INRIA, Grenoble, France

This is an evolving document (corrections, additions): please link it, but don’t copy it !

This is also an open document and we hope to engage discussions on the forum here.

version: March, 8 2008


(continue reading…)

AMD opens R5xx technical informations

amdatiAfter having released 2D hardware registers informations for their R5xx and R6xx GPUs a few months ago, AMD/ATI has just published the first bits of open-source 3D programming documentation for their R5xx GPUs (Radeon HD X1000 class cards). Even if R5xx are pretty old now, AMD plans to released 3D specifications for their R6xx (Radeon HD X2000 class cards) last architecture (supporting SM 4.0 and featuring unified shader architecture like the G80) very soon. This release is part of a large AMD plans to open up their whole hardware allowing the development of an open source driver fully supporting the last ATI chips (more infos here). As ATI have never been able to develop correct OpenGL support (nor linux drivers), it sounds for me also like a very good news for OpenGL support on ATI hardware. Who know maybe will we get OpenGL 3.0 support on ATI before on NVidia hardware… Hum, sounds like science fiction !

As for Intel’s documentations published a few weeks ago (see here ), these documentations gives very interesting details on the hardware architecture but this time for a “true” high performance 3D architecture. Among the areas covered in this guide are the general command processor,
vertex shaders and fragment shaders, Hyper-Z, and the various 3D registers.
In addition to these specifications, AMD should release soon the source of their new proprietary OpenGL implementation as well as their “TCore” tool, a kind of GPU emulator used internally to test software code before chips availability.

The R5xx documentations can be found here:

More informations here.

Intel opens GPU technical informations

Intel just released publicly the full technical informations and hardware specifications of their unified Shader Model 4 class i965 and G35 chipsets and graphics processor. These specifications concern all the GPU parts (2D, 3D and video encoding and decoding) and will allow third party open source drivers development.
Among other things, it includes very detailed and interesting informations on graphics pipeline shaders hardware implementation, like hardware states, internal formats, rasterisation policies and threads batching. Even if Intel don’t provide the highest performance hardware, these specification are very interesting and gives a good idea on how this kind of hardware can work internally.

Full specifications can be found here:


nvidia_logoWladimir Jasper van der Laan released a few week ago a disassembler for CUDA G80/G90 CUBIN (CUDA true hardware binary NVidia never described) and today he just released a first version of an assembler for this format. With these two tools, we now have an entire toolchain allowing very interesting optimization on the code produced by CUDA.

Cg 2.0 SDK Beta

cg_toolkitUn an après la sortie du G80, NVidia viens enfin de publier une première Beta de son SDK Cg 2.0 apportant le support des Shaders Model 4.0. Un peu tard maintenant que je suis passé à GLSL… En plus, la version du compilateur ne semble même pas plus récente que celle de celui qui est inclus dans le driver pour compiler le GLSL ( contre

Le SDK peut tout de même être téléchargé ici.

Désassembleur G80

nvidia_logo Un étudiant néerlandais (Wladimir Jasper van der Laan) c’est amusé à écrire un désasembleur permettant de récupérer l’assembleur bas niveau utilisé directement par le G80 à partir d’un fichier binaire .cubin de Cuda. Au contraire de l’assembleur PTX que Cuda permet de générer, cet assembleur expose réellement l’ISA (Instruction Set Architecture) du matériel. Il est ainsi possible de connaitre exactement les instructions exécutées par le GPU et cela permet en plus de se faire une idée très précise de son fonctionnement. Le petit gars travaille maintenant sur l’assembleur correspondant ce qui permettra d’écrire directement des programmes en assembleur G80.

L’outils peut être téléchargé ici


120407-1Crysis c’est LE jeu attendu par tous les fans de FPS et possesseurs d’une grosse GeForce 8800 et c’est finalement ce matin qu’a été rendue publique la première démo du mode solo de Crysis. En tant que membre de ces deux catégories de personnes je me suis donc logiquement précipité sur cette démo et sans surprise c’est plutôt joli coté graphique (Quoique après tant d’attente on ne puisse s’empêcher de se dire qu’on s’attendait à mieux !). Bon les effets naturels style nuages, eau, rayons du soleil etc. sont pas mal et la physique à l’air de plutôt bien marcher.

Mais ce que je trouve vraiment rigolo avec Crysis, c’est l’énorme Buzz qui a été entretenu depuis 2 ans autour de son utilisation de DirectX 10 et donc de la nécessité de le faire tourner sous Windows Vista (seul système d’exploitation supportant l’API graphique) pour pouvoir profiter de tous les effets visuels “next-gen”. De nombreuses vidéo et prises d’écran comparatives ont été publiées (on en trouve ici par exemple) montrant des différences de rendu très importantes entre les deux modes, différences ne pouvant se justifier nullement par les nouvelles possibilités offertes par l’API DirectX 10 (et donc les cartes de la génération du G80), tous ces effets pouvant être créés sous DirectX 9 (et OpenGL 2.0 bien sur :) et une carte supportant les shaders model 3.0. Tous les observateurs avisés se sont donc rendu compte qu’il s’agissait très clairement d’un bridage du moteur de rendu en mode DirectX 9, dont le seul but est de faire vendre à la fois des cartes graphiques compatibles DirectX 10, et bien sur un système d’exploitation qui sans cela manque très cruellement d’arguments de vente. Nous avons aujourd’hui avec cette démo la confirmation de ce bridage.

(continue reading…)

OpenGL 3.0 : Présentation

longpeaksIl y a presque un an maintenant, le groupe Khronos auquel venait d’être transféré le contrôle d’OpenGL dévoilait l’avenir vers lequel s’orientait notre API préférée (enfin la mienne en tout cas ;-) : Noms de code “Longs Peak” et “Mt. Evans”, deux révisions prévues pour 2007 et qui constituent beaucoup plus qu’une simple évolution de l’API. C’est finalement il y a quelques semaines, au SIGGRAPH 2007, qu’a été officiellement présenté et baptisé “Longs Peak”:OpenGL 3.0.

Je vais, dans cet article, essayer de présenter rapidement les changements que vont apporter ces nouvelles versions. Enfilez donc vos baudriers et chaussons d’escalade, ca grimpe ;-)

(continue reading…)

NVidia GLSL compiling options

opengl_logoDue to the lack of update of the Cg toolkits for G80 I have had to work with GLSL for a couple of months now. One of the main problem I had with GLSL was the lack of control on the compilation process (the parameters usually passed to the Cg compiler). Happily I have found a series of pragma command the NVidia GLSL compiler (and only the NVidia one) can interpret and use as compilation options. I think somebodies can be interested in it so here are the commands (thanks to Gernot Ziegler for the information) :

#pragma optionNV(fastmath on)
#pragma optionNV(fastprecision on)
#pragma optionNV(ifcvt none)
#pragma optionNV(inline all)
#pragma optionNV(strict on)
#pragma optionNV(unroll all)

There are also some environment variables that can be defined (options typically controlled globally with NVEmulate):

NVidia SDK 10, Forceware 100 XP et Cg 2.0 BETA

NVidia à publié il y a quelques jours la version 10 de son SDK et propose ainsi d’intéressants exemples d’applications OpenGL et DirectX 10 exploitant les nouvelles possibilités du G80. Le SDK DirectX 10 nécessite bien entendue Windows Vista ainsi que le dernier SDK DirectX Microsoft.
Il est à noter qu’il comprend une version beta du toolkit Cg 2.0 qui apporte le support des nouvelles fonctionnalités du G80.

Ce SDK peut être téléchargé ici.


Sinon toujours concernant NVidia, des premières versions BETA des Forceware release 100 pour Windows XP commencent à être disponibles sur internet. Par contre il semble qu’il s’agisse uniquement de versions modifiées pour XP des versions équivalentes sous Vista. Ils souffrent ainsi des mêmes limitations et sont encore largement moins matures que les releases 97.92 par exemple. Ils apportent par contre le support de fonctionnalités OpenGL non présentes dans les releases 95. J’ai pu par exemple tester le support de l’attribut gl_InstanceID (lié à draw_instanced) dans le vertex shader. Théoriquement ils doivent également permettre le binding simultané de multiples couches d’une texture 3D ou layered via MRT (mais je n’ai pas testé).

Forcewares 100.95 sur Guru3D.Les versions finales se font tout de même décidément bien attendre en ce moment du coté de NVidia…

MISE A JOUR: (29/03/07)
Une version plus récente des Forcewares 100 pour XP (101.02) est disponible depuis quelques jours. Je les utilises, ils semblent plus stables et d’après mes essais, offrent maintenant des performances supérieurs aux releases 95 dans pas mal d’applications.
Forcewares 101.02 sur Guru3D

MISE A JOUR 2: (16/04/07)
Nouvelle version Forceware 158.16 XP

Vision Techno et Stream Computing

Une petite analyse écrite il y a quelques mois maintenant sur l’évolution de l’informatique grand public vers des architectures de plus en plus dédiées eu calcul intensif sur les base des architecture de traitement de flux des GPU.
A l’époque, ATI semblait le plus en avance avec le futur R600, le CTM et le rachat par AMD, depuis il y a eu le G80 et CUDA. Les enjeux sont dans tous les cas de plus en plus d’actualité et le GPGPU devenu Stream Computing sort de plus en plus des laboratoires et sera sans contexte la technologie qui va faire parler d’elle dans les mois qui viennent.

J’ai pensé à ressortir ce petit article écrit à l’issue de mon stage de fin d’études pendant lequel j’ai travaillé sur ces problématiques car récemment il y a eu un article de David Strom dans Information Week sur les technologies majeures de 2007 faisant mention du GPGPU ( Aujourd’hui le fondateur de PeakStream, une start-up proposant des solution logicielles pour le Stream Computing a également publié un article intéressant:

(continue reading…)

NVidia G80: OpenGL


Alors que NVidia ne fournit pas encore de pilote pour DirectX 10 (donc windows Vista, seul système à supporter cette version de DirectX qui introduit l’ensemble des nouvelles fonctionnalités), il
propose déjà une série d’extensions OpenGL exposant toutes les nouvelles capacités du GPU.

Je vais décrire rapidement ce qui me semble intéressant dans ce que j’ai lu de ces extensions dans les specs de NVidia ( object/nvidia_opengl_specs.html ).

(continue reading…)

NVidia G80: Architecture


Mercredi a été lancé le G80 ou GeForce 8800, nouveau GPU de chez NVidia. Il s’agit du premier GPU supportant les nouvelles fonctionnalités de DirectX 10 Il est doté pour cela d’une architecture totalement nouvelle sur laquelle très peu d’informations n’avaient filtré jusqu’a sa sortie officielle.

Je vais faire une petite review de ce GPU en commençant par décrire rapidement l’architecture du GPU. Ensuite je présenterai ce qui me semble intéressant et ce que j’ai pour l’instant découvert dans les nouvelles extensions OpenGL que NVidia propose avec.

(continue reading…)

Copyright © 2006-2012 Icare3D. All rights reserved.
Icare3D 2.0 theme, based on Jarrah theme by Templates Next | Powered by WordPress