 |
|
 |
|
General -
Site
|
|
samedi, 12 juin 2004 |
|

Welcome on my personal and technical website. The aim of this website is to publish different projects I make as part of my studies, internships or simply during my free time, mainly related to Computer Graphics, real time 3D programming and scientific visualisation. The second goal of this website is to provide technical news about GPU related subjects.
Cyril
|
|
TechnoBlog -
GPU
|
|
lundi, 19 avril 2010 |
|
After many years using this website as a kind of blog, I finally decided to make a separate blog and I turned out to Blogger. Publishing things involved too much formating effort and I became lazy posting here. In addition, I will keep this website to publish about my personal works and creations, while the blog will cover much wider topics. The intend of this new blog is to publish more regularly my thoughts and findings about GPUs, parallel programming, and computer graphics in general.
You can follow it there: http://blog.icare3d.org
Write Comment (0 Comments) |
| |
|
|
TechnoBlog -
GPU
|
|
jeudi, 20 août 2009 |
|
Martin Wahnschaffe, a Bachelor student at the Technische Universität Braunschweig written a very nice summary report on our GigaVoxels I3D paper for a seminar he did about it in the context of it's computer graphics course. The report is very detailed, written nicely and in a very instructive way, so if you feel lasy reading our paper, go read this report !
The report can be read there: http://graphics.tu-bs.de/teaching/seminars/ss09/CG/studentwebsites/MartinWahnschaffe/
Congratulations to Martin for this very good work !
Write Comment (1 Comments) |
|
|
TechnoBlog -
GPU
|
|
lundi, 17 août 2009 |
|
I come back from one week of vacation in Louisiana I took right after Siggraph. My Siggraph talk about GigaVoxels went very well I guess, I get interesting feedbacks and discussions right after the talk and it seems people were very interested. It was the first time I told about our CUDA implementation and the new cache mechanism fully implemented on GPU. It's based on scattered visibility informations, stream compaction and a LRU mechanism implemented entirely in GPU memory and managed in a data-parallel manner from CUDA kernels. This way, the CPU only role is to answer the GPU cache by uploading bricks and constant area information.
I also demonstrated the cone tracing approach using our continuous 3D MipMapping to implement very efficient Soft Shadows and Depth-Of-Field. and I shown examples of scene instancing using a BVH structure ray-traced on GPU. I think this was the most interesting parts for those who already known the technique.
For those who are interested and have not been able to attend, I have put my slides there:
http://artis.imag.fr/Publications/2009/CNLSE09/GigaVoxels_Siggraph09_Slides.pdf
Write Comment (0 Comments) |
|
|
TechnoBlog -
GPU
|
|
lundi, 03 août 2009 |
Day 1: Sunday August 2
I was at High Performance Graphics on Sunday, HPG is the merge of two previous conferences: Graphics Hradware and Interactive Ray Tracing. There was a lot of interesting things, especially on low level GPU things and the evolution of real-time graphics with ray-tracing based algorithms.
In particular, there was an "Hot 3D" panel where NVIDIA Austin Robinson gived more information on NVIRT now called OptiX, with very interesting implementation details. In particular, Austin explained a little bit how each computation steps of rays are scheduled on the GPU (in particular in case of recursive operations), using persistent threads, launched once to maximize MP occupancy, and used as a state machine switching between computation steps. More info on this on Tuesday at Siggraph. In the same session, James McCombe of Caustic Graphics presented their own real-time RT API, that seems quite similar to NVIDIA solution, and Larry Seiller from Intel presented RT on Larrabee. About Larrabee, it appears more and more to me that Larrabee wont be able to compete with GPU for rasterisation application, but will really be a ray-tracing killer platform.
In the evening, there was the social event of the conference on steamboat on the mississipi, very nice :-)
Day 2: Monday August 3
9:30: Still at HPG, the first talk is from Tim Sweeney, founder of Epic Games. Tim exposes its vision of the future of real-time graphics with the end of the dedicated GPUs as we know currently (with dedicated units), and their graphics API, replaced general computing devices, utilizing multi-core vector-processing units to run all code, graphics and non-graphics, uniformly 100% in software. This would be programmed directly in C++ and would allow a wide variety of new algorithms to be implemented, taking advantage of very good load balancing and memory latency hiding using large caches to provide high performances. In particular, Tim told about the usage REYES rendering pipeline in future video games, and the usage of ray-tracing for secondary rays effects. For Tim, while current video game engines development takes 3 years, next generation engines will need 5 years of development, due to the increasing complexity of these graphics engines.
9:45: Just get the news that OpenGL 3.2 specification has been announced by Khronos. Full spec can be downloaded there http://www.opengl.org/registry/ and as usual, NVIDIA announced the support in their driver http://developer.nvidia.com/object/opengl_3_driver.html. Main additions to the spec are:
- Increased performance for vertex arrays and fence sync
objects to avoid idling while waiting for resources shared between the
CPU and GPU, or multiple CPU threads;
- Improved pipeline programmability, including geometry shaders in the OpenGL core;
-
Boosted cube map visual quality and multisampling rendering flexibility
by enabling shaders to directly process texture samples.
As well as new extensions: ARB_fragment_coord_convention, ARB_provoking_vertex, ARB_vertex_array_bgra, ARB_depth_clamp, WGL_ARB_create_context (updated to create profiles), GLX_ARB_create_context (updated to create profiles), GL_EXT_separate_shader_objects, GL_NV_parameter_buffer_object2, GL_NV_copy_image.txt.
3:00pm: The first Siggraph session I followed partially was "Advances in Real-Time Rendering in 3D Graphics and Games" course. It seems that the most interesting things were the talk on Light Propagation Volumes in CryEngine 3 by Kaplanyan, were he exposed very good results they got using VPL (Virtual Point Lights) and Light Propagation Volumes. The Graphics Techniques From Disney’s Pure by Moore and Jeffries was also interesting as well as Making it Smooth: Advances in Antialiasing and Depth of Field Techniques by yang. It seems that the slides will be available there: www.bungie.net/publications in a few days.I didn't see the talk on Graphics Engine Postmortem from LittleBigPlanet by Evans but I was said that it was also very interesting, especially to see how game developers make their own mixture from published techniques, and develop ad-hoc hacks that appears to be sufficient in most cases.
16:00: Final HPG panel : Tim Sweeney (Epic Games),
Larry Gritz (Sony Pictures Imageworks),
Steve Parker (NVIDIA), Aaron Lefohn (Intel), Vineet Goel (AMD).
Aeron told about it's vision of future cross architecture rendering abstraction: Running mostly in user space and made of multiple specialized pipelines (REYES, RASTER, RT). Steve predicted the dead of rasterisation in 7 years as well as REYES and RT: Future graphics algorithms will be a combination of RT, rasterization and REYES (blend or unification), the question is how hardware manufacturer will fit into this market.
Day 3 : Tuesday August 4
Not so many things today. In the morning, I went to the Real-Time Global Illumination for Dynamic Scenes course. It was quite interesting and provided a good synthesis of current state of the art techniques GI. I also presented my poster on GigaVoxels at the poster session during lunch break.
At the end of the day, an interesting session was the OpenCL Birth Of Feather. Mike Houston from ATI presented the OpenCL specification and the model and features proposed. Mike also explained how OpenCL will be implemented on ATI GPU, and in particular that due to the R7xx architecture,developers will need to vectorize their algorithm using vec4, like in shaders, to take advantage of their 5 components SIMD (composing each of the 16 cores also working in SIMD and composing their kind of multi-processors). He announced the release of an implementation of OpenCL from AMD, but only for CPU ! It's seems they still don't have GPU implementation,
and I wonder if they are waiting for the release of evergreen
architecture to provide an implementation. Next was Aaron Lefohn and Larry Seller who exposed future INTEL implementation. They particularly pushed the Task API of OpenCL, that seems to allow to implement efficiently task-parallel algorithms, with multiple kernels running concurrently and communicating. This kind of model suppose that different kernels can run on different cores of the GPU, on thing that is not possible with current NVIDIA architecture (neither for ATI one I think). If I well understood, it also seems that their first OpenCL implementation will require the programmer to use 16 components vectors to fill their SIMD lines. Finaly Simon Green exposed NVIDIA implementation, and as you know they are the only one to propose working OpenCL implementation now (and since December 2008).
Day 4: Wednesday August 5
Interesting session from NVIDIA this afternoon: Alternative Rendering Pipelines on NVIDIA CUDA. Andrew Tatarinov and Alexander Kharlamov exposed their work on CUDA implementation of a Ray-Tracing pipeline as well as a REYES pipeline. These two implementations are using persistent threads to enhance work balancing and per-MP computation ressources usage with Uber-kernels using dynamic branching to switch between multiple tasks. In addition, the REYES implementation is using work queues, implemented using prefix-sum scan operations and used to fill persistant threads with work and to make them communicate. That's for me a really awesome model, but for me the problem is the registers usage of such king of Uber-kernels. Ideally, a better threads scheduling on MPs and the ability to launch different threads on each MP should be even more efficient. Mode details there: http://developer.download.nvidia.com/presentations/2009/SIGGRAPH/3DVision_Develop_Design_Play_in_3D_Stereo.pdf
Write Comment (3 Comments) |
|
|
TechnoBlog -
GPU
|
|
jeudi, 18 juin 2009 |
Come to see my talk on GigaVoxels at Siggraph 2009 !
Friday, 7 August | 3:45 PM | Room 260-262
I will present the new GigaVoxels pipeline implemented in CUDA and discuss about it's integration into video games. I will also present new results we get and especially the very efficient implementation of soft shadows and depth-of-field effects thanks to intrinsic properties of GigaVoxels hierarchical structure and volume Mipmapping mechanism.
http://artis.imag.fr/Membres/Cyril.Crassin/
Write Comment (6 Comments) |
|
Read more...
|
|
|
TechnoBlog -
GPU
|
|
vendredi, 08 mai 2009 |
|
Timothy Farrar blog
is always a very very good source of information if
you are looking for very in-depth thoughts and experiences on GPU.
Wednesday, Timothy published a post about his vision of the evolution
of
the GPU in a near future. In particular, Timothy exposes the idea that
future GPUs could expose a new highly flexible mechanism for job
distribution based on generic hardware managed queues (FIFO) associated
to kernels.
Current GPUs start threads by scheduling groups of independent jobs
between dependent state changes from a master control stream (a command
buffer filled by the CPU). OpenGL conditional rendering provides a
starting point to on-the-fly modify the task list in this stream and
DX11
seams to go further with the DispatchIndirect
function that enables DX Compute grid dimensions to come directly from
device
memory. The idea is that future hardware may provide generic queues
that could be filed by kernels and used proactively by the hardware
scheduler to set up thread blocks and route data to an available core
to start new thread blocks using the kernel associated with the queue.
Much
of the work in parallel processing is related to grouping, moving and
compacting or expanding data and end up to be data routing problems.
This model seems to provide a very good way to handle grouping for data
locality. That could allow kernels that reach a divergent point (such
as branch
divergence or data locality divergence) to output threads to new
queues with a new domain coordinate to insure a new good grouping for
continued computation. Data associated to a kernel would also be in the
queue and managed in hardware, to provide very fast access to threads
parameters.
This
can be done using a CPU like coherent cache with a large vector
processor like Larabee, but data routing becomes expensive with a
coherent cache that consume transistor for a rooting that could have
been define explicitly by programer. When you attempt to do all
this routing manually with dedicated local memory and high throughput
global memory, it is still expensive, just less expensive. The idea of Timothy is that this mechanism could be highly hardware accelerated and could provide a big advantages to "traditional" GPUs over Larabee like more generic architectures. I really
think this is the way to go for GPU to continue to provide high
performances to more generic graphics rendering pipelines.
The same idea is developed on a TOG paper that will be presented at Siggraph this year. This paper present GRAMPS, a programming model that generalizes concepts from
modern real-time graphics pipelines by exposing a model of execution
mixing task parallelism and data parallelism containing both fixed-function and application-programmable processing
stages that exchange data via queues.
Write Comment (0 Comments) |
|
|
TechnoBlog -
GPU
|
|
mardi, 28 avril 2009 |
|
Yesterday, NVIDIA made public two new OpenGL extensions named NV_shader_buffer_load and NV_vertex_buffer_unified_memory, these new extensions allow to use OpenGL in a totally new way they called Bindless Graphics. With Bindless Graphics you can manipulate Buffer Objects directly using their GPU global memory addresses and control the residency of these objects from applications. It allows to remove the bottleneck coming from binding objects before being able to use them, that force the driver to fetch all objects states before being able to use of modify them.
The NV_shader_buffer_load extension provides a mechanism to bind buffer objects to the context in such a way that they can be accessed by reading from a flat, 64-bit GPU address space directly from any shader stage and to query GPU addresses of buffer objects at the API level. The intent is that applications can avoid re-binding buffer objects or updating constants between each Draw call and instead simply use a VertexAttrib (or TexCoord, or InstanceID, or...) to "point" to the new object's state.
The NV_vertex_buffer_unified_memory extension provides a mechanism to specify vertex attributes and element array locations using these GPU addresses. Binding vertex buffers is one of the most frequent and expensive operations in many GL applications, due to the cost of chasing pointers and binding objects. With this extension, application can specify vertex attributes state direcly using VBO adresses that alleviates the overhead of object binds and driver memory management.
NVIDIA provides a small bindless graphics tutorial, with a presentation of the new features.
That seems very useful, but what scare me a little bit is that each time you provide the developer with lower level access like this, you reduce a lot the potential of automatic driver optimizations and in particular, I wonder how this mechanism interact with NVIDIA SLI mode that provide automatic scaling of OpenGL applications among multiple GPU. This mode duplicate data on each GPU and broadcast drawing command to all the GPU to allow them to produce differents parts of a frame and compose them before display. Using these extensions, the same address space has to be maintained on all GPU involved in SLI drawing, that seems to be very difficult especially in case of etherogenous SLI configurations.
Write Comment (0 Comments) |
|
|
TechnoBlog -
GPU
|
|
mercredi, 22 avril 2009 |
|
Just would like to give small tips I found to help working with CUDA under visual studio.
First, syntax highlighting for .cu files can be enabled with these few steps:
1. Copy the content of the “usertype.dat” file provided by nvidia (NVIDIA CUDA SDK\doc\syntax_highlighting\visual_studio_8) into your “Microsoft Visual Studio 8\Common7\IDE” folder from your program files folder.
2. Open Visual Studio and Take Tools -> Options. Under Text Editor -> File Extension tab, specify the extension “cu” as a new type.
Visual Studio rely on a feature named Intellisense to provide functions and variables names completion, definitions lookup and all these kind of features. To get intelligence working with .cu files, yo have to modify a windows registry key: Add c and cuh extensions to NCB Default C/C++ Extensions key under "HKEY_CURRENT_USER\Software\Microsoft\VisualStudio\9.0\Languages\Language Services\C/C++" path. (Thanks to http://www.wizardsofeast.com/?p=378 for the tip)
For those using Visual Assist X, you can do the following. First, find the Visual Assist X install directory: (X:\Program Files\Visual
Assist X\AutoText\latest) and then make a copy of Cpp.tpl and rename it
to Cu.tpl. Second, Open and close Visual Studio (this initializes
Visual Assist X parameters by creating some folders/variables in the
Registry ). Third, open regedit and go to: "HKEY_CURRENT_USER\Software\Whole Tomato\Visual Assist X\VANet9" and add ".cu;" to the ExtSource key and add ".cuh;" to the ExtHeader key. (Thanks to ciberxtrem for the tip)
Finally, build rules allowing to easily compile .cu files without having to write the rules manually can be integrated installing this little wizard: http://forums.nvidia.com/index.php?showtopic=65111. More details on CUDA build rules can be found on this website : http://sarathc.wordpress.com/2008/09/26/how-to-integrate-cuda-with-visual-c/.
Write Comment (3 Comments) |
|
|
TechnoBlog -
GPU
|
|
lundi, 06 avril 2009 |
|
Last week, intel gives two talks about Larrabee ISA called Larrabee New instructions (LRBNi).
The most significant thing to note is that Larrabee will expose a vector assembly, very similar to SSE instructions, but operating on 16 components vectors instead of 4. To program this, they will provide C intrinsics whose names that look... really weird !
C++ Larrabee prototype library: http://software.intel.com/en-us/articles/prototype-primitives-guide/
Intel provide headers with x86 implementations of these instructions to allow developers to start using these instructions now. But I can't imagine anybody using this kind of vector intrinsics to program a data parallel architecture. As we have seen with SSE instructions, very few programmers finally used them, and only for very specific algorithm parts. So I think that these intrinsics will be only used to implement higher level programming layers, like an OpenCL implementation, that is for me a really better and more flexible way to program these architectures.
The scalar model exposed for the G80 through CUDA and the PTX assembly (and that will be exposed by OpenCL) uses scalar operations over scalar registers. In this model, the underlining SIMD architecture is visible through the notion of warps, inside which programmers know that divergent branches are serialized. Inter-threads communication is exposed through the notion of CTA (Cooperative Threads Array), a group of threads able to communicate through a very fast shared memory. Coalescing rules are given to the programmers to allow him to make best use of the underlining SIMD architecture, but the model is far more scalable (not restricted to a given vector size) and allows to write codes in a lot more natural way than a vector model.
Even if, for now, Larrabee exposes a vector assembly, where the G80 expose a scalar
one, only the programming model vary but the underlining architecture
is finally very similar. Each Larrabee core can dual issue instructions
to an x86 unit and 16
scalar processors working in SIMD, that is very similar to a G80
Multiprocessor, that can dual issue instructions to a special unit or 8
scalar processor working in SIMD over 4 cycles (providing a 32 wide
SIMD). Larrabee exposes 16 wide vectore registers, where the G80 expose scalar ones, that are in facts aligned parts of vector memory bank.
The true difference before the two architecture is that Larrabee will implement the whole graphics pipeline using these general purpose cores (plus dedicated texture unit), where the G80 still has a lot of very optimized units and data paths dedicated to graphics operations connected into a fixed pipeline. The bet Intel is doing is that the flexibility provided by the full programmable pipeline will allow a better load balancing that will compensate the less efficiency of the architecture for graphics operations. The major asset they rely on is a binning rasterisation model, where after the transform stage, triangles are affected by screen tiles locality to the cores where all the rasterisation, the shading and the blending is done. Thanks to this model, they could keep local screen regions per cores in dedicated parts of a global L2 cache, used for inter-cores communications. That should allow efficient programmable blending for instance. But I think that even them don't know if it will really be competitive for consumer graphics !
And even on that point, Larrabee approach is not so different from G80 approach, where triangles are globally rasterized and then fragments are spread among Multiprocessors based on screen tiles (cf. http://www.icare3d.org/GPU/CN08) for fragment shading, the diference is that z-test and blending are done by fix ROP units, connected to the MP via a crossbar (cf. http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242).
Finally, with these talks, Intel seems to present as a revolutionary new architecture something that for the major part has been here for more than 2 years now with the G80, coming with a programming model that seems really weird compared to the CUDA model. This is even more weird that Larabee may not be released before Q1 2010, and at this time NVIDIA and ATI will have already released their next generation architectures that may look even more similar to Larrabee. With Larrabee, Intel has been feeding the industry with a lot of promises, like the "it will be x86, so you won't have to do anything particular to use it", that we have always known to be wrong, since by nature the efficiency of a data parallel architecture comes from it's particular programming model. If a proof was needed, I think this ISA is the one.
Intel GDC presentations: http://software.intel.com/en-us/articles/intel-at-gdc/
Larrabee at GDC, PCWATCH review: http://translate.google.fr/translate?u=http%3A%2F%2Fpc.watch.impress.co.jp%2Fdocs%2F2009%2F0330%2Fkaigai498.htm&sl=ja&tl=en&hl=fr&ie=UTF-8
Very good article about Larrabee and LRBNi: http://www.ddj.com/architect/216402188
Write Comment (2 Comments) |
|
|
TechnoBlog -
GPU
|
|
lundi, 30 mars 2009 |
|
OpenGL 3.1 specification have been released at GDC 2009 a few days ago. I think this is the first time in ARB history a new version of OpenGL is released so quickly, less than one year after OpenGL 3.0 was released (at siggraph last year).
This new revision promote to the core some remaining G80 features that where not promoted into OpenGL 3.0: Texture Buffer Objects (GL_ARB_texture_buffer_object, one-dimensional array of texels used as texture without filtering, equivalent to CUDA linear textures), Uniform Buffer Objects (GL_ARB_uniform_buffer_object, enables rapid swapping of blocks of uniforms, rapid update and
sharing across program objects), Primitive Restart (GL_NV_primitive_restart, restart an executing primitive, exist as a extensin since Geforce 6 I think), Instancing (GL_ARB_draw_instanced), Texture Rectangle (GL_ARB_texture_rectangle). Uniform Buffer Objects has been enhanced quite a lot compared to the original GL_EXT_bindable_uniform, among other things, several buffer can be combined to populate a shader uniform block and a standard cross-platform data storage layout is proposed.
There is also two new "features" that were not available as extensions before. A CopyBuffer
API that allows fast copied between buffer objects (VBO/PBO/UBO) that will also useful for sharing buffers with OpenCL. The other feature is a Signed Normalized Textures format that is a new integer texture formats that represent a value in the range [-1.0,1.0].
Geometry shaders (GL_ARB_geometry_shader4) where not promoted and maybe they will never be. The extension is not implemented by ATI and is not used a lot since this feature is usefull only in a few cases (due to implementation performances). Direct state access (GL_EXT_direct_state_access) was neither promoted, it's a very usefull extension that allows to reduce states changes cost, but it's a really new (released with GL3.0) and I did'nt expected it tobe promoted yet.
The deprecation model is a design mechanism introduced in GL 3.0 to allow to
remove outdated features and commands. (the reverse of the extension mechanism). Core
features are first marked as deprecated, then moved to an ARB
extension, then eventually to an EXT or vendor extension, or removed
entirely. The OpenGL 3.0 specification marks several features as
deprecated, including the venerable glBegin/glEnd mechanism, display
lists, matrix and attribute stacks, and the portion of the fixed
function pipe subsumed by shaders (lighting, fog, texture mapping, and
texture coordinate generation).
About deprecation, the specification is available in two formats, one with deprecated features (http://www.opengl.org/registry/doc/glspec31undep.20090324.pdf) and one with only "pure" GL 3.1 features (http://www.opengl.org/registry/doc/glspec31.20090324.pdf). An extension called ARB_compatibility has been introduce. If supported by an implementation, this extensien ensure that all deprecated features are available. This mechnism allows not to break the compatibility for old GL applications, keeping every features in the driver, while cleaning the API and providing new high performance paths. It seems to be a good mechanism, more convinient than the initial idea of creating specific contexts. NVIDIA for instance ensure that they will keep all deprecated features in their drivers to answer customers needs (I think mainly CAD customers).
Once again, like for OpenGL 3.0, while ATI declared that they will support GL 3.1, NVIDIA announced a BETA support of GL 3.1 and released drivers: http://developer.nvidia.com/object/opengl_3_driver.html
To conclude, it's good to the the Khronos/ARB remaining so active since the release of OpenGL 3.0, and it's good to see OpenGL evolving in the right direction :-)
Links:
- The announcement: http://www.khronos.org/news/press/releases/khronos-releases-streamlined-opengl-3.1-specification/
- The specifications: http://www.opengl.org/registry/
- More informations: http://www.g-truc.net/#news0152, http://www.skew-matrix.com/bb/viewtopic.php?f=3&t=4
Write Comment (0 Comments) |
|
| |
|
|