Irrlicht isometric game engine

Noiecity · Post by **Noiecity** » Sat Nov 01, 2025 12:26 am

This example works with the irrlicht base files. I compiled and ran it on Linux with irrlicht 1.9.0 svn trunk.
It is an example of isometric graphics and bloom effect, as well as saving file settings, something simple but functional.
Always render at 256x256, apply the filters, and scale the image considering the height of the window.

Code: Select all

#include <irrlicht.h>
#include <vector>
#include <algorithm>
#include <cmath>
#include <cstdlib>

using namespace irr;
using namespace core;
using namespace scene;
using namespace video;
using namespace io;
using namespace gui;

struct BloomParams {
    float threshold;
    float softness;
    int radius;
    float strength;
};

const int RENDER_WIDTH = 256;
const int RENDER_HEIGHT = 256;

// Function to apply Gaussian blur
void ApplyGaussianBlur(std::vector<SColor>& source, std::vector<SColor>& dest,
                      int width, int height, int radius) {
    if (radius < 1) {
        dest = source;
        return;
    }

    std::vector<SColor> temp(width * height);

    // Create Gaussian kernel
    int kernelSize = radius * 2 + 1;
    std::vector<float> kernel(kernelSize);
    float sigma = radius / 2.0f;
    float sum = 0.0f;

    for (int i = 0; i < kernelSize; ++i) {
        int x = i - radius;
        kernel[i] = std::exp(-(x * x) / (2 * sigma * sigma));
        sum += kernel[i];
    }

    // Normalize kernel
    for (int i = 0; i < kernelSize; ++i) {
        kernel[i] /= sum;
    }

    // Horizontal blur
    for (int y = 0; y < height; ++y) {
        for (int x = 0; x < width; ++x) {
            float r = 0, g = 0, b = 0;

            for (int k = -radius; k <= radius; ++k) {
                int px = std::max(0, std::min(width - 1, x + k));
                SColor pixel = source[y * width + px];
                float weight = kernel[k + radius];

                r += pixel.getRed() * weight;
                g += pixel.getGreen() * weight;
                b += pixel.getBlue() * weight;
            }

            temp[y * width + x] = SColor(255, (u32)r, (u32)g, (u32)b);
        }
    }

    // Vertical blur
    for (int y = 0; y < height; ++y) {
        for (int x = 0; x < width; ++x) {
            float r = 0, g = 0, b = 0;

            for (int k = -radius; k <= radius; ++k) {
                int py = std::max(0, std::min(height - 1, y + k));
                SColor pixel = temp[py * width + x];
                float weight = kernel[k + radius];

                r += pixel.getRed() * weight;
                g += pixel.getGreen() * weight;
                b += pixel.getBlue() * weight;
            }

            dest[y * width + x] = SColor(255, (u32)r, (u32)g, (u32)b);
        }
    }
}

// Function to apply bloom effect
void ApplyBloomToImage(IImage* image, const BloomParams& params) {
    const dimension2du size = image->getDimension();
    const int width = size.Width;
    const int height = size.Height;

    std::vector<SColor> originalPixels(width * height);
    std::vector<SColor> brightPixels(width * height);
    std::vector<SColor> bloomPixels(width * height);

    // 1. Read original pixels
    for (u32 y = 0; y < height; ++y) {
        for (u32 x = 0; x < width; ++x) {
            originalPixels[y * width + x] = image->getPixel(x, y);
        }
    }

    // 2. Extract bright areas with smoothing
    for (int i = 0; i < width * height; ++i) {
        SColor pixel = originalPixels[i];
        float brightness = (pixel.getRed() * 0.299f +
                          pixel.getGreen() * 0.587f +
                          pixel.getBlue() * 0.114f) / 255.0f;

        if (brightness > params.threshold) {
            // Apply smoothing based on softness parameter
            float softFactor = 1.0f - params.softness;
            float intensity = (brightness - params.threshold) / (1.0f - params.threshold);
            intensity = std::pow(intensity, softFactor * 2.0f + 0.5f);

            brightPixels[i] = SColor(
                255,
                (u32)(pixel.getRed() * intensity),
                (u32)(pixel.getGreen() * intensity),
                (u32)(pixel.getBlue() * intensity)
            );
        } else {
            brightPixels[i] = SColor(0, 0, 0, 0);
        }
    }

    // 3. Apply blur
    ApplyGaussianBlur(brightPixels, bloomPixels, width, height, params.radius);

    // 4. Combine with original image
    for (int i = 0; i < width * height; ++i) {
        SColor original = originalPixels[i];
        SColor bloom = bloomPixels[i];

        u32 r = original.getRed() + (u32)(bloom.getRed() * params.strength);
        u32 g = original.getGreen() + (u32)(bloom.getGreen() * params.strength);
        u32 b = original.getBlue() + (u32)(bloom.getBlue() * params.strength);

        r = r > 255 ? 255 : r;
        g = g > 255 ? 255 : g;
        b = b > 255 ? 255 : b;

        image->setPixel(i % width, i / width, SColor(255, r, g, b));
    }
}

// Function to save bloom parameters to XML
bool SaveBloomConfig(const BloomParams& params, const char* filename) {
    IrrlichtDevice* device = createDevice(video::EDT_NULL);
    if (!device) return false;

    IXMLWriter* writer = device->getFileSystem()->createXMLWriter(filename);
    if (!writer) {
        device->drop();
        return false;
    }

    writer->writeXMLHeader();
    writer->writeElement(L"BloomConfig");
    writer->writeLineBreak();

    // Write parameters
    core::stringw thresholdStr = core::stringw(params.threshold);
    core::stringw softnessStr = core::stringw(params.softness);
    core::stringw radiusStr = core::stringw(params.radius);
    core::stringw strengthStr = core::stringw(params.strength);

    writer->writeElement(L"Threshold", false, L"value", thresholdStr.c_str());
    writer->writeLineBreak();
    writer->writeElement(L"Softness", false, L"value", softnessStr.c_str());
    writer->writeLineBreak();
    writer->writeElement(L"Radius", false, L"value", radiusStr.c_str());
    writer->writeLineBreak();
    writer->writeElement(L"Strength", false, L"value", strengthStr.c_str());
    writer->writeLineBreak();

    writer->writeClosingTag(L"BloomConfig");
    writer->writeLineBreak();

    writer->drop();
    device->drop();
    return true;
}

// Function to load bloom parameters from XML
bool LoadBloomConfig(BloomParams& params, const char* filename) {
    IrrlichtDevice* device = createDevice(video::EDT_NULL);
    if (!device) return false;

    // Check if file exists
    if (!device->getFileSystem()->existFile(filename)) {
        device->drop();
        return false;
    }

    IXMLReader* reader = device->getFileSystem()->createXMLReader(filename);
    if (!reader) {
        device->drop();
        return false;
    }

    // Default values in case of error
    params.threshold = 0.3f;
    params.softness = 0.8f;
    params.radius = 8;
    params.strength = 0.6f;

    while (reader->read()) {
        switch (reader->getNodeType()) {
            case io::EXN_ELEMENT: {
                core::stringw nodeName = reader->getNodeName();

                if (nodeName == L"Threshold") {
                    // Read as string and convert to float
                    const wchar_t* value = reader->getAttributeValue(L"value");
                    if (value) {
                        char buffer[32];
                        wcstombs(buffer, value, sizeof(buffer));
                        params.threshold = strtof(buffer, NULL);
                    }
                }
                else if (nodeName == L"Softness") {
                    const wchar_t* value = reader->getAttributeValue(L"value");
                    if (value) {
                        char buffer[32];
                        wcstombs(buffer, value, sizeof(buffer));
                        params.softness = strtof(buffer, NULL);
                    }
                }
                else if (nodeName == L"Radius") {
                    params.radius = reader->getAttributeValueAsInt(L"value");
                }
                else if (nodeName == L"Strength") {
                    const wchar_t* value = reader->getAttributeValue(L"value");
                    if (value) {
                        char buffer[32];
                        wcstombs(buffer, value, sizeof(buffer));
                        params.strength = strtof(buffer, NULL);
                    }
                }
                break;
            }
        }
    }

    reader->drop();
    device->drop();
    return true;
}

class SaveButtonEventReceiver : public IEventReceiver {
private:
    bool saveButtonPressed;
    BloomParams* bloomParams;
    IGUIButton* saveButton;

public:
    SaveButtonEventReceiver() : saveButtonPressed(false), bloomParams(nullptr), saveButton(nullptr) {}

    virtual bool OnEvent(const SEvent& event) {
        if (event.EventType == EET_GUI_EVENT) {
            if (event.GUIEvent.EventType == EGET_BUTTON_CLICKED) {
                if (event.GUIEvent.Caller == saveButton) {
                    saveButtonPressed = true;
                    return true;
                }
            }
        }
        return false;
    }

    void setBloomParams(BloomParams* params) {
        bloomParams = params;
    }

    void setSaveButton(IGUIButton* button) {
        saveButton = button;
    }

    bool isSaveButtonPressed() {
        if (saveButtonPressed) {
            saveButtonPressed = false;
            return true;
        }
        return false;
    }
};

int main() {
    SaveButtonEventReceiver receiver;

    IrrlichtDevice *device = createDevice(
        video::EDT_OPENGL,
        dimension2d<u32>(640, 480),
        16, false, true, false, &receiver
    );

    if (!device)
        return 1;

    device->setWindowCaption(L"256x256 Render with Bloom - Floor, Light & Shadows");

    IVideoDriver* driver = device->getVideoDriver();
    ISceneManager* smgr = device->getSceneManager();
    IGUIEnvironment* guienv = device->getGUIEnvironment();

    // Simple config file name - will be saved in current directory
    const char* configFileName = "bloom_config.xml";

    // Bloom parameters with default values
    BloomParams bloomParams;
    bloomParams.threshold = 0.3f;
    bloomParams.softness = 0.8f;
    bloomParams.radius = 8;
    bloomParams.strength = 0.6f;

    // Try to load configuration from file
    if (LoadBloomConfig(bloomParams, configFileName)) {
        printf("Configuration loaded from %s\n", configFileName);
    } else {
        printf("No configuration file found, using default values\n");
    }

    // Bloom controls with unique IDs
    guienv->addStaticText(L"Threshold:", rect<s32>(10,10,100,30), false, false, 0, 1000);
    IGUIScrollBar* thresholdScroll = guienv->addScrollBar(true, rect<s32>(110,10,250,30), 0, 1001);
    thresholdScroll->setMax(100);
    thresholdScroll->setPos((s32)(bloomParams.threshold * 100));

    guienv->addStaticText(L"Softness:", rect<s32>(10,40,100,60), false, false, 0, 1002);
    IGUIScrollBar* softnessScroll = guienv->addScrollBar(true, rect<s32>(110,40,250,60), 0, 1003);
    softnessScroll->setMax(100);
    softnessScroll->setPos((s32)(bloomParams.softness * 100));

    guienv->addStaticText(L"Radius:", rect<s32>(10,70,100,90), false, false, 0, 1004);
    IGUIScrollBar* radiusScroll = guienv->addScrollBar(true, rect<s32>(110,70,250,90), 0, 1005);
    radiusScroll->setMax(20);
    radiusScroll->setPos(bloomParams.radius);

    guienv->addStaticText(L"Strength:", rect<s32>(10,100,100,120), false, false, 0, 1006);
    IGUIScrollBar* strengthScroll = guienv->addScrollBar(true, rect<s32>(110,100,250,120), 0, 1007);
    strengthScroll->setMax(100);
    strengthScroll->setPos((s32)(bloomParams.strength * 100 / 1.5f));

    // Save button
    IGUIButton* saveButton = guienv->addButton(rect<s32>(10, 130, 250, 160), 0, 1008, L"Save Configuration");
    receiver.setSaveButton(saveButton);
    receiver.setBloomParams(&bloomParams);

    // Load model
    IAnimatedMesh* mesh = smgr->getMesh("../../media/sydney.md2");
    IAnimatedMeshSceneNode* node = nullptr;

    if (!mesh) {
        printf("Could not load ../../media/sydney.md2\n");
        device->drop();
        return 1;
    } else {
        node = smgr->addAnimatedMeshSceneNode(mesh);
        if (node) {
            node->setMaterialFlag(EMF_LIGHTING, true);
            node->setMD2Animation(scene::EMAT_STAND);
            node->setMaterialTexture(0, driver->getTexture("../../media/sydney.bmp"));

            // Enable shadow volume
            node->addShadowVolumeSceneNode();
            node->setMaterialFlag(EMF_NORMALIZE_NORMALS, true);
        }
    }

    // Create floor plane using addHillPlaneMesh with correct parameters
    IAnimatedMesh* planeMesh = smgr->addHillPlaneMesh("floor",
        dimension2d<f32>(20,20),   // Tile size
        dimension2d<u32>(25,25),   // Tile count
        0,                         // Material (0 for default)
        0.0f,                      // Hill height (0 for flat plane)
        dimension2d<f32>(0,0),     // Hill count
        dimension2d<f32>(20,20));  // Texture repeat count

    IMeshSceneNode* floor = smgr->addMeshSceneNode(planeMesh->getMesh(0));
    if (floor) {
        floor->setPosition(vector3df(0,-25,0));
        floor->setMaterialTexture(0, driver->getTexture("../../media/wall.bmp"));
        floor->setMaterialFlag(EMF_LIGHTING, true);
        floor->setMaterialFlag(EMF_BILINEAR_FILTER, false);
    }

    // Create rotating light
    ILightSceneNode* light = smgr->addLightSceneNode(0, vector3df(0,0,0),
        SColorf(1.0f, 1.0f, 1.0f, 1.0f), 100.0f);

    if (light) {
        light->setPosition(vector3df(0, 20, 30));
        light->getLightData().DiffuseColor.set(1.0f, 1.0f, 1.0f);
        light->getLightData().SpecularColor.set(0.5f, 0.5f, 0.5f);
        light->getLightData().AmbientColor.set(0.2f, 0.2f, 0.2f);
    }

    // Create isometric camera
    ICameraSceneNode* camera = smgr->addCameraSceneNode(0);

    // Configure orthographic projection
    matrix4 proj;
    f32 viewWidth = 200.0f;
    f32 viewHeight = 200.0f;

    proj.buildProjectionMatrixOrthoLH(viewWidth, viewHeight, 0.1f, 1000.0f);
    camera->setProjectionMatrix(proj, true);

    vector3df cameraPosition(30, 30, 30);
    vector3df cameraTarget(0, 5, 0);
    camera->setPosition(cameraPosition);
    camera->setTarget(cameraTarget);

    // Create 256x256 render target texture
    ITexture* renderTexture = driver->addRenderTargetTexture(
        dimension2d<u32>(RENDER_WIDTH, RENDER_HEIGHT), "render256");

    // Create image for processing
    IImage* processedImage = driver->createImage(ECF_A8R8G8B8,
        dimension2d<u32>(RENDER_WIDTH, RENDER_HEIGHT));
    ITexture* displayTexture = driver->addTexture("display", processedImage);

    // Variables for light rotation animation
    f32 lightAngle = 0.0f;
    f32 lightRadius = 30.0f;
    f32 lightHeight = 20.0f;

    // Main loop
    while(device->run()) {
        // UPDATE PARAMETERS - ensure they are read correctly
        bloomParams.threshold = thresholdScroll->getPos() / 100.0f;
        bloomParams.softness = softnessScroll->getPos() / 100.0f;
        bloomParams.radius = radiusScroll->getPos();
        bloomParams.strength = strengthScroll->getPos() / 100.0f * 1.5f;

        // Check if save button was pressed
        if (receiver.isSaveButtonPressed()) {
            if (SaveBloomConfig(bloomParams, configFileName)) {
                printf("Configuration saved to %s\n", configFileName);
            } else {
                printf("Error saving configuration\n");
            }
        }

        // Animate light rotation
        lightAngle += 0.01f;
        if (lightAngle > 360.0f) lightAngle = 0.0f;

        if (light) {
            f32 x = cosf(lightAngle) * lightRadius;
            f32 z = sinf(lightAngle) * lightRadius;
            light->setPosition(vector3df(x, lightHeight, z));
        }

        // Render 3D scene to 256x256 texture
        driver->setRenderTarget(renderTexture, true, true, SColor(255,100,101,140));
        smgr->drawAll();

        // Read and process texture
        IImage* renderImage = driver->createImage(renderTexture,
            position2d<s32>(0,0), dimension2d<u32>(RENDER_WIDTH, RENDER_HEIGHT));

        if (renderImage) {
            // Copy and apply bloom
            for (u32 y = 0; y < RENDER_HEIGHT; ++y) {
                for (u32 x = 0; x < RENDER_WIDTH; ++x) {
                    processedImage->setPixel(x, y, renderImage->getPixel(x, y));
                }
            }

            ApplyBloomToImage(processedImage, bloomParams);
            renderImage->drop();
        }

        // Update display texture
        driver->removeTexture(displayTexture);
        displayTexture = driver->addTexture("display", processedImage);

        // Display on main screen
        driver->setRenderTarget(0, true, true, SColor(255,100,101,140));

        // Calculate dimensions for 1:1 aspect ratio
        const dimension2du screenSize = driver->getScreenSize();
        const u32 displaySize = screenSize.Height;
        const u32 xOffset = (screenSize.Width - displaySize) / 2;

        // Draw scaled texture
        driver->draw2DImage(displayTexture,
                           rect<s32>(xOffset, 0, xOffset + displaySize, displaySize),
                           rect<s32>(0, 0, RENDER_WIDTH, RENDER_HEIGHT),
                           0, 0, true);

        // Draw GUI on top
        guienv->drawAll();

        driver->endScene();
    }

    // Cleanup
    driver->removeTexture(renderTexture);
    driver->removeTexture(displayTexture);
    processedImage->drop();
    device->drop();
    return 0;
}

Thanks mr cute alien

CuteAlien · Post by **CuteAlien** » Sat Nov 01, 2025 12:23 pm

Looks kinda nice. Code can maybe be optimized a bit - the setPixel loop can probably be replaced by IImage::copyTo. And to avoid create/destroying renderImage in the loop you can create it before and then use copyToScaling with renderTexture->lock(ETLM_READ_ONLY) as data. I assume you use a bit older Irrlicht and not trunk, in there the functions might have be renamed/changed a bit.

Noiecity · Post by **Noiecity** » Sat Nov 01, 2025 2:34 pm

CuteAlien wrote: Sat Nov 01, 2025 12:23 pm Looks kinda nice. Code can maybe be optimized a bit - the setPixel loop can probably be replaced by IImage::copyTo. And to avoid create/destroying renderImage in the loop you can create it before and then use copyToScaling with renderTexture->lock(ETLM_READ_ONLY) as data. I assume you use a bit older Irrlicht and not trunk, in there the functions might have be renamed/changed a bit.

Thanks, I'll try what you say. I downloaded the version from the svn repository that's out there, it has the shadow volume update with freeze... so I don't know if they update it elsewhere

Noiecity · Post by **Noiecity** » Sat Nov 01, 2025 10:25 pm

Okay fixed...and now the XML file is saved only in the executable path (previously, if you ran it from the compiler, it was saved in the project directory).
I also fixed a bug that occurred if you manually increased the values of the functions to increase the bloom strength(when leaving the expected range, the lines of the tiles were visible)... and I fixed a few other things.
*edit: lol update again*

Code: Select all

// bloom_sse2_tile_full.cpp
// CPU-only optimized bloom for Irrlicht 1.9 trunk
// SSE2 accelerated separable blur (horizontal + vertical) + tile processing
// Compile: g++ -O3 -march=native -msse2 -ffast-math -funroll-loops bloom_sse2_tile_full.cpp -o bloom -lIrrlicht -lGL -lX11
// In visual studio you need SIMD Extension 2, project propeties -> c/c++->code generation -> Enable Enhanced Instruction Set : SSE2
// In visual studio i recommend project properties ->c/c++->code generation ->floating point model: /fp:fast. And enable in c/c++->Optimization, all optimizations lmao	

#include <irrlicht.h>
#include <vector>
#include <cmath>
#include <cstdlib>
#include <cstring>
#include <cstdio>
#include <cstdint>
#include <algorithm>
#include <xmmintrin.h>  // SSE
#include <emmintrin.h>  // SSE2

using namespace irr;
using namespace core;
using namespace scene;
using namespace video;
using namespace io;
using namespace gui;
//for windows:
/*
#ifdef _IRR_WINDOWS_
#pragma comment(lib, "Irrlicht.lib")
#endif*/
// ============================================================================
// TYPE DEFINITIONS AND CONSTANTS
// ============================================================================

typedef uint32_t PIX; // ARGB8888 packed

// Bloom parameters structure
struct BloomParams {
    float threshold;
    float softness;
    int radius;
    float strength;
};

// Configuration constants
const int RENDER_WIDTH = 256;
const int RENDER_HEIGHT = 256;
const int TILE_SIZE = 64;
const int COLOR_QUANTIZATION = 64;

// ============================================================================
// GLOBAL BUFFERS
// ============================================================================

static PIX* g_srcBuffer = NULL;
static PIX* g_brightBuffer = NULL;
static PIX* g_tmpTile = NULL;
static PIX* g_bloomBuffer = NULL;
static PIX* g_blurHorizontalBuffer = NULL;
static std::vector<float> gaussianKernelFloat;
static int gaussianRadius = -1;

// ============================================================================
// UTILITY FUNCTIONS
// ============================================================================

/**
 * Allocates 16-byte aligned memory for optimal SSE performance
 */
static void* aligned_alloc_16(size_t bytes) {
#if defined(_POSIX_VERSION)
    void* ptr = NULL;
    const size_t align = 16;
    if (posix_memalign(&ptr, align, bytes) != 0) return NULL;
    return ptr;
#else
    void* p = malloc(bytes);
    return p;
#endif
}

/**
 * Frees aligned memory
 */
static void aligned_free(void* p) { free(p); }

/**
 * Packs ARGB components into a 32-bit pixel value
 */
static inline PIX PackARGB(unsigned int a, unsigned int r, unsigned int g, unsigned int b) {
    return ( ( (PIX)a << 24 ) | ( (PIX)r << 16 ) | ( (PIX)g << 8 ) | (PIX)b );
}

/**
 * Extracts red component from packed pixel
 */
static inline unsigned int UnpackR(PIX p){ return (p >> 16) & 0xFF; }

/**
 * Extracts green component from packed pixel
 */
static inline unsigned int UnpackG(PIX p){ return (p >> 8) & 0xFF; }

/**
 * Extracts blue component from packed pixel
 */
static inline unsigned int UnpackB(PIX p){ return p & 0xFF; }

/**
 * Calculates luminance from RGB components using integer arithmetic
 */
static inline unsigned char LumaByte(unsigned int r, unsigned int g, unsigned int b) {
    return (unsigned char)(((r * 77u) + (g * 150u) + (b * 29u)) >> 8);
}

/**
 * Precalculates Gaussian kernel coefficients for blur operations
 * Uses memoization to avoid recalculating for same radius
 */
void PrecalcGaussianFloat(int radius) {
    if (radius == gaussianRadius && !gaussianKernelFloat.empty()) return;
    gaussianRadius = radius;
    int ksize = radius * 2 + 1;
    gaussianKernelFloat.assign(ksize, 0.0f);
    float sigma = (radius > 0) ? (radius / 8.0f) : 1.0f;
    float sum = 0.0f;
    for (int i = 0; i < ksize; ++i) {
        int x = i - radius;
        float v = std::exp(-(x * x) / (2.0f * sigma * sigma));
        gaussianKernelFloat[i] = v;
        sum += v;
    }
    if (sum == 0.0f) sum = 1.0f;
    for (int i = 0; i < ksize; ++i) gaussianKernelFloat[i] /= sum;
}

// ============================================================================
// SSE BLUR FUNCTIONS
// ============================================================================

/**
 * Applies horizontal Gaussian blur to a tile using SSE2 instructions
 * Processes 4 pixels simultaneously for optimal performance
 */
void TileHorizontalBlurSSE(const PIX* src, PIX* dst,
                          int imgW, int imgH,
                          int tileX, int tileY, int tileW, int tileH,
                          int radius)
{
    PrecalcGaussianFloat(radius);
    const int ksize = radius * 2 + 1;
    const __m128i maskFF = _mm_set1_epi32(0xFF);
    const __m128 one255 = _mm_set1_ps(255.0f);
    
    for (int ty = 0; ty < tileH; ++ty) {
        int y = tileY + ty;
        const PIX* rowSrcBase = src + y * imgW;
        PIX* rowDstBase = dst + y * imgW;
        
        for (int tx = 0; tx < tileW; tx += 4) {
            int x0_tile = tx;
            int x0_img = tileX + x0_tile;
            int numOutputs = std::min(4, tileW - tx);
            
            if (numOutputs == 4) {
                // Process 4 pixels simultaneously using SSE
                __m128 accR = _mm_setzero_ps();
                __m128 accG = _mm_setzero_ps();
                __m128 accB = _mm_setzero_ps();
                
                for (int k = -radius, ki = 0; k <= radius; ++k, ++ki) {
                    float w = gaussianKernelFloat[ki];
                    int pxStart = x0_img + k;
                    
                    if (pxStart >= 0 && (pxStart + 3) < imgW) {
                        // Fast path: all pixels in bounds, use aligned load
                        const PIX* ptr = rowSrcBase + pxStart;
                        __m128i v = _mm_loadu_si128((const __m128i*)ptr);
                        __m128i r_i = _mm_and_si128(_mm_srli_epi32(v, 16), maskFF);
                        __m128i g_i = _mm_and_si128(_mm_srli_epi32(v, 8), maskFF);
                        __m128i b_i = _mm_and_si128(v, maskFF);
                        __m128 r_f = _mm_cvtepi32_ps(r_i);
                        __m128 g_f = _mm_cvtepi32_ps(g_i);
                        __m128 b_f = _mm_cvtepi32_ps(b_i);
                        __m128 wv = _mm_set1_ps(w);
                        accR = _mm_add_ps(accR, _mm_mul_ps(r_f, wv));
                        accG = _mm_add_ps(accG, _mm_mul_ps(g_f, wv));
                        accB = _mm_add_ps(accB, _mm_mul_ps(b_f, wv));
                    } else {
                        // Slow path: handle boundary conditions
                        float tmpR[4] = {0,0,0,0}, tmpG[4] = {0,0,0,0}, tmpB[4] = {0,0,0,0};
                        for (int out = 0; out < 4; ++out) {
                            int px = x0_img + out + k;
                            if (px < 0) px = 0;
                            else if (px >= imgW) px = imgW - 1;
                            PIX p = rowSrcBase[px];
                            tmpR[out] = (float)UnpackR(p);
                            tmpG[out] = (float)UnpackG(p);
                            tmpB[out] = (float)UnpackB(p);
                        }
                        __m128 r_f = _mm_set_ps(tmpR[3], tmpR[2], tmpR[1], tmpR[0]);
                        __m128 g_f = _mm_set_ps(tmpG[3], tmpG[2], tmpG[1], tmpG[0]);
                        __m128 b_f = _mm_set_ps(tmpB[3], tmpB[2], tmpB[1], tmpB[0]);
                        __m128 wv = _mm_set1_ps(w);
                        accR = _mm_add_ps(accR, _mm_mul_ps(r_f, wv));
                        accG = _mm_add_ps(accG, _mm_mul_ps(g_f, wv));
                        accB = _mm_add_ps(accB, _mm_mul_ps(b_f, wv));
                    }
                }
                
                // Convert results back to integers and pack into output pixel
                __m128i r_i = _mm_cvtps_epi32(_mm_min_ps(accR, one255));
                __m128i g_i = _mm_cvtps_epi32(_mm_min_ps(accG, one255));
                __m128i b_i = _mm_cvtps_epi32(_mm_min_ps(accB, one255));
                __m128i r_sh = _mm_slli_epi32(r_i, 16);
                __m128i g_sh = _mm_slli_epi32(g_i, 8);
                __m128i rgb = _mm_or_si128(_mm_or_si128(r_sh, g_sh), b_i);
                __m128i a255 = _mm_set1_epi32(0xFF000000);
                __m128i out = _mm_or_si128(a255, rgb);
                _mm_storeu_si128((__m128i*)(rowDstBase + x0_img), out);
            } else {
                // Process remaining pixels (less than 4) using scalar code
                for (int out = 0; out < numOutputs; ++out) {
                    float accR = 0.0f, accG = 0.0f, accB = 0.0f;
                    int x_img = x0_img + out;
                    for (int k = -radius, ki = 0; k <= radius; ++k, ++ki) {
                        int px = x_img + k;
                        if (px < 0) px = 0;
                        else if (px >= imgW) px = imgW - 1;
                        PIX p = rowSrcBase[px];
                        float w = gaussianKernelFloat[ki];
                        accR += UnpackR(p) * w;
                        accG += UnpackG(p) * w;
                        accB += UnpackB(p) * w;
                    }
                    unsigned int rr = (unsigned int)std::min(255.0f, accR);
                    unsigned int gg = (unsigned int)std::min(255.0f, accG);
                    unsigned int bb = (unsigned int)std::min(255.0f, accB);
                    rowDstBase[x_img] = PackARGB(255u, rr, gg, bb);
                }
            }
        }
    }
}

/**
 * Applies vertical Gaussian blur to a tile using SSE2 instructions
 * Processes 4 columns simultaneously for optimal performance
 */
void TileVerticalBlurSSE(const PIX* src, PIX* dst,
                        int imgW, int imgH,
                        int tileX, int tileY, int tileW, int tileH,
                        int radius)
{
    PrecalcGaussianFloat(radius);
    const int ksize = radius * 2 + 1;
    const __m128i maskFF = _mm_set1_epi32(0xFF);
    const __m128 one255 = _mm_set1_ps(255.0f);
    
    for (int tx = 0; tx < tileW; tx += 4) {
        int numCols = std::min(4, tileW - tx);
        
        for (int ty = 0; ty < tileH; ++ty) {
            int y_img = tileY + ty;
            PIX* dstRow = dst + y_img * imgW;
            
            if (numCols == 4) {
                // Process 4 columns simultaneously using SSE
                __m128 accR = _mm_setzero_ps();
                __m128 accG = _mm_setzero_ps();
                __m128 accB = _mm_setzero_ps();
                
                for (int k = -radius, ki = 0; k <= radius; ++k, ++ki) {
                    int py = y_img + k;
                    if (py < 0) py = 0;
                    else if (py >= imgH) py = imgH - 1;
                    const PIX* srcRow = src + py * imgW + tileX + tx;
                    __m128i v = _mm_loadu_si128((const __m128i*)srcRow);
                    __m128i r_i = _mm_and_si128(_mm_srli_epi32(v, 16), maskFF);
                    __m128i g_i = _mm_and_si128(_mm_srli_epi32(v, 8), maskFF);
                    __m128i b_i = _mm_and_si128(v, maskFF);
                    __m128 r_f = _mm_cvtepi32_ps(r_i);
                    __m128 g_f = _mm_cvtepi32_ps(g_i);
                    __m128 b_f = _mm_cvtepi32_ps(b_i);
                    __m128 wv = _mm_set1_ps(gaussianKernelFloat[ki]);
                    accR = _mm_add_ps(accR, _mm_mul_ps(r_f, wv));
                    accG = _mm_add_ps(accG, _mm_mul_ps(g_f, wv));
                    accB = _mm_add_ps(accB, _mm_mul_ps(b_f, wv));
                }
                
                // Convert results back to integers and pack into output pixel
                __m128i r_out = _mm_cvtps_epi32(_mm_min_ps(accR, one255));
                __m128i g_out = _mm_cvtps_epi32(_mm_min_ps(accG, one255));
                __m128i b_out = _mm_cvtps_epi32(_mm_min_ps(accB, one255));
                __m128i r_sh = _mm_slli_epi32(r_out, 16);
                __m128i g_sh = _mm_slli_epi32(g_out, 8);
                __m128i rgb = _mm_or_si128(_mm_or_si128(r_sh, g_sh), b_out);
                __m128i a255 = _mm_set1_epi32(0xFF000000);
                __m128i out = _mm_or_si128(a255, rgb);
                _mm_storeu_si128((__m128i*)(dstRow + tileX + tx), out);
            } else {
                // Process remaining columns (less than 4) using scalar code
                for (int col = 0; col < numCols; ++col) {
                    float accR = 0.0f, accG = 0.0f, accB = 0.0f;
                    int x_img = tileX + tx + col;
                    for (int k = -radius, ki = 0; k <= radius; ++k, ++ki) {
                        int py = y_img + k;
                        if (py < 0) py = 0;
                        else if (py >= imgH) py = imgH - 1;
                        PIX p = src[py * imgW + x_img];
                        float w = gaussianKernelFloat[ki];
                        accR += UnpackR(p) * w;
                        accG += UnpackG(p) * w;
                        accB += UnpackB(p) * w;
                    }
                    unsigned int rr = (unsigned int)std::min(255.0f, accR);
                    unsigned int gg = (unsigned int)std::min(255.0f, accG);
                    unsigned int bb = (unsigned int)std::min(255.0f, accB);
                    dstRow[x_img] = PackARGB(255u, rr, gg, bb);
                }
            }
        }
    }
}

// ============================================================================
// MAIN BLOOM PROCESSING FUNCTION
// ============================================================================

/**
 * Main bloom application function with tile-based processing
 * Implements complete bloom pipeline: bright extraction -> horizontal blur -> vertical blur -> composition
 */
void ApplyBloomToImageOptimized(IImage* image, const BloomParams& params) {
    const dimension2du size = image->getDimension();
    const int width = (int)size.Width;
    const int height = (int)size.Height;
    const int npixels = width * height;

    // --- Allocate global buffers ---
    if (!g_srcBuffer) {
        g_srcBuffer = (PIX*)aligned_alloc_16(sizeof(PIX) * (size_t)npixels);
        g_brightBuffer = (PIX*)aligned_alloc_16(sizeof(PIX) * (size_t)npixels);
        g_blurHorizontalBuffer = (PIX*)aligned_alloc_16(sizeof(PIX) * (size_t)npixels);
        g_bloomBuffer = (PIX*)aligned_alloc_16(sizeof(PIX) * (size_t)npixels);
        g_tmpTile = (PIX*)aligned_alloc_16(sizeof(PIX) * (size_t)TILE_SIZE * (size_t)TILE_SIZE);

        // Fallback to regular malloc if aligned allocation fails
        if (!g_srcBuffer || !g_brightBuffer || !g_blurHorizontalBuffer || !g_bloomBuffer || !g_tmpTile) {
            if (!g_srcBuffer) g_srcBuffer = (PIX*)malloc(sizeof(PIX) * (size_t)npixels);
            if (!g_brightBuffer) g_brightBuffer = (PIX*)malloc(sizeof(PIX) * (size_t)npixels);
            if (!g_blurHorizontalBuffer) g_blurHorizontalBuffer = (PIX*)malloc(sizeof(PIX) * (size_t)npixels);
            if (!g_bloomBuffer) g_bloomBuffer = (PIX*)malloc(sizeof(PIX) * (size_t)npixels);
            if (!g_tmpTile) g_tmpTile = (PIX*)malloc(sizeof(PIX) * (size_t)TILE_SIZE * (size_t)TILE_SIZE);
            if (!g_srcBuffer || !g_brightBuffer || !g_blurHorizontalBuffer || !g_bloomBuffer || !g_tmpTile) {
                fprintf(stderr, "Failed to allocate bloom buffers\n");
                return;
            }
        }
    }

    // --- OPTIMIZATION 2: Replace getPixel loop with memcpy ---
    // Fast data copy from IImage to working buffer
    void* inData = image->getData();
    if (!inData) {
        fprintf(stderr, "Failed to get image data for processing\n");
        return;
    }
    std::memcpy(g_srcBuffer, inData, (size_t)npixels * 4);
    // --- END OPTIMIZATION 2 ---

    // Bright extraction pass
    const float thr = params.threshold;
    const float softness = params.softness;
    const float denom = (1.0f - thr) > 1e-6f ? (1.0f - thr) : 1.0f;

    for (int i = 0; i < npixels; ++i) {
        PIX p = g_srcBuffer[i];
        unsigned int r = UnpackR(p), g = UnpackG(p), b = UnpackB(p);
        unsigned char lum = LumaByte(r, g, b);
        float brightness = lum / 255.0f;

        if (brightness > thr) {
            float intensity = (brightness - thr) / denom;
            float softFactor = (1.0f - softness) * 2.0f + 0.5f;
            if (softFactor > 1.5f) intensity = intensity * intensity;
            unsigned int rq = (r / COLOR_QUANTIZATION) * COLOR_QUANTIZATION;
            unsigned int gq = (g / COLOR_QUANTIZATION) * COLOR_QUANTIZATION;
            unsigned int bq = (b / COLOR_QUANTIZATION) * COLOR_QUANTIZATION;
            unsigned int rr = (unsigned int)std::min(255.0f, rq * intensity);
            unsigned int gg = (unsigned int)std::min(255.0f, gq * intensity);
            unsigned int bb = (unsigned int)std::min(255.0f, bq * intensity);
            g_brightBuffer[i] = PackARGB(255u, rr, gg, bb);
        } else {
            g_brightBuffer[i] = 0u;
        }
    }

    // --- Blur passes ---
    const int tsize = TILE_SIZE;
    for (int ty = 0; ty < height; ty += tsize) {
        int th = std::min(tsize, height - ty);
        for (int tx = 0; tx < width; tx += tsize) {
            int tw = std::min(tsize, width - tx);
            TileHorizontalBlurSSE(g_brightBuffer, g_blurHorizontalBuffer,
                                 width, height, tx, ty, tw, th, params.radius);
        }
    }
    for (int ty = 0; ty < height; ty += tsize) {
        int th = std::min(tsize, height - ty);
        for (int tx = 0; tx < width; tx += tsize) {
            int tw = std::min(tsize, width - tx);
            TileVerticalBlurSSE(g_blurHorizontalBuffer, g_bloomBuffer,
                               width, height, tx, ty, tw, th, params.radius);
        }
    }

    // Final composition
    const float strength = params.strength;
    const __m128 mulStrengthF = _mm_set1_ps(strength);
    const __m128i maskFF = _mm_set1_epi32(0xFF);
    const __m128 one255 = _mm_set1_ps(255.0f);
    int limit = (npixels / 4) * 4;
    int j = 0;
    
    // Process 4 pixels at a time using SSE
    for (; j < limit; j += 4) {
        __m128i orig = _mm_loadu_si128((__m128i*)(g_srcBuffer + j));
        __m128i bl   = _mm_loadu_si128((__m128i*)(g_bloomBuffer + j));
        __m128i r_orig = _mm_and_si128(_mm_srli_epi32(orig, 16), maskFF);
        __m128i g_orig = _mm_and_si128(_mm_srli_epi32(orig, 8), maskFF);
        __m128i b_orig = _mm_and_si128(orig, maskFF);
        __m128i r_bl = _mm_and_si128(_mm_srli_epi32(bl, 16), maskFF);
        __m128i g_bl = _mm_and_si128(_mm_srli_epi32(bl, 8), maskFF);
        __m128i b_bl = _mm_and_si128(bl, maskFF);
        __m128 r_of = _mm_cvtepi32_ps(r_orig);
        __m128 g_of = _mm_cvtepi32_ps(g_orig);
        __m128 b_of = _mm_cvtepi32_ps(b_orig);
        __m128 r_bf = _mm_cvtepi32_ps(r_bl);
        __m128 g_bf = _mm_cvtepi32_ps(g_bl);
        __m128 b_bf = _mm_cvtepi32_ps(b_bl);
        __m128 r_resf = _mm_add_ps(r_of, _mm_mul_ps(r_bf, mulStrengthF));
        __m128 g_resf = _mm_add_ps(g_of, _mm_mul_ps(g_bf, mulStrengthF));
        __m128 b_resf = _mm_add_ps(b_of, _mm_mul_ps(b_bf, mulStrengthF));
        r_resf = _mm_min_ps(r_resf, one255);
        g_resf = _mm_min_ps(g_resf, one255);
        b_resf = _mm_min_ps(b_resf, one255);
        __m128i r_res = _mm_cvtps_epi32(r_resf);
        __m128i g_res = _mm_cvtps_epi32(g_resf);
        __m128i b_res = _mm_cvtps_epi32(b_resf);
        __m128i a255 = _mm_set1_epi32(0xFF000000);
        __m128i r_sh = _mm_slli_epi32(r_res, 16);
        __m128i g_sh = _mm_slli_epi32(g_res, 8);
        __m128i rgb = _mm_or_si128(_mm_or_si128(r_sh, g_sh), b_res);
        __m128i out = _mm_or_si128(a255, rgb);
        _mm_storeu_si128((__m128i*)(g_srcBuffer + j), out);
    }
    
    // Process remaining pixels using scalar code
    for (; j < npixels; ++j) {
        PIX orig = g_srcBuffer[j];
        PIX bl = g_bloomBuffer[j];
        unsigned int r = UnpackR(orig) + (unsigned int)(UnpackR(bl) * strength);
        unsigned int g = UnpackG(orig) + (unsigned int)(UnpackG(bl) * strength);
        unsigned int b = UnpackB(orig) + (unsigned int)(UnpackB(bl) * strength);
        if (r > 255u) r = 255u;
        if (g > 255u) g = 255u;
        if (b > 255u) b = 255u;
        g_srcBuffer[j] = PackARGB(255u, r, g, b);
    }

    // --- Write back to IImage ---
    // Your code already uses the fast memcpy method (getData)
    void* outData = image->getData();
    if (outData) {
        std::memcpy(outData, g_srcBuffer, (size_t)npixels * 4);
    } else {
        // Fallback (slow)
        int oi = 0;
        for (int y = 0; y < height; ++y) {
            for (int x = 0; x < width; ++x, ++oi) {
                PIX p = g_srcBuffer[oi];
                SColor sc(255, (u32)UnpackR(p), (u32)UnpackG(p), (u32)UnpackB(p));
                image->setPixel(x, y, sc);
            }
        }
    }
}

// ============================================================================
// CONFIGURATION FUNCTIONS
// ============================================================================

/**
 * Saves bloom configuration parameters to XML file
 */
bool SaveBloomConfig(const BloomParams& params, const char* filename) {
    IrrlichtDevice* device = createDevice(video::EDT_NULL);
    if (!device) return false;
    IXMLWriter* writer = device->getFileSystem()->createXMLWriter(filename);
    if (!writer) { device->drop(); return false; }
    writer->writeXMLHeader();
    writer->writeElement(L"BloomConfig");
    writer->writeLineBreak();
    core::stringw thresholdStr = core::stringw(params.threshold);
    core::stringw softnessStr = core::stringw(params.softness);
    core::stringw radiusStr = core::stringw(params.radius);
    core::stringw strengthStr = core::stringw(params.strength);
    writer->writeElement(L"Threshold", false, L"value", thresholdStr.c_str());
    writer->writeLineBreak();
    writer->writeElement(L"Softness", false, L"value", softnessStr.c_str());
    writer->writeLineBreak();
    writer->writeElement(L"Radius", false, L"value", radiusStr.c_str());
    writer->writeLineBreak();
    writer->writeElement(L"Strength", false, L"value", strengthStr.c_str());
    writer->writeLineBreak();
    writer->writeClosingTag(L"BloomConfig");
    writer->writeLineBreak();
    writer->drop();
    device->drop();
    return true;
}

/**
 * Loads bloom configuration parameters from XML file
 */
bool LoadBloomConfig(BloomParams& params, const char* filename) {
    IrrlichtDevice* device = createDevice(video::EDT_NULL);
    if (!device) return false;
    if (!device->getFileSystem()->existFile(filename)) { device->drop(); return false; }
    IXMLReader* reader = device->getFileSystem()->createXMLReader(filename);
    if (!reader) { device->drop(); return false; }
    params.threshold = 0.0f;
    params.softness = 1.0f;
    params.radius = 16;
    params.strength = 1.6f;
    while (reader->read()) {
        switch (reader->getNodeType()) {
            case io::EXN_ELEMENT: {
                core::stringw nodeName = reader->getNodeName();
                if (nodeName == L"Threshold") {
                    const wchar_t* value = reader->getAttributeValue(L"value");
                    if (value) params.threshold = (float)wcstod(value, NULL);
                } else if (nodeName == L"Softness") {
                    const wchar_t* value = reader->getAttributeValue(L"value");
                    if (value) params.softness = (float)wcstod(value, NULL);
                } else if (nodeName == L"Radius") {
                    params.radius = reader->getAttributeValueAsInt(L"value");
                } else if (nodeName == L"Strength") {
                    const wchar_t* value = reader->getAttributeValue(L"value");
                    if (value) params.strength = (float)wcstod(value, NULL);
                }
                break;
            }
            default: break;
        }
    }
    reader->drop();
    device->drop();
    return true;
}

// ============================================================================
// EVENT RECEIVER CLASS
// ============================================================================

/**
 * Event receiver for handling GUI interactions
 * Specifically handles save button press events
 */
class SaveButtonEventReceiver : public IEventReceiver {
private:
    bool saveButtonPressed;
    BloomParams* bloomParams;
    IGUIButton* saveButton;
public:
    SaveButtonEventReceiver() : saveButtonPressed(false), bloomParams(NULL), saveButton(NULL) {}
    
    virtual bool OnEvent(const SEvent& event) {
        if (event.EventType == EET_GUI_EVENT) {
            if (event.GUIEvent.EventType == EGET_BUTTON_CLICKED) {
                if (event.GUIEvent.Caller == saveButton) {
                    saveButtonPressed = true;
                    return true;
                }
            }
        }
        return false;
    }
    
    void setBloomParams(BloomParams* params) { bloomParams = params; }
    void setSaveButton(IGUIButton* button) { saveButton = button; }
    
    bool isSaveButtonPressed() {
        if (saveButtonPressed) {
            saveButtonPressed = false;
            return true;
        }
        return false;
    }
};

// ============================================================================
// MAIN APPLICATION
// ============================================================================

/**
 * Main application entry point
 * Initializes Irrlicht engine, sets up scene, and runs main loop
 */
int main(int argc, char** argv) {
    PrecalcGaussianFloat(8);

    SaveButtonEventReceiver receiver;
    IrrlichtDevice *device = createDevice(
        video::EDT_OPENGL,
        dimension2d<u32>(640, 480),
        16, false, true, false, &receiver
    );
    if (!device) return 1;
    device->setWindowCaption(L"Optimized Bloom SSE2 Tile - 256x256 (CPU)");

    IVideoDriver* driver = device->getVideoDriver();
    ISceneManager* smgr = device->getSceneManager();
    IGUIEnvironment* guienv = device->getGUIEnvironment();

    io::IFileSystem* fs = device->getFileSystem();
    core::stringc exePath = fs->getAbsolutePath(argv[0]);
    core::stringc exeDir = fs->getFileDir(exePath);
    core::stringc configPath = exeDir + "/bloom_config.xml";
    const char* configFileName = configPath.c_str();

    BloomParams bloomParams;
    bloomParams.threshold = 0.3f;
    bloomParams.softness = 0.8f;
    bloomParams.radius = 8;
    bloomParams.strength = 0.6f;

    if (LoadBloomConfig(bloomParams, configFileName)) {
        printf("Configuration loaded from %s\n", configFileName);
    } else {
        printf("No configuration file found (%s), using default values\n", configFileName);
    }

    // --- GUI Setup ---
    guienv->addStaticText(L"Threshold:", rect<s32>(10,10,100,30), false, false, 0, 1000);
    IGUIScrollBar* thresholdScroll = guienv->addScrollBar(true, rect<s32>(110,10,250,30), 0, 1001);
    thresholdScroll->setMax(100);
    thresholdScroll->setPos((s32)(bloomParams.threshold * 100));
    guienv->addStaticText(L"Softness:", rect<s32>(10,40,100,60), false, false, 0, 1002);
    IGUIScrollBar* softnessScroll = guienv->addScrollBar(true, rect<s32>(110,40,250,60), 0, 1003);
    softnessScroll->setMax(100);
    softnessScroll->setPos((s32)(bloomParams.softness * 100));
    guienv->addStaticText(L"Radius:", rect<s32>(10,70,100,90), false, false, 0, 1004);
    IGUIScrollBar* radiusScroll = guienv->addScrollBar(true, rect<s32>(110,70,250,90), 0, 1005);
    radiusScroll->setMax(20);
    radiusScroll->setPos(bloomParams.radius);
    guienv->addStaticText(L"Strength:", rect<s32>(10,100,100,120), false, false, 0, 1006);
    IGUIScrollBar* strengthScroll = guienv->addScrollBar(true, rect<s32>(110,100,250,120), 0, 1007);
    strengthScroll->setMax(100);
    strengthScroll->setPos((s32)(bloomParams.strength * 100 / 5.0f));
    IGUIButton* saveButton = guienv->addButton(rect<s32>(10,130,250,160), 0, 1008, L"Save Configuration");
    receiver.setSaveButton(saveButton);
    receiver.setBloomParams(&bloomParams);

    // --- 3D Scene Setup ---
    IAnimatedMesh* mesh = smgr->getMesh("../../media/sydney.md2");
    IAnimatedMeshSceneNode* node = NULL;
    if (!mesh) {
        printf("Could not load ../../media/sydney.md2\n");
        device->drop();
        return 1;
    } else {
        node = smgr->addAnimatedMeshSceneNode(mesh);
        if (node) {
            node->setMaterialFlag(EMF_LIGHTING, true);
            node->setMD2Animation(scene::EMAT_STAND);
            node->setMaterialTexture(0, driver->getTexture("../../media/sydney.bmp"));
            node->addShadowVolumeSceneNode();
            node->setMaterialFlag(EMF_NORMALIZE_NORMALS, true);
        }
    }
    IAnimatedMesh* planeMesh = smgr->addHillPlaneMesh("floor",
        dimension2d<f32>(20,20), dimension2d<u32>(25,25),
        0, 0.0f, dimension2d<f32>(0,0), dimension2d<f32>(20,20));
    IMeshSceneNode* floor = smgr->addMeshSceneNode(planeMesh->getMesh(0));
    if (floor) {
        floor->setPosition(vector3df(0,-25.2,0));
        floor->setMaterialTexture(0, driver->getTexture("../../media/wall.bmp"));
        floor->setMaterialFlag(EMF_LIGHTING, true);
        floor->setMaterialFlag(EMF_BILINEAR_FILTER, false);
    }
    ILightSceneNode* light = smgr->addLightSceneNode(0, vector3df(0,0,0),
        SColorf(1.0f, 1.0f, 1.0f, 1.0f), 100.0f);
    if (light) {
        light->setPosition(vector3df(0, 20, 30));
        light->getLightData().DiffuseColor.set(1.0f, 1.0f, 1.0f);
        light->getLightData().SpecularColor.set(0.5f, 0.5f, 0.5f);
        light->getLightData().AmbientColor.set(0.2f, 0.2f, 0.2f);
    }
    ICameraSceneNode* camera = smgr->addCameraSceneNode(0);
    matrix4 proj;
    f32 viewWidth = 200.0f;
    f32 viewHeight = 200.0f;
    proj.buildProjectionMatrixOrthoLH(viewWidth, viewHeight, 0.1f, 1000.0f);
    camera->setProjectionMatrix(proj, true);
    camera->setPosition(vector3df(30, 30, 30));
    camera->setTarget(vector3df(0, 5, 0));

    // --- Resources Setup ---
    ITexture* renderTexture = driver->addRenderTargetTexture(
        dimension2d<u32>(RENDER_WIDTH, RENDER_HEIGHT), "render256");

    IImage* processedImage = driver->createImage(ECF_A8R8G8B8,
        dimension2d<u32>(RENDER_WIDTH, RENDER_HEIGHT));

    // --- OPTIMIZATION 3: Remove renderImage, no longer needed ---
    // IImage* renderImage = driver->createImage(ECF_A8R8G8B8,
    //    dimension2d<u32>(RENDER_WIDTH, RENDER_HEIGHT));

    // --- OPTIMIZATION 1: Create displayTexture only once ---
    // (We use processedImage to create it with the correct size and format)
    ITexture* displayTexture = driver->addTexture("display", processedImage);

    // --- Animation and FPS variables ---
    f32 lightAngle = 0.0f;
    f32 lightRadius = 30.0f;
    f32 lightHeight = 20.0f;
    u32 lastFPSTime = device->getTimer()->getTime();
    u32 frameCount = 0;
    const u32 fpsUpdateInterval = 1000;

    // --- Main Loop ---
    while (device->run()) {
        frameCount++;
        u32 currentTime = device->getTimer()->getTime();
        u32 deltaTime = currentTime - lastFPSTime;

        if (deltaTime >= fpsUpdateInterval) {
            f32 fps = (f32)frameCount / (f32)deltaTime * 1000.0f;
            core::stringw str = L"Optimized Bloom SSE2 Tile - 256x256 (CPU) - FPS: ";
            str += (int)fps;
            device->setWindowCaption(str.c_str());
            lastFPSTime = currentTime;
            frameCount = 0;
        }

        // --- GUI and Scene Update ---
        bloomParams.threshold = thresholdScroll->getPos() / 100.0f;
        bloomParams.softness = softnessScroll->getPos() / 100.0f;
        bloomParams.radius = radiusScroll->getPos();
        bloomParams.strength = strengthScroll->getPos() / 100.0f * 5.0f;
        if (receiver.isSaveButtonPressed()) {
            if (SaveBloomConfig(bloomParams, configFileName)) {
                printf("Configuration saved to %s\n", configFileName);
            }
        }
        lightAngle += 0.01f;
        if (lightAngle > 2.0f * 3.14159265f) lightAngle -= 2.0f * 3.14159265f;
        if (light) {
            float cosv = cosf(lightAngle);
            float sinv = sinf(lightAngle);
            float x = cosv * lightRadius;
            float z = sinv * lightRadius;
            light->setPosition(vector3df(x, lightHeight, z));
        }

        // Render scene to texture
        driver->setRenderTarget(renderTexture, true, true, SColor(255,100,101,140));
        smgr->drawAll();

        // --- OPTIMIZATION 3: Copy RTT (GPU) directly to processedImage (CPU) ---
        // Removes the intermediate copy to 'renderImage'
        void* lockedData = renderTexture->lock(ETLM_READ_ONLY);
        if (lockedData) {
            void* processedData = processedImage->lock();
            if (processedData) {
                std::memcpy(processedData, lockedData, RENDER_WIDTH * RENDER_HEIGHT * 4);
                processedImage->unlock();
            }
            renderTexture->unlock();
        }
        // --- END OPTIMIZATION 3 ---

        // Apply CPU bloom filter
        ApplyBloomToImageOptimized(processedImage, bloomParams);

        // --- OPTIMIZATION 1: Update displayTexture instead of recreating it ---
        if (displayTexture) {
            void* textureData = displayTexture->lock(ETLM_WRITE_ONLY);
            if (textureData) {
                // Copy processed data (CPU) to texture (GPU)
                std::memcpy(textureData, processedImage->getData(), RENDER_WIDTH * RENDER_HEIGHT * 4);
                displayTexture->unlock();
            }
        }
        // --- END OPTIMIZATION 1 ---

        // Present final image
        driver->setRenderTarget(0, true, true, SColor(255,100,101,140));
        const dimension2du screenSize = driver->getScreenSize();
        const u32 displaySize = screenSize.Height;
        const u32 xOffset = (screenSize.Width - displaySize) / 2;

        driver->draw2DImage(displayTexture,
                            rect<s32>(xOffset, 0, xOffset + displaySize, displaySize),
                            rect<s32>(0, 0, RENDER_WIDTH, RENDER_HEIGHT),
                            0, 0, true);

        // Frame rate limiting and GUI rendering
        //device->sleep(17);
        guienv->drawAll();
        driver->endScene();
    }

    // --- Cleanup Resources ---
    // --- OPTIMIZATION 3: Remove renderImage->drop() ---
    // renderImage->drop();
    driver->removeTexture(renderTexture);
    driver->removeTexture(displayTexture);
    processedImage->drop();

    // Free global buffers
    if (g_srcBuffer) { aligned_free(g_srcBuffer); g_srcBuffer = NULL; }
    if (g_brightBuffer) { aligned_free(g_brightBuffer); g_brightBuffer = NULL; }
    if (g_bloomBuffer) { aligned_free(g_bloomBuffer); g_bloomBuffer = NULL; }
    if (g_tmpTile) { aligned_free(g_tmpTile); g_tmpTile = NULL; }

    device->drop();
    return 0;
}

I don't know how to fix the shadow volume problem at the moment (irrlicht may have options and I haven't configured shadow volume properly).

CuteAlien · Post by **CuteAlien** » Sat Nov 01, 2025 11:42 pm

Yeah, Sydney model isn't fully closed so can't have some optimizations. Bit slower unfortunatly, but you can try this:

Code: Select all

scene::IShadowVolumeSceneNode * shadVol = node->addShadowVolumeSceneNode();
if(shadVol) shadVol->setOptimization(scene::ESV_NONE);

Noiecity · Post by **Noiecity** » Sun Nov 02, 2025 10:44 pm

CuteAlien wrote: Sat Nov 01, 2025 11:42 pm Yeah, Sydney model isn't fully closed so can't have some optimizations. Bit slower unfortunatly, but you can try this:
Code: Select all
scene::IShadowVolumeSceneNode * shadVol = node->addShadowVolumeSceneNode();
if(shadVol) shadVol->setOptimization(scene::ESV_NONE);

Thanks, mr cuteGod. What do you think of the optimizations? Is there anything else I'm not including that Irrlicht does better? I didn't want to include bitwise operators, bit masks, and two-component values, nor labels as values, since the code would become too long and difficult to read.

I have an example of realistic perspective, rendered with a FOV of 19 several times at different angles, then I created a single image from those renders, using algorithms and other techniques to achieve a high FOV without recurring distortion...

CuteAlien · Post by **CuteAlien** » Mon Nov 03, 2025 12:20 am

Code getting a bit too complex for a quick look ;-) But one a bit unexpected thing which can slow it down is using setWindowCaption too much. That function is way more expensive than make sense (not sure what Windows does when updating that). One reason we only update it in examples when the fps really changes.

But basically the big problem when doing this kind of algorithms on the cpu is a) image size (as it runs on every pixel) and b) moving textures between cpu/gpu. So it might work for some stuff, but in games you'll usually have to switch to shaders at some point.

Noiecity · Post by **Noiecity** » Mon Nov 03, 2025 12:29 am

CuteAlien wrote: Mon Nov 03, 2025 12:20 am Code getting a bit too complex for a quick look But one a bit unexpected thing which can slow it down is using setWindowCaption too much. That function is way more expensive than make sense (not sure what Windows does when updating that). One reason we only update it in examples when the fps really changes.

But basically the big problem when doing this kind of algorithms on the cpu is a) image size (as it runs on every pixel) and b) moving textures between cpu/gpu. So it might work for some stuff, but in games you'll usually have to switch to shaders at some point.

Thanks, look, this example is the best I can offer without shaders. For some reason, older computers work faster for me if I pass part of the work to the CPU. Without shaders, I can't increase the resolution, unless I get a better CPU, of course (example of realistic perspective with almost no distortion):

Code: Select all

//-O3 -march=native -funroll-loops -ffast-math -msse2
#include <irrlicht.h>
#include <sstream>
#include <iomanip>
#include <cmath>
#include <vector> // Required for the LUT

using namespace irr;
using namespace core;
using namespace scene;
using namespace video;
using namespace io;
using namespace gui;

#ifdef _IRR_WINDOWS_
#pragma comment(lib, "Irrlicht.lib")
#endif

// --- MyEventReceiver (unchanged) ---
class MyEventReceiver : public IEventReceiver {
public:
    virtual bool OnEvent(const SEvent& event) {
        if (event.EventType == EET_KEY_INPUT_EVENT) {
            keys[event.KeyInput.Key] = event.KeyInput.PressedDown;
        }
        return false;
    }
    virtual bool IsKeyDown(EKEY_CODE keyCode) const {
        return keys[keyCode];
    }
    MyEventReceiver() {
        for (u32 i = 0; i < KEY_KEY_CODES_COUNT; ++i)
            keys[i] = false;
    }
private:
    bool keys[KEY_KEY_CODES_COUNT];
};

// --- intToString (Optimized) ---
// stringw(int) is more direct and probably faster than std::to_wstring
stringw intToString(int value) {
    return stringw(value);
}

// --- buildOffCenterPerspectiveLH (Optimized) ---
// Removed redundant initialization loop since all 16 values are explicitly set.
core::matrix4 buildOffCenterPerspectiveLH(f32 l, f32 r, f32 b, f32 t, f32 zn, f32 zf) {
    core::matrix4 m;
    const f32 A = 2.0f * zn / (r - l);
    const f32 B = 2.0f * zn / (t - b);
    const f32 C = (l + r) / (l - r);
    const f32 D = (t + b) / (b - t);
    const f32 E = zf / (zf - zn);
    const f32 F = (zn * zf) / (zn - zf);
    m(0, 0) = A;    m(0, 1) = 0.0f; m(0, 2) = 0.0f; m(0, 3) = 0.0f;
    m(1, 0) = 0.0f; m(1, 1) = B;    m(1, 2) = 0.0f; m(1, 3) = 0.0f;
    m(2, 0) = C;    m(2, 1) = D;    m(2, 2) = E;    m(2, 3) = 1.0f;
    m(3, 0) = 0.0f; m(3, 1) = 0.0f; m(3, 2) = F;    m(3, 3) = 0.0f;
    return m;
}


// --- FAST BLUR IMPLEMENTATION START ---
// fast_blur_for_irrlicht.cpp

struct PixelF { float a, r, g, b; }; // alpha first for cache locality if desired

static PixelF* g_bufA = 0;
static PixelF* g_bufB = 0;
static int g_bufSize = 0;    // in pixels
static int g_tileW = 32;
static int g_tileH = 32;

static inline void freeBuffers()
{
    if (g_bufA) { free(g_bufA); g_bufA = 0; }
    if (g_bufB) { free(g_bufB); g_bufB = 0; }
    g_bufSize = 0;
}

static inline void ensureBuffers(int pixels)
{
    if (pixels <= g_bufSize) return;
    freeBuffers();
    g_bufA = (PixelF*)malloc(sizeof(PixelF) * pixels);
    g_bufB = (PixelF*)malloc(sizeof(PixelF) * pixels);
    if (!g_bufA || !g_bufB) {
        freeBuffers();
        return;
    }
    g_bufSize = pixels;
}

void BlurInit(int maxWidth, int maxHeight, int maxDownsampleFactor, int tileW, int tileH)
{
    if (tileW > 0) g_tileW = tileW;
    if (tileH > 0) g_tileH = tileH;
    int w = (maxWidth + (maxDownsampleFactor - 1)) / maxDownsampleFactor;
    int h = (maxHeight + (maxDownsampleFactor - 1)) / maxDownsampleFactor;
    if (w < 1) w = 1;
    if (h < 1) h = 1;
    ensureBuffers(w * h);
}

void BlurShutdown()
{
    freeBuffers();
}

// ---------------------- Helpers (float premultiplied) ----------------------

// Horizontal sliding-window box blur for rows y0..y1-1 (float accumulators)
static void boxBlurH_range_float(const PixelF* src, PixelF* dst, int w, int h, int radius, int y0, int y1)
{
    int diameter = radius * 2 + 1;
    for (int y = y0; y < y1; ++y) {
        const PixelF* row = src + y * w;
        PixelF* out = dst + y * w;

        double sumA = 0.0, sumR = 0.0, sumG = 0.0, sumB = 0.0;
        int x;
        for (x = -radius; x <= radius; ++x) {
            int xi = x;
            if (xi < 0) xi = 0;
            if (xi >= w) xi = w - 1;
            const PixelF& c = row[xi];
            sumA += c.a;
            sumR += c.r;
            sumG += c.g;
            sumB += c.b;
        }
        for (x = 0; x < w; ++x) {
            PixelF p;
            float inv = 1.0f / (float)diameter;
            p.a = (float)(sumA * inv);
            p.r = (float)(sumR * inv);
            p.g = (float)(sumG * inv);
            p.b = (float)(sumB * inv);
            out[x] = p;

            int idxRemove = x - radius;
            int idxAdd = x + radius + 1;
            if (idxRemove < 0) idxRemove = 0;
            if (idxAdd >= w) idxAdd = w - 1;
            const PixelF& cremove = row[idxRemove];
            const PixelF& cadd = row[idxAdd];
            sumA = sumA - cremove.a + cadd.a;
            sumR = sumR - cremove.r + cadd.r;
            sumG = sumG - cremove.g + cadd.g;
            sumB = sumB - cremove.b + cadd.b;
        }
    }
}

// Vertical sliding-window box blur for cols x0..x1-1
static void boxBlurV_range_float(const PixelF* src, PixelF* dst, int w, int h, int radius, int x0, int x1)
{
    int diameter = radius * 2 + 1;
    for (int x = x0; x < x1; ++x) {
        double sumA = 0.0, sumR = 0.0, sumG = 0.0, sumB = 0.0;
        int y;
        for (y = -radius; y <= radius; ++y) {
            int yi = y;
            if (yi < 0) yi = 0;
            if (yi >= h) yi = h - 1;
            const PixelF& c = src[yi * w + x];
            sumA += c.a;
            sumR += c.r;
            sumG += c.g;
            sumB += c.b;
        }
        for (y = 0; y < h; ++y) {
            PixelF p;
            float inv = 1.0f / (float)diameter;
            p.a = (float)(sumA * inv);
            p.r = (float)(sumR * inv);
            p.g = (float)(sumG * inv);
            p.b = (float)(sumB * inv);
            dst[y * w + x] = p;

            int idxRemove = y - radius;
            int idxAdd = y + radius + 1;
            if (idxRemove < 0) idxRemove = 0;
            if (idxAdd >= h) idxAdd = h - 1;
            const PixelF& cremove = src[idxRemove * w + x];
            const PixelF& cadd = src[idxAdd * w + x];
            sumA = sumA - cremove.a + cadd.a;
            sumR = sumR - cremove.r + cadd.r;
            sumG = sumG - cremove.g + cadd.g;
            sumB = sumB - cremove.b + cadd.b;
        }
    }
}

// tiled passes: apply 3 box blur passes, process H and V in tiles
static void fastGaussianApprox3Boxes_tiled_float(PixelF* bufA, PixelF* bufB, int w, int h, int radius)
{
    if (!bufA || !bufB) return;
    int passes = 3;
    PixelF* curSrc = bufA;
    PixelF* curDst = bufB;

    for (int pass = 0; pass < passes; ++pass) {
        // H pass in tile rows
        for (int y0 = 0; y0 < h; y0 += g_tileH) {
            int y1 = y0 + g_tileH;
            if (y1 > h) y1 = h;
            boxBlurH_range_float(curSrc, curDst, w, h, radius, y0, y1);
        }
        PixelF* tmp = curSrc; curSrc = curDst; curDst = tmp;

        // V pass in tile cols
        for (int x0 = 0; x0 < w; x0 += g_tileW) {
            int x1 = x0 + g_tileW;
            if (x1 > w) x1 = w;
            boxBlurV_range_float(curSrc, curDst, w, h, radius, x0, x1);
        }
        tmp = curSrc; curSrc = curDst; curDst = tmp;
    }

    if (curSrc != bufA) {
        // copy back
        int pix = w * h;
        for (int i = 0; i < pix; ++i) bufA[i] = curSrc[i];
    }
}

// Downsample from texture memory into g_bufA (store premultiplied floats)
static void downsample_area_fromTex_float(const unsigned char* srcData, u32 pitch, int srcW, int srcH, int factor, PixelF* dst, int dstW, int dstH)
{
    const int K = factor;
    for (int y = 0; y < dstH; ++y) {
        for (int x = 0; x < dstW; ++x) {
            int sx = x * K;
            int sy = y * K;
            double sumA=0.0, sumR=0.0, sumG=0.0, sumB=0.0;
            int count = 0;
            for (int yy = 0; yy < K; ++yy) {
                int py = sy + yy;
                if (py >= srcH) break;
                const unsigned char* row = srcData + py * pitch;
                for (int xx = 0; xx < K; ++xx) {
                    int px = sx + xx;
                    if (px >= srcW) break;
                    unsigned int c = *((unsigned int*)(row + px * 4));
                    unsigned int A = (c >> 24) & 0xFF;
                    unsigned int R = (c >> 16) & 0xFF;
                    unsigned int G = (c >> 8) & 0xFF;
                    unsigned int B = (c) & 0xFF;
                    float af = (float)A * (1.0f/255.0f);
                    // premultiplied color
                    sumA += af;
                    sumR += ((float)R * (1.0f/255.0f)) * af;
                    sumG += ((float)G * (1.0f/255.0f)) * af;
                    sumB += ((float)B * (1.0f/255.0f)) * af;
                    ++count;
                }
            }
            if (count == 0) count = 1;
            PixelF p;
            float invCount = 1.0f / (float)count;
            p.a = (float)(sumA * invCount);
            p.r = (float)(sumR * invCount);
            p.g = (float)(sumG * invCount);
            p.b = (float)(sumB * invCount);
            dst[y * dstW + x] = p;
        }
    }
}

// Upsample from small buffer and write back to texture memory blending with original by intensity
static void upsample_blend_writeToTex(PixelF* smallBuf, int smallW, int smallH, unsigned char* dstData, u32 dstPitch, int dstW, int dstH, float intensity)
{
    // clamp intensity 0..1
    if (intensity <= 0.0f) return; // nothing to do
    if (intensity > 1.0f) intensity = 1.0f;

    for (int y = 0; y < dstH; ++y) {
        unsigned char* rowOut = dstData + y * dstPitch;
        float fy = ((float)y + 0.5f) * ((float)smallH / (float)dstH) - 0.5f;
        int sy = (int)floorf(fy);
        float wy = fy - sy;
        if (sy < 0) { sy = 0; wy = 0.0f; }
        if (sy >= smallH - 1) { sy = smallH - 1; wy = 0.0f; }
        for (int x = 0; x < dstW; ++x) {
            float fx = ((float)x + 0.5f) * ((float)smallW / (float)dstW) - 0.5f;
            int sx = (int)floorf(fx);
            float wx = fx - sx;
            if (sx < 0) { sx = 0; wx = 0.0f; }
            if (sx >= smallW - 1) { sx = smallW - 1; wx = 0.0f; }

            PixelF c00 = smallBuf[sy * smallW + sx];
            PixelF c10 = smallBuf[sy * smallW + (sx+1 < smallW ? sx+1 : sx)];
            PixelF c01 = smallBuf[(sy+1 < smallH ? sy+1 : sy) * smallW + sx];
            PixelF c11 = smallBuf[(sy+1 < smallH ? sy+1 : sy) * smallW + (sx+1 < smallW ? sx+1 : sx)];

            float w00 = (1.0f - wx) * (1.0f - wy);
            float w10 = wx * (1.0f - wy);
            float w01 = (1.0f - wx) * wy;
            float w11 = wx * wy;

            // interpolate premultiplied floats and alpha
            float a_prem = c00.a * w00 + c10.a * w10 + c01.a * w01 + c11.a * w11;
            float r_prem = c00.r * w00 + c10.r * w10 + c01.r * w01 + c11.r * w11;
            float g_prem = c00.g * w00 + c10.g * w10 + c01.g * w01 + c11.g * w11;
            float b_prem = c00.b * w00 + c10.b * w10 + c01.b * w01 + c11.b * w11;

            // un-premultiply (if alpha > 0)
            float outA_f = a_prem;
            float outR_f = 0.0f, outG_f = 0.0f, outB_f = 0.0f;
            if (outA_f > 1e-6f) {
                outR_f = r_prem / outA_f;
                outG_f = g_prem / outA_f;
                outB_f = b_prem / outA_f;
            }

            // convert to 0..255 ints
            int blurA = (int)(outA_f * 255.0f + 0.5f);
            int blurR = (int)(outR_f * 255.0f + 0.5f);
            int blurG = (int)(outG_f * 255.0f + 0.5f);
            int blurB = (int)(outB_f * 255.0f + 0.5f);
            if (blurA < 0) blurA = 0; if (blurA > 255) blurA = 255;
            if (blurR < 0) blurR = 0; if (blurR > 255) blurR = 255;
            if (blurG < 0) blurG = 0; if (blurG > 255) blurG = 255;
            if (blurB < 0) blurB = 0; if (blurB > 255) blurB = 255;

            // read original pixel
            unsigned int orig = *((unsigned int*)(rowOut + x * 4));
            int origA = (orig >> 24) & 0xFF;
            int origR = (orig >> 16) & 0xFF;
            int origG = (orig >> 8) & 0xFF;
            int origB = orig & 0xFF;

            // Lerping between original and blurred by intensity
            int finalA = (int)(origA * (1.0f - intensity) + blurA * intensity + 0.5f);
            int finalR = (int)(origR * (1.0f - intensity) + blurR * intensity + 0.5f);
            int finalG = (int)(origG * (1.0f - intensity) + blurG * intensity + 0.5f);
            int finalB = (int)(origB * (1.0f - intensity) + blurB * intensity + 0.5f);

            if (finalA < 0) finalA = 0; if (finalA > 255) finalA = 255;
            if (finalR < 0) finalR = 0; if (finalR > 255) finalR = 255;
            if (finalG < 0) finalG = 0; if (finalG > 255) finalG = 255;
            if (finalB < 0) finalB = 0; if (finalB > 255) finalB = 255;

            unsigned int out = ((finalA & 0xFF) << 24) | ((finalR & 0xFF) << 16) | ((finalG & 0xFF) << 8) | (finalB & 0xFF);
            *((unsigned int*)(rowOut + x * 4)) = out;
        }
    }
}

// *** NEW OPTIMIZED WRAPPER FUNCTION ***
// This function operates on an IImage (RAM) instead of an ITexture (VRAM)
void applyFastBloomBlurToImage(video::IImage* image, int downsample, int radius, float intensity)
{
    if (!image) return;
    if (downsample < 1) downsample = 1;
    if (radius < 1) radius = 1;
    if (intensity <= 0.0f) return;

    if (image->getColorFormat() != ECF_A8R8G8B8) {
        return;
    }

    const core::dimension2d<u32>& size = image->getDimension();
    int srcW = (int)size.Width;
    int srcH = (int)size.Height;
    if (srcW <= 0 || srcH <= 0) return;

    int smallW = (srcW + downsample - 1) / downsample;
    int smallH = (srcH + downsample - 1) / downsample;
    if (smallW <= 0) smallW = 1;
    if (smallH <= 0) smallH = 1;
    int pixels = smallW * smallH;
    ensureBuffers(pixels);
    if (!g_bufA || !g_bufB) return;

    // Lock image
    image->lock();
    unsigned char* data = (unsigned char*)image->getData();
    u32 pitch = image->getPitch();

    // Downsample from IImage to g_bufA
    // (We reuse the existing function since it takes a char* pointer)
    downsample_area_fromTex_float(data, pitch, srcW, srcH, downsample, g_bufA, smallW, smallH);

    // Apply tiled blur on small buffer
    fastGaussianApprox3Boxes_tiled_float(g_bufA, g_bufB, smallW, smallH, radius);

    // Upsample and blend back to IImage
    // (We reuse the existing function)
    upsample_blend_writeToTex(g_bufA, smallW, smallH, data, pitch, srcW, srcH, intensity);

    image->unlock();
}
// --- FAST BLUR IMPLEMENTATION END ---


// *** NEW STRUCTURE FOR THE LUT ***
// A simple struct to store the source coordinate (as an offset) and its validity
struct ReprojectionCoord
{
    u32 offset; // Pixel offset in the source image
    bool valid; // Is this pixel within the rendered FOV?
};
// Global pointer for our LUT. Will be initialized in main().
ReprojectionCoord* g_reprojectionLUT = 0;


int main() {
    MyEventReceiver receiver;

    // --- Configuration (MODIFIED for over-rendering) ---
    const u32 TILE_W = 64;
    const u32 TILE_H = 64;
    const u32 H_COUNT = 10; // Was 8. Now rendering 10 tiles wide (10 * 64 = 640)
    const u32 V_COUNT = 8;  // Was 6. Now rendering 8 tiles high (8 * 64 = 512)
    const dimension2du RENDER_SIZE(TILE_W, TILE_H);
    const dimension2du FINAL_SIZE(TILE_W * H_COUNT, TILE_H * V_COUNT); // Now 640x512

    IrrlichtDevice* device = createDevice(EDT_OPENGL,
        dimension2d<u32>(FINAL_SIZE.Width, FINAL_SIZE.Height), 16, false, true, true, &receiver);
    if (!device) return 1;

    IVideoDriver* driver = device->getVideoDriver();
    ISceneManager* smgr = device->getSceneManager();
    IGUIEnvironment* guienv = device->getGUIEnvironment();

    device->getCursorControl()->setVisible(false);

    // --- Camera / projection parameters ---
    // This is the FOV of *a single tile*
    const f32 individualFOV = 19.0f * core::DEGTORAD;
    // This is now the *total source FOV*
    const f32 totalHorizontalFOV = (f32)H_COUNT * individualFOV; // 10 * 19 = 190 degrees
    const f32 totalVerticalFOV = (f32)V_COUNT * individualFOV;   // 8 * 19 = 152 degrees
    const f32 zn = 0.1f;
    const f32 zf = 1000.0f;
    const u32 RENDER_COUNT = H_COUNT * V_COUNT; // Now 80

    // --- Pre-calculations for reprojection (Source) ---
    const f32 halfHFOV = totalHorizontalFOV * 0.5f; // 95 degrees
    const f32 halfVFOV = totalVerticalFOV * 0.5f;   // 76 degrees
    const f32 tanHalfHFOV = tanf(halfHFOV);
    const f32 tanHalfVFOV = tanf(halfVFOV);
    const f32 finalWidth = (f32)FINAL_SIZE.Width;
    const f32 finalHeight = (f32)FINAL_SIZE.Height;

    // --- Variables for window scaling ---
    const f32 targetAspectRatio = finalWidth / finalHeight; // 640/512 = 1.25
    dimension2du currentWindowSize = device->getVideoDriver()->getScreenSize();

    array<ICameraSceneNode*> multiViewCameras;
    array<ITexture*> renderTextures;
    ITexture* finalCompositeTexture = 0;

    // --- Create cameras and render targets (now 80) ---
    for (u32 i = 0; i < RENDER_COUNT; ++i) {
        ICameraSceneNode* cam = smgr->addCameraSceneNode();
        cam->setNearValue(zn);
        cam->setFarValue(zf);
        stringw texName = L"RenderTex_";
        texName += i;
        ITexture* rt = driver->addRenderTargetTexture(RENDER_SIZE, texName.c_str(), ECF_A8R8G8B8);
        multiViewCameras.push_back(cam);
        renderTextures.push_back(rt);
    }

    // --- MODIFIED: Create work textures ONCE ---
    finalCompositeTexture = driver->addRenderTargetTexture(FINAL_SIZE, "finalComposite", ECF_A8R8G8B8);

    // --- Resources for the re-projected image ---
    IImage* remappedImage = driver->createImage(ECF_A8R8G8B8, FINAL_SIZE);
    ITexture* remappedTexture = driver->addTexture("remapped", remappedImage);

    // --- Scene (unchanged) ---
    ILightSceneNode* light = smgr->addLightSceneNode(0, vector3df(0, 20, 0), SColorf(1.0f, 1.0f, 1.0f), 500.0f);
    light->getLightData().Type = video::ELT_POINT;
    light->getLightData().DiffuseColor = SColorf(0.0f, 0.0f, 0.9f);
    light->getLightData().SpecularColor = SColorf(0.0f, 0.0f, 0.0f);
    light->getLightData().AmbientColor = SColorf(0.08f, 0.08f, 0.08f);
    video::SMaterial redMaterial; redMaterial.Lighting = true; redMaterial.DiffuseColor = video::SColor(255, 255, 80, 80); redMaterial.AmbientColor = video::SColor(255, 100, 30, 30); redMaterial.SpecularColor = video::SColor(0, 255, 220, 220); redMaterial.Shininess = 35.0f;
    video::SMaterial greenMaterial; greenMaterial.Lighting = true; greenMaterial.DiffuseColor = video::SColor(255, 80, 255, 80); greenMaterial.AmbientColor = video::SColor(255, 30, 100, 30); greenMaterial.SpecularColor = video::SColor(0, 220, 255, 220); greenMaterial.Shininess = 35.0f;
    video::SMaterial blueMaterial; blueMaterial.Lighting = true; blueMaterial.DiffuseColor = video::SColor(255, 80, 80, 255); blueMaterial.AmbientColor = video::SColor(255, 30, 30, 100); blueMaterial.SpecularColor = video::SColor(0, 220, 220, 255); blueMaterial.Shininess = 35.0f;
    video::SMaterial yellowMaterial; yellowMaterial.Lighting = true; yellowMaterial.DiffuseColor = video::SColor(255, 255, 255, 120); yellowMaterial.AmbientColor = video::SColor(255, 120, 120, 40); yellowMaterial.SpecularColor = video::SColor(0, 255, 255, 220); yellowMaterial.Shininess = 30.0f;
    const int cubeCount = 3;
    const f32 spacing = 24.0f;
    for (int x = -cubeCount; x <= cubeCount; ++x) {
        for (int z = -cubeCount; z <= cubeCount; ++z) {
            IMeshSceneNode* cube = smgr->addCubeSceneNode(4.0f);
            cube->setPosition(vector3df(x * spacing, 4.0f, z * spacing));
            cube->setRotation(vector3df(x * 20.0f, z * 15.0f, x * 10.0f + z * 5.0f));
            if ((x + z) % 3 == 0) cube->getMaterial(0) = redMaterial;
            else if ((x + z) % 3 == 1) cube->getMaterial(0) = greenMaterial;
            else cube->getMaterial(0) = blueMaterial;
        }
    }
    IMeshSceneNode* floor = smgr->addCubeSceneNode(1.0f);
    floor->setScale(vector3df(60.0f, 1.0f, 60.0f));
    floor->setPosition(vector3df(0, -8.0f, 0));
    floor->getMaterial(0) = yellowMaterial;
    IMeshSceneNode* centerCube = smgr->addCubeSceneNode(6.0f);
    centerCube->setPosition(vector3df(0, 6.0f, 0));
    centerCube->getMaterial(0) = redMaterial;
    centerCube->getMaterial(0).Shininess = 0.0f;
    IMeshSceneNode* frontCube = smgr->addCubeSceneNode(5.0f);
    frontCube->setPosition(vector3df(0, 5.0f, 20.0f));
    frontCube->getMaterial(0) = blueMaterial;
    frontCube->getMaterial(0).Shininess = 0.0f;
    IMeshSceneNode* sideCube = smgr->addCubeSceneNode(5.0f);
    sideCube->setPosition(vector3df(20.0f, 5.0f, 0));
    sideCube->getMaterial(0) = greenMaterial;
    sideCube->getMaterial(0).Shininess = 0.0f;
    IMeshSceneNode* sphere1 = smgr->addSphereSceneNode(3.0f, 32);
    sphere1->setPosition(vector3df(-15.0f, 3.0f, -15.0f));
    sphere1->getMaterial(0) = blueMaterial;
    IMeshSceneNode* sphere2 = smgr->addSphereSceneNode(2.5f, 32);
    sphere2->setPosition(vector3df(15.0f, 3.0f, -15.0f));
    sphere2->getMaterial(0) = greenMaterial;
    // --- End of Scene ---

    // --- UI ---
    IGUIStaticText* infoText = guienv->addStaticText(
        L"Spherical Reprojection (CPU) | ESC to exit",
        rect<s32>(10, 10, FINAL_SIZE.Width - 10, 80),
        true, true, 0, -1, true
    );
    infoText->setOverrideColor(SColor(255, 255, 255, 255));

    bool running = true;
    u32 frameCount = 0;

    // --- Frustum pre-calculations (for 10x8) ---
    const f32 halfWidthTotal = tanf(totalHorizontalFOV * 0.5f) * zn;
    const f32 halfHeightTotal = tanf(totalVerticalFOV * 0.5f) * zn;
    const f32 tileW_near = (2.0f * halfWidthTotal) / (f32)H_COUNT;
    const f32 tileH_near = (2.0f * halfHeightTotal) / (f32)V_COUNT;


    // --- *** MODIFIED: PRE-CALCULATION OF REPROJECTION LUT (Over-render) *** ---

    // 1. Define the *destination* FOV (the view we want on screen)
    // This was the *original* FOV of 8x6 tiles.
    const f32 dest_HFOV = (8.0f * individualFOV); // 152 degrees
    const f32 dest_VFOV = (6.0f * individualFOV); // 114 degrees
    const f32 dest_halfHFOV = dest_HFOV * 0.5f;
    const f32 dest_halfVFOV = dest_VFOV * 0.5f;

    // 2. The *source* FOV is already in global variables
    //    (tanHalfHFOV and tanHalfVFOV) thanks to H_COUNT=10 and V_COUNT=8
    //    - totalHorizontalFOV is 190°
    //    - totalVerticalFOV is 152°

    const u32 destTotalPixels = FINAL_SIZE.Width * FINAL_SIZE.Height;
    const u32 sourcePitchInU32s = FINAL_SIZE.Width;

    g_reprojectionLUT = new ReprojectionCoord[destTotalPixels];

    u32 lut_idx = 0;
    for (u32 y = 0; y < FINAL_SIZE.Height; ++y) {
        for (u32 x = 0; x < FINAL_SIZE.Width; ++x, ++lut_idx) {

            // Normalized coordinate of the *destination* pixel (screen 640x512)
            const f32 nx = ( (f32)x / finalWidth ) * 2.0f - 1.0f;
            const f32 ny = ( (f32)y / finalHeight) * 2.0f - 1.0f;

            // 3. Map (nx, ny) to a spherical angle
            //    USING DESTINATION FOV (152° x 114°)
            const f32 longitude = nx * dest_halfHFOV;
            const f32 latitude  = ny * dest_halfVFOV;

            // Convert angle to 3D vector
            const f32 f_cos_lat = cosf(latitude);
            const f32 dirX = f_cos_lat * sinf(longitude);
            const f32 dirY = sinf(latitude);
            const f32 dirZ = f_cos_lat * cosf(longitude);

            if (dirZ > 0.0001f) {
                // 4. Convert 3D vector back to texture coordinate (u,v)
                //    USING SOURCE FOV (190° x 152°)
                const f32 u_norm = (dirX / dirZ) / tanHalfHFOV;
                const f32 v_norm = (dirY / dirZ) / tanHalfVFOV;

                // 5. Check if (u,v) is WITHIN the *source* texture
                if (u_norm >= -1.0f && u_norm <= 1.0f && v_norm >= -1.0f && v_norm <= 1.0f) {
                    const f32 u_source = (u_norm + 1.0f) * 0.5f * finalWidth;
                    const f32 v_source = (v_norm + 1.0f) * 0.5f * finalHeight;

                    s32 u_clamped = (s32)core::clamp(u_source, 0.0f, finalWidth - 1.0f);
                    s32 v_clamped = (s32)core::clamp(v_source, 0.0f, finalHeight - 1.0f);

                    g_reprojectionLUT[lut_idx].valid = true;
                    // Store linear offset instead of (u, v)
                    g_reprojectionLUT[lut_idx].offset = (u32)v_clamped * sourcePitchInU32s + (u32)u_clamped;
                } else {
                    g_reprojectionLUT[lut_idx].valid = false; // (As backup, just in case)
                }
            } else {
                g_reprojectionLUT[lut_idx].valid = false; // Behind camera (black)
            }
        }
    }
    // --- *** END OF LUT PRE-CALCULATION *** ---


    while (running && device->run()) {
        if (receiver.IsKeyDown(KEY_ESCAPE)) { running = false; break; }

        // --- Camera Setup ---
        f32 time = device->getTimer()->getTime() / 1000.0f;
        vector3df basePos(
            sin(time * 0.15f) * 28.0f,
            12.0f + sin(time * 0.3f) * 3.0f,
            cos(time * 0.15f) * 28.0f
        );
        vector3df baseTarget(0, 5, 0);
        for (u32 i = 0; i < multiViewCameras.size(); ++i) {
            multiViewCameras[i]->setPosition(basePos);
            multiViewCameras[i]->setTarget(baseTarget);
        }
        // Configure the 80 frustums
        for (u32 i = 0; i < multiViewCameras.size(); ++i) {
            u32 row = i / H_COUNT; // H_COUNT is 10
            u32 col = i % H_COUNT;
            f32 l = -halfWidthTotal + (f32)col * tileW_near;
            f32 r = l + tileW_near;
            f32 t = halfHeightTotal - (f32)row * tileH_near;
            f32 b = t - tileH_near;
            core::matrix4 proj = buildOffCenterPerspectiveLH(l, r, b, t, zn, zf);
            multiViewCameras[i]->setProjectionMatrix(proj);
        }

        // --- PHASE 1: Render to each render target (80 renders) ---
        for (u32 i = 0; i < multiViewCameras.size(); ++i) {
            if (!renderTextures[i]) continue;
            driver->setRenderTarget(renderTextures[i], true, true, SColor(255, 0, 0, 0));
            smgr->setActiveCamera(multiViewCameras[i]);
            smgr->drawAll();
        }

        // --- PHASE 2: Compose into finalCompositeTexture (640x512) ---
        driver->setRenderTarget(finalCompositeTexture, true, true, SColor(0, 0, 0, 0));
        const s32 tileW_int = FINAL_SIZE.Width / (s32)H_COUNT; // 640 / 10 = 64
        const s32 tileH_int = FINAL_SIZE.Height / (s32)V_COUNT; // 512 / 8 = 64
        for (u32 i = 0; i < renderTextures.size(); ++i) {
            if (!renderTextures[i]) continue;
            u32 col = i % H_COUNT;
            u32 row = i / H_COUNT;
            position2d<s32> destPos(col * tileW_int, row * tileH_int);
            rect<s32> srcRect(0, 0, (s32)renderTextures[i]->getSize().Width, (s32)renderTextures[i]->getSize().Height);
            driver->draw2DImage(renderTextures[i], destPos, srcRect, 0, SColor(255, 255, 255, 255), false);
        }

        // --- *** NEW PHASE 2.5: SUPER FAST REPROJECTION (CPU) + BLUR *** ---

        // 1. Copy texture from GPU (finalCompositeTexture) to RAM (sourceImage)
        IImage* sourceImage = driver->createImage(finalCompositeTexture, position2di(0, 0), FINAL_SIZE);

        if (sourceImage) {
            // 2. Lock both images for direct pointer access
            sourceImage->lock();
            remappedImage->lock();

            u32* sourcePtr = (u32*)sourceImage->getData();
            u32* destPtr = (u32*)remappedImage->getData();

            // 3. Reprojection loop using LUT (VERY FAST!)
            //    (destTotalPixels is now 640*512)
            for (u32 i = 0; i < destTotalPixels; ++i) {
                if (g_reprojectionLUT[i].valid) {
                    // Copy pixel from source to destination
                    destPtr[i] = sourcePtr[g_reprojectionLUT[i].offset];
                } else {
                    // Out of range pixel (black)
                    destPtr[i] = 0xFF000000; // A=255, R=0, G=0, B=0
                }
            }

            // 4. Unlock both images
            remappedImage->unlock();
            sourceImage->unlock();

            // 5. Free temporary image
            sourceImage->drop();

            // 6. Apply blur DIRECTLY to IImage in RAM (remappedImage)
            applyFastBloomBlurToImage(remappedImage, 8, 1, 0.9f);

            // 7. Update GPU texture efficiently
            void* textureData = remappedTexture->lock(video::ETLM_WRITE_ONLY);
            if (textureData) {
                remappedImage->lock(); // Lock for reading
                memcpy(textureData, remappedImage->getData(), remappedImage->getImageDataSizeInBytes());
                remappedImage->unlock();
                remappedTexture->unlock();
            }
        }

        // --- PHASE 3: DRAW FINAL TO SCREEN (MODIFIED FOR SCALING) ---
        driver->setRenderTarget(0, true, true, SColor(255, 0, 0, 0));

        currentWindowSize = driver->getScreenSize();
        const f32 windowWidth = (f32)currentWindowSize.Width;
        const f32 windowHeight = (f32)currentWindowSize.Height;

        f32 scaledWidth = windowHeight * targetAspectRatio;
        f32 scaledHeight = windowHeight;

        if (scaledWidth > windowWidth) {
            scaledWidth = windowWidth;
            scaledHeight = windowWidth / targetAspectRatio;
        }

        const s32 posX = (s32)((windowWidth - scaledWidth) * 0.5f);
        const s32 posY = (s32)((windowHeight - scaledHeight) * 0.5f);

        if (remappedTexture) {
            rect<s32> destRect(posX, posY, posX + (s32)scaledWidth, posY + (s32)scaledHeight);
            rect<s32> srcRect(0, 0, (s32)FINAL_SIZE.Width, (s32)FINAL_SIZE.Height);
            driver->draw2DImage(remappedTexture, destRect, srcRect, 0, 0, true);
        }

        // --- UI + end of frame (MODIFIED to reflect optimization) ---
        frameCount++;
        stringw statusText = L"Optimized: 10x8 Over-render + LUT + CPU Blur | FPS: ";
        statusText += intToString(driver->getFPS());
        statusText += L" | Renders: ";
        statusText += RENDER_COUNT;
        statusText += L" | Res: ";
        statusText += intToString(FINAL_SIZE.Width);
        statusText += L"x";
        statusText += intToString(FINAL_SIZE.Height);
        infoText->setText(statusText.c_str());
        guienv->drawAll();

        driver->endScene();
    }

    // --- Cleanup (MODIFIED) ---
    delete[] g_reprojectionLUT; // Free LUT memory
    g_reprojectionLUT = 0;

    BlurShutdown(); // Free blur buffers

    remappedImage->drop();

    device->drop();
    return 0;
}

CuteAlien · Post by **CuteAlien** » Mon Nov 03, 2025 11:39 am

You render the scene 80 times per frame in your latest code?

Noiecity · Post by **Noiecity** » Mon Nov 03, 2025 2:28 pm

CuteAlien wrote: Mon Nov 03, 2025 11:39 am You render the scene 80 times per frame in your latest code?

No, I had to correct something caused by the correction. First, when joining each render, I had geometric discontinuity, so I had to apply off-center or something like that. Once I did that, the edges of the final result were distorted, so I applied a mathematical calculation similar to fisheye. It fixed it, but it produced a circular edge instead of a square one (a circle formed at the border of the final texture joined). I had to create more renders and apply zoom...
By the way, the renders of the two examples are optimized to take advantage of the L1 cache. If a higher resolution is used, it will go to the L2 cache (although there is no absolute guarantee that they will go to the L1 cache).

I'm redoing the models I promised... if I had $19 in PayPal, PayPal would charge me $5 for withdrawal and I could buy the SATA-to-USB connector, damn poverty.

This could be avoided if I could create a camera with certain parameters, but graphics APIs such as older versions of OpenGL or DirectX do not allow you to configure their default pipeline... (according to ChatGPT and Deepseek)... that is, if the camera were 100% CPU(or GPU with Opencl), I could do something like this:

Code: Select all

// HumanEyeProject and dependencies extracted from retro_undefinied_branchless.cpp

#include <cmath> // Only for basic math constants if needed

// Type definitions
typedef unsigned int u32;

// Branchless helper macros
#define U32(x) ((unsigned int)(x))
#define NZ32(u) ((U32(u) | (0u - U32(u))) >> 31)

// Float-bit manipulation utilities
static inline unsigned float_bits(float f) {
    union { float f; unsigned u; } v;
    v.f = f;
    return v.u;
}

static inline float bits_float(unsigned u) {
    union { unsigned u; float f; } v;
    v.u = u;
    return v.f;
}

// Fast inverse square root (Quake-style approximation)
static inline float fast_inv_sqrt(float number) {
    const float threehalfs = 1.5f;
    float x2 = number * 0.5f;
    float y = number;
    
    // Bit hack for initial approximation
    unsigned i = float_bits(y);
    i = 0x5f3759dfu - (i >> 1);
    y = bits_float(i);
    
    // One Newton-Raphson iteration for refinement
    y = y * (threehalfs - (x2 * y * y));
    return y;
}

// Branchless float >= 0 check returning mask (0xFFFFFFFF if true, 0 if false)
static inline unsigned float_ge_zero_mask(float f) {
    unsigned u = float_bits(f);
    unsigned sign = u >> 31; // 0 if >=0, 1 if <0
    return ~((unsigned)(-(int)sign)); // 0xFFFFFFFF if >=0, 0 if <0
}

/**
 * Human-eye-inspired projection with curved retina and cortical magnification
 * 
 * This function simulates human visual perception with:
 * - Curved retina projection (non-linear mapping)
 * - Cortical magnification (foveal emphasis)
 * - Physically-based depth calculation
 * 
 * @param rx, ry, rz: 3D point in camera space
 * @param camera_plane_distance: Distance to camera projection plane
 * @param focal_scale: Scaling factor for field of view
 * @param cortical_gamma: Cortical magnification factor (higher = more foveal emphasis)
 * @param center_x, center_y: Screen center coordinates for projection
 * @param out_sx, out_sy: Output screen coordinates
 * @param out_w: Output depth value (1/z for depth buffering)
 */
static inline void HumanEyeProject(
    float rx, float ry, float rz,
    float camera_plane_distance,
    float focal_scale,
    float cortical_gamma,
    int center_x, int center_y,
    float* out_sx,
    float* out_sy,
    float* out_w) {

    float nx = rx;
    float ny = ry;
    float nz = rz;

    // Perspective projection denominator with epsilon to avoid division by zero
    float denom = 1.0f - nz + 1e-9f;
    float inv_denom = 1.0f / denom;
    float sx = nx * inv_denom;
    float sy = ny * inv_denom;

    // Calculate squared radius from optical center with epsilon
    float r2 = sx * sx + sy * sy + 1e-12f;
    
    // Fast inverse square root for radius calculation
    float inv_r = fast_inv_sqrt(r2);
    float r = r2 * inv_r; // r = sqrt(r2) without direct sqrt()

    // Cortical magnification: scale decreases with eccentricity
    // scale_candidate = 1/(1 + gamma*r) - models foveal compression
    float tiny = 1e-12f;
    float scale_candidate = 1.0f / (1.0f + cortical_gamma * r + tiny);

    // Branchless scale selection: use scale_candidate if r >= eps, else 1.0
    const float eps = 1e-9f;
    unsigned mask_r_nonzero = float_ge_zero_mask(r - eps); // 0xFFFFFFFF if r >= eps
    unsigned bits_scale_candidate = float_bits(scale_candidate);
    unsigned bits_one = float_bits(1.0f);
    unsigned out_scale_bits = (bits_scale_candidate & mask_r_nonzero) | 
                             (bits_one & ~mask_r_nonzero);
    float scale = bits_float(out_scale_bits);

    // Apply cortical magnification and focal scaling
    float sx_m = sx * scale;
    float sy_m = sy * scale;

    // Project to screen coordinates (Y inverted for screen space)
    float screen_x = (float)center_x + focal_scale * sx_m;
    float screen_y = (float)center_y - focal_scale * sy_m;

    // Calculate effective depth for depth buffering
    float z_eff = camera_plane_distance - rz + 1e-9f;
    float inv_z_eff = 1.0f / z_eff;

    // Output results
    *out_sx = screen_x;
    *out_sy = screen_y;
    *out_w  = inv_z_eff;
}
//I include a clamp in case it is needed...
static inline float clamp01_branchless(float f) {
    unsigned u = float_bits(f);
    unsigned mask_ge0 = float_ge_zero_mask(f);
    unsigned mask_le1 = float_ge_zero_mask(1.0f - f);
    unsigned mask_inrange = mask_ge0 & mask_le1;
    unsigned mask_neg = ~mask_ge0;
    unsigned mask_gt1 = mask_ge0 & ~mask_le1;
    unsigned bits0 = float_bits(0.0f);
    unsigned bits1 = float_bits(1.0f);
    unsigned out_bits = (u & mask_inrange) | (bits0 & mask_neg) | (bits1 & mask_gt1);
    return bits_float(out_bits);
}

I would avoid having to render so many times...

I have a complete example, but I can't compile it in CodeBlocks, only in Dev C++ 4.9.9.2 on Windows 7 (I tried to compile it on Windows 10 and got an error), however, the executables created in it are compatible with Windows 10... I was planning to create another software renderer, but I gave up when I saw the huge number of functions I had to translate. Besides, I also had to create a version for Linux using X11 (if you want, I can give you a link so you can see it)

CuteAlien · Post by **CuteAlien** » Mon Nov 03, 2025 4:09 pm

Yeah, I'm not understanding whatever AI said here. It can spin out code fast. I can't keep up reading this. But rendering scene 80 times per frame will be slow. And adding micro-optimizations won't change this as long as the elephant is in the room. Rendering 80 times per frame is not a correction - that's giving up on real-time :-)

If you work with AI anyway - why not ask it to write shaders instead?

Noiecity · Post by **Noiecity** » Mon Nov 03, 2025 4:29 pm

CuteAlien wrote: Mon Nov 03, 2025 4:09 pm Yeah, I'm not understanding whatever AI said here. It can spin out code fast. I can't keep up reading this. But rendering scene 80 times per frame will be slow. And adding micro-optimizations won't change this as long as the elephant is in the room. Rendering 80 times per frame is not a correction - that's giving up on real-time

If you work with AI anyway - why not ask it to write shaders instead?

Because I specified that he shouldn't do it on the first prompt, since I want to understand what I'm doing, so I asked him to do it with whatever he had in his pipeline. Besides, I do it for compatibility reasons. On older hardware, shaders may not have any effect on the GPU, or they may have a completely unexpected effect. This is unlikely but it can happen, since those GPUs are configured to run applications with changes, such as another proprietary bilinear filter, normalmaps, etc. (although currently these changes can be disabled in not-so-old GPUs from their own software). Apart from an energy issue, older GPUs do not seem to apply optimizations correctly because they consume a lot of electricity or get very hot.

Similarly, in modern GPUs, rendering many times in small quantities may be optimal, as it is better equipped to work on tasks in parallel, generally having more cores than a CPU. However, if the GPU performs optimizations, it may worsen the result in this specific case, rather than improving it.

Similarly, a CPU can compete with the GPU in terms of speed when it comes to executing a single task on a single thread, on a single core. For example, using ALU, you can use the CPU's internal registers, and if your CPU supports vectorization in a single CPU clock cycle, you can perform four calculations(modern GPUs can also do this), as in ARM, which would normally take the GPU more cycles as it is optimized for heavier and parallel loads.

However, using internal registers is only good practice if you have control over or can predict with certainty when internal registers will be available. Otherwise, if they are not available, they will enter a kind of queue where they are transferred to memory until they can be used. This is costly in terms of CPU clock cycles... In other words, it is good practice if you create your own kind of PS4 with a development board, for example...

Ideally, you would use the GPU in conjunction with OpenCL to have more control over what is calculated, but not all graphics cards support OpenCL.

This development board, for example, supports vectorization using internal CPU registers, i.e., performing more calculations using ALU in a single clock cycle, and supports OpenCL:
https://www.aliexpress.com/item/1005008645669387.html

CuteAlien · Post by **CuteAlien** » Mon Nov 03, 2025 7:40 pm

Calculations are one problem. But you have another fat problem which is transferring memory from/to the graphic card. So doesn't matter too much how fast any processor is if the images do some round-trips.

As for the 80 cameras... are you trying to approximate some kind of spherical render that way? Better (still expensive) solution for that - use 6 cameras (one per direction). And the go over the final image to do the spherical projection. But maybe I'm guessing wrong. I mean this post was about an isometric engine. Which usually is a mixture of clever art and maybe some camera panning (wasn't in Irrlicht 1.8, but supported by Irrlicht trunk and wouldn't be too much code to backport).

Sorry, you posted several times code which each time is completely different, you lost me here. Despite my name I'm only human :-) (the secret is out!)

Noiecity · Post by **Noiecity** » Mon Nov 03, 2025 7:50 pm

CuteAlien wrote: Mon Nov 03, 2025 7:40 pm Calculations are one problem. But you have another fat problem which is transferring memory from/to the graphic card. So doesn't matter too much how fast any processor is if the images do some round-trips.

As for the 80 cameras... are you trying to approximate some kind of spherical render that way? Better (still expensive) solution for that - use 6 cameras (one per direction). And the go over the final image to do the spherical projection. But maybe I'm guessing wrong. I mean this post was about an isometric engine. Which usually is a mixture of clever art and maybe some camera panning (wasn't in Irrlicht 1.8, but supported by Irrlicht trunk and wouldn't be too much code to backport).

Sorry, you posted several times code which each time is completely different, you lost me here. Despite my name I'm only human (the secret is out!)

Oh no, I thought you were actually a bunch of sun stars in space. Well, as for why so many cameras, the answer is short and simple: I wanted to cover the entire range of vision of the human eye, about 150 degrees horizontally and 120 degrees vertically, or something like that. However, using a wide field of view increases distortion and causes a lot of information to be lost. In my method, I try to reduce the loss by having a smaller field of view and then finally reconstructing... basically a better algorithm than a camera with panini projection. In fact, the version without reconstruction is quite fast. Maybe I should settle for that?

Irrlicht Engine

Irrlicht isometric game engine

Irrlicht isometric game engine

Re: Irrlicht isometric game engine

Re: Irrlicht isometric game engine

Re: Irrlicht isometric game engine

Re: Irrlicht isometric game engine

Re: Irrlicht isometric game engine

Re: Irrlicht isometric game engine

Re: Irrlicht isometric game engine

Re: Irrlicht isometric game engine

Re: Irrlicht isometric game engine

Re: Irrlicht isometric game engine

Re: Irrlicht isometric game engine

Re: Irrlicht isometric game engine

Re: Irrlicht isometric game engine