Optimized and Extended Mandelbrot Set Program

In our previous Mandelbrot Set Program we implemented a Perturbation Theory algorithm that allowed us to compute images faster using most IEEE 754 double precision floats. We can make a few tweaks to our program to make it even faster using a number of optimizations.

The first optimization is a simple one. We notice that mpreal objects are 32 bytes. In numerous places we pass mpreals as function arguments by value. If we change those function arguments to by reference, we can save some time when we call functions. For example the ULP() function is called frequently so we can change it to:

        inline mpreal ULP(const mpreal &x) {
            mpreal next = nextabove(x);             // Smallest representable number greater than x
            mpreal ulp = next - x;
            return ulp;
        }

Another function that is called a lot is the norm_squared() function. It takes a complex<mpreal> which is at least twice as large as a single mpreal so passing by reference speeds up calls to this function.

template<typename T>
inline T norm_squared(const complex<T>& z) {
    return z.real() * z.real() + z.imag() * z.imag();
}

There are a number of other functions that can be modified using this technique to speed up function calls.

Notice that we pass a const reference so that the functions can't change the original variable that are referenced by the arguments. This is good practice to avoid any hard to trace side-effects.

Another technique to speed up our compute_mandelbrot() function (which is were a lot of the work is done) is to pre-allocate the large mpreal objects outside of loops instead of creating them over and over again inside loops. Normally we would prefer to allocate variables closer to where they are used, but mpreals are costly to create because they make a big heap allocation to store all the digits of the number. This allocation can be reused when the mpreal is given a new value. By moving these variable declarations to before the loops we can avoid costly allocations with mpreals as well as with complex<double> variables.

    double_type cell_x_min, cell_x_max, cell_y_min, cell_y_max;
    double_type x0, y0;
    complex<double_type> center, point, dc_full, ref;
    complex<double> ref_z, dc, dz;

With the mpreals we just assign new values to the variables to reuse them. With complex<double> we use the real() and imag() methods to set the components of an existing complex<double>.

            cell_x_min = x_min + i * cell_width;
            cell_x_max = x_min + i * cell_width + cell_width;
            cell_y_min = y_max - j * cell_width - cell_width;
            cell_y_max = y_max - j * cell_width;
            center.real((cell_x_max + cell_x_min) / 2.);
            center.imag((cell_y_max + cell_y_min) / 2.);

Another optimization is to precompute the double precision version of the orbit points. Even though the orbit is just computed once for each cell, the orbit points of type complex<mpreal> are converted to complex<double> over and over during the computations for each point in the cell. If we just compute those complex<double> values once and store them in a vector<complex<double>>, we can avoid that computational overhead.

            vector<complex<double>> reference_as_double;
            for (vector<complex<double_type>>::size_type c = 0; c <= max_ref_iteration; ++c) {
                ref = reference[c];
                reference_as_double.emplace_back(
                    static_cast<double>(ref.real()),
                    static_cast<double>(ref.imag()));
            }

Using emplace_back() to create the complex<> elements of the vector in place is another optimization technique. It avoids creating, then passing a complex<> as would be the case if we used push_back() instead.

Complete mandelbrot_threaded_color_mp_fast_opt.cpp
/**
* @file mandelbrot_threaded_color_mp_fast_opt.cpp
* @brief Program to compute the Mandelbrot Set.
* 
* This program can compute images of the Mandelbrot Set in its
* entirety or close ups of boundary areas.  It will use as many
* CPUs as are available to speed up the rendering.  This program
* will use extended precision floating point.  The Gnu Multi-Precision
* library is required if hardware extended precision FP is not
* available.  The algorithm used is based on perturbation theory
* so that most computations can be done with doubles.
*
* It also allows two RGB colors to be specified to define the
* coloring of the set based on the escape value of a pixel.
* Colors can be specified using #xxx or #xxxxxx hexadecimal.
*
* Requires:
* Linux<br>
* get cli11.hpp at https://github.com/CLIUtils/CLI11/releases
* get mpreal.h at https://github.com/advanpix/mpreal.git
* sudo apt install libgmp-dev<br>
* sudo apt install libmpfr-dev<br>
*
* macOS<br>
* brew install cli11<br>
* brew install gmp<br>
* brew install mpfr<br>
* get mpreal.h at https://github.com/advanpix/mpreal.git
*
* Build:
* Linux<br>
* put CLI.hpp in /usr/local/include <br>
* put mpreal.h in /usr/local/include <br>
*
* macOS<br>
* put mpreal.h in /opt/homebrew/include

* Add to .zshrc<br>
* export CPPFLAGS="-I/opt/homebrew/include"<br>
* export LDFLAGS="-L/opt/homebrew/lib"
*
* Compile with<br>
* Linux<br>
* If 128-bit doubles are available on your platform <br>
* g++ --std=gnu++20 -o mandelbrot_threaded_color_mp_fast_opt mandelbrot_threaded_color_mp_fast_opt.cpp -lquadmath

* to use GMP<br>
* g++ --std=gnu++20 -o mandelbrot_threaded_color_mp_fast_opt mandelbrot_threaded_color_mp_fast_opt.cpp -lmpfr -lgmp

* macOS<br>
* g++ $CPPFLAGS $LDFLAGS --std=gnu++20 -o mandelbrot_threaded_color_mp_fast_opt mandelbrot_threaded_color_mp_fast_opt.cpp -lmpfr -lgmp
*
* Execute:
* mandelbrot_threaded_color_mp_fast_opt real imaginary radius [--filename=basename]
*     [--image_size=pixels] [--max_iterations=count]
*     [--low_color=#xxx] [--high_color=#xxx] [--rainbow]
*     [--decimal_digits=n] [--cycle_detection]
*
* Test multi-precision with command <br>
* ./mandelbrot_threaded_color_mp_fast 0.3663629834227643413289327308 0.59153377326144522757749349375 1e-27 -m 100000 -d 30
*
* @author Richard Lesh
* @date 2025-03-12
* @version 1.0
*/

#include <bit>
#include <CLI/CLI.hpp>
#include <cassert>
#include <complex>
#include <fstream>
#include <string>
#include <thread>
#include <unordered_set>
#include <vector>
#include "Color.hpp"

#define MAX_PIXEL_LEVELS 256            // 256 for 8-bit
bool USE_COLOR = true;                  // True to allow colors to be used
bool USE_RAINBOW = false;               // True to use rainbow colors
#define USE_EXTENDED_FP                 // Define to use extended precision floating point
#define FORCE_MPREAL                    // Force use of GMP MPREALs

#if MAX_PIXEL_LEVELS > 256
    #define PIXEL_TYPE uint16_t
#else
    #define PIXEL_TYPE uint8_t
#endif

int DIGITS = 16;                                    // Decimal digits of accuracy
#ifdef USE_EXTENDED_FP
    #if defined __FLOAT128__ && !defined FORCE_MPREAL
        typedef __float128 double_type;
    #elif defined(__SIZEOF_LONG_DOUBLE__) && (__SIZEOF_LONG_DOUBLE__ > 8) && !defined FORCE_MPREAL
        #define LONG_DOUBLE
        typedef long double double_type;
    #else
        #define HAS_MPREAL
        #include "mpreal.h"                         // MPFR C++ Wrapper
        using namespace mpfr;
        #define PRECISION int(ceil(DIGITS * 3.3))   // Precision in bits (~3.3 bits per decimal digit)
        #define DOUBLE_DIGITS DIGITS
        typedef mpreal double_type;
        inline mpreal ULP(const mpreal &x) {
            mpreal next = nextabove(x);             // Smallest representable number greater than x
            mpreal ulp = next - x;
            return ulp;
        }
    #endif
#else
    typedef double double_type;
#endif
#ifndef HAS_MPREAL
    #define DOUBLE_DIGITS numeric_limits<double_type>::digits10
    #define PRECISION numeric_limits<double_type>::digits
/**
 * @brief Computes the Unit in the Last Place (ULP) of a double-precision value.
 *
 * This function calculates the distance between the given value @p x and the next
 * representable double greater than @p x in the direction of positive infinity.
 * The result gives a measure of the precision at the value @p x.
 *
 * @param x The input double value.
 * @return The ULP of @p x (i.e., the difference to the next larger representable double).
 *
 * @note If @p x is NaN or infinity, the result may be undefined.
 */
    inline double_type ULP(const double_type &x) {
        double_type next = nextafter(x, numeric_limits<double_type>::infinity());
        double_type ulp = next - x;
        return ulp;
    }
#endif

using namespace std;

typedef Color<PIXEL_TYPE> RGB;

constexpr std::string::size_type operator"" _sz(unsigned long long n) {
    return static_cast<size_t>(n);
}

// Array used to store the image in a thread safe way as an array of arrays
// Each element of the array will be a pointer to an array containing the
// pixels for a row of the image.  Rows will be computed by threads.
unique_ptr<unique_ptr<int32_t[]>[]> image = nullptr;
// Atomic int used to assign tasks (row to compute) to the threads
atomic_int counter{0};

/**
 * @brief Computes the squared magnitude (norm) of a complex number without
 * taking the square root.
 *
 * This function returns the sum of the squares of the real and imaginary parts
 * of the complex number @p z.  It avoids the overhead of computing a square
 * root, making it useful for performance-sensitive contexts like Mandelbrot
 * set calculations where only relative magnitude comparisons are needed.
 *
 * @tparam T The numeric type (e.g., float, double, mpreal).
 * @param z The complex number whose squared magnitude is to be computed.
 * @return The squared magnitude of @p z, equivalent to |z|^2.
 *
 * @see norm may be used as a standard alternative, but may perform
 * additional checks.
 */
template<typename T>
inline T norm_squared(const complex<T>& z) {
    return z.real() * z.real() + z.imag() * z.imag();
}

// Number of pixels that escape
atomic_int num_escaped{0};
// Number of cycles detected
atomic_int num_cycles{0};
// Number of pixels that reach max. iterations
atomic_int num_max_iter{0};

/**
* @brief Compute the number of iterations to escape.
*
* Function computes the number of iterations of the formula
* z(n+1) = z(n)^2 + c that can be executed before the value of |z|
* exceeds the escape limit radius of 2.0.
*
* @param c Complex point used to perform the calculation.
* @param max_iteration Number of iterations, if exceeded, when we will declare
*                      as not escaping and therefore in the Mandelbrot set.
* @return vector of complex points representing the orbit of c.
*/

vector<complex<double_type>>
compute_escape(const complex<double_type> &c, int max_iteration) {
    int iteration = 0;                  // Number of iterations before escaping
    complex<double_type> z{0., 0.};
    vector<complex<double_type>> reference;
    reference.push_back(z);

    for (int i = 0; i < max_iteration && norm_squared(z) < 1e20; ++i) {
        z *= z;
        z += c;
        reference.emplace_back(z.real(), z.imag());
        ++iteration;
    }

    return reference;
}

// Convert double to its bitwise representation as int64_t
inline int64_t toULP(double x) {
    auto i = std::bit_cast<int64_t>(x);
    // adjust negative ones so that the resulting int64_t sort the same as the doubles.
    return i < 0 ? numeric_limits<int64_t>::min() - i : i;
}

struct QuantizedComplex {
    int64_t ulp_real;
    int64_t ulp_imag;
    int64_t d_ulp_real;
    int64_t d_ulp_imag;

    QuantizedComplex(const complex<double> &x, const complex<double> &dx, int ulp_bin_shift = 2) {
        // Quantize by grouping into bins of ulp_bin ULPs
        ulp_real = toULP(x.real()) >> ulp_bin_shift;
        ulp_imag = toULP(x.imag()) >> ulp_bin_shift;
        d_ulp_real = toULP(dx.real()) >> ulp_bin_shift;
        d_ulp_imag = toULP(dx.imag()) >> ulp_bin_shift;
    }

    bool operator==(const QuantizedComplex& other) const {
        return ulp_real == other.ulp_real && ulp_imag == other.ulp_imag &&
            d_ulp_real == other.d_ulp_real && d_ulp_imag == other.d_ulp_imag;
    }

    static bool ulp_diff(double a, double b, int max_ulps) {
        return std::abs(toULP(a) - toULP(b)) <= max_ulps;
    }
};

// Hash based on quantized ULPs
template <>
struct std::hash<QuantizedComplex> {
    size_t operator()(const QuantizedComplex& q) const noexcept {
        return q.ulp_real ^ q.ulp_imag ^ q.d_ulp_real ^ q.d_ulp_imag;
    }
};

/**
* @brief Compute a portion of the Mandelbrot Set.
* 
* Function computes a portion of the Mandelbrot Set row by row.
* The atomic integer counter is used to get the row to compute.
* The row is stored in the image array.
*
* @param cell_size size of the computation cell in pixels
* @param x_min Minimum real axis bound to plot.
* @param x_max Maximum real axis bound to plot.
* @param y_min Minimum imaginary axis bound to plot.
* @param y_max Maximum imaginary axis bound to plot.
* @param max_iteration Limit to the number of iterations we will compute per point.
* @param image_size Size, in pixels, of the image to generate.
* @param use_cycle_detection true to use cycle detection algorithm.  Impacts
*                           performance negatively if no cycles present.
*/

void compute_mandelbrot(int cell_size, const double_type& x_min, const double_type& x_max, const double_type& y_min,
    const double_type& y_max, int max_iteration, int image_size, bool use_cycle_detection = false) {
#ifdef HAS_MPREAL
    mpreal::set_default_prec(PRECISION);      // Set default precision in bits for each thread
#endif
    int num_cells = image_size / cell_size;
    double_type width = x_max - x_min;
    double_type height = y_max - y_min;
    int modulus = ceil(num_cells * num_cells / 40.);
    int cycle_detection_limit = 10000;
    double_type cell_width = (x_max - x_min) / num_cells;
    double_type cell_height = (y_max - y_min) / num_cells;
    double_type cell_x_min, cell_x_max, cell_y_min, cell_y_max;
    double_type x0, y0;
    complex<double_type> center, point, dc_full, ref;
    complex<double> ref_z, dc, dz;
    while (true) {
        int cell_num = counter.fetch_add(1);
        if (cell_num < num_cells * num_cells) {
            int i = cell_num % num_cells;
            int j = cell_num / num_cells;
            cell_x_min = x_min + i * cell_width;
            cell_x_max = x_min + i * cell_width + cell_width;
            cell_y_min = y_max - j * cell_width - cell_width;
            cell_y_max = y_max - j * cell_width;
            center.real((cell_x_max + cell_x_min) / 2.);
            center.imag((cell_y_max + cell_y_min) / 2.);
            vector<complex<double_type>> reference = compute_escape(center, max_iteration);
            vector<double_type>::size_type max_ref_iteration = reference.size() - 1;
            vector<complex<double>> reference_as_double;
            for (vector<complex<double_type>>::size_type c = 0; c <= max_ref_iteration; ++c) {
                ref = reference[c];
                reference_as_double.emplace_back(
                    static_cast<double>(ref.real()),
                    static_cast<double>(ref.imag()));
            }
            assert(max_ref_iteration > 1);
            bool enable_cycle_detection = false;
            for (int i0 = 0; i0 < cell_size; ++i0) {
                for (int j0 = 0; j0 < cell_size; ++j0) {
                    y0 = j0 + 0.5;
                    y0 *= cell_height;
                    y0 /= cell_size;
                    mpfr::negate(y0);
                    y0 += cell_y_max;
                    x0 = i0 + 0.5;
                    x0 *= cell_width;
                    x0 /= cell_size;
                    x0 += cell_x_min;
                    point.real(x0);
                    point.imag(y0);
                    dc_full = point - center;
                    dc.real(static_cast<double>(dc_full.real()));
                    dc.imag(static_cast<double>(dc_full.imag()));
                    int iteration = 0;
                    vector<complex<double_type>>::size_type ref_iteration = 0;
                    dz.real(0.);
                    dz.imag(0.);
                    ref_z = reference_as_double[ref_iteration];
                    std::unordered_set<QuantizedComplex> seen;
                    bool cycle_found = false;
                    while (iteration < max_iteration) {
                        dz = 2. * dz * ref_z + dz * dz + dc;
                        ++ref_iteration;
                        ref_z = reference_as_double[ref_iteration];
                        complex<double> z = ref_z + dz;
                        double norm = norm_squared(z);
                        if (norm > 4.0) break;
                        if (enable_cycle_detection && iteration > cycle_detection_limit) {
                            QuantizedComplex qz(ref_z, dz);
                            if (seen.count(qz)) {
                                cycle_found = true;
                                break;
                            } else {
                                seen.insert(qz);
                            }
                        }
                        if (norm < norm_squared(dz) || ref_iteration == max_ref_iteration) {
                            dz = z;
                            ref_iteration = 0;
                            ref_z = reference_as_double[ref_iteration];
                        }
                        ++iteration;
                    }
                    if (cycle_found) {
                        ++num_cycles;
                        cycle_detection_limit = static_cast<int>(0.9 * iteration);
                       iteration = 0;
                    } else if (iteration == max_iteration) {
                        ++num_max_iter;
                        enable_cycle_detection = use_cycle_detection;
                        cycle_detection_limit = static_cast<int>(0.9 * max_iteration);
                    } else {
                        ++num_escaped;
                        if (cycle_detection_limit < iteration) {
                            cycle_detection_limit = static_cast<int>(1.1 * iteration);
                        }
                    }
                    image[j * cell_size + j0][i * cell_size + i0] = iteration == max_iteration ? 0 : iteration;
                }
            }
            if (cell_num % modulus == 0) cout << '.' << flush;
        } else {
            break;
        }
    }
}

/**
* @brief Write out a PGM/PPM file.
* 
* Function writes out image data to a PGM or PPM format file.  Can be an 
* 8-bit/16-bit PGM or an 8-bit/16-bit PPM.  This function normalizes the
* level values using a histogram.
* 
* @param filename Pathspec to the file that will be created.
* @param max_iteration Limit to the number of iterations we used to compute each point.
* @param image_size Size, in pixels, of the image to save.
* @param color1 First color to use for near 1 iteration before escaping.
* @param color2 Second color to use for near max_iterations iterations before escaping.
* @param use_rainbow true to use rainbow coloring.
*/
void output_file(const string &filename, int max_iteration, int image_size, RGB color1, RGB color2, bool use_rainbow) {
    string ext = ".pgm";
    if (USE_COLOR) ext = ".ppm";

    cout << endl << "Creating " << (filename + ext) << "..." << endl;
    ofstream out_file((filename + ext).c_str(), ios::binary);
    if (!out_file) {                    // Check if file opened successfully
        cerr << "Error opening file " << (filename + ext) << " for writing!" << endl;
        exit(1);
    }
    if (USE_COLOR) {
        // File header is P6 for binary pix map, x_size y_size, max pixel value per RGB component
        out_file << "P6\n" << image_size << " " << image_size << "\n" << (MAX_PIXEL_LEVELS - 1) << "\n";
    } else {
        // File header is P5 for binary gray map, x_size y_size, max pixel value
        out_file << "P5\n" << image_size << " " << image_size << "\n" << (MAX_PIXEL_LEVELS - 1) << "\n";
    }

    // Find the maximum iteration count
    int max = 0;
    for (int j = 0; j < image_size; ++j) {
        for (int i = 0; i < image_size; ++i) {
            if (image[j][i] > max) max = image[j][i];
        }
    }
    max_iteration = max + 1;
    // Compute bucket size for histogram
    int max_iteration_bits = static_cast<int>(ceil(log(max_iteration)/log(2.0)));
    int bucket_shift_bits = max_iteration_bits > 20 ? max_iteration_bits - 20 : 0;
    vector<int>::size_type histogram_size = (max_iteration >> bucket_shift_bits) + 1;
    // Compute histogram
    vector<int> histogram(histogram_size, 0);
    int totalCount = 0;
    for (int j = 0; j < image_size; ++j) {
        for (int i = 0; i < image_size; ++i) {
            if (image[j][i] > 0) {
                ++histogram[vector<int>::size_type(image[j][i] >> bucket_shift_bits)];
                ++totalCount;
            }
        }
    }
    // Compute cumulative histogram
    int count = 0;
    for (vector<int>::size_type i = 1; i < histogram_size; ++i) {
        count += histogram[i];
        histogram[i] = count;
    }
    // Normalize using histogram
    for (int j = 0; j < image_size; ++j) {
        for (int i = 0; i < image_size; ++i) {
            if (USE_COLOR) {
                RGB rgb(0,0,0);
                if (image[j][i] != 0) {
                    if (use_rainbow)
                        rgb = level_2_RGB<PIXEL_TYPE>(
                            histogram[vector<int>::size_type(image[j][i] >> bucket_shift_bits)] /
                            static_cast<double>(totalCount + 1));
                    else
                        rgb = blend_colors(
                            histogram[vector<int>::size_type(image[j][i] >> bucket_shift_bits)] /
                            static_cast<double>(totalCount + 1), color1, color2);
                }
                if constexpr (sizeof(PIXEL_TYPE) == 2) {
                    out_file.write(reinterpret_cast<const char*>(&rgb.R) + 1, 1);
                    out_file.write(reinterpret_cast<const char*>(&rgb.R), 1);
                    out_file.write(reinterpret_cast<const char*>(&rgb.G) + 1, 1);
                    out_file.write(reinterpret_cast<const char*>(&rgb.G), 1);
                    out_file.write(reinterpret_cast<const char*>(&rgb.B) + 1, 1);
                    out_file.write(reinterpret_cast<const char*>(&rgb.B), 1);
                } else {
                    out_file.write(reinterpret_cast<const char*>(&rgb.R), 1);
                    out_file.write(reinterpret_cast<const char*>(&rgb.G), 1);
                    out_file.write(reinterpret_cast<const char*>(&rgb.B), 1);
                }
            } else {
                if constexpr (sizeof(PIXEL_TYPE) == 2) {
                    PIXEL_TYPE level = 0;
                    if (image[j][i] != 0)
                        level = static_cast<uint8_t>(color1.R +
                            histogram[vector<int>::size_type(image[j][i] >> bucket_shift_bits)] /
                            static_cast<double>(totalCount + 1) * (color2.R - color1.R + 1));
                    out_file.write(reinterpret_cast<const char*>(&level) + 1, 1);
                    out_file.write(reinterpret_cast<const char*>(&level), 1);
                } else {
                    PIXEL_TYPE level = 0;
                    if (image[j][i] != 0)
                        level = static_cast<uint8_t>(color1.R +
                            histogram[vector<int>::size_type(image[j][i] >> bucket_shift_bits)] /
                            static_cast<double>(totalCount + 1) * (color2.R - color1.R + 1));
                    out_file.write(reinterpret_cast<const char*>(&level), 1);
                }
            }
        }
    }
    out_file.close();
}

/**
* @brief Main function.
* 
* The entry point of the program.
*
* Syntax: mandelbrot_threaded_color_mp_fast_opt real imaginary radius [--filename=basename]
*     [--image_size=n] [--max_iterations=n] [--high_color=#xxx] [--low_color=#xxx]
*     [--rainbow] [--decimal_digits=n] [--cycle_detection]
*
* real & imaginary specify the coordinates of the Mandelbrot Set to display at
* the center of the image.
*
* radius is the radius in Mandelbrot Set coordinates out to the extreme edges
* of the image.
*
* max_iterations is the maximum number of iterations per pixel allowed to
* determine if it is in the set.  The max_iterations will generally need to get
* larger as radius gets smaller.
*
* image_size is the size in pixels for the width and height of the square image.
*
* high_color and low_color are hexadecimal color values.
*
* filename is the basename of the image file.  It defaults to "image".
*
* decimal_digits sets the accuracy of floating point computations when using GMP.
*
* @return int Returns 0 on successful execution.
*/
int main(int argc, char** argv) {
    CLI::App app{"Program to generate Mandelbrot Set images."};
    app.set_help_flag("", "");
    argv = app.ensure_utf8(argv);
    // use CLI11 to process command line options.
    vector<string> coords;
    app.add_option("coords", coords, "Location to plot (x y radius)")
        ->expected(3);

    // Optional arguments
    string filename = "image";
    int max_iteration = 10000;
    int image_size = 512;
    string high_color = "#FFF";
    string low_color = "#000";
    bool use_rainbow = false;
    int decimal_digits = 16;
    int cell_size = 0;
    bool use_cycle_detection = false;
    bool show_help = false;

    app.add_flag("--help", show_help, "Show help");
    app.add_option("-f,--filename", filename, "Base filename (no extension)");
    app.add_option("-m,--max_iterations", max_iteration, "Max. iterations");
    app.add_option("-s,--image_size", image_size, "Image size in pixels");
    app.add_option("-h,--high_color", high_color, "High color hex code (#xxx or #xxxxxx)");
    app.add_option("-l,--low_color", low_color, "Low color hex code (#xxx or #xxxxxx)");
    app.add_flag("-r,--rainbow", use_rainbow, "Use rainbow colors");
    app.add_option("-d,--decimal_digits", decimal_digits, "Decimal digits of accuracy to use");
    app.add_option("-c,--cell_size", cell_size, "Size in pixels of computation cell");
    app.add_flag("-C,--cycle_detection", use_cycle_detection, "Use cycle detection algorithm");

    CLI11_PARSE(app, argc, argv);

    if (show_help) {
        cout << app.help() << endl;
        return 0;
    }

    // Get the three required coordinates to plot
    double_type x, y, radius;
    if (coords.size() == 3) {
#ifdef HAS_MPREAL
        DIGITS = decimal_digits;
        mpreal::set_default_prec(PRECISION);    // Set default precision in bits
        x = double_type(coords[vector<string>::size_type(0)]);
        y = double_type(coords[vector<string>::size_type(1)]);
        radius = double_type(coords[vector<string>::size_type(2)]);
#elif defined __FLOAT128__
        x = strtoflt128(coords[0], nullptr);
        y = strtoflt128(coords[1], nullptr);
        radius = strtoflt128(coords[2], nullptr);
#elif defined LONG_DOUBLE
        x = stold(coords[0]);
        y = stold(coords[1]);
        radius = stold(coords[2]);
#else
        x = stod(coords[0]);
        y = stod(coords[1]);
        radius = stod(coords[2]);
#endif
    } else {
        cout << "Arguments x, y and radius required but not found!" << endl;
        return 1;
    }

    // Check the image size
    if (image_size > sqrt(numeric_limits<int>::max())) {
        cout << "Image size can not exceed " << floor(sqrt(numeric_limits<int>::max())) << endl;
        return 1;
    }
    
    // Check the max number of iterations
    if (max_iteration > (numeric_limits<int32_t>::max() / 2)) {
        cout << "Max iterations can not exceed " << numeric_limits<int32_t>::max() / 2 << endl;
        return 1;
    }

    // Get the low and high colors
    RGB high_RGB = RGB(high_color);
    RGB low_RGB = RGB(low_color);
    if (high_RGB.isGray() && low_RGB.isGray()) USE_COLOR = false;
    if (use_rainbow) USE_COLOR = true;

    // Compute limits of image
    const double_type x_min = x - radius;
    const double_type x_max = x + radius;
    const double_type y_min = y - radius;
    const double_type y_max = y + radius;

    // Get the number of CPU cores.
    unsigned int num_cpus = thread::hardware_concurrency();
    if (num_cpus == 0) {
        cout << "Could not determine number of CPUs.  Using 4.\n";
        num_cpus = 4;
    } else {
        cout << "Number of CPU cores: " << num_cpus << endl;
    }

    if (cell_size <= 0) {
        cell_size = static_cast<int>(image_size / std::ceil(std::sqrt(2 * num_cpus)));
        if (cell_size <= 1) cell_size = 1;
        else if (cell_size < 4) cell_size = 2;
        else if (cell_size < 8) cell_size = 4;
        else if (cell_size < 16) cell_size = 8;
        else if (cell_size < 32) cell_size = 16;
        else if (cell_size < 64) cell_size = 32;
        else if (cell_size < 128) cell_size = 64;
        else if (cell_size < 256) cell_size = 128;
        else cell_size = 256;
    }
    image_size = ((image_size + cell_size - 1) / cell_size) * cell_size;

   // Output settings
    cout << "Size of floating point: " << sizeof(double_type) << endl;
    cout << "Floating point decimal digits: " << DOUBLE_DIGITS << endl;
    cout << "Floating point precision: " << PRECISION << " bits" << endl;
    cout << "Maximum Iterations: " << max_iteration << endl;
    cout << (USE_COLOR ? "Using color" : "Using grayscale") << endl;
    if (use_rainbow) {
        cout << "Rainbow colors" << endl;
    } else {
        cout << "High escape color: " << high_color << endl;
        cout << "Low escape color: " << low_color << endl;
    }
    cout << "Image Bounds: " << setprecision(DOUBLE_DIGITS) << x_min << ", " << x_max << ", " << y_min << ", " << y_max << endl;
    cout << "Image Size: " << image_size << " x " << image_size << endl;
    cout << "Cell Size: " << cell_size << " x " << cell_size << endl;
    cout << "Cycle Detection: " << (use_cycle_detection ? "enabled" : "disabled") << endl;
    
    // Do a simple check to see if our doubles have enough precision
    if ((x_max - x_min) / image_size < ULP(x) ||
        (y_max - y_min) / image_size < ULP(y))
        cout << "Doubles may not have enough precision!" << endl;

    // Allocate an array for the row pointers.
    image = make_unique<unique_ptr<int32_t[]>[]>(image_size);

    for (int i = 0; i < image_size; ++i)
        image[i] = make_unique<int32_t[]>(image_size);

    vector<thread> threads;
    threads.reserve(num_cpus);
    // Create the threads
    for (int i = 0; i < num_cpus; ++i) {
        threads.emplace_back(compute_mandelbrot, cell_size, x_min, x_max, y_min,
            y_max, max_iteration, image_size, use_cycle_detection);
    }
    // Wait for all threads to complete
    for (auto& t : threads) {
        t.join();
    }

    cout << endl;
    cout << "Number of escaped pixels: " << num_escaped << endl;
    if (use_cycle_detection)
        cout << "Number of cycles detected: " << num_cycles << endl;
    cout << "Number of max iteration pixels: " << num_max_iter << endl;

    // Generate the output file
    output_file(filename, max_iteration, image_size, low_RGB, high_RGB, use_rainbow);
    return 0;
}

Extending the Reach of our Program

This works great up to a point. If we try to zoom in with a magnification greater than about \(10^{308}\) our algorithm starts to fail due to the limits of double precision floats.

Complete mandelbrot_threaded_color_mp_fast_ext.cpp
/**
* @file mandelbrot_threaded_color_mp_fast_ext.cpp
* @brief Program to compute the Mandelbrot Set.
* 
* This program can compute images of the Mandelbrot Set in its
* entirety or close ups of boundary areas.  It will use as many
* CPUs as are available to speed up the rendering.  This program
* will use extended precision floating point.  The Gnu Multi-Precision
* library is required if hardware extended precision FP is not
* available.  The algorithm used is based on perturbation theory
* so that most computations can be done with doubles.
*
* It also allows two RGB colors to be specified to define the
* coloring of the set based on the escape value of a pixel.
* Colors can be specified using #xxx or #xxxxxx hexadecimal.
*
* Requires:
* Linux<br>
* get cli11.hpp at https://github.com/CLIUtils/CLI11/releases
* get mpreal.h at https://github.com/advanpix/mpreal.git
* sudo apt install libgmp-dev<br>
* sudo apt install libmpfr-dev<br>
*
* macOS<br>
* brew install cli11<br>
* brew install gmp<br>
* brew install mpfr<br>
* get mpreal.h at https://github.com/advanpix/mpreal.git
*
* Build:
* Linux<br>
* put CLI.hpp in /usr/local/include <br>
* put mpreal.h in /usr/local/include <br>
*
* macOS<br>
* put mpreal.h in /opt/homebrew/include

* Add to .zshrc<br>
* export CPPFLAGS="-I/opt/homebrew/include"<br>
* export LDFLAGS="-L/opt/homebrew/lib"
*
* Compile with<br>
* Linux<br>
* If 128-bit doubles are available on your platform <br>
* g++ --std=gnu++20 -o mandelbrot_threaded_color_mp_fast_ext mandelbrot_threaded_color_mp_fast_ext.cpp -lquadmath

* to use GMP<br>
* g++ --std=gnu++20 -o mandelbrot_threaded_color_mp_fast_ext mandelbrot_threaded_color_mp_fast_ext.cpp -lmpfr -lgmp

* macOS<br>
* g++ $CPPFLAGS $LDFLAGS --std=gnu++20 -o mandelbrot_threaded_color_mp_fast_ext mandelbrot_threaded_color_mp_fast_ext.cpp -lmpfr -lgmp
*
* Execute:
* mandelbrot_threaded_color_mp_fast_ext real imaginary radius [--filename=basename]
*     [--image_size=pixels] [--max_iterations=count]
*     [--low_color=#xxx] [--high_color=#xxx] [--rainbow]
*     [--decimal_digits=n] [--cycle_detection]
*
* Test multi-precision with command <br>
* ./mandelbrot_threaded_color_mp_fast 0.3663629834227643413289327308 0.59153377326144522757749349375 1e-27 -m 100000 -d 30
*
* @author Richard Lesh
* @date 2025-03-12
* @version 1.0
*/

#include <bit>
#include <CLI/CLI.hpp>
#include <cassert>
#include <complex>
#include <fstream>
#include <string>
#include <thread>
#include <unordered_set>
#include <vector>
#include "BigExpFloat.hpp"
#include "Color.hpp"

#define MAX_PIXEL_LEVELS 256            // 256 for 8-bit
bool USE_COLOR = true;                  // True to allow colors to be used
bool USE_RAINBOW = false;               // True to use rainbow colors
#define USE_EXTENDED_FP                 // Define to use extended precision floating point
#define FORCE_MPREAL                    // Force use of GMP MPREALs

#if MAX_PIXEL_LEVELS > 256
    #define PIXEL_TYPE uint16_t
#else
    #define PIXEL_TYPE uint8_t
#endif

int DIGITS = 16;                                    // Decimal digits of accuracy
#ifdef USE_EXTENDED_FP
    #if defined __FLOAT128__ && !defined FORCE_MPREAL
        typedef __float128 double_type;
    #elif defined(__SIZEOF_LONG_DOUBLE__) && (__SIZEOF_LONG_DOUBLE__ > 8) && !defined FORCE_MPREAL
        #define LONG_DOUBLE
        typedef long double double_type;
    #else
        #define HAS_MPREAL
        #include "mpreal.h"                         // MPFR C++ Wrapper
        using namespace mpfr;
        #define PRECISION int(ceil(DIGITS * 3.3))   // Precision in bits (~3.3 bits per decimal digit)
        #define DOUBLE_DIGITS DIGITS
        typedef mpreal double_type;
        inline mpreal ULP(const mpreal &x) {
            mpreal next = nextabove(x);             // Smallest representable number greater than x
            mpreal ulp = next - x;
            return ulp;
        }
    #endif
#else
    typedef double double_type;
#endif
#ifndef HAS_MPREAL
    #define DOUBLE_DIGITS numeric_limits<double_type>::digits10
    #define PRECISION numeric_limits<double_type>::digits
/**
 * @brief Computes the Unit in the Last Place (ULP) of a double-precision value.
 *
 * This function calculates the distance between the given value @p x and the next
 * representable double greater than @p x in the direction of positive infinity.
 * The result gives a measure of the precision at the value @p x.
 *
 * @param x The input double value.
 * @return The ULP of @p x (i.e., the difference to the next larger representable double).
 *
 * @note If @p x is NaN or infinity, the result may be undefined.
 */
    inline double_type ULP(double_type x) {
        double_type next = nextafter(x, numeric_limits<double_type>::infinity());
        double_type ulp = next - x;
        return ulp;
    }
#endif

using namespace std;

typedef BigExpFloat<double> BigFloat;

typedef Color<PIXEL_TYPE> RGB;

constexpr std::string::size_type operator"" _sz(unsigned long long n) {
    return static_cast<size_t>(n);
}

// Array used to store the image in a thread safe way as an array of arrays
// Each element of the array will be a pointer to an array containing the
// pixels for a row of the image.  Rows will be computed by threads.
unique_ptr<unique_ptr<int32_t[]>[]> image = nullptr;
// Atomic int used to assign tasks (row to compute) to the threads
atomic_int counter{0};

/**
 * @brief Computes the squared magnitude (norm) of a complex number without
 * taking the square root.
 *
 * This function returns the sum of the squares of the real and imaginary parts
 * of the complex number @p z.  It avoids the overhead of computing a square
 * root, making it useful for performance-sensitive contexts like Mandelbrot
 * set calculations where only relative magnitude comparisons are needed.
 *
 * @tparam T The numeric type (e.g., float, double, mpreal).
 * @param z The complex number whose squared magnitude is to be computed.
 * @return The squared magnitude of @p z, equivalent to |z|^2.
 *
 * @see norm may be used as a standard alternative, but may perform
 * additional checks.
 */
template<typename T>
inline T norm_squared(const complex<T>& z) {
    return z.real() * z.real() + z.imag() * z.imag();
}

// Number of pixels that escape
atomic_int num_escaped{0};
// Number of cycles detected
atomic_int num_cycles{0};
// Number of pixels that reach max. iterations
atomic_int num_max_iter{0};

/**
* @brief Compute the number of iterations to escape.
*
* Function computes the number of iterations of the formula
* z(n+1) = z(n)^2 + c that can be executed before the value of |z|
* exceeds the escape limit radius of 2.0.
*
* @param c Complex point used to perform the calculation.
* @param max_iteration Number of iterations, if exceeded, when we will declare
*                      as not escaping and therefore in the Mandelbrot set.
* @return vector of complex points representing the orbit of c.
*/

vector<complex<double_type>>
compute_escape(const complex<double_type> &c, int max_iteration) {
    int iteration = 0;                  // Number of iterations before escaping
    complex<double_type> z{0., 0.};
    vector<complex<double_type>> reference;
    reference.push_back(z);

    for (int i = 0; i < max_iteration && norm_squared(z) < 1e20; ++i) {
        z *= z;
        z += c;
        reference.emplace_back(z.real(), z.imag());
        ++iteration;
    }

    return reference;
}

// Convert double to its bitwise representation as int64_t
inline int64_t toULP(double x) {
    auto i = std::bit_cast<int64_t>(x);
    // adjust negative ones so that the resulting int64_t sort the same as the doubles.
    return i < 0 ? numeric_limits<int64_t>::min() - i : i;
}

struct QuantizedComplex {
    int64_t ulp_real;
    int64_t ulp_imag;
    int32_t ulp_real_exp;
    int32_t ulp_imag_exp;
    int64_t d_ulp_real;
    int64_t d_ulp_imag;
    int32_t d_real_exp;
    int32_t d_imag_exp;

    QuantizedComplex(const complex<BigFloat> &x, const complex<BigFloat> &dx, int ulp_bin_shift = 2) {
        // Quantize by grouping into bins of ulp_bin ULPs
        ulp_real = toULP(x.real().significand()) >> ulp_bin_shift;
        ulp_imag = toULP(x.imag().significand()) >> ulp_bin_shift;
        d_ulp_real = toULP(dx.real().significand()) >> ulp_bin_shift;
        d_ulp_imag = toULP(dx.imag().significand()) >> ulp_bin_shift;
        ulp_real_exp = x.real().exponent();
        ulp_imag_exp = x.imag().exponent();
        d_real_exp = dx.real().exponent();
        d_imag_exp = dx.imag().exponent();
    }

    bool operator==(const QuantizedComplex& other) const {
        return ulp_real == other.ulp_real && ulp_imag == other.ulp_imag &&
            d_ulp_real == other.d_ulp_real && d_ulp_imag == other.d_ulp_imag &&
            ulp_real_exp == other.ulp_real_exp && ulp_imag_exp == other.ulp_imag_exp &&
            d_real_exp == other.d_real_exp && d_imag_exp == other.d_imag_exp;
    }

    static bool ulp_diff(const BigFloat &a, const BigFloat &b, int max_ulps) {
        return a.exponent() == b.exponent() &&
            std::abs(toULP(a.significand()) - toULP(b.significand())) <= max_ulps;
    }
};

// Hash based on quantized ULPs
template <>
struct std::hash<QuantizedComplex> {
    size_t operator()(const QuantizedComplex& q) const noexcept {
        return q.ulp_real ^ q.ulp_imag ^ q.d_ulp_real ^ q.d_ulp_imag;
    }
};

/**
* @brief Compute a portion of the Mandelbrot Set.
* 
* Function computes a portion of the Mandelbrot Set row by row.
* The atomic integer counter is used to get the row to compute.
* The row is stored in the image array.
*
* @param cell_size size of the computation cell in pixels
* @param x_min Minimum real axis bound to plot.
* @param x_max Maximum real axis bound to plot.
* @param y_min Minimum imaginary axis bound to plot.
* @param y_max Maximum imaginary axis bound to plot.
* @param max_iteration Limit to the number of iterations we will compute per point.
* @param image_size Size, in pixels, of the image to generate.
* @param use_cycle_detection true to use cycle detection algorithm.  Impacts
*                           performance negatively if no cycles present.
*/

void compute_mandelbrot(int cell_size, const double_type& x_min, const double_type& x_max, const double_type& y_min,
    const double_type& y_max, int max_iteration, int image_size, bool use_cycle_detection = false) {
#ifdef HAS_MPREAL
    mpreal::set_default_prec(PRECISION);      // Set default precision in bits for each thread
#endif
    int num_cells = image_size / cell_size;
    double_type width = x_max - x_min;
    double_type height = y_max - y_min;
    int modulus = ceil(num_cells * num_cells / 40.);
    int cycle_detection_limit = 10000;
    double_type cell_width = (x_max - x_min) / num_cells;
    double_type cell_height = (y_max - y_min) / num_cells;
    double_type cell_x_min, cell_x_max, cell_y_min, cell_y_max;
    double_type x0, y0;
    complex<double_type> center, point, dc_full, ref;
    complex<BigFloat> ref_z, dc, dz;
    while (true) {
        int cell_num = counter.fetch_add(1);
        if (cell_num < num_cells * num_cells) {
            int i = cell_num % num_cells;
            int j = cell_num / num_cells;
            cell_x_min = x_min + i * cell_width;
            cell_x_max = x_min + i * cell_width + cell_width;
            cell_y_min = y_max - j * cell_width - cell_width;
            cell_y_max = y_max - j * cell_width;
            center.real((cell_x_max + cell_x_min) / 2.);
            center.imag((cell_y_max + cell_y_min) / 2.);
            vector<complex<double_type>> reference = compute_escape(center, max_iteration);
            vector<double_type>::size_type max_ref_iteration = reference.size() - 1;
            vector<complex<BigFloat>> reference_as_double;
            for (vector<complex<double_type>>::size_type c = 0; c <= max_ref_iteration; ++c) {
                ref = reference[c];
                reference_as_double.emplace_back(
                    static_cast<BigFloat>(ref.real()),
                    static_cast<BigFloat>(ref.imag()));
            }
            assert(max_ref_iteration > 1);
            bool enable_cycle_detection = false;
            for (int i0 = 0; i0 < cell_size; ++i0) {
                for (int j0 = 0; j0 < cell_size; ++j0) {
                    y0 = j0 + 0.5;
                    y0 *= cell_height;
                    y0 /= cell_size;
                    mpfr::negate(y0);
                    y0 += cell_y_max;
                    x0 = i0 + 0.5;
                    x0 *= cell_width;
                    x0 /= cell_size;
                    x0 += cell_x_min;
                    point.real(x0);
                    point.imag(y0);
                    dc_full = point - center;
                    dc.real(static_cast<BigFloat>(dc_full.real()));
                    dc.imag(static_cast<BigFloat>(dc_full.imag()));
                    int iteration = 0;
                    vector<complex<BigFloat>>::size_type ref_iteration = 0;
                    dz.real(0.);
                    dz.imag(0.);
                    ref_z = reference_as_double[ref_iteration];
                    std::unordered_set<QuantizedComplex> seen;
                    bool cycle_found = false;
                    BigFloat two(2.0);
                    BigFloat four(4.0);
                    while (iteration < max_iteration) {
                        dz = two * dz * ref_z + dz * dz + dc;
                        ++ref_iteration;
                        ref_z = reference_as_double[ref_iteration];
                        complex<BigFloat> z = ref_z + dz;
                        BigFloat norm = norm_squared(z);
                        if (norm > four) break;
                        if (enable_cycle_detection && iteration > cycle_detection_limit) {
                            QuantizedComplex qz(ref_z, dz);
                            if (seen.count(qz)) {
                                cycle_found = true;
                                break;
                            } else {
                                seen.insert(qz);
                            }
                        }
                        if (norm < norm_squared(dz) || ref_iteration == max_ref_iteration) {
                            dz = z;
                            ref_iteration = 0;
                            ref_z = reference_as_double[ref_iteration];
                        }
                        ++iteration;
                    }
                    if (cycle_found) {
                        ++num_cycles;
                        cycle_detection_limit = static_cast<int>(0.9 * iteration);
                       iteration = 0;
                    } else if (iteration == max_iteration) {
                        ++num_max_iter;
                        enable_cycle_detection = use_cycle_detection;
                        cycle_detection_limit = static_cast<int>(0.9 * max_iteration);
                    } else {
                        ++num_escaped;
                        if (cycle_detection_limit < iteration) {
                            cycle_detection_limit = static_cast<int>(1.1 * iteration);
                        }
                    }
                    image[j * cell_size + j0][i * cell_size + i0] = iteration == max_iteration ? 0 : iteration;
                }
            }
            if (cell_num % modulus == 0) cout << '.' << flush;
        } else {
            break;
        }
    }
}

/**
* @brief Write out a PGM/PPM file.
* 
* Function writes out image data to a PGM or PPM format file.  Can be an 
* 8-bit/16-bit PGM or an 8-bit/16-bit PPM.  This function normalizes the
* level values using a histogram.
* 
* @param filename Pathspec to the file that will be created.
* @param max_iteration Limit to the number of iterations we used to compute each point.
* @param image_size Size, in pixels, of the image to save.
* @param color1 First color to use for near 1 iteration before escaping.
* @param color2 Second color to use for near max_iterations iterations before escaping.
* @param use_rainbow true to use rainbow coloring.
*/
void output_file(const string &filename, int max_iteration, int image_size, RGB color1, RGB color2, bool use_rainbow) {
    string ext = ".pgm";
    if (USE_COLOR) ext = ".ppm";

    cout << endl << "Creating " << (filename + ext) << "..." << endl;
    ofstream out_file((filename + ext).c_str(), ios::binary);
    if (!out_file) {                    // Check if file opened successfully
        cerr << "Error opening file " << (filename + ext) << " for writing!" << endl;
        exit(1);
    }
    if (USE_COLOR) {
        // File header is P6 for binary pix map, x_size y_size, max pixel value per RGB component
        out_file << "P6\n" << image_size << " " << image_size << "\n" << (MAX_PIXEL_LEVELS - 1) << "\n";
    } else {
        // File header is P5 for binary gray map, x_size y_size, max pixel value
        out_file << "P5\n" << image_size << " " << image_size << "\n" << (MAX_PIXEL_LEVELS - 1) << "\n";
    }

    // Find the maximum iteration count
    int max = 0;
    for (int j = 0; j < image_size; ++j) {
        for (int i = 0; i < image_size; ++i) {
            if (image[j][i] > max) max = image[j][i];
        }
    }
    max_iteration = max + 1;
    // Compute bucket size for histogram
    int max_iteration_bits = static_cast<int>(ceil(log(max_iteration)/log(2.0)));
    int bucket_shift_bits = max_iteration_bits > 20 ? max_iteration_bits - 20 : 0;
    vector<int>::size_type histogram_size = (max_iteration >> bucket_shift_bits) + 1;
    // Compute histogram
    vector<int> histogram(histogram_size, 0);
    int totalCount = 0;
    for (int j = 0; j < image_size; ++j) {
        for (int i = 0; i < image_size; ++i) {
            if (image[j][i] > 0) {
                ++histogram[vector<int>::size_type(image[j][i] >> bucket_shift_bits)];
                ++totalCount;
            }
        }
    }
    // Compute cumulative histogram
    int count = 0;
    for (vector<int>::size_type i = 1; i < histogram_size; ++i) {
        count += histogram[i];
        histogram[i] = count;
    }
    // Normalize using histogram
    for (int j = 0; j < image_size; ++j) {
        for (int i = 0; i < image_size; ++i) {
            if (USE_COLOR) {
                RGB rgb(0,0,0);
                if (image[j][i] != 0) {
                    if (use_rainbow)
                        rgb = level_2_RGB<PIXEL_TYPE>(
                            histogram[vector<int>::size_type(image[j][i] >> bucket_shift_bits)] /
                            static_cast<double>(totalCount + 1));
                    else
                        rgb = blend_colors(
                            histogram[vector<int>::size_type(image[j][i] >> bucket_shift_bits)] /
                            static_cast<double>(totalCount + 1), color1, color2);
                }
                if constexpr (sizeof(PIXEL_TYPE) == 2) {
                    out_file.write(reinterpret_cast<const char*>(&rgb.R) + 1, 1);
                    out_file.write(reinterpret_cast<const char*>(&rgb.R), 1);
                    out_file.write(reinterpret_cast<const char*>(&rgb.G) + 1, 1);
                    out_file.write(reinterpret_cast<const char*>(&rgb.G), 1);
                    out_file.write(reinterpret_cast<const char*>(&rgb.B) + 1, 1);
                    out_file.write(reinterpret_cast<const char*>(&rgb.B), 1);
                } else {
                    out_file.write(reinterpret_cast<const char*>(&rgb.R), 1);
                    out_file.write(reinterpret_cast<const char*>(&rgb.G), 1);
                    out_file.write(reinterpret_cast<const char*>(&rgb.B), 1);
                }
            } else {
                if constexpr (sizeof(PIXEL_TYPE) == 2) {
                    PIXEL_TYPE level = 0;
                    if (image[j][i] != 0)
                        level = static_cast<uint8_t>(color1.R +
                            histogram[vector<int>::size_type(image[j][i] >> bucket_shift_bits)] /
                            static_cast<double>(totalCount + 1) * (color2.R - color1.R + 1));
                    out_file.write(reinterpret_cast<const char*>(&level) + 1, 1);
                    out_file.write(reinterpret_cast<const char*>(&level), 1);
                } else {
                    PIXEL_TYPE level = 0;
                    if (image[j][i] != 0)
                        level = static_cast<uint8_t>(color1.R +
                            histogram[vector<int>::size_type(image[j][i] >> bucket_shift_bits)] /
                            static_cast<double>(totalCount + 1) * (color2.R - color1.R + 1));
                    out_file.write(reinterpret_cast<const char*>(&level), 1);
                }
            }
        }
    }
    out_file.close();
}

/**
* @brief Main function.
* 
* The entry point of the program.
*
* Syntax: mandelbrot_threaded_color_mp_fast_ext real imaginary radius [--filename=basename]
*     [--image_size=n] [--max_iterations=n] [--high_color=#xxx] [--low_color=#xxx]
*     [--rainbow] [--decimal_digits=n] [--cycle_detection]
*
* real & imaginary specify the coordinates of the Mandelbrot Set to display at
* the center of the image.
*
* radius is the radius in Mandelbrot Set coordinates out to the extreme edges
* of the image.
*
* max_iterations is the maximum number of iterations per pixel allowed to
* determine if it is in the set.  The max_iterations will generally need to get
* larger as radius gets smaller.
*
* image_size is the size in pixels for the width and height of the square image.
*
* high_color and low_color are hexadecimal color values.
*
* filename is the basename of the image file.  It defaults to "image".
*
* decimal_digits sets the accuracy of floating point computations when using GMP.
*
* @return int Returns 0 on successful execution.
*/
int main(int argc, char** argv) {
    CLI::App app{"Program to generate Mandelbrot Set images."};
    app.set_help_flag("", "");
    argv = app.ensure_utf8(argv);
    // use CLI11 to process command line options.
    vector<string> coords;
    app.add_option("coords", coords, "Location to plot (x y radius)")
        ->expected(3);

    // Optional arguments
    string filename = "image";
    int max_iteration = 10000;
    int image_size = 512;
    string high_color = "#FFF";
    string low_color = "#000";
    bool use_rainbow = false;
    int decimal_digits = 16;
    int cell_size = 0;
    bool use_cycle_detection = false;
    bool show_help = false;

    app.add_flag("--help", show_help, "Show help");
    app.add_option("-f,--filename", filename, "Base filename (no extension)");
    app.add_option("-m,--max_iterations", max_iteration, "Max. iterations");
    app.add_option("-s,--image_size", image_size, "Image size in pixels");
    app.add_option("-h,--high_color", high_color, "High color hex code (#xxx or #xxxxxx)");
    app.add_option("-l,--low_color", low_color, "Low color hex code (#xxx or #xxxxxx)");
    app.add_flag("-r,--rainbow", use_rainbow, "Use rainbow colors");
    app.add_option("-d,--decimal_digits", decimal_digits, "Decimal digits of accuracy to use");
    app.add_option("-c,--cell_size", cell_size, "Size in pixels of computation cell");
    app.add_flag("-C,--cycle_detection", use_cycle_detection, "Use cycle detection algorithm");

    CLI11_PARSE(app, argc, argv);

    if (show_help) {
        cout << app.help() << endl;
        return 0;
    }

    // Get the three required coordinates to plot
    double_type x, y, radius;
    if (coords.size() == 3) {
#ifdef HAS_MPREAL
        DIGITS = decimal_digits;
        mpreal::set_default_prec(PRECISION);    // Set default precision in bits
        x = double_type(coords[vector<string>::size_type(0)]);
        y = double_type(coords[vector<string>::size_type(1)]);
        radius = double_type(coords[vector<string>::size_type(2)]);
#elif defined __FLOAT128__
        x = strtoflt128(coords[0], nullptr);
        y = strtoflt128(coords[1], nullptr);
        radius = strtoflt128(coords[2], nullptr);
#elif defined LONG_DOUBLE
        x = stold(coords[0]);
        y = stold(coords[1]);
        radius = stold(coords[2]);
#else
        x = stod(coords[0]);
        y = stod(coords[1]);
        radius = stod(coords[2]);
#endif
    } else {
        cout << "Arguments x, y and radius required but not found!" << endl;
        return 1;
    }

    // Check the image size
    if (image_size > sqrt(numeric_limits<int>::max())) {
        cout << "Image size can not exceed " << floor(sqrt(numeric_limits<int>::max())) << endl;
        return 1;
    }
    
    // Check the max number of iterations
    if (max_iteration > (numeric_limits<int32_t>::max() / 2)) {
        cout << "Max iterations can not exceed " << numeric_limits<int32_t>::max() / 2 << endl;
        return 1;
    }

    // Get the low and high colors
    RGB high_RGB = RGB(high_color);
    RGB low_RGB = RGB(low_color);
    if (high_RGB.isGray() && low_RGB.isGray()) USE_COLOR = false;
    if (use_rainbow) USE_COLOR = true;

    // Compute limits of image
    const double_type x_min = x - radius;
    const double_type x_max = x + radius;
    const double_type y_min = y - radius;
    const double_type y_max = y + radius;

    // Get the number of CPU cores.
    unsigned int num_cpus = thread::hardware_concurrency();
    if (num_cpus == 0) {
        cout << "Could not determine number of CPUs.  Using 4.\n";
        num_cpus = 4;
    } else {
        cout << "Number of CPU cores: " << num_cpus << endl;
    }

    if (cell_size <= 0) {
        cell_size = static_cast<int>(image_size / std::ceil(std::sqrt(2 * num_cpus)));
        if (cell_size <= 1) cell_size = 1;
        else if (cell_size < 4) cell_size = 2;
        else if (cell_size < 8) cell_size = 4;
        else if (cell_size < 16) cell_size = 8;
        else if (cell_size < 32) cell_size = 16;
        else if (cell_size < 64) cell_size = 32;
        else if (cell_size < 128) cell_size = 64;
        else if (cell_size < 256) cell_size = 128;
        else cell_size = 256;
    }
    image_size = ((image_size + cell_size - 1) / cell_size) * cell_size;

    // Output settings
    cout << "Size of floating point: " << sizeof(double_type) << endl;
    cout << "Floating point decimal digits: " << DOUBLE_DIGITS << endl;
    cout << "Floating point precision: " << PRECISION << " bits" << endl;
    cout << "Maximum Iterations: " << max_iteration << endl;
    cout << (USE_COLOR ? "Using color" : "Using grayscale") << endl;
    if (use_rainbow) {
        cout << "Rainbow colors" << endl;
    } else {
        cout << "High escape color: " << high_color << endl;
        cout << "Low escape color: " << low_color << endl;
    }
    cout << "Image Bounds: " << setprecision(DOUBLE_DIGITS) << x_min << ", " << x_max << ", " << y_min << ", " << y_max << endl;
    cout << "Image Size: " << image_size << " x " << image_size << endl;
    cout << "Cell Size: " << cell_size << " x " << cell_size << endl;
    cout << "Cycle Detection: " << (use_cycle_detection ? "enabled" : "disabled") << endl;
    
    // Do a simple check to see if our doubles have enough precision
    if ((x_max - x_min) / image_size < ULP(x) ||
        (y_max - y_min) / image_size < ULP(y))
        cout << "Doubles may not have enough precision!" << endl;

    // Allocate an array for the row pointers.
    image = make_unique<unique_ptr<int32_t[]>[]>(image_size);

    for (int i = 0; i < image_size; ++i)
        image[i] = make_unique<int32_t[]>(image_size);

    vector<thread> threads;
    threads.reserve(num_cpus);
    // Create the threads
    for (int i = 0; i < num_cpus; ++i) {
        threads.emplace_back(compute_mandelbrot, cell_size, x_min, x_max, y_min,
            y_max, max_iteration, image_size, use_cycle_detection);
    }
    // Wait for all threads to complete
    for (auto& t : threads) {
        t.join();
    }

    cout << endl;
    cout << "Number of escaped pixels: " << num_escaped << endl;
    if (use_cycle_detection)
        cout << "Number of cycles detected: " << num_cycles << endl;
    cout << "Number of max iteration pixels: " << num_max_iter << endl;

    // Generate the output file
    output_file(filename, max_iteration, image_size, low_RGB, high_RGB, use_rainbow);
    return 0;
}