# HW Acceleration using SIMD assembly instructions

Test app accommodates two types of tests: [`functionality test`](#Functionality-test) and [`benchmark test`](#Benchmark-test). Both tests are provided per each function written in assembly (typically per each assembly file). Both test apps use a hard copy of LVGL blending API, representing an ANSI implementation of the LVGL blending functions. The hard copy is present in [`lv_blend`](main/lv_blend/) folder.

Assembly source files could be found in the [`lvgl_port`](../../src/lvgl9/simd/) component. Header file with the assembly function prototypes is provided into the LVGL using Kconfig option `LV_DRAW_SW_ASM_CUSTOM_INCLUDE` and can be found in the [`lvgl_port/include`](../../include/esp_lvgl_port_lv_blend.h)

## Benchmark results for LV Fill functions (memset)

| Color format | Matrix size | Memory alignment |  ASM version   | ANSI C version |
| :----------- | :---------- | :--------------- | :------------- | :------------- |
| ARGB8888     | 128x128     |     16 byte      |     0.327      |     1.600      |
|              | 127x127     |      1 byte      |     0.488      |     1.597      |
| RGB565       | 128x128     |     16 byte      |     0.196      |     1.146      |
|              | 127x127     |      1 byte      |     0.497      |     1.124      |
| RGB888       | 128x128     |     16 byte      |     0.608      |     4.062      |
|              | 127x127     |      1 byte      |     0.818      |     3.969      |
* this data was obtained by running [benchmark tests](#benchmark-test) on 128x128 16 byte aligned matrix (ideal case) and 127x127 1 byte aligned matrix (worst case)
* the values represent cycles per sample to perform simple fill of the matrix on esp32s3

## Benchmark results for LV Image functions (memcpy)

| Color format | Matrix size | Memory alignment |  ASM version   | ANSI C version |
| :----------- | :---------- | :--------------- | :------------- | :------------- |
| RGB565       | 128x128     |     16 byte      |     0.352      |     3.437      |
|              | 127x128     |      1 byte      |     0.866      |     5.978      |
* this data was obtained by running [benchmark tests](#benchmark-test) on 128x128 16 byte aligned matrix (ideal case) and 127x128 1 byte aligned matrix (worst case)
* the values represent cycles per sample to perform memory copy between two matrices on esp32s3

## Functionality test
* Tests, whether the HW accelerated assembly version of an LVGL function provides the same results as the ANSI version
* A top-level flow of the functionality test:
    * generate a test matrix with test parameters (matrix width, matrix height, memory alignment.. )
    * run an ANSI version of a DUT function with the generated input parameters
    * run an assembly version of a DUT function with the same input parameters
    * compare the results given by the ANSI and the assembly DUTs
    * the results shall be the same
    * repeat all the steps for a set of different input parameters, checking different matrix heights, widths..

## Benchmark test
* Tests, whether the HW accelerated assembly version of an LVGL function provides a performance increase over the ANSI version
* A top-level flow of the functionality test:
    * generate a test matrix with test parameters (matrix width, matrix height, memory alignment.. )
    * run an ANSI version of a DUT function with the generated input parameters multiple times (1000 times for example), while counting CPU cycles
    * run an assembly version of a DUT function with the generated input parameters multiple times (1000 times for example), while counting CPU cycles
    * compare the results given by the ANSI and the assembly DUTs
    * the assembly version of the DUT function shall be faster than the ANSI version of the DUT function

## Run the test app

The test app is intended to be used only with esp32 and esp32s3

    idf.py build

## Example output

```
I (302) main_task: Started on CPU0
I (322) main_task: Calling app_main()
______  _____ ______   _               _   
|  _  \/  ___|| ___ \ | |             | |  
| | | |\ `--. | |_/ / | |_   ___  ___ | |_ 
| | | | `--. \|  __/  | __| / _ \/ __|| __|
| |/ / /\__/ /| |     | |_ |  __/\__ \| |_ 
|___/  \____/ \_|      \__| \___||___/ \__|


Press ENTER to see the list of tests.


Here's the test menu, pick your combo:
(1)	"Test fill functionality ARGB8888" [fill][functionality][ARGB8888]
(2)	"Test fill functionality RGB565" [fill][functionality][RGB565]
(3)	"LV Fill benchmark ARGB8888" [fill][benchmark][ARGB8888]
(4)	"LV Fill benchmark RGB565" [fill][benchmark][RGB565]
(5)	"LV Image functionality RGB565 blend to RGB565" [image][functionality][RGB565]
(6)	"LV Image benchmark RGB565 blend to RGB565" [image][benchmark][RGB565]

Enter test for running.
```

### Example of a functionality test run

```
Running Test fill functionality ARGB8888...
I (81512) LV Fill Functionality: running test for ARGB8888 color format
I (84732) LV Fill Functionality: test combinations: 31824

MALLOC_CAP_8BIT usage: Free memory delta: 0 Leak threshold: -800 
MALLOC_CAP_32BIT usage: Free memory delta: 0 Leak threshold: -800 
./main/test_lv_fill_functionality.c:102:Test fill functionality ARGB8888:PASS
Test ran in 3242ms
```
The test gives a simple FAIL/PASS result after comparison of the two DUTs results.
Also gives us an information about how many combinations (input parameters) the functionality test run with, `31824` in this case.

### Example of a benchmark test run

```
Running LV Fill benchmark ARGB8888...
I (163492) LV Fill Benchmark: running test for ARGB8888 color format
I (163522) LV Fill Benchmark:  ASM ideal case: 5363.123 cycles for 128x128 matrix, 0.327 cycles per sample
I (163572) LV Fill Benchmark:  ASM corner case: 7868.724 cycles for 127x127 matrix, 0.488 cycles per sample

I (163732) LV Fill Benchmark:  ANSI ideal case: 26219.137 cycles for 128x128 matrix, 1.600 cycles per sample
I (163902) LV Fill Benchmark:  ANSI corner case: 25762.178 cycles for 127x127 matrix, 1.597 cycles per sample

MALLOC_CAP_8BIT usage: Free memory delta: -220 Leak threshold: -800 
MALLOC_CAP_8BIT potential leak: Before 393820 bytes free, After 393600 bytes free (delta 220)
MALLOC_CAP_32BIT usage: Free memory delta: -220 Leak threshold: -800 
MALLOC_CAP_32BIT potential leak: Before 393820 bytes free, After 393600 bytes free (delta 220)
./main/test_lv_fill_benchmark.c:69:LV Fill benchmark ARGB8888:PASS
Test ran in 458ms
```

The test provides couple of information:
* Total number of CPU cycles for the whole DUT function
    * `5363.123` cycles for the assembly DUT function
    * `26219.137` cycles for the ANSI DUT function
* Number of CPU cycles per sample, which is basically the total number of CPU cycles divided by the test matrix area
    * `0.327` cycles per sample for the assembly DUT
    * `1.6` cycles per sample for the ANSI DUT
    * In this case, the assembly implementation has achieved a performance increase in around 4.9-times, comparing to the ANSI implementation.
* Range of the CPU cycles (a best case and a corner case scenarios) into which, the DUT functions are expected to fit into
    * The execution time of those function highly depends on the input parameters, thus a boundary scenarios for input parameters shall be set
    * An example of such a boundaries is in a table below
    * The benchmark boundary would help us to get an performance expectations of the real scenarios

Example of an best and corner case input parameters for benchmark test, for a color format `ARGB8888`
| Test matrix params | Memory alignment | Width          | Height         | Stride         |
| :----------------- | :--------------- | :------------- | :------------- | :------------- |
| Best case          | 16-byte aligned  | Multiple of 8  | Multiple of 8  | Multiple of 8  |
| Corner case        | 1-byte aligned   | Not power of 2 | Not power of 2 | Not power of 2 |