This fix prevents build failure due to not accepting PLD instruction when
compiling for armv4 cpu with the relevant -mcpu/-march options set in CFLAGS.
NEON unit has fast access to L1/L2 caches and even simple
copy of memory buffers using NEON provides more than 1.5x
performance improvement on ARM Cortex-A8.
This function is needed to improve performance of xfce4 terminal when
using bitmap fonts and running with 16bpp desktop. Some other applications
may potentially benefit too.
After applying this patch, top functions from Xorg process in
oprofile log change from
samples % image name symbol name
13296 29.1528 libpixman-1.so.0.17.1 combine_over_u
6452 14.1466 libpixman-1.so.0.17.1 fetch_scanline_r5g6b5
5516 12.0944 libpixman-1.so.0.17.1 fetch_scanline_a1
2273 4.9838 libpixman-1.so.0.17.1 store_scanline_r5g6b5
1741 3.8173 libpixman-1.so.0.17.1 fast_composite_add_1000_1000
1718 3.7669 libc-2.9.so memcpy
to
samples % image name symbol name
5594 14.7033 libpixman-1.so.0.17.1 fast_composite_over_n_1_0565
4323 11.3626 libc-2.9.so memcpy
3695 9.7119 libpixman-1.so.0.17.1 fast_composite_add_1000_1000
when scrolling text in terminal (reading man page).
This is a similar change as the top/bottom one, but in this case the
rounding is simpler because it's just always rounding down.
Based on a patch by M Joonas Pihlaja.
The rules for trap rasterization is that coordinates are rounded
towards north-west.
The pixman_sample_ceil() function is used to compute the first
(top-most) sample row included in the trap, so when the input
coordinate is already exactly on a sample row, no rounding should take
place.
On the other hand, pixman_sample_floor() is used to compute the final
(bottom-most) sample row, so if the input is precisely on a sample
row, it needs to be rounded down to the previous row.
This commit fixes the rounding computation. The idea of the
computation is like this:
Floor operation that rounds exact matches down: First subtract
pixman_fixed_e to make sure input already on a sample row gets rounded
down. Then find out how many small steps are between the input and the
first fraction. Then add those small steps to the first fraction.
The ceil operation first adds (small_step + pixman_e), then runs a
floor. This ensures that exact matches are not rounded off.
Based on a patch by M Joonas Pihlaja.
The sampling grid is slightly skewed in the antialiased case. Consider
the case where we have n = 8 bits of alpha.
The small step is
small_step = fixed_1 / 15 = 65536 / 15 = 4369
The first fraction is then
frac_first = (small_step / 2) = (65536 - 15) / 2 = 2184
and the last fraction becomes
frac_last
= frac_first + (15 - 1) * small_step = 2184 + 14 * 4369 = 63350
which means the size of the last bit of the pixel is
65536 - 63350 = 2186
which is 2 bigger than the first fraction. This is not the end of the
world, but it would be more correct to have 2185 and 2185, and we can
accomplish that simply by making the first fraction half the *big*
step instead of half the small step.
If we ever move to coordinates with 8 fractional bits, the
corresponding values become 8 and 10 out of 256, where 9 and 9 would
be better.
Similarly in the X direction.
Instead introduce two new fake formats
PIXMAN_pixbuf
PIXMAN_rpixbuf
and compute whether the source and mask have them in
find_fast_path(). This lead to some duplicate entries in the fast path
tables that could then be removed.
This flag was used to indicate that the mask was solid while still
allowing a specific format to be required. However, there is not
actually any need for this because the fast paths all used
_pixman_image_get_solid() which already allowed arbitrary formats.
The one thing that had to be dealt with was component alpha. In
addition to interpreting the presence of the NEED_COMPONENT_ALPHA
flag, we now also interprete the *absence* of this flag as a
requirement that the mask does *not* have component alpha.
Siarhei Siamashka pointed out that the first version of this commit
had a bug, in which a NEED_SOLID_MASK was accidentally not turned into
a PIXMAN_solid in the ARM NEON implementation.
When the destination buffer is either a8r8g8b8 or x8r8g8b8, we can use
it directly instead of fetching into a temporary buffer. When the
format is x8r8g8b8, we require the operator to not make use of
destination alpha, but when it is a8r8g8b8, there are no restrictions.
This is approximately a 5% speedup on the poppler cairo benchmark:
[ # ] backend test min(s) median(s) stddev. count
Before:
[ 0] image poppler 6.661 6.709 0.59% 6/6
After:
[ 0] image poppler 6.307 6.320 0.12% 5/6
This is a small speedup on the swfdec-youtube benchmark:
Before:
[ 0] image swfdec-youtube 5.789 5.806 0.20% 6/6
After:
[ 0] image swfdec-youtube 5.489 5.524 0.27% 6/6
Ie., approximately 5% faster.
GNU assembler and its macro preprocessor is now used to generate
NEON optimized functions from a common template. This automatically
takes care of nuisances like ensuring optimal alignment, dealing with
leading/trailing pixels, doing prefetch, etc.
Implementations for a lot of compositing functions are also added,
but not enabled.