NEON unit has fast access to L1/L2 caches and even simple
copy of memory buffers using NEON provides more than 1.5x
performance improvement on ARM Cortex-A8.
This function is needed to improve performance of xfce4 terminal when
using bitmap fonts and running with 16bpp desktop. Some other applications
may potentially benefit too.
After applying this patch, top functions from Xorg process in
oprofile log change from
samples % image name symbol name
13296 29.1528 libpixman-1.so.0.17.1 combine_over_u
6452 14.1466 libpixman-1.so.0.17.1 fetch_scanline_r5g6b5
5516 12.0944 libpixman-1.so.0.17.1 fetch_scanline_a1
2273 4.9838 libpixman-1.so.0.17.1 store_scanline_r5g6b5
1741 3.8173 libpixman-1.so.0.17.1 fast_composite_add_1000_1000
1718 3.7669 libc-2.9.so memcpy
to
samples % image name symbol name
5594 14.7033 libpixman-1.so.0.17.1 fast_composite_over_n_1_0565
4323 11.3626 libc-2.9.so memcpy
3695 9.7119 libpixman-1.so.0.17.1 fast_composite_add_1000_1000
when scrolling text in terminal (reading man page).
This is a similar change as the top/bottom one, but in this case the
rounding is simpler because it's just always rounding down.
Based on a patch by M Joonas Pihlaja.
The rules for trap rasterization is that coordinates are rounded
towards north-west.
The pixman_sample_ceil() function is used to compute the first
(top-most) sample row included in the trap, so when the input
coordinate is already exactly on a sample row, no rounding should take
place.
On the other hand, pixman_sample_floor() is used to compute the final
(bottom-most) sample row, so if the input is precisely on a sample
row, it needs to be rounded down to the previous row.
This commit fixes the rounding computation. The idea of the
computation is like this:
Floor operation that rounds exact matches down: First subtract
pixman_fixed_e to make sure input already on a sample row gets rounded
down. Then find out how many small steps are between the input and the
first fraction. Then add those small steps to the first fraction.
The ceil operation first adds (small_step + pixman_e), then runs a
floor. This ensures that exact matches are not rounded off.
Based on a patch by M Joonas Pihlaja.
The sampling grid is slightly skewed in the antialiased case. Consider
the case where we have n = 8 bits of alpha.
The small step is
small_step = fixed_1 / 15 = 65536 / 15 = 4369
The first fraction is then
frac_first = (small_step / 2) = (65536 - 15) / 2 = 2184
and the last fraction becomes
frac_last
= frac_first + (15 - 1) * small_step = 2184 + 14 * 4369 = 63350
which means the size of the last bit of the pixel is
65536 - 63350 = 2186
which is 2 bigger than the first fraction. This is not the end of the
world, but it would be more correct to have 2185 and 2185, and we can
accomplish that simply by making the first fraction half the *big*
step instead of half the small step.
If we ever move to coordinates with 8 fractional bits, the
corresponding values become 8 and 10 out of 256, where 9 and 9 would
be better.
Similarly in the X direction.
Instead introduce two new fake formats
PIXMAN_pixbuf
PIXMAN_rpixbuf
and compute whether the source and mask have them in
find_fast_path(). This lead to some duplicate entries in the fast path
tables that could then be removed.
This flag was used to indicate that the mask was solid while still
allowing a specific format to be required. However, there is not
actually any need for this because the fast paths all used
_pixman_image_get_solid() which already allowed arbitrary formats.
The one thing that had to be dealt with was component alpha. In
addition to interpreting the presence of the NEED_COMPONENT_ALPHA
flag, we now also interprete the *absence* of this flag as a
requirement that the mask does *not* have component alpha.
Siarhei Siamashka pointed out that the first version of this commit
had a bug, in which a NEED_SOLID_MASK was accidentally not turned into
a PIXMAN_solid in the ARM NEON implementation.
When the destination buffer is either a8r8g8b8 or x8r8g8b8, we can use
it directly instead of fetching into a temporary buffer. When the
format is x8r8g8b8, we require the operator to not make use of
destination alpha, but when it is a8r8g8b8, there are no restrictions.
This is approximately a 5% speedup on the poppler cairo benchmark:
[ # ] backend test min(s) median(s) stddev. count
Before:
[ 0] image poppler 6.661 6.709 0.59% 6/6
After:
[ 0] image poppler 6.307 6.320 0.12% 5/6
This is a small speedup on the swfdec-youtube benchmark:
Before:
[ 0] image swfdec-youtube 5.789 5.806 0.20% 6/6
After:
[ 0] image swfdec-youtube 5.489 5.524 0.27% 6/6
Ie., approximately 5% faster.
GNU assembler and its macro preprocessor is now used to generate
NEON optimized functions from a common template. This automatically
takes care of nuisances like ensuring optimal alignment, dealing with
leading/trailing pixels, doing prefetch, etc.
Implementations for a lot of compositing functions are also added,
but not enabled.
Instead of mucking around with CFLAGS in configure.ac, preventing
users from setting their own CFLAGS, just define the
PIXMAN_USE_INTERNAL_API and PIXMAN_DISABLE_DEPRECATED in
pixman-private.h
This adds a bilinear fetcher for the case where the image has a scaled
transformation, does not repeat, and the format {ax}8r8g8b8.
Results for the swfdec-youtube benchmark
Before:
[ # ] backend test min(s) median(s) stddev. count
[ 0] image swfdec-youtube 7.841 7.915 0.72% 6/6
After:
[ # ] backend test min(s) median(s) stddev. count
[ 0] image swfdec-youtube 6.677 6.780 0.94% 6/6
These results were measured on a faster machine than the ones in the
previous commit, so the numbers are not comparable.
Signed-off-by: Søren Sandmann Pedersen <sandmann@redhat.com>
Speed up bilinear interpolation by processing more than one component
at a time on 64 bit architectures, and by precomputing the dist{ixiy}
products on 32 bit architectures.
Previously bilinear interpolation for one pixel would take 24
multiplications. With this improvement it takes 12 on 64 bit, and 20
on 32 bit.
This is a small but consistent speedup on the swfdec-youtube
benchmark:
[ # ] backend test min(s) median(s) stddev. count
Before:
[ 0] image swfdec-youtube 18.010 18.020 0.09% 4/5
After:
[ 0] image swfdec-youtube 17.488 17.584 0.22% 5/6
Signed-off-by: Søren Sandmann Pedersen <sandmann@redhat.com>
On Wed, 2009-10-21 at 13:36 +1000, Peter Hutterer wrote:
> On Tue, Oct 20, 2009 at 08:23:55PM -0700, Jeremy Huddleston wrote:
> > I noticed an INSTALL file in xlsclients and libXvMC today, and it
> > was quite annoying to work around since 'autoreconf -fvi' replaces
> > it and git wants to commit it. Should these files even be in git?
> > Can I nuke them for the betterment of humanity and since they get
> > created by autoreconf anyways?
>
> See https://bugs.freedesktop.org/show_bug.cgi?id=24206
As an interim measure, replace AM_INIT_AUTOMAKE([dist-bzip2]) with
AM_INIT_AUTOMAKE([foreign dist-bzip2]). This will prevent the generation
of the INSTALL file. It is also part of the 24206 solution.
Signed-off-by: Jeremy Huddleston <jeremyhu@freedesktop.org>