Commit Graph

1850 Commits

Author SHA1 Message Date
Siarhei Siamashka
af7a69d90e ARM: added flags parameter to some asm fast path wrapper macros
Not all types of operations can be skipped when having transparent
solid source or transparent solid mask. Add an extra flags parameter
for providing this information to the wrappers.
2010-12-03 15:38:00 +02:00
Siarhei Siamashka
f6843e3797 ARM: added 'neon_composite_add_8888_n_8888' fast path 2010-12-03 15:37:54 +02:00
Siarhei Siamashka
b066b520df ARM: added 'neon_composite_add_n_8_8888' fast path 2010-12-03 15:37:49 +02:00
Siarhei Siamashka
1fba779036 ARM: better NEON instructions scheduling for add_8888_8888_8888
Provides a minor performance improvement by using pipelining and hiding
instructions latencies. Also do not clobber d0-d3 registers (source
image pixels) while doing calculations in order to allow the use of
the same macro for add_n_8_8888 fast path later.

Benchmark from ARM Cortex-A8 @500MHz:

== before ==

  add_8888_8888_8888 = L1:  95.94  L2:  42.27  M: 25.60 (121.09%)
                       HT:  14.54  VT:  13.13  R: 12.77  RT:  4.49 (48Kops/s)
     add_8888_8_8888 = L1: 104.51  L2:  57.81  M: 36.06 (106.62%)
                       HT:  19.24  VT:  16.45  R: 14.71  RT:  4.80 (51Kops/s)

== after ==

  add_8888_8888_8888 = L1: 106.66  L2:  47.82  M: 27.32 (129.30%)
                       HT:  15.44  VT:  13.96  R: 12.86  RT:  4.48 (48Kops/s)
     add_8888_8_8888 = L1: 107.72  L2:  61.02  M: 38.26 (113.16%)
                       HT:  19.48  VT:  16.72  R: 14.82  RT:  4.80 (51Kops/s)
2010-12-03 15:37:44 +02:00
Siarhei Siamashka
c3f48b6aa2 ARM: added 'neon_composite_add_8888_8_8888' fast path 2010-12-03 15:37:40 +02:00
Siarhei Siamashka
6d2f7f981b ARM: added 'neon_composite_over_0565_n_0565' fast path 2010-12-03 15:37:23 +02:00
Siarhei Siamashka
3990931bf6 ARM: reuse common NEON code for over_{n_8|8888_n|8888_8}_0565
Renamed suppementary macros from 'over_n_8_0565' to 'over_8888_8_0565',
because they can actually support all variants of this operation:
over_8888_8_0565/over_n_8_0565/over_8888_n_0565.

Also 'over_8888_8_0565' now uses more optimized common code instead of its
own variant, improving performance a bit. Even though this operation is
still memory bandwidth limited, scaled variants of these fast paths may
put more stress on CPU later.

Benchmarked on ARM Cortex-A8 @500MHz:

== before ==

    over_8888_8_0565 =  L1:  67.10  L2:  53.82  M: 44.70 (105.17%)
                        HT:  18.73  VT:  16.91  R: 14.25  RT:  4.80 (52Kops/s)

== after ==

    over_8888_8_0565 =  L1:  77.83  L2:  58.14  M: 44.82 (105.52%)
                        HT:  20.58  VT:  17.44  R: 15.05  RT:  4.88 (52Kops/s)
2010-12-03 15:37:19 +02:00
Siarhei Siamashka
a7c36681c0 ARM: added 'neon_composite_over_8888_n_0565' fast path 2010-12-03 15:37:15 +02:00
Siarhei Siamashka
e6814837a6 ARM: better NEON instructions scheduling for over_n_8_0565
Code rearranged to get better instructions scheduling for ARM Cortex-A8/A9.
Now it is ~30% faster for the pixel data in L1 cache and makes better use
of memory bandwidth when running at lower clock frequencies (ex. 500MHz).
Also register d24 (pixels from the mask image) is now not clobbered by
supplementary macros, which allows to reuse them for the other variants
of compositing operations later.

Benchmark from ARM Cortex-A8 @500MHz:

== before ==

    over_n_8_0565 =  L1:  63.90  L2:  63.15  M: 60.97 ( 73.53%)
                     HT:  28.89  VT:  24.14  R: 21.33  RT:  6.78 (  67Kops/s)

== after ==

    over_n_8_0565 =  L1:  82.64  L2:  75.19  M: 71.52 ( 84.14%)
                     HT:  30.49  VT:  25.56  R: 22.36  RT:  6.89 (  68Kops/s)
2010-12-03 15:37:11 +02:00
Siarhei Siamashka
3be86a92cc ARM: introduced 'fetch_mask_pixblock' macro to simplify code
This macro hides the implementation details of pixels fetching
for the mask image just like 'fetch_src_pixblock' does for the
source image. This provides more possibilities for reusing the
same code blocks in different compositing functions.

This patch does not introduce any functional changes and the
resulting code in the compiled object file is exactly the same.
2010-12-03 15:37:06 +02:00
Siarhei Siamashka
98d08b37f1 ARM: added 'neon_composite_over_n_8_8' fast path 2010-12-03 15:37:01 +02:00
Siarhei Siamashka
4b5b5a2a83 C fast path for a1 fill operation
Can be used as one of the solutions to fix bug
https://bugs.freedesktop.org/show_bug.cgi?id=31604
2010-11-23 00:54:19 +02:00
Alan Coopersmith
654961efe4 Sun's copyrights belong to Oracle now
Signed-off-by: Alan Coopersmith <alan.coopersmith@oracle.com>
2010-11-21 11:42:22 -08:00
Cyril Brulebois
e7ee43c39d Fix argument quoting for AC_INIT.
One gets rid of this accordingly:
| autoreconf -vfi
| autoreconf: Entering directory `.'
| autoreconf: configure.ac: not using Gettext
| autoreconf: running: aclocal --force
| configure.ac:61: warning: AC_INIT: not a literal: "pixman@lists.freedesktop.org"
| autoreconf: configure.ac: tracing
| configure.ac:61: warning: AC_INIT: not a literal: "pixman@lists.freedesktop.org"

Signed-off-by: Cyril Brulebois <kibi@debian.org>
2010-11-19 13:57:47 -05:00
Cyril Brulebois
149ed6b1f0 Upload to experimental. 2010-11-17 15:56:52 +01:00
Cyril Brulebois
865e06cab0 Update debian/copyright from upstream's COPYING. 2010-11-17 15:28:15 +01:00
Cyril Brulebois
868ed1e2a0 Update changelogs. 2010-11-17 15:27:13 +01:00
Cyril Brulebois
bed147b523 Merge branch 'upstream-experimental' into debian-experimental 2010-11-17 15:25:39 +01:00
Søren Sandmann Pedersen
c59db8af66 Post-release version bump to 0.21.3 2010-11-16 17:14:47 -05:00
Søren Sandmann Pedersen
4646c23858 Pre-release version bump 2010-11-16 16:43:26 -05:00
Søren Sandmann Pedersen
536cf4dd3b Generate {a,x}8r8g8b8, a8, 565 fetchers for nearest/affine images
There are versions for all combinations of x8r8g8b8/a8r8g8b8 and
pad/repeat/none/normal repeat modes. The bulk of each function is an
inline function that takes a format and a repeat mode as parameters.
2010-11-16 16:41:42 -05:00
Andrea Canciani
da0176e853 Improve conical gradients opacity check
Conical gradients are completely opaque if all of their stops are
opaque and the repeat mode is not 'none'.
2010-11-12 17:13:30 +01:00
Andrea Canciani
151f2554fc Fix opacity check
Radial gradients are "conical", thus they can have some non-opaque
parts even if all of their stops are completely opaque.

To guarantee that a radial gradient is actually opaque, it needs to
also have one of the two circles containing the other one. In this
case when extrapolating, the whole plane is completely covered (as
explained in the comment in pixman-radial-gradient.c).
2010-11-12 17:13:30 +01:00
Andrea Canciani
19ed415b74 Remove unused stop_range field 2010-11-12 17:13:30 +01:00
Siarhei Siamashka
d8fe87a626 ARM: optimization for scaled src_0565_0565 with nearest filter
The performance improvement is only in the ballpark of 5% when
compared against C code built with a reasonably good compiler
(gcc 4.5.1). But gcc 4.4 produces approximately 30% slower code
here, so assembly optimization makes sense to avoid dependency
on the compiler quality and/or optimization options.

Benchmark from ARM11:
    == before ==
    op=1, src_fmt=10020565, dst_fmt=10020565, speed=34.86 MPix/s

    == after ==
    op=1, src_fmt=10020565, dst_fmt=10020565, speed=36.62 MPix/s

Benchmark from ARM Cortex-A8:
    == before ==
    op=1, src_fmt=10020565, dst_fmt=10020565, speed=89.55 MPix/s

    == after ==
    op=1, src_fmt=10020565, dst_fmt=10020565, speed=94.91 MPix/s
2010-11-10 17:26:49 +02:00
Siarhei Siamashka
b8007d0423 ARM: NEON optimization for scaled src_0565_8888 with nearest filter
Benchmark from ARM Cortex-A8 @720MHz:
    == before ==
    op=1, src_fmt=10020565, dst_fmt=20028888, speed=8.99 MPix/s

    == after ==
    op=1, src_fmt=10020565, dst_fmt=20028888, speed=76.98 MPix/s

    == unscaled ==
    op=1, src_fmt=10020565, dst_fmt=20028888, speed=137.78 MPix/s
2010-11-10 17:26:42 +02:00
Siarhei Siamashka
2e855a2b4a ARM: NEON optimization for scaled src_8888_0565 with nearest filter
Benchmark from ARM Cortex-A8 @720MHz:
    == before ==
    op=1, src_fmt=20028888, dst_fmt=10020565, speed=42.51 MPix/s

    == after ==
    op=1, src_fmt=20028888, dst_fmt=10020565, speed=55.61 MPix/s

    == unscaled ==
    op=1, src_fmt=20028888, dst_fmt=10020565, speed=117.99 MPix/s
2010-11-10 17:26:28 +02:00
Siarhei Siamashka
4a09e472b8 ARM: NEON optimization for scaled over_8888_0565 with nearest filter
Benchmark from ARM Cortex-A8 @720MHz:
    == before ==
    op=3, src_fmt=20028888, dst_fmt=10020565, speed=10.29 MPix/s

    == after ==
    op=3, src_fmt=20028888, dst_fmt=10020565, speed=36.36 MPix/s

    == unscaled ==
    op=3, src_fmt=20028888, dst_fmt=10020565, speed=79.40 MPix/s
2010-11-10 17:26:23 +02:00
Siarhei Siamashka
67a4991f33 ARM: NEON optimization for scaled over_8888_8888 with nearest filter
Benchmark from ARM Cortex-A8 @720MHz:
    == before ==
    op=3, src_fmt=20028888, dst_fmt=20028888, speed=12.73 MPix/s

    == after ==
    op=3, src_fmt=20028888, dst_fmt=20028888, speed=28.75 MPix/s

    == unscaled ==
    op=3, src_fmt=20028888, dst_fmt=20028888, speed=53.03 MPix/s
2010-11-10 17:26:17 +02:00
Siarhei Siamashka
0b56244ac8 ARM: performance tuning of NEON nearest scaled pixel fetcher
Interleaving the use of NEON registers helps to avoid some stalls
in NEON pipeline and provides a small performance improvement.
2010-11-10 17:26:10 +02:00
Siarhei Siamashka
6e76af0d4b ARM: macro template in C code to simplify using scaled fast paths
This template can be used to instantiate scaled fast path functions
by providing main loop code and calling NEON assembly optimized
scanline processing functions from it. Another macro can be used
to simplify adding entries to fast path tables.
2010-11-10 17:25:56 +02:00
Siarhei Siamashka
88014a0e6f ARM: nearest scaling support for NEON scanline compositing functions
Now it is possible to generate scanline processing functions
for the case when the source image is scaled with NEAREST filter.

Only 16bpp and 32bpp pixel formats are supported for now. But the
others can be also added later when needed. All the existing NEON
fast path functions should be quite easy to reuse for implementing
fast paths which can work with scaled source images.
2010-11-10 17:25:39 +02:00
Siarhei Siamashka
324712e48c ARM: NEON: source image pixel fetcher can be overrided now
Added a special macro 'pixld_src' which is now responsible for fetching
pixels from the source image. Right now it just passes all its arguments
directly to 'pixld' macro, but it can be used in the future to provide
a special pixel fetcher for implementing nearest scaling.

The 'pixld_src' has a lot of arguments which define its behavior. But
for each particular fast path implementation, we already know NEON
registers allocation and how many pixels are processed in a single block.
That's why a higher level macro 'fetch_src_pixblock' is also introduced
(it's easier to use because it has no arguments) and used everywhere
in 'pixman-arm-neon-asm.S' instead of VLD instructions.

This patch does not introduce any functional changes and the resulting code
in the compiled object file is exactly the same.
2010-11-10 17:25:33 +02:00
Siarhei Siamashka
cb3f183025 ARM: fix 'vld1.8'->'vld1.32' typo in add_8888_8888 NEON fast path
This was mostly harmless and had no effect on little endian systems.
But wrong vector element size is at least inconsistent and also
can theoretically cause problems on big endian ARM systems.
2010-11-10 17:25:26 +02:00
Cyril Brulebois
85950507f1 Upload to experimental. 2010-11-06 10:01:02 +01:00
Cyril Brulebois
23b9668233 Update changelogs. 2010-11-06 09:58:54 +01:00
Cyril Brulebois
7374af53e1 Merge commit 'pixman-0.20.0' into debian-experimental 2010-11-06 09:58:20 +01:00
Siarhei Siamashka
fed4a2fde5 Do CPU features detection from 'constructor' function when compiled with gcc
There is attribute 'constructor' supported since gcc 2.7 which allows
to have a constructor function for library initialization. This eliminates
an extra branch for each composite operation and also helps to avoid
complains from race condition detection tools like helgrind.

The other compilers may or may not support this attribute properly.
Ideally, the compilers should fail to compile the code with unknown
attribute, so the configure check should do the right job. But in
reality the problems are surely possible. Fortunately such problems
should be quite easy to find because NULL pointer dereference should
happen almost immediately if the constructor fails to run.

clang 2.7:
  supports __attribute__((constructor)) properly and pretends to be gcc

tcc 0.9.25:
  ignores __attribute__((constructor)), but does not pretend to be gcc
2010-11-05 16:02:28 +02:00
Søren Sandmann Pedersen
99699771cd Delete the source_image_t struct.
It serves no purpose anymore now that the source_class_t field is gone.
2010-11-04 21:03:38 -04:00
Søren Sandmann Pedersen
f405b40798 [mmx] Mark some of the output variables as early-clobber.
GCC assumes that input variables in inline assembly are fully consumed
before any output variable is written. This means it may allocate the
variables in the same register unless the output variables are marked
as early-clobber.

From Jeremy Huddleston:

    I noticed a problem building pixman with clang and reported it to
    the clang developers.  They responded back with a comment about
    the inline asm in pixman-mmx.c and suggested a fix:

    """
    Incidentally, Jeremy, in the asm that reads
    __asm__ (
    "movq %7, %0\n"
    "movq %7, %1\n"
    "movq %7, %2\n"
    "movq %7, %3\n"
    "movq %7, %4\n"
    "movq %7, %5\n"
    "movq %7, %6\n"
    : "=y" (v1), "=y" (v2), "=y" (v3),
      "=y" (v4), "=y" (v5), "=y" (v6), "=y" (v7)
    : "y" (vfill));

    all the output operands except the last one should be marked as
    earlyclobber ("=&y"). This is working by accident with gcc.
    """

Cc: jeremyhu@apple.com
Reviewed-by: Matt Turner <mattst88@gmail.com>
2010-11-04 21:03:38 -04:00
Søren Sandmann Pedersen
9c19a85b00 Remove workaround for a bug in the 1.6 X server.
There used to be a bug in the X server where it would rely on
out-of-bounds accesses when it was asked to composite with a
window as the source. It would create a pixman image pointing
to some bogus position in memory, but then set a clip region
to the position where the actual bits were.

Due to a bug in old versions of pixman, where it would not clip
against the image bounds when a clip region was set, this would
actually work. So when the pixman bug was fixed, a workaround was
added to allow certain out-of-bound accesses.

However, the 1.6 X server is so old now that we can remove this
workaround. This does mean that if you update pixman to 0.22 or later,
you will need to use a 1.7 X server or later.
2010-11-04 21:03:38 -04:00
Siarhei Siamashka
56748ea9a6 Fixed broken configure check for __thread support
Somehow the patch from [1] was not applied correctly, fixing that.

1. http://lists.cairographics.org/archives/cairo/2010-September/020826.html
2010-11-02 01:36:37 +02:00
Søren Sandmann Pedersen
ecc3612995 COPYING: Stop saying that a modification is currently under discussion.
Also put the copyright text into a C comment for easier cut and paste.
2010-11-01 18:04:31 -04:00
Søren Sandmann Pedersen
c993cd9614 Version bump 0.21.1.
The previous bump to 0.20.1 was a mistake; it belongs on the 0.20 branch.
2010-10-27 17:21:06 -04:00
Cyril Brulebois
2da37f260e Upload to experimental. 2010-10-27 23:14:13 +02:00
Søren Sandmann Pedersen
d890b684f6 Post-release version bump to 0.20.1 2010-10-27 16:58:29 -04:00
Cyril Brulebois
3a4ab94548 Add myself to Uploaders. 2010-10-27 22:57:19 +02:00
Cyril Brulebois
a74572e2e1 Enable the testsuite. 2010-10-27 22:56:49 +02:00
Søren Sandmann Pedersen
c5e048d46c Pre-release version bump to 0.20.0 2010-10-27 16:51:40 -04:00
Cyril Brulebois
990c9e2447 Pass --disable-gtk to ./configure
As of pixman-0.19.2-5-g5b99710, Gtk+ is auto-detected, make sure not to
pick it accidentally, by passing --disable-gtk. (That's only for test
purposes, but would require pixman-1 itself.)
2010-10-27 22:50:51 +02:00