Commit Graph

2611 Commits

Author SHA1 Message Date
Søren Sandmann Pedersen
4ac0a1d60f Move PowerPC specific CPU detection to its own file pixman-ppc.c 2012-07-07 01:09:23 -04:00
Søren Sandmann Pedersen
8590415f0e Move ARM specific CPU detection to a new file pixman-arm.c
Similar to the x86 commit, this moves the ARM specific CPU detection
to its own file which exports a pixman_arm_get_implementations()
function that is supposed to be a noop on non-ARM.
2012-07-07 01:09:22 -04:00
Søren Sandmann Pedersen
39ac18570a Move x86 specific CPU detection to a new file pixman-x86.c
Extract the x86 specific parts of pixman-cpu.c and put them in their
own file called pixman-x86.c which exports one function
pixman_x86_get_implementations() that creates the MMX and SSE2
implementations. This file is supposed to be compiled on all
architectures, but pixman_x86_get_implementations() should be a noop
on non-x86.
2012-07-06 23:53:19 -04:00
Søren Sandmann Pedersen
1a3b7614a9 pixman-cpu.c: Rename disabled to _pixman_disabled() and export it 2012-07-06 23:52:14 -04:00
Sebastian Bauer
d4aa82fb91 Qualify the static variables in pixman_f_transform_invert() with the const keyword.
Their contents is not overwritten.
2012-07-06 23:50:21 -04:00
Søren Sandmann Pedersen
f9c91ee2f2 Use a compile-time constant for the "K" constraint in the MMX detection.
When compiling with -O0, gcc doesn't understand that in

     signed char x = 0;

     ...

     asm ("...",
     	  : "K" (x));

x is constant. Fix this by using an immediate constant instead of a
variable.
2012-07-02 18:21:21 -04:00
Søren Sandmann Pedersen
cd7ecf548a In fast_composite_tiled_repeat() don't clone images with a palette
In fast_composite_tiled_repeat() if the source image is less than a
certain constant width, a clone is created which is then
pre-repeated. However, the source image's palette, if it has one, is
not cloned, so for indexed images, the pre-repeating would crash.

Fix this by not doing any pre-repeating for images with a palette set.
2012-07-02 18:21:21 -04:00
Søren Sandmann Pedersen
7b20ad39f7 test: Make stress-test more likely to actually composite something
stress-test current almost never composites anything because the clip
rectangles and transformations are such that either
_pixman_compute_composite_region32() or analyze_extent() will return
FALSE.

Fix this by:

- making log_rand() return smaller numbers so that the clip rectangles
  are more likely to be within the destination image

- adding rand_x() and rand_y() functions that pick positions within an
  image and using them for positioning alpha maps and source/mask
  positions.

- making it less likely that clip regions are used in general

These changes make the test take longer, so speed it up a little by
making most images smaller and by reducing the maximum convolution
filter from 17x19 to 3x4.

With these changes, stress-test reveals a crash in iteration 0xd39
where fast_composite_tiled_repeat() creates an indexed image without a
palette.
2012-07-02 18:21:21 -04:00
Matt Turner
4cdf8e9f3a sse2: add missing ABGR entires for bilinear src_8888_8888 2012-07-01 16:35:46 -04:00
Matt Turner
ef99f9e972 loongson: optimize _mm_set_pi* functions with shuffle instructions 2012-07-01 16:34:45 -04:00
Matt Turner
9aa8e3a260 mmx: optimize bilinear function when using 7-bit precision
Loongson:
image             firefox-fishtank 1037.738 1040.218   0.19%    3/3
image             firefox-fishtank 1056.611 1057.581   0.20%    3/3

ARM/iwMMXt:
image             firefox-fishtank 1487.282 1492.640   0.17%    3/3
image             firefox-fishtank 1363.913 1364.366   0.11%    3/3
2012-07-01 16:34:21 -04:00
Matt Turner
1ad6ae6ee8 mmx: add scaled bilinear over_8888_8_8888
Loongson:
image             firefox-fishtank 1665.163 1670.370   0.17%    3/3
image             firefox-fishtank 1037.738 1040.218   0.19%    3/3

ARM/iwMMXt:
image             firefox-fishtank 2042.723 2045.308   0.10%    3/3
image             firefox-fishtank 1487.282 1492.640   0.17%    3/3
2012-07-01 16:34:14 -04:00
Matt Turner
c43de364cb mmx: add scaled bilinear over_8888_8888
Loongson:
image         firefox-planet-gnome  157.012  158.087   0.30%    6/6
image         firefox-planet-gnome  156.617  157.109   0.15%    5/6

ARM/iwMMXt:
image         firefox-planet-gnome  148.086  149.339   0.76%    6/6
image         firefox-planet-gnome  144.939  146.123   0.61%    6/6
2012-07-01 16:33:19 -04:00
Matt Turner
9209cd746b mmx: add scaled bilinear src_8888_8888
Loongson:
image         firefox-planet-gnome  170.025  170.229   0.09%    3/4
image         firefox-planet-gnome  157.012  158.087   0.30%    6/6

ARM/iwMMXt:
image         firefox-planet-gnome  164.192  164.875   0.34%    3/4
image         firefox-planet-gnome  148.086  149.339   0.76%    6/6
2012-07-01 16:33:08 -04:00
Matt Turner
51f27d7364 mmx: Use expand_alpha instead of mask/shift 2012-07-01 16:25:30 -04:00
Siarhei Siamashka
b0855f095a Change default bilinear interpolation precision to 7 bits
This improves performance for the current SSE2 code. Further
reduction to 4 bits may be considered later if it proves
to allow additional speedup.
2012-07-01 23:00:34 +03:00
Siarhei Siamashka
c430b1dba7 sse2: _mm_madd_epi16 for faster bilinear scaling with 7-bit precision
Reducing interpolation precision allows the use of PMADDWD instruction.
This makes bilinear scaling much faster (on Intel Core i7):

8-bit: image             firefox-fishtank   57.584   58.349   0.74%    3/3
7-bit: image             firefox-fishtank   51.139   51.229   0.30%    3/3

8-bit: src_8888_8888 =  L1: 228.71  L2: 226.52  M:224.82 ( 14.95%)  HT:183.22  VT:154.02  R:171.72  RT:109.36
7-bit: src_8888_8888 =  L1: 320.45  L2: 317.43  M:314.38 ( 20.77%)  HT:215.13  VT:177.35  R:204.46  RT:121.93
2012-07-01 22:40:23 +03:00
Siarhei Siamashka
ccd31896bc Bilinear interpolation precision is now configurable at compile time
Macro BILINEAR_INTERPOLATION_BITS in pixman-private.h selects
the number of fractional bits used for bilinear interpolation.

scaling-test and affine-test have checksums for 4-bit, 7-bit
and 8-bit configurations.
2012-07-01 21:45:43 +03:00
Matt Turner
ad9f1d0201 Fix distcheck due to custom iwMMXt rules 2012-06-29 14:24:30 -04:00
Siarhei Siamashka
ff5d041b88 sse2: faster bilinear scaling (use _mm_loadl_epi64)
Using _mm_loadl_epi64() to load two pixels at once (pairs of top
and bottom pixels) is faster than loading each pixel separately
and combining them with _mm_set_epi32().

=== cairo-perf-trace ===

before: image             firefox-fishtank   66.912   66.931   0.13%    3/3
after:  image             firefox-fishtank   57.584   58.349   0.74%    3/3

=== lowlevel-blt-bench ===

before: src_8888_8888 =  L1: 181.10  L2: 179.14  M:178.08 ( 11.02%)  HT:153.22  VT:133.45  R:142.24  RT: 95.32
after:  src_8888_8888 =  L1: 228.68  L2: 225.75  M:223.98 ( 14.23%)  HT:185.32  VT:155.06  R:162.73  RT:102.52

This improvement was suggested by Matt Turner on irc.
2012-06-29 03:29:32 +03:00
Siarhei Siamashka
fc162bad56 test: support nearest/bilinear scaling in lowlevel-blt-bench
Scale factor is selected to be nearly 1x, so that the MPix/s results
can be directly compared with the results of non-scaled compositing
operations.
2012-06-29 03:24:29 +03:00
Siarhei Siamashka
387e9bcddb test: Fix for strict aliasing issue in 'get_random_seed'
Gets rid of gcc warning when compiled with -fstrict-aliasing option in CFLAGS
2012-06-29 03:23:09 +03:00
Andrea Canciani
4cbeb0aedc build: Fix compilation on win32
When compiling using the win32 build system, config.h is not
available nor needed.

Fixes:

pixman-glyph.c(26) : fatal error C1083: Cannot open include file:
'config.h': No such file or directory
2012-06-20 17:13:33 +02:00
Matt Turner
21077e1b83 sse2: add src_x888_0565
Port of 2ddd1c498b to SSE2.

Uses the pmadd technique described in
http://software.intel.com/sites/landingpage/legacy/mmx/MMX_App_24-16_Bit_Conversion.pdf

Works around lack of packusdw instruction by first sign extending the
values.

fast:	src_8888_0565 =  L1: 681.40  L2: 689.20  M: 644.76 ( 25.51%)  HT:404.42  VT:288.04  R:306.07  RT:150.80 (1619Kops/s)
mmx:	src_8888_0565 =  L1:2056.03  L2:1985.44  M:1574.91 ( 61.87%)  HT:533.10  VT:376.35  R:416.10  RT:178.79 (1833Kops/s)
sse2:	src_8888_0565 =  L1:3793.42  L2:3653.44  M:1878.83 ( 73.94%)  HT:535.03  VT:407.96  R:421.46  RT:163.31 (1727Kops/s)

and for reference, using packusdw
sse4:	src_8888_0565 =  L1:4396.18  L2:4229.25  M:1904.04 ( 75.18%)  HT:559.79  VT:427.96  R:440.06  RT:165.71 (1744Kops/s)

Notice that MMX is faster in the RT case because it can operate on
8-bytes instead of the current 16-bytes for SSE2.
2012-06-16 16:00:00 -04:00
Cyril Brulebois
3acc1ffc32 Upload to unstable. 2012-06-15 01:25:23 +02:00
Cyril Brulebois
1952e2a77b Document the cherry-pick, fixing FTBFS on *i386. 2012-06-15 01:20:14 +02:00
Matt Turner
1701defb49 mmx: add missing _mm_empty calls
Fixes spurious test failures on x86-32.
(cherry picked from commit da6193b1fc)
2012-06-15 01:19:04 +02:00
Cyril Brulebois
8940c5222e Upload to unstable. 2012-06-15 00:16:59 +02:00
Cyril Brulebois
0181d422ab Bump changelogs. 2012-06-15 00:15:43 +02:00
Cyril Brulebois
f53c40a739 Merge branch 'upstream-unstable' into debian-unstable 2012-06-15 00:15:23 +02:00
Matt Turner
7db07cb731 sse2: enable over_n_0565 for b5g6r5
Same as b950bb12 for MMX.
2012-06-13 19:32:21 -04:00
Matt Turner
45946c5fa1 .gitignore: add test/glyph-test 2012-06-13 19:32:21 -04:00
Søren Sandmann Pedersen
eadb442b5c test: Add missing break in stress-test.c
Found by coverity:

https://bugzilla.redhat.com/show_bug.cgi?id=756069
2012-06-13 07:30:06 -04:00
Siarhei Siamashka
492dac7593 test: fix bisecting issue in fuzzer-find-diff.pl
Before bisecting to find the exact test which has failed, we
first need to make sure that the first test is fine (the first
test is "good" and the whole range is "bad"). Otherwise
test 2 gets incorrectly flagged as problematic in the case
if we already got a failure on test 1 right from the start.
2012-06-12 04:21:57 +03:00
Siarhei Siamashka
40a0d10eea test: OpenMP 2.5 requires signed loop iteration variables
Unsigned loop variables are only supported since version 3.0
of OpenMP specification. Changing loop variables to use int32_t
type fixes pixman build problems with path64 compiler.
2012-06-12 04:21:07 +03:00
Søren Sandmann Pedersen
619a60d201 test: Make glyph test pass on big endian
The destination buffer was initialized with random uint32_t values, so
it started out different on big endian vs. little endian. Fix that by
initializing the buffer with random uint8_t values instead.
2012-06-11 19:19:23 -04:00
Søren Sandmann Pedersen
f80e7ad3cb bits-image: Turn all the fetchers into iterator getters
Instead of caching these fetchers in the image structure, and then
have the iterator getter call them from there, simply change them to
be iterator getters themselves.

This avoids an extra indirect function call and lets us get rid of the
get_scanline_32/64 fields in pixman_image_t.
2012-06-11 07:15:00 -04:00
Antti S. Lankila
fd175f9d02 Faster unorm_to_unorm for wide processing.
Optimizing the unorm_to_unorm functions allows a speedup from:

src_8888_2x10 =  L1:  62.08  L2:  60.73  M: 59.61 (  4.30%)  HT: 46.81
	VT: 42.17  R: 43.18  RT: 26.01 (325Kops/s)

to:

src_8888_2x10 =  L1:  76.94  L2:  78.43  M: 75.87 (  5.59%)  HT: 56.73
	VT: 52.39  R: 53.00  RT: 29.29 (363Kops/s)

on a i7 Q720 -based laptop.

The key of the patch is the observation that unorm_to_unorm's work can
more easily be done with a simple multiplication and shift, when the
function is applied repeatedly and the parameters are not compile-time
constants. For instance, converting from 0xfe to 0xfefe (expanding
from 8 bits to 16 bits) can be done by calculating

c = c * 0x101

However, sometimes the result is not a neat replication of all the
bits. For instance, going from 10 bits to 16 bits can be done by
calculating

c = c * 0x401UL >> 4

where the intermediate result is 20 bit wide repetition of the 10-bit
pattern followed by shifting off the unnecessary lowest bits.

The patch has the algorithm to calculate the factor and the shift, and
converts the code to use it.
2012-06-10 14:23:17 -04:00
Matt Turner
367b78fd5c configure.ac: add iwmmxt2 configure flag
The flag allows the user to select whether pixman-mmx.c is compiled with
-march=iwmmxt or -march=iwmmxt2.

gcc has scheduling support for the Marvell CPU in the XO 1.75 when
building with -march=iwmmxt2.
2012-06-09 16:57:16 -04:00
Matt Turner
31a6563ec5 autotools: use custom build rule to build iwMMXt code
gcc has no sane way of enabling iwmmxt code generation, like -msse for
SSE, so you have to use -march=iwmmxt{,2}. User CFLAGS are placed after
-march=iwmmxt and override the march value, so we have to use a custom
build rule to order the CFLAGS such that pixman-mmx.c will be built with
the necessary CFLAGS.
2012-06-09 16:57:16 -04:00
Søren Sandmann Pedersen
706bf8264c Speed up _pixman_image_get_solid() in common cases
Make _pixman_image_get_solid() faster by special-casing the common
cases where the image is SOLID or a repeating a8r8g8b8 image.

This optimization together with the previous one results in a small
but reproducable performance improvement on the xfce4-terminal-a1
cairo trace:

[ # ]  backend                         test   min(s) median(s) stddev. count
Before:
[  0]    image            xfce4-terminal-a1    1.221    1.239   1.21%  100/100
After:
[  0]    image            xfce4-terminal-a1    1.170    1.199   1.26%  100/100

Either optimization by itself is difficult to separate from noise.
2012-06-02 08:19:38 -04:00
Søren Sandmann Pedersen
934c9d8546 Speed up _pixman_composite_glyphs_no_mask()
Bypass much of the overhead of pixman_image_composite32() by only
computing the composite region once instead of once per glyph, and by
only looking up the composite function whenever the glyph format or
flags change.

As part of this, the pixman_compute_composite_region32() was renamed
to _pixman_compute_composite_region32() and exported in
pixman-private.h.

I couldn't find a trace that would reliably demonstrate that this is
actually an improvement by itself (since _pixman_composite_glyphs_no_mask()
is called so rarely), but together with the following optimization for
solid sources, there is a small but reliable improvement to the
xfce4-a1-terminal cairo trace.
2012-06-02 08:19:38 -04:00
Søren Sandmann Pedersen
a162189dc0 Speed up pixman_composite_glyphs()
When adding glyphs to the mask, bypass most of the overhead of
pixman_image_composite32() by:

- Only looking up the composite function when the glyph changes either
  format or flags.

- Only using a white source when the glyph format is different from
  the mask format.

- Simply intersecting the glyph rectangle with the destination
  rectangle instead of doing the full _pixman_composite_region32().

Performance results:

[ # ]  backend                         test   min(s) median(s) stddev. count
Before:
[  0]    image            firefox-talos-gfx    6.570    6.577   0.13%    8/10
After:
[  0]    image            firefox-talos-gfx    4.272    4.289   0.28%   10/10

V2: Changes to deal with white sources
2012-06-02 08:19:30 -04:00
Søren Sandmann Pedersen
d9710442b4 test: Add glyph-test
This test tests the new glyph cache and compositing API. Much of this
test is intending to making sure that clipping and alpha map handling
survive any optimizations that may be added to the glyph compositing.

V2: Evaluating lcg_rand_n() multiple times in an argument list lead
    to undefined behavior.
2012-06-02 07:55:11 -04:00
Søren Sandmann Pedersen
dc92374727 Add support for alpha maps to compute_crc32_for_image().
When a destination image I has an alpha map A, the following rules apply:

   - If I has an alpha channel itself, the content of that channel is
     undefined

   - If A has RGB channels, the content of those channels is
     undefined.

Hence in order to compute the CRC32 for such an image, we have to mask
off the alpha channel of the image, and the RGB channels of the alpha
map.

V2: Shifting by 32 is undefined in C
2012-06-02 07:55:11 -04:00
Søren Sandmann Pedersen
43e029d525 Move CRC32 computation from blitters-test.c into utils.c
This way it can be used in other tests.
2012-06-02 07:55:11 -04:00
Søren Sandmann Pedersen
fce31a5ef8 Add pixman_glyph_cache_t API
This new API allows entire glyph strings to be composited in one go
which reduces overhead compared to multiple calls to
pixman_image_composite32().

The pixman_glyph_cache_t is a hash table that maps two keys (a "font"
and a "glyph" key, but they are just keys; there is no distinction
between them as far as pixman is concerned) to a glyph. Glyphs in the
cache can be composited through two new entry points
pixman_glyph_cache_composite_glyphs() and
pixman_glyph_cache_composite_glyphs_no_mask().

A glyph cache may only be inserted into when it is "frozen", which is
achieved by calling pixman_glyph_cache_freeze(). When
pixman_glyph_cache_thaw() is later called, if the cache has become too
crowded, some glyphs (currently the least-recently-used) will
automatically be evicted. This means that a user must ensure that all
the required glyphs are present in the cache before compositing a
string. The intended way to use the cache is like this:

        pixman_glyph_t glyphs[MAX_GLYPHS];

        pixman_glyph_cache_freeze (cache);

        for (i = 0; i < n_glyphs; ++i)
        {
            const void *g;

            if (!(g = pixman_glyph_cache_lookup (cache, font_key, glyph_key)))
            {
                img = <rasterize glyph as a pixman_image_t>;

                g = pixman_glyph_cache_insert (cache, font_key, glyph_key,
                                               glyph_origin_x, glyph_origin_y,
                                               img);

                if (!g)
                {
                    /* Clean up out-of-memory condition */
                    goto oom;
                }

                glyphs[i].pos_x = glyph_x_pos;
                glyphs[i].pos_y = glyph_y_pos;
                glyphs[i].glyph = g;
            }
        }

        pixman_composite_glyphs (op, src, dest, ..., cache, n_glyphs, glyphs);

        pixman_glyph_cache_thaw (cache);

V2:
- Move glyphs to front of the MRU list when they are used. Pointed
  out by Behdad Esfahbod.
- Composite glyphs with (white IN glyph) ADD mask in order to support
  mixed a8 and a8r8g8b8 glyphs. Also pointed out by Behdad.
- Add pixman_glyph_get_mask_format
2012-06-02 07:55:11 -04:00
Søren Sandmann Pedersen
a3ae88b71b Add doubly linked lists
This commit adds some new inline functions to maintain a doubly linked
list.

The way to use them is to embed a pixman_link_t into the structures
that should be linked, and use a pixman_list_t as the head of the
list.

The new functions are

    pixman_list_init (pixman_list_t *list);
    pixman_list_prepend (pixman_list_t *list, pixman_link_t *link);
    pixman_list_move_to_front (pixman_list_t *list, pixman_link_t *link);

There are also a new macro:

    CONTAINER_OF(type, member, data);

that can be used to get from a pointer to a member to the containing
structure.

V2: Use the C89 macro offsetof() instead of rolling our own -
suggested by Alan Coopersmith.
2012-06-02 07:54:48 -04:00
Søren Sandmann Pedersen
c2230fe2af Make use of image flags in mmx and sse2 iterators
Now that we have the full image flags available, the SSE2 and MMX
iterators can simply check against SAMPLES_COVER_CLIP_NEAREST (which
is computed in pixman_image_composite32()) instead of comparing all
the x/y/width/height parameters.
2012-05-30 04:42:29 -04:00
Søren Sandmann Pedersen
c1065a9cb4 Pass the full image flags to iterators
When pixman_image_composite32() is called some flags are computed that
indicate various things about the composite operation that can't be
deduced from the image flags themselves. These additional flags are
not currently available to iterators. All they can do is read the
image flags in image->common.flags.

Fix that by passing the info->{src, mask, dest}_flags on to the
iterator initialization and store the flags in the iter struct as
"image_flags". At the same time rename the *iterator* flags variable
to "iter_flags" to avoid confusion.
2012-05-30 04:34:29 -04:00