Commit Graph

1850 Commits

Author SHA1 Message Date
Søren Sandmann Pedersen
e75e6a4ef5 ARM: Add 'neon_composite_over_n_8888_0565_ca' fast path
This improves the performance of the firefox-talos-gfx benchmark with
the image16 backend. Benchmark on an 800 MHz ARM Cortex A8:

Before:

[ # ]  backend                         test   min(s) median(s) stddev. count
[  0]  image16            firefox-talos-gfx  121.773  122.218   0.15%    6/6

After:

[ # ]  backend                         test   min(s) median(s) stddev. count
[  0]  image16            firefox-talos-gfx   85.247   85.563   0.22%    6/6

V2: Slightly better instruction scheduling based on comments from Taekyun Kim.
V3: Eliminate all stalls from the inner loop. Also based on comments from Taekyun Kim.
2011-04-18 16:25:36 -04:00
Gilles Espinasse
1670b95214 Fix OpenMP not supported case
PIXMAN_LINK_WITH_ENV did not fail unless -Wall -Werror is used.
So even when the compiler did not support OpenMP, USE_OPENMP was defined.
Fix that by running the second OpenMP test only when first AC_OPENMP find supported

configure tested in the cases :
gcc without libgomp support, no openmp option, --enable-openmp and --disable-openmp
gcc with libgomp support, no openmp option, --enable-openmp and --disable-openmp

Not tested with autoconf version not knowing openmp (<2.62)

Warn when --enable-openmp is requested but no support is found

Signed-off-by: Gilles Espinasse <g.esp@free.fr>
2011-04-18 16:13:58 -04:00
Gilles Espinasse
b9e8f7fb74 Fix missing AC_MSG_RESULT value from Werror test
Use the correct variable name

Signed-off-by: Gilles Espinasse <g.esp@free.fr>
2011-04-18 16:13:58 -04:00
Siarhei Siamashka
caae4e82ff ARM: pipelined NEON implementation of bilinear scaled 'src_8888_0565'
Benchmark on ARM Cortex-A8 r1p3 @600MHz, 32-bit LPDDR @166MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=20028888, dst=10020565, speed=33.59 MPix/s
  after:  op=1, src=20028888, dst=10020565, speed=46.25 MPix/s

Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=20028888, dst=10020565, speed=63.86 MPix/s
  after:  op=1, src=20028888, dst=10020565, speed=84.22 MPix/s
2011-04-11 10:48:35 +03:00
Siarhei Siamashka
d080d59b80 ARM: pipelined NEON implementation of bilinear scaled 'src_8888_8888'
Performance of the inner loop when working with the data in L1 cache:
    ARM Cortex-A8: 41 cycles per 4 pixels (no stalls and partial dual issue)
    ARM Cortex-A9: 48 cycles per 4 pixels (no stalls)

It might be still possible to improve performance even more on ARM Cortex-A8
with a better use of dual issue.

Benchmark on ARM Cortex-A8 r1p3 @600MHz, 32-bit LPDDR @166MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=20028888, dst=20028888, speed=40.38 MPix/s
  after:  op=1, src=20028888, dst=20028888, speed=48.47 MPix/s

Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=20028888, dst=20028888, speed=79.68 MPix/s
  after:  op=1, src=20028888, dst=20028888, speed=93.11 MPix/s
2011-04-11 10:48:30 +03:00
Siarhei Siamashka
b496a8b279 ARM: support different levels of loop unrolling in bilinear scaler
Now an extra 'flag' parameter is supported in bilinear scaline scaling
function generation macro. It can be used to enable 4 or 8 pixels per
loop iteration unrolling and provide save/restore code for d8-d15
registers.
2011-04-11 10:48:24 +03:00
Siarhei Siamashka
34ca9cf03f ARM: use less ARM instructions in NEON bilinear scaling code
This reduces code size and also puts less pressure on the
instruction decoder.
2011-04-11 10:48:14 +03:00
Siarhei Siamashka
0f7be9f72e ARM: support for software pipelining in bilinear macros
Now it's possible to override the main loop of bilinear scaling code
with optimized pipelined implementation.
2011-04-11 10:48:10 +03:00
Siarhei Siamashka
9638af9583 ARM: use aligned memory writes in NEON bilinear scaling code 2011-04-11 10:48:05 +03:00
Siarhei Siamashka
8bba3a0e1e ARM: tweaked horizontal weights update in NEON bilinear scaling code
Moving horizontal interpolation weights update instructions from the
beginning of loop to its end allows to hide some pipeline stalls and
improve performance.
2011-04-11 10:48:01 +03:00
Cyril Brulebois
eade7b4dbd Upload to unstable. 2011-04-10 23:08:45 +02:00
Søren Sandmann Pedersen
a215322267 ARM: Tiny improvement in over_n_8888_8888_ca_process_pixblock_head
Instead of two

	mvn d24, d24
	mvn d25, d25

use just one

	mvn q12, q12

Also move another vmvn instruction into the created pipeline bubble,
as pointed out by Siarhei.
2011-04-06 23:03:19 -04:00
Søren Sandmann Pedersen
44f99735d9 Makefile.am: Put development releases in "snapshots" directory
Up until now, all pixman release, both snapshots and releases were
uploaded to the "releases" directory on www.cairographics.org, but
it's better to development snapshots in the "snapshots" directory.

This patch changes Makefile.am to do that.
2011-04-06 23:03:10 -04:00
Steve Langasek
c6ce22e73a build for multiarch 2011-03-26 00:30:06 -07:00
Søren Sandmann Pedersen
ad3cbfb073 test: Fix infinite loop in composite
When run in PIXMAN_RANDOMIZE_TESTS mode, this test would go into an
infinite loop because the loop started at 'seed' but the stop
condition was still N_TESTS.
2011-03-22 13:43:29 -04:00
Alexandros Frantzis
b514e63cfc Add support for the r8g8b8a8 and r8g8b8x8 formats to the tests. 2011-03-22 13:43:29 -04:00
Alexandros Frantzis
f05a90e5f8 Add simple support for the r8g8b8a8 and r8g8b8x8 formats.
This format is particularly useful on big-endian architectures, where RGBA in
memory/file order corresponds to r8g8b8a8 as an uint32_t. This is important
because RGBA is in some cases the only available choice (for example as a pixel
format in OpenGL ES 2.0).
2011-03-22 13:43:29 -04:00
Søren Sandmann Pedersen
7eb0abb5e8 test: Randomize some tests if PIXMAN_RANDOMIZE_TESTS is set
This patch makes so that composite and stress-test will start from a
random seed if the PIXMAN_RANDOMIZE_TESTS environment variable is
set. Running the test suite in this mode is useful to get more test
coverage.

Also, in stress-test.c make it so that setting the initial seed causes
threads to be turned off. This makes it much easier to see when
something fails.
2011-03-19 08:51:35 -04:00
Søren Sandmann Pedersen
6b27768d81 Simplify the prototype for iterator initializers.
All of the information previously passed to the iterator initializers
is now available in the iterator itself, so there is no need to pass
it as arguments anymore.
2011-03-18 16:23:10 -04:00
Søren Sandmann Pedersen
74d0f44b6d Fill out parts of iters in _pixman_implementation_{src,dest}_iter_init()
This makes _pixman_implementation_{src,dest}_iter_init() responsible
for filling parts of the information in the iterators. Specifically,
the information passed as arguments is stored in the iterator.

Also add a height field to pixman_iter_t().
2011-03-18 16:23:10 -04:00
Søren Sandmann Pedersen
be4eaa0e4f In delegate_{src,dest}_iter_init() call delegate directly.
There is no reason to go through
_pixman_implementation_{src,dest}_iter_init(), especially since
_pixman_implementation_src_iter_init() is doing various other checks
that only need to be done once.

Also call delegate->src_iter_init() directly in pixman-sse2.c
2011-03-18 16:23:10 -04:00
Siarhei Siamashka
70a923882c ARM: a bit faster NEON bilinear scaling for r5g6b5 source images
Instructions scheduling improved in the code responsible for fetching r5g6b5
pixels and converting them to the intermediate x8r8g8b8 color format used in
the interpolation part of code. Still a lot of NEON stalls are remaining,
which can be resolved later by the use of pipelining.

Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=10020565, dst=10020565, speed=32.29 MPix/s
          op=1, src=10020565, dst=20020888, speed=36.82 MPix/s
  after:  op=1, src=10020565, dst=10020565, speed=41.35 MPix/s
          op=1, src=10020565, dst=20020888, speed=49.16 MPix/s
2011-03-12 21:30:22 +02:00
Siarhei Siamashka
fe99673719 ARM: NEON optimization for bilinear scaled 'src_0565_0565'
Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=10020565, dst=10020565, speed=3.30 MPix/s
  after:  op=1, src=10020565, dst=10020565, speed=32.29 MPix/s
2011-03-12 21:30:18 +02:00
Siarhei Siamashka
29003c3bef ARM: NEON optimization for bilinear scaled 'src_0565_x888'
Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=10020565, dst=20020888, speed=3.39 MPix/s
  after:  op=1, src=10020565, dst=20020888, speed=36.82 MPix/s
2011-03-12 21:30:13 +02:00
Siarhei Siamashka
2ee27e7d79 ARM: NEON optimization for bilinear scaled 'src_8888_0565'
Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=20028888, dst=10020565, speed=6.56 MPix/s
  after:  op=1, src=20028888, dst=10020565, speed=61.65 MPix/s
2011-03-12 21:30:09 +02:00
Siarhei Siamashka
11a0c5badb ARM: use common macro template for bilinear scaled 'src_8888_8888'
This is a cleanup for old and now duplicated code. The performance improvement
is mostly coming from the enabled use of software prefetch, but instructions
scheduling is also slightly better.

Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=20028888, dst=20028888, speed=53.24 MPix/s
  after:  op=1, src=20028888, dst=20028888, speed=74.36 MPix/s
2011-03-12 21:30:05 +02:00
Siarhei Siamashka
34098dba67 ARM: NEON: common macro template for bilinear scanline scalers
This allows to generate bilinear scanline scaling functions targeting
various source and destination color formats. Right now a8r8g8b8/x8r8g8b8
and r5g6b5 color formats are supported. More formats can be added if needed.
2011-03-12 21:30:00 +02:00
Siarhei Siamashka
66f4ee1b3b ARM: new bilinear fast path template macro in 'pixman-arm-common.h'
It can be reused in different ARM NEON bilinear scaling fast path functions.
2011-03-12 21:29:56 +02:00
Siarhei Siamashka
5921c17639 ARM: assembly optimized nearest scaled 'src_8888_8888'
Benchmark on ARM Cortex-A8 r1p3 @500MHz, 32-bit LPDDR @166MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=20028888, dst=20028888, speed=44.36 MPix/s
  after:  op=1, src=20028888, dst=20028888, speed=39.79 MPix/s

Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=20028888, dst=20028888, speed=102.36 MPix/s
  after:  op=1, src=20028888, dst=20028888, speed=163.12 MPix/s
2011-03-12 21:26:05 +02:00
Siarhei Siamashka
f3e17872f5 ARM: common macro for nearest scaling fast paths
The code of nearest scaled 'src_0565_0565' function was generalized
and moved to a common macro, so that it can be reused for other
fast paths.
2011-03-12 21:24:40 +02:00
Siarhei Siamashka
bb3d1b67fd ARM: use prefetch in nearest scaled 'src_0565_0565'
Benchmark on ARM Cortex-A8 r1p3 @500MHz, 32-bit LPDDR @166MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=10020565, dst=10020565, speed=75.02 MPix/s
  after:  op=1, src=10020565, dst=10020565, speed=73.63 MPix/s

Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=10020565, dst=10020565, speed=176.12 MPix/s
  after:  op=1, src=10020565, dst=10020565, speed=267.50 MPix/s
2011-03-12 21:23:54 +02:00
Cyril Brulebois
3503f7956f Upload to experimental. 2011-03-09 04:08:04 +01:00
Cyril Brulebois
19f2d3d9c1 Bump Standards-Version to 3.9.1 (no changes needed). 2011-03-09 04:07:54 +01:00
Cyril Brulebois
bec6320b0e Add a quilt series placeholder file. 2011-03-09 04:04:13 +01:00
Cyril Brulebois
43375c5d66 Switch to dh. 2011-03-09 03:55:08 +01:00
Cyril Brulebois
d3975d7ff9 Update Uploaders list. Thanks, David! 2011-03-09 03:42:00 +01:00
Cyril Brulebois
b03a2e477b Remove libpixman1-dev from Conflicts, last seen in etch! 2011-03-09 03:41:05 +01:00
Cyril Brulebois
61363cc614 Wrap Build-Depends. 2011-03-09 03:40:06 +01:00
Cyril Brulebois
b98292b4d5 Bump shlibs accordingly. 2011-03-09 03:39:07 +01:00
Cyril Brulebois
1e6491fdde Update symbols file with new symbols. 2011-03-09 03:38:42 +01:00
Cyril Brulebois
1d60bb92f7 Bump changelogs. 2011-03-09 03:21:07 +01:00
Cyril Brulebois
a0ab0aecb2 Merge branch 'upstream-experimental' into debian-experimental 2011-03-09 03:20:36 +01:00
Søren Sandmann Pedersen
84e361c8e3 test: Do endian swapping of the source and destination images.
Otherwise the test fails on big endian. Fix for bug 34767, reported by
Siarhei Siamashka.
2011-03-07 14:08:00 -05:00
Søren Sandmann Pedersen
84f3c5a71a test: In image_endian_swap() use pixman_image_get_format() to get the bpp.
There is no reason to pass in the bpp as an argument; it can be gotten
directly from the image.
2011-03-07 14:07:44 -05:00
Siarhei Siamashka
17feaa9c50 ARM: NEON optimization for bilinear scaled 'src_8888_8888'
Initial NEON optimization for bilinear scaling. Can be probably
improved more.

Benchmark on ARM Cortex-A8:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=20028888, dst=20028888, speed=6.70 MPix/s
  after:  op=1, src=20028888, dst=20028888, speed=44.27 MPix/s
2011-02-28 15:47:58 +02:00
Siarhei Siamashka
350029396d SSE2 optimization for bilinear scaled 'src_8888_8888'
A primitive naive implementation of bilinear scaling using SSE2 intrinsics,
which only handles one pixel at a time. It is approximately 2x faster than
pixman general compositing path. Single pass processing without intermediate
temporary buffer contributes to ~15% and loop unrolling contributes to ~20%
of this speedup.

Benchmark on Intel Core i7 (x86-64):
 Using cairo-perf-trace:
  before: image        firefox-planet-gnome   12.566   12.610   0.23%    6/6
  after:  image        firefox-planet-gnome   10.961   11.013   0.19%    5/6

 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=20028888, dst=20028888, speed=70.48 MPix/s
  after:  op=1, src=20028888, dst=20028888, speed=165.38 MPix/s
2011-02-28 15:47:52 +02:00
Siarhei Siamashka
0df43b8ae5 test: check correctness of 'bilinear_pad_repeat_get_scanline_bounds'
Individual correctness check for the new bilinear scaling related
supplementary function. This test program uses a bit wider range
of input arguments, not covered by other tests.
2011-02-28 15:29:23 +02:00
Siarhei Siamashka
d506bf68fd Main loop template for fast single pass bilinear scaling
Can be used for implementing SIMD optimized fast path
functions which work with bilinear scaled source images.

Similar to the template for nearest scaling main loop, the
following types of mask are supported:
1. no mask
2. non-scaled a8 mask with SAMPLES_COVER_CLIP flag
3. solid mask

PAD repeat is fully supported. NONE repeat is partially
supported (right now only works if source image has alpha
channel or when alpha channel of the source image does not
have any effect on the compositing operation).
2011-02-28 15:29:16 +02:00
Andrea Canciani
9ebde285fa test: Silence MSVC warnings
MSVC does not notice non-returning functions (abort() / assert(0))
and warns about paths which end with them in non-void functions:

c:\cygwin\home\ranma42\code\fdo\pixman\test\fetch-test.c(114) :
warning C4715: 'reader' : not all control paths return a value
c:\cygwin\home\ranma42\code\fdo\pixman\test\stress-test.c(133) :
warning C4715: 'real_reader' : not all control paths return a value
c:\cygwin\home\ranma42\code\fdo\pixman\test\composite.c(431) :
warning C4715: 'calc_op' : not all control paths return a value

These warnings can be silenced by adding a return after the
termination call.
2011-02-28 10:38:02 +01:00
Andrea Canciani
8868778ea1 Do not include unused headers
pixman-combine32.h is included without being used both in
pixman-image.c and in pixman-general.c.
2011-02-28 10:38:02 +01:00