| Today, designers of high-performance
embedded products face the challenge of choosing the
best combination of hardware and supporting software
application tools. After the hardware architecture
is selected, care must be taken to evaluate and select
software development tools that squeeze the highest
performance out of the architecture. After all, what
good is a hardware architecture feature if the software
doesn’t take advantage of it?
IBM engineers recently concluded a study to help
customers face this challenge for the IBM® PowerPC®
405 and 440 processors. Several issues were addressed:
- Are there useful, honest benchmarks to evaluate
hardware/software combinations?
- Can I control how my C/C++ compiler exploits
the special features of the PowerPC architecture?
- How fast can I expect my application to run compared
to the theoretical maximum?
Useful, honest performance benchmarks
The Embedded Microprocessor Benchmark Consortium
(EEMBC) develops and certifies meaningful performance
benchmarks for embedded processors and compilers.
The EEMBC benchmarks are composed of dozens of algorithms
organized into benchmark suites targeting telecommunications,
networking, automotive and industrial, consumer, and
office equipment products, as shown in Table 1. Although
there is no better benchmark for a customer application
than the application itself, the EEMBC benchmarks’
real-world nature offers a clear improvement over
other outdated synthetic benchmarks.
The scores
IBM tested various compilers with its PowerPC
4xx products on all five EEMBC suites and chose the
compiler that produced the best scores. IBM then certified
the scores at EEMBC Certification Labs (ECL) and published
the scores on the public EEMBC Web site, www.eembc.org.
“EEMBC benchmarks are based on real-world code
that indicates how our PowerPC 405GPr and 440GP processors
work in our customers' applications,” said Kalpesh
Gala, PowerPC strategic marketing manager at IBM Microelectronics.
“With Green Hills Software's compilers, our
PowerPC 440GP processor exceeded all other System-on-Chip
processors on four of the five EEMBC benchmark suites.”
| Table
1. Compilers mix for EEMBC certification |
| IBM PowerPC processor |
Telecom |
Office
automation
|
Consumer |
Automotive/
industrial
|
Networking |
| 405GPr – 266 MHz |
Green Hills |
Green Hills |
Green Hills |
Green Hills |
Green Hills |
| 405GPr – 400
MHz |
Green Hills |
Green Hills |
Green Hills |
Green Hills |
Green Hills |
| 440GP |
Green Hills |
Green Hills |
GNU |
Green Hills |
Green Hills |
Note: Table 1 shows the IBM PowerPC architecture
and compiler configurations that produced the highest
scores on the EEMBC Web site.
Scores on the EEMBC Web site show that Green Hills
Software’s compilers outperformed other compilers
by as much as 20% on the PowerPC 440GP, a significant
margin in a realm where single-digit improvements
are often considered a breakthrough in the success
of a customer’s product design (see Table 2).
| Table
2. Certified EEMBC scores for PowerPC 440GP |
IBM PowerPC
440GP
- 500 MHz
|
Green
Hills Software MULTI 3.6.1 |
Next-best
score |
Green
Hills Software faster by: |
| Telecom Telemark |
11.4 |
9.5
WRS-Diab 4.4a
|
20% |
| Automotive/industrial
Automark |
264.2 |
222.1
MetaWare HighC/C++ 4.5b
|
19% |
| Consumer Consumermark |
42.5 |
48.0
GNU GCC 3.04
|
- 13% |
| Networking
Netmark |
9.4 |
8.8
GNU GCC 3.04
|
7% |
| Office automation Oamark |
511.1 |
501.3
Green Hills Software 3.5
|
2% |
| Note: Certified
EEMBC scores are public and enable customers to
compare compiler performance. |
The
technology behind the scores
Key compiler optimizations that produced IBM’s
EEMBC scores fall into two categories:
- IBM PowerPC architecture-specific optimizations
- General optimization
The Green Hills Software compilers are one component
of the MULTI integrated development environment, show
in Figure 1. Programmers can evoke the compiler within
MULTI or from makefiles.
Figure 1. MULTI Integrated Development Environment
Multiply accumulate instructions
One of the keys to the high EEMBC scores was the compiler’s
ability to effectively use the
multiply-accumulate (MAC) instruction set extensions
on the PowerPC 405 and 440 processors. The 16-bit
multiply instructions included in the MAC extensions
offer shorter latencies and higher throughputs than
the 32-bit multiply instructions of the standard PowerPC
architecture. Unfortunately, the C programming language
is not always conducive to arithmetic operations on
values of size less than that of type "int"
because many operations promote their results to larger
sizes, and programmers often do not select the smallest
data types for their variables. The compiler however,
overcame both of these obstacles and automatically
identified and used the smallest-possible data sizes
for these operations and maximized the MAC usage.
Divides and multiplies can often be performed using
sequences of shifts, adds and subtracts, which can
be faster than using actual divide and multiply instructions.
The Green Hills Software compiler chooses an approach
based on factors such as:
- Latency and throughput execution times of particular
divide and multiply instructions
- Heuristics tuned by a data base of performance
characteristics from other real applications run
on the particular PowerPC processor
Alternatively, the user can override the compiler’s
choice by directing it through compiler flags.
Pipeline scheduling
The high scores were also a result of the compiler
effectively scheduling instructions for the PowerPC
architecture pipelines. The PowerPC 405 utilizes a
single-issue five-stage pipeline. The PowerPC 440
processor uses superscalar, out-of-order execution
with a seven-stage pipeline containing of three parallel
pipelines. In addition, it contains two integer units
and one load/store unit. The challenge facing the
compiler is to schedule instructions, often involving
creative reordering, to make the most productive use
of the PowerPC 405 and 440 processor’s execution
units.
Register allocation
The compiler’s advanced register allocation
logic was also a positive factor. Register allocation
tries to allocate local variables in registers instead
of in memory, because accessing them there requires
multiple load and store instructions. Several optimization
techniques, including register coalescing, loop optimization
analysis, and analysis of constant data and variable
lifetimes, minimize the number of accesses to variables
in memory.
Many other advanced compiler optimizations were required
to achieve the high scores, including:
- Intermodule inlining
- Loop unrolling
- Constant folding
- Register coalescing
- Loop rotation
- Static address elimination
- Common subexpression elimination
- Dead code elimination
- Constant propagation
- Strength reduction
- Loop invariant removal
- Tail recursion
- Peephole optimization
Investment in compiler technology continues at Green
Hills Software. The return on that investment is evident.
Early tests on Green Hills’ next-generation
PowerPC compiler, planned for release later this year,
show it beating its predecessor on these same EEMBC
benchmarks.
Conclusion
Many attributes of a compiler and hardware architecture
must be considered when evaluating software development
tools for a high-performance System-on-Chip (SOC)
design based on IBM PowerPC technology. One valuable
evaluation tool is the EEMBC benchmarks, which enabled
IBM engineers to compare compiler performance on real-world
code and publish winning, certified performance scores.
Green Hills Software’s out-of-the-box compilers
contain optimization technology that enables the user
to unlock the highest performance and smallest code
size for applications based on the IBM PowerPC 4xx
family. |