Taalverwerking practical assignment
Lab assistant: Simon Polstra
Introduction
For this practical assignment you need to develop a peephole optimizer
for
SimpleScalar DLX assembly code. This peephole optimizer
reads assembly code generated by a gcc cross compiler and outputs new
assembly in which redundant instructions have been removed. Examples of
possibilities for peephole optimization can be found below. Given
a set of small benchmark programs, you should measure the speedups
obtained
by your peephole optimizer. Finally, a report should be written in
which
your peephole optimizer (its design, its structure, etc.) is described
as
well as the experiments performed on the set of benchmarks.
Peephole optimization
By looking at the assembly code the gcc cross compiler generates in the
case no optimization flags are specified, one can see that the compiler
will
sometimes generate quite a few redundant instructions. A peephole
optimizer
analyses the assembly code by means of a "sliding window" (often the
size
of a basic block or smaller) with which it tries to find certain
(small) instruction
sequences that can be replaced with more efficient (sequences of)
instructions.
Within the analyzed instruction window, such rewriting of instructions
is
repeated until there are no changes anymore. Below, we give several
simple
examples of instruction sequences which can be replaced:
|
Original sequence
|
Replacement
|
| mov $regA,$regB //with $regA == $regB |
--- //remove it |
mov $regA,$regB
instr $regA, $regA,... //where
instruction "instr" writes
$regA |
instr $regA, $regB,... |
instr
$regA,...
// instruction "instr"
writes $regA
mov $4,
$regA
// $4 is one of the argument
regs in SimpleScalar
jal
XXX
// function call |
instr $4,...
jal XXX |
sw $regA,XXX
ld $regA,XXX |
sw $regA, XXX |
| shift $regA,$regA,0 //shift =
sll,sla,srl or sra |
--- //remove it |
add $regA,$regA,X //where X is a
constant
lw ...,0($regA) |
lw ...,X($regA) |
beq
...,$Lx
j $Ly
$Lx: ... |
bne ...,$Ly
$Lx: |
Of course, there are a lot more possibilities. Moreover, the
examples
given in the table above are instances of classes of more generic
instruction
rewrite rules. To make your instruction rewrite rules (specifying which
instruction sequence can be replaced by a new sequence under what
circumstances) more generic, you probably should include support for
dataflow analysis on basic blocks. More specifically, you should be
able to tell whether or not a register is dead (i.e., its current value
is not used anymore) after a certain position in the basic block.
Where to find the tools?
In /home/spolstra/vertalerbouw/ you can find all needed material for
this assignment. It contains a number of subdirectories/files:
- bin/ : contains the gcc cross compiler (xgcc) and
the
SimpleScalar out-of-order simulator (sim-outorder).
- lib/ : library needed by xgcc
- ssbig-na-sstrix/ : binaries and libraries needed by
xgcc/SimpleScalar
- include/: include files for xgcc
- benchmarks.tar : a set of small C files for evaluating your
peephole optimizer (copy to and untar in your working directory)
The gcc cross compiler
The gcc cross compiler (xgcc) can compile C code into binaries for the
SimpleScalar processor simulator. By simply executing:
xgcc -o foo foo.c
the binary foo is generated for foo.c
To generate assembly code that will serve as input for your peephole
optimizer, compile your .c file as follows:
xgcc -S foo.c
This will generate foo.s, which contains the assembly.
Note: for this assignment, to generate assembly for
your
peephole optimizer you should not specify any optimization
flags to
the compiler. Otherwise, your peephole optimizer will have a hard
time
to find any optimizations at all!
After you processed an assembly file with your peephole optimizer,
you use xgcc again to generate the binary which can be simulated using
the SimpleScalar tool set:
xgcc -o newfoo newfoo.s (where newfoo.s
is
the peephole optimized version of foo.s)
The SimpleScalar "sim-outorder" simulator
The "sim-outorder" simulator is the most detailed processor simulator
provided by the SimpleScalar tool set. It is based on the MIPS-IV
instruction set architecture and models a very modern superscalar
microprocessor which has 10 functional units (pipelines) and which
aggressively tries to exploit instruction-level parallelism (to
keep its functional units busy) by, for example, executing instructions
out of order. In other words, it realistically
models the processor in the latest workstations/PCs. In this
assignment,
we will use the default setup of the sim-outorder simulator (which can
be
reconfigured to model other flavors of microprocessors). This means
that
you simulate a binary by simply executing:
sim-outorder
newfoo
(where newfoo is the binary to simulate)
After the simulation has finished, a lot of statistics
is generated (outputted to stderr). From these statistics, one number
is of importance
to you:
sim_cycle
........... # total simulation time in cycles
The number shown before the "#" indicates the number of cycles
needed
to execute this binary. Hence, it is your ultimate goal to reduce
this number for each of the given benchmark programs. And, of course,
this should be done while keeping the functionality of the programs
intact!
Modeling as detailed as sim-outorder does, has its disadvantages: it
is compute-intensive and therefore slow! So, you might need
to
downscale some benchmarks (reducing the number of iterations they
execute)
for testing purposes. Remember, however, to use the original benchmarks
for
your final experiments of which the results also need to be reported.
You can find a detailed description of the sim-outorder simulator
and
a specification of the instruction-set it simulates in
this document . Note: the assembly code generated by the cross
compiler also contains so-called pseudo-operations (e.g., the li -- load immediate --
and mov instructions).
The assembler translates these pseudo-ops into equivalent sequences of
actual machine instructions.
The benchmark suite
The benchmark suite consists of six relatively simple programs. After
simulating, you will find the program output in the files *.output and
the simulator output (e.g. statistics) in *.cycles
pi.c
Use: pi <nr_of_iterations>
For this assignment, simulate using "sim-outorder -redir:prog pi.output
-redir:sim pi.cycles pi 25000"
Program calculates pi.
acron.c
Simulate using "sim-outorder -redir:prog acron.output -redir:sim
acron.cycles acron"
Program generates acronyms for a set of words. It outputs the
generated acronyms to stdout.
dhrystone.c
Simulate using "sim-outorder -redir:prog dhrystone.output -redir:sim
dhrystone.cycles dhrystone"
The well-known dhrystone benchmark which was used "ages ago" to
measure (integer) performance of control-based programs.
whet.c
Simulate using "sim-outorder -redir:prog whet.output -redir:sim
whet.cycles whet"
The well-known whetstone benchmark which was used "ages ago" to
measure floating point performance.
slalom.c
Simulate using "sim-outorder -redir:prog slalom.output -redir:sim
slalom.cycles slalom < slalom.input".
Note: this benchmark uses the slalom.input and geom files.
Slalom stands for Scalable Language-independent Ames Laboratory
One-minute Measurement which is a benchmark for measuring MFLOPS.
clinpack.c
Simulate using "sim-outorder -redir:prog clinpack.output -redir:sim
clinpack.cycles clinpack"
The well-known linpack benchmark which performs all kinds of matrix
operations (e.g. solving linear equations). Note: the "sim_cycle"
number the sim-outorder simulator returns for this benchmark may show
minor differences for different runs.
Requirements for this assignment
As was explained already, you should develop a peephole optimizer that
operates on assembly programs targeted for the SimpleScalar processor
architecture. Your peephole optimizer should be an executable
which reads SimpleScalar
assembly code and outputs peephole optimized assembly code. You are
entirely
free in which language you program your peephole optimizer, as long
as
it runs on a Linux machine! So, no
Windows stuff.
Your work will be rated according to several important requirements:
- The peephole optimizer that is submitted functions correctly in
the sense that it executes without crashing and it generates assembly
code that is functionally equivalent to the original assembly code.
- The source code of your peephole optimizer (which is, of course,
also handed in) is clean and well-commented.
- Your work, including the experiments performed on the benchmark
suite, is well-documented. That is, you need to write a small report
(< 6 pages) on the design & implementation of your optimizer and
discussing its
results for the benchmarks. An important topic in such a report
is,
of course, which types of peephole optimizations your optimizer
implements.
- Finally, the obtained speedups of your peephole optimizer are
taken into account.
The work is performed in teams of X students, with X to be defined
(this will be done during the first lecture). In
the end, you will
get a team grading in the sense that you get a number of points
(dependent
on the number of students in the team) which can be evenly divided over
the team members or which can be divided with a +1/-1 difference if you
think that certain team members earn less/more points for their
contribution. For example, if a team of 3 students gets 21 points, then
they may decide that each of them gets a 7 or that two members get a
7.5 and one a 6.
It would be nice when each team uses a nick-name so that we can
bring
in a "competition element": who is going to develop the most aggressive
peephole optimizer? During the 6 weeks you can spend on this
assignment, we hope
to regularly publish the achievements of your and other teams on the
ranking page (with eternal glory for the final winner :) Whenever
your team has new or improved optimization numbers, you can email them
together with the binary of your peephole optimizer to me after which I
will update the ranking page. So, the ranking page also
publishes intermediate results of teams in order to "stimulate" the
competition
element.
Regarding the division of work within a team, one could recognize
several (possible) tasks for implementing a peephole optimizer:
- a parser for SimpleScalar assembly code. This could be
done
with, for example, the lex and yacc tools (see the Dragon book,
and
"man lex" / "man yacc")
- dividing the parsed assembly into basic blocks
- performing peephole optimizations on basic blocks (Hint:
peephole optimizations regarding jump/branch instructions usually need
to be performed on the global assembly code, so before the partitioning
into basic blocks)
- to allow for more aggressive optimizations, liveness analysis
(is a register dead after a certain point?) may be required within a
basic block
Grading the work
The grading of this assignment will be done as follows:
- Implementation
- Instruction parsing (10%)
- Division of instructions into basic blocks (10%)
- Global branch/jump optimizations (10%)
- "Standard" peephole optimization on basic blocks (10%)
- "Advanced" peephole optimizations using dataflow/liveness
analysis (15%)
- Comments in code (10%)
- Report (20%)
- Results
- 5% (very small speedups) - 15% (extremely competitive speedups)
Reference results
Of course, I also built a peephole optimizer, nick-named
"Hyper-de-peep". Below, the reduction of #cycles and number of removed
(redundant) instructions is given for the benchmark set.
Hyper-de-peep is a moderately to fairly aggressive peephole optimizer.
|
Benchmark
|
Number of analyzed instructions
|
Number of instructions removed
|
Reduction of # cycles (%)
(100* ((cycles_old - cycles_new) / cycles_old))
|
|
pi
|
78
|
2 (5)
|
0 (0.4)
|
|
acron
|
307
|
32 (37)
|
2.6 (4.2)
|
|
dhrystone
|
659
|
50 (54)
|
5.7 (4.3!)
|
|
whet
|
855
|
53 (65)
|
1.0 (1.4)
|
|
slalom
|
3755
|
310 (371)
|
0.5 (-0.1!)
|
|
clinpack
|
3262
|
239 (272)
|
~ 8.6 (8.9)
|
The numbers between parenthesis refer to Hyper-de-peep running in a
special "aggressive mode" in which certain optimistic assumptions are
made on the liveness of registers when the liveness cannot be
determined by analyzing the basic block in question. As you can see,
removing a large number of
instructions does not necessarily imply a large speedup. Reasons for
this
are that the removed instructions might not be part of the critical
code
(e.g., inner loops) and that the SimpleScalar processor architecture is
able
to resolve a lot of code inefficiencies at hardware level.
Good luck with the assignment!
(c) 2011, Andy Pimentel