Taalverwerking practical assignment

Lab assistant: Simon Polstra

Introduction

For this practical assignment you need to develop a peephole optimizer for SimpleScalar DLX assembly code.   This peephole optimizer reads assembly code generated by a gcc cross compiler and outputs new assembly in which redundant instructions have been removed. Examples of possibilities for peephole optimization can be found below.  Given a set of small benchmark programs, you should measure the speedups obtained by your peephole optimizer. Finally, a report should be written in which your peephole optimizer (its design, its structure, etc.) is described as well as the experiments performed on the set of benchmarks.

Peephole optimization

By looking at the assembly code the gcc cross compiler generates in the case no optimization flags are specified, one can see that the compiler will sometimes generate quite a few redundant instructions. A peephole optimizer analyses the assembly code by means of a "sliding window" (often the size of a basic block or smaller)  with which it tries to find certain (small) instruction sequences that can be replaced with more efficient (sequences of) instructions. Within the analyzed instruction window, such rewriting of instructions is repeated until there are no changes anymore. Below, we give several simple examples of instruction sequences which can be replaced:
 
Original sequence
Replacement
mov $regA,$regB     //with $regA == $regB ---  //remove it
mov $regA,$regB
instr $regA, $regA,...   //where instruction "instr" writes $regA
instr $regA, $regB,...
instr $regA,...               // instruction "instr" writes $regA
mov $4, $regA              // $4 is one of the argument regs in SimpleScalar
jal  XXX                        // function call
instr $4,...
jal XXX
sw $regA,XXX
ld $regA,XXX
sw $regA, XXX
shift $regA,$regA,0     //shift = sll,sla,srl or sra ---  //remove it
add $regA,$regA,X     //where X is a constant
lw ...,0($regA)
lw ...,X($regA)
          beq ...,$Lx
          j $Ly
$Lx:   ...
         bne ...,$Ly
$Lx:

Of course, there are a lot more possibilities. Moreover, the examples given in the table above are instances of classes of more generic instruction rewrite rules. To make your instruction rewrite rules (specifying which instruction sequence can be replaced by a new sequence under what circumstances) more generic, you probably should include support for dataflow analysis on basic blocks. More specifically, you should be able to tell whether or not a register is dead (i.e., its current value is not used anymore) after a certain position in the basic block.

Where to find the tools?

In /home/spolstra/vertalerbouw/ you can find all needed material for this assignment. It contains a number of subdirectories/files:

The gcc cross compiler

The gcc cross compiler (xgcc) can compile C code into binaries for the SimpleScalar processor simulator. By simply executing:
             xgcc -o foo foo.c
 the binary foo is generated for  foo.c

To generate assembly code that will serve as input for your peephole optimizer, compile your .c file as follows:
             xgcc -S foo.c
This will generate foo.s, which contains the assembly.

Note:  for this assignment, to generate assembly for your peephole optimizer you should not specify any optimization flags to the compiler.  Otherwise, your peephole optimizer will have a hard time to find any optimizations at all!

After you processed an assembly file with your peephole optimizer, you use xgcc again to generate the binary which can be simulated using the SimpleScalar tool set:
             xgcc -o newfoo newfoo.s      (where newfoo.s is the peephole optimized version of foo.s)

The SimpleScalar "sim-outorder" simulator

The "sim-outorder" simulator is the most detailed processor simulator provided by the SimpleScalar tool set.  It is based on the MIPS-IV instruction set architecture and models a very modern superscalar microprocessor which has 10 functional units (pipelines) and which aggressively  tries to exploit instruction-level parallelism (to keep its functional units busy) by, for example, executing instructions out of order. In other words, it realistically models the processor in the latest workstations/PCs.  In this assignment, we will use the default setup of the sim-outorder simulator (which can be reconfigured to model other flavors of microprocessors). This means that you simulate a binary by simply executing:

        sim-outorder  newfoo           (where newfoo is the binary to simulate)

After the simulation has finished,  a lot of statistics is generated (outputted to stderr). From these statistics, one number is of importance to you:

         sim_cycle                    ........... # total simulation time in cycles

The number shown before the "#" indicates the number of cycles needed to execute this binary.  Hence, it is your ultimate goal to reduce this number for each of the given benchmark programs. And, of course, this should be done while keeping the functionality of the programs intact!

Modeling as detailed as sim-outorder does, has its disadvantages: it is compute-intensive and therefore slow!  So, you might need to downscale some benchmarks (reducing the number of iterations they execute) for testing purposes. Remember, however, to use the original benchmarks for your final experiments of which the results also need to be reported.

You can find a detailed description of the sim-outorder simulator and a specification of the instruction-set it simulates in this document . Note: the assembly code generated by the cross compiler also contains so-called pseudo-operations (e.g., the li -- load immediate -- and mov instructions).  The assembler translates these pseudo-ops into equivalent sequences of actual machine instructions.

The benchmark suite

The benchmark suite consists of six relatively simple programs. After simulating, you will find the program output in the files *.output and the simulator output (e.g. statistics) in *.cycles

pi.c

Use:   pi  <nr_of_iterations>
For this assignment, simulate using "sim-outorder -redir:prog pi.output -redir:sim pi.cycles pi 25000"

Program calculates pi.

acron.c

Simulate using "sim-outorder -redir:prog acron.output -redir:sim acron.cycles acron"

Program generates acronyms for a set of words. It outputs the generated acronyms to stdout.

dhrystone.c

Simulate using "sim-outorder -redir:prog dhrystone.output -redir:sim dhrystone.cycles dhrystone"

The well-known dhrystone benchmark which was used "ages ago" to measure (integer) performance of control-based programs.

whet.c

Simulate using "sim-outorder -redir:prog whet.output -redir:sim whet.cycles whet"

The well-known whetstone benchmark which was used "ages ago" to measure floating point performance.

slalom.c

Simulate using "sim-outorder -redir:prog slalom.output -redir:sim slalom.cycles slalom < slalom.input".
Note: this benchmark uses the slalom.input and geom files.

Slalom stands for Scalable Language-independent Ames Laboratory One-minute Measurement which is a benchmark for measuring MFLOPS.

clinpack.c

Simulate using "sim-outorder -redir:prog clinpack.output -redir:sim clinpack.cycles clinpack"

The well-known linpack benchmark which performs all kinds of matrix operations (e.g. solving linear equations). Note: the "sim_cycle" number the sim-outorder simulator returns for this benchmark may show minor differences for different runs.
 

Requirements for this assignment

As was explained already, you should develop a peephole optimizer that operates on assembly programs targeted for the SimpleScalar processor architecture.  Your peephole optimizer should be an executable which reads SimpleScalar assembly code and outputs peephole optimized assembly code. You are entirely free in which language you program your peephole optimizer, as long as it runs on a Linux machine! So, no Windows stuff.

Your work will be rated according to several important requirements:

The work is performed in teams of X students, with X to be defined (this will be done during the first lecture). In the end, you will get a team grading in the sense that you get a number of points (dependent on the number of students in the team) which can be evenly divided over the team members or which can be divided with a +1/-1 difference if you think that certain team members earn less/more points for their contribution. For example, if a team of 3 students gets 21 points, then they may decide that each of them gets a 7 or that two members get a 7.5 and one a 6.

It would be nice when each team uses a nick-name so that we can bring in a "competition element": who is going to develop the most aggressive peephole optimizer?  During the 6 weeks you can spend on this assignment, we hope to regularly publish the achievements of your and other teams on the ranking page (with eternal glory for the final winner :) Whenever your team has new or improved optimization numbers, you can email them together with the binary of your peephole optimizer to me after which I will update the ranking page. So, the ranking page also publishes intermediate results of teams in order to "stimulate" the competition element.

Regarding the division of work within a team, one could recognize several (possible) tasks for implementing a peephole optimizer:

Grading the work

The grading of this assignment will be done as follows:

Reference results

Of course, I also built a peephole optimizer, nick-named "Hyper-de-peep". Below, the reduction of #cycles and number of removed (redundant) instructions is given for the benchmark set.  Hyper-de-peep is a moderately to fairly aggressive peephole optimizer.
 
 
Benchmark
Number of analyzed instructions
Number of instructions removed
Reduction of # cycles (%)
(100* ((cycles_old - cycles_new) / cycles_old))
pi
78
2 (5)
0 (0.4)
acron
307
32 (37)
2.6 (4.2)
dhrystone
659
50 (54)
5.7 (4.3!)
whet
855
53 (65)
1.0 (1.4)
slalom
3755
310 (371)
0.5 (-0.1!)
clinpack
3262
239 (272)
  ~ 8.6 (8.9)

The numbers between parenthesis refer to Hyper-de-peep running in a special "aggressive mode" in which certain optimistic assumptions are made on the liveness of registers when the liveness cannot be determined by analyzing the basic block in question. As you can see, removing a large number of instructions does not necessarily imply a large speedup. Reasons for this are that the removed instructions might not be part of the critical code (e.g., inner loops) and that the SimpleScalar processor architecture is able to resolve a lot of code inefficiencies at hardware level.

Good luck with the assignment!

 (c) 2011, Andy Pimentel