cs352: December 2007

Saturday, December 8, 2007

Assignment#4
Problem Statement: http://www.cs.utexas.edu/users/hds/cs352/assignments/sep24/sep24.pdf
Sample Solution: http://docs.google.com/Doc?id=dgt9wvxh_54c7ggxk

Points Distribution(Question:Points)--> 1:10 2:20 3:20

1. a) jr rs
Jump register is implemented as PC = R[rs]
rs is Instruction[25-21] in Fig 5.17 which is used to select a register. Data is read from Read data 1 which is R[rs]. To assign this value to PC we add a multiplexer between output of Read data 1 and input of PC.

Extra Controls
A new control for the new multiplexer is added in the control unit. When jr is asserted, this control is asserted as well. We do not care about other signals.

b) lui rt, imm

Load upper immediate is implemented as
R[rt] := (imm << 16); PC = PC + 4;
Immediate (Instruction[15-0]) is left shifted by 16 and is channeled to input of Write data of register file. A new control for a new multiplexer is added in control unit.

Extra Controls
The new control is 1.
rt is Instruction[20-16] so RegDst is 0.
Since lui writes to register, RegWrite is 1.
branch is 0, since we want to increment PC
MemRead and MemWrite are 0, since we don't read or write memory.
We don't care about ALU and other controls.

2)
Assume that we have a multi-cycle MIPS processor as described in figure 5.28.
Give a complete description of how the following instructions (your description must
give the settings of all of the control signals and the contents of each of the effected
registers for each cycle:
a. sub $t3, $t1, $t2 equivalently sub $11, $9, $10
b. lw $s0, 4($t4) equivalently lw $16, 0x0004($12)
Assume that the following registers contain:
$t1 = 0x00000010, $t2 = 0x00000002, $t3 = 0x00000000, $t4 = 0x10001000
mem[0x10001000] = 0xfffffffd,
mem[0x10001004] = 0x00000789,
mem[0x10001008] = 0x00001111

Ans)

The control signals are as in Fig 5.38 with a correction in state 4: RegDst = 0, MemtoReg=1. The effect on registers are following:

a) sub $t3, $t1, $t2 equivalently sub $11, $9, $10
(1) Instruction Fetch
IR <= Memory[PC]
PC <= PC + 4

IR = 0x014B4822

(2) Instruction Decode and Register Fetch
A <= Reg[IR[25:21]]; (A:=R[rs])
B <= Reg[IR[20:16]]; (B:=R[rt])

A = 0x00000010, B = 0x00000002

(3) Execution
ALUOut <= A op B; (ALUOut=A-B)

ALUOut = 0x0000000E

(4) R-type Instruction Completion
Reg[IR[15:11]] <= ALUOut; (R[rd]=ALUOut)

$t3 = 0x0000000E

b) lw $s0, 4($t4) equivalently lw $16, 0x0004($12)
(1) Instruction Fetch
IR <= Memory[PC]
PC <= PC + 4

IR = 0x8D900004

(2) Instruction Decode and Register Fetch
A <= Reg[IR[25:21]]; (A:=R[rs])

A = 0x10001000

(3) Memory Address Computation
ALUOut <= A + sign-extend (IR[15-0]);

ALUOut = 0x10001004

(4) Memory Access
MDR <= Memory [ALUOut];

MDR = 0x00000789

(5) Memory Read Completion
Reg[IR[20:16]] <= MDR;

$s0 = 0x00000789

3.
http://spreadsheets.google.com/pub?key=p7FynqhFWcPPy8TPnE0f0Yg

	a.CPI =	4.19	c.New CPI =	4.56


Instruction count =		time * clocks per sec / clocks per Instr
	b.	63007159905
	c.New CPI =	4.56
Execution Time =		Instr.Count * clocks per Instr./clocks per sec
	d.	95.77088305

Assignment#11
Problem: http://www.cs.utexas.edu/users/hds/cs352/assignments/nov26/nov26.pdf
Solution: http://docs.google.com/Doc?id=dgt9wvxh_88hh77pv

Scoring key: [-2 for each incorrect part]

1.

a. Speedup = 1/(.99/100 + 0.01) = 50.2513
b. Speedup = 1 / ( fraction(enhanced)/speedup(enhanced) + (1-fraction(enhanced)) )
After the algebra:
fraction(enhanced) = (1-1/Speedup)/(1-1/speedup(enhanced))
= 0.997475

2.

a. Remote memory access time = 200 nsec
cycles per sec = 2 * 10^9

cycles per remote memory access
= cycles per sec * remote memory access time
= 2 * 10^9 * 200 * 10^-9
= 400

b. CPI
= base CPI + CPI of remote memory access * percent instructions involving remote memory access
= 0.5 + 400*0.02
= 0.5 + 0.8
= 1.3

c. Number of Instructions, I
= clock rate * CPU time / CPI
= 2 * 10^9 * 3600 /0.5
= 144 * 10^11

Time taken for remote access
= Number of remote access instructions * CPI of remote access instructions / clock rate
= 0.02 * 144 * 10^11 * 400 / (2 * 10^9)
= 5760 sec

Net total time
= Time taken for remote access + Time taken for local accesses
= 5760 + 3600
= 9360 sec

d. Effective number of instructions
= 96% of 144 * 10^11 / 32 + 4% of 144 * 10^11
= 7% of 144 * 10^11

Expected Execution time
= Effective number of instructions * effective CPI / clock rate
= 7% of 144 * 10^11 * 1.3 / 2 * 10^9
= 655.2 sec

Alternatively,

By Amdahl's Law:

Speedup
= 1 / (0.04 + 0.96/32)
= 14.2857

Effective execution time
= Net total time from 2.c / Speedup
= 9360 * (0.04 + 0.96/32)
= 655.2 sec

3.

Every request message causes a reply.
a. rate of each link = 1000/0.000600 = 1,666,666.67 bytes/sec

b. msgs/sec/node = 4 x 100 x 2 = 800 msgs/sec; demand = 800 x 1000 =
800,000 bytes/sec

c. load at switch = 2000 x 800,000 = 1.6 GBytes/sec - so No, switch cannot
handle the load

d. switch can handle 1,000,000,000/(2000 x 4) = 125,000 bytes/sec/core

e. 125,000/1000 = 125 msgs/sec/core

f. not enough information

Errata and explanations:
--------------------------

2.c corrected as:
What is the CPU time of the program on a single processor with the remote accesses as specified?

3.f is not possible to solve with given data.

Common rule of thumb
---------------------
1.a G in GBit/sec is 10^9 and not 2^30. Here is the rule of thumb. In general, in a serial transfer rate G is 10^9 and in chunk read or write it is 2^30. The same goes for K and M. That is, when 1 MB copied from disk to memory, what is meant here is 2^20. While a transfer rate of 1 MB per sec is 10^6 B per sec.

Emails:
--------

Upon further reflection (and a suggestion by Dr. Boral), I believe that
problem 2.c should read

c. What is the CPU time of the program on a single processor but with
the remote accesses as specified?

In question 3.f, I ask for the CPU utilization; CPU utilization is the
percentage of the elapsed
time that the CPU is busy (not idle). The idea here is that using the
info in question 3.b, you
should assume that program is using the CPU all of the time and that if
it generates 100 messages/sec,
then this would be 0.010 second of CPU time per message. Now, if the
program sends a message
after 0.010 seconds of cpu time, then there is an idle period (while the
message is being sent and then
received). So the elapsed time is the cpu time + the idle time and the
cpu utilization is

(cpu time/message)/(cpu time/message + idle time/message)

My older wrong answer:
------------------------
3.
a.
1 Gbit/sec link
= 125*10^6 bytes per sec

b. Bandwidth demand by single node
= 4 cores per node * 100 messages per sec per core * 1000 bytes per message
= 400,000 bytes per sec per node

c. Load by 2000 nodes on switch
= 2000 nodes * Load per node
= 2000 nodes * 400,000 bytes per sec per node
= 8 * 10^8 bytes per sec
< 1 Gbyte per sec

Therefore, yes, the switch can handle the load imposed by the 2000 nodes.

d. Maximum off-node message rate that the switch will allow
= total capacity of switch in bits per sec / Load in bits per message
= (8 * 10^9 bits per sec) / (100 * 8 bits per message)
= 10^6 messages per sec allowed

Total cores
= 200 nodes * 4 cores per node
= 8000

Maximum off-node message rate PER CORE that the switch will allow
= Maximum off-node message rate that the switch will allow / Total cores
= 10^6 messages per sec allowed / 8000
= 125

e. Using data from 3.b: each process (per core) generates 100 messages per sec. As per 3.c this doesn't saturate the bus.
Thus 100 messages per sec is the achieved message rate per core on this system.

f. (Insufficient data, please see errata at the end of this document)

Assignment#10
Problem: http://www.cs.utexas.edu/~hds/cs352/assignments/nov12/nov12.pdf
Solution: http://docs.google.com/Doc?id=dgt9wvxh_84fzwg52

1. Textbook: 8.18

Max sustainable simultaneous disk transfers
= Max sustainable bandwidth / Disk Transfer Rate
= Max sustainable bandwidth / 50 MB per sec (from page 570)

Now we need to calculate max sustainable bandwidth.

For 4 words:

We calculate the bus bandwidth for 4 words.
Bandwidth
= Size of 4 words / Time to transfer 4 words
= 4 * 32 / Time to Transfer 4 words

Time to transfer 4 words
= actual transfer time + memory access time
= (Total cycles required to transfer 4 words / clock speed) + memory access time.

Memory access time = 200ns (characteristic 4)

Total cycles = 2 (characteristic 2: 1 per 64 bit data transfer) + 1 (characteristic 2: to send address) + 2 (characteristic 3: between each bus operation) = 5

Therefore,
Time to transfer 4 words
= 5 cycles / 200MHz + 200 ns = 225 ns.

Bandwidth
= Size of 4 words / Time to transfer 4 words
= 4 * 32 / Time to Transfer 4 words
= 4 * 32 bits / 225 ns
= 4 * 4000 / 225 MB per sec
= 71.11 MB per sec

Similarly,
For 16 words:

We calculate the bus bandwidth for 16 words.

Bandwidth
= Size of 16 words / Time to transfer 16 words
= 16 * 32 / Time to Transfer 16 words

Time to transfer 16 words
= Total cycles required to transfer 16 words / clock speed + memory access time.

Memory access time = 200ns + 20 * 3 = 260 ns(characteristic 4: 3 additional set of four words)

Total cycles = 8 (characteristic 2: 1 per 64 bit data transfer) + 1 (characteristic 2: to send address) + 2 (characteristic 3: between each bus operation) = 11
Therefore,
Time to transfer 16 words
= 11 cycles / 200MHz + 200 ns = 315 ns.

Bandwidth
= Size of 16 words / Time to transfer 16 words
= 16 * 32 / Time to Transfer 16 words
= 16 * 32 bits / 315 ns
= 16 * 4000 / 315 MB per sec
= 203.1746 MB/sec

For 4-word block transfers, the bus bandwidth is 71.11 MB/sec. For 16-
word block transfers, the bus bandwidth was 203.175 MB/sec. The disk drive has a
transfer rate of 50 MB/sec. Thus for 4-word blocks we could sustain 71/50 = 1
simultaneous disk transfers, and for 16-word blocks we could sustain 203/50 = 4
simultaneous disk transfers. The number of simultaneous disk transfers is inherently
an integer and we want the sustainable value. Thus, we take the floor of the
quotient of bus bandwidth divided by disk transfer rate.

2. Textbook: 8.43
Redo the example on page 601, but instead assume that the reads are 8KB
reads. You can assume that the reads are always
to an idle disk, if one is available.

(We need to do exactly as per the text, just instead of 64KB reads we have 8KB reads)

Maximum I/O rate of CPU
= Instructions per sec / Instructions per I/O
= 3*10^9 / (200,000 + 100,000)
= 10,000 I/O per sec

Each I/O transfers 8KB, so

Maximum I/O rate of bus
= Bus Bandwidth / Bytes per I/O
= 1000*10^6 / 8,000
= 125,000 I/Os per sec

The CPU is the bottleneck and therefore, the maximum sustainable I/O rate is 10,000 I/Os per sec

Now, we need to calculate the maximum number of disks possible while within the maximum sustainable I/O rate bound
Max disks
<= Max sustainable I/O rate / I/O rate of disk
= Max sustainable I/O rate * Time per I/O at disk

Time per I/O at disk
= (Seek + rotational) + Transfer
= (6 ms) + 8 KB / 75 MB per sec
= 6.1066 ms

Therefore,
Max disks
<= Max sustainable I/O rate * Time per I/O at disk
= 10,000 I/Os per sec * 6.1066 ms
= 61.066

Max disks = 61

To compute the number of SCSI buses, we need to check the average transfer rate per disk to see if we can saturate the bus, which is given by

Transfer rate
= Transfer Size / Transfer Time
= 8 KB / 6.1066 ms
= 1.31 MB/sec

The SCSI Controller has max sustainable rate of 320 MB/sec and therefore can very easily handle 7 disks each of transfer rate 1.31 MB/sec.

Therefore,

Number of SCSI Controllers required
>= Number of disks required / number of disks per controller
= 61 / 7
= 8.714

Number of SCSI Controllers required = 9

3.

a. Maximum operations per second that the disk can sustain
= 1/average delay
= 1/ ( 0.0075 sec per operation)
= 133.33 operation per sec

b. Number of disk units required
>= arrival rate of operations / maximum operations per second
= 3 * 550 operations per sec / 133.33
= 12.375

Therefore, 13 units will be required

c. With a cache of hit rate 40% the number of disk units required
= Number of disk units required for misses which are 60% of total accesses
>= miss_rate * Number of disk units required without cache
= 0.6 * Number of disk units required without cache
= 0.6 * 12.375
= 7.425

Therefore, 8 units will be required

d. We have,
Number of disk units required with cache >= miss_rate * (Number of disk units required without cache)
=> 3 >= miss_rate * (Number of disk units required without cache)
=> 3 >= (1-hit_rate) * (Answer of 3.b)
=> hit_rate >= (1- (3/12.375))
= 0.75757

Therefore, required hit rate is 76%

Assignment#3
Fall 2007 Schwetman
CS352 – Assignment #3
Sept. 17, 2007
Weight: 50 points
Due date: Monday, Sept. 24, 2007 (beginning of class)
1. For each of the following C statements, give a sequence of MIPS assembly
statements that compute the same result. Assume that all of the memory variables
are located in global memory and can be accessed using the “la” instruction
(pseudo-op).
a. i = i + 5;
la $t0, i
lw $t1, 0($t0)
addi $t1, $t1, 5
sw $t1, 0($t0)

b. y = (x – 1) + (z + 10);

la $t1, x
la $t2, z
lw $t4, 0($t1)
addi $t4, $t4, -1
lw $t5, 0($t2)
addi $t5, $t5, 10
add $t3, $t4, $t5
sw $t3, 0($t0)

c. if(x < 100) y = 5;

la $t0, x
lw $t2, 0($t0)
slti $t1, $t2, 100
beq $t1, $zero, end
la $t0, y
ldi $t1, 5
sw $t1, 0($t0)
end:

d. y = func(a, b); // do not write the code for func()
la $t0, a
la $t1, b
lw $a0, 0($t0)
lw $a1, 0($t1)
jal func
la $t0, y
sw v0, 0($t0)

e. swap(&a, &b); // do not write the code for swap()
la $t0, a
la $t1, b
jal swap

2. Textbook, problems for Chapter 4 (starting on page 272)
4.1

For P1, M2 is 4/3 (2 sec/1.5 sec) times as fast as M1. For P2, M1 is 2 times as
fast (10 sec/5 sec) as M2.

4.2

We know the number of instructions and the total time to execute the program.
The execution rate for each machine is simply the ratio of the two values.
Thus, the instructions per second for P1 on M1 is
(5×10^9 instructions/2 seconds)
= 2.5×10^9 IPS,
and the instructions for P1 on M2 is
(6×10^9 instructions/1.5 seconds)
= 4×10^9IPS.

2.5 * pow(10,9)
4*pow(10,9)

4.3

M2 runs 4/3 as fast as M1, but it costs 8/5 as much. As 8/5 is more than 4/3,
M1 has the better value.

4.17

cpuTime = InstructionCount * CPI / clocksPerSec
CPI = cpuTime * clocksPerSec/InstructionCount

CPI on M1 = 2 * 4/5 = 1.6
CPI on M2 = 1.5 *6/6 = 1.5

4.7

(a) expected CPU time = InstructionCount *CPI / ClocksPerSec = 1.2 sec
(b) 1.2/3 = 40%

4.11

Program P running on machine M takes (10^9 cycles/seconds) * 10 seconds = 10^10 cycles. P′ takes (10^9 cycles/seconds) * 9 seconds = 9×10^9 cycles. This leaves 10^9 less cycles after the optimization. Everytime we replace a mult with two adds, it takes 4 – 2 * 1 = 2 cycles less per replacement. Thus, there must have been 10^9 cycles /(2 cycles/replacement) = 5×10^8 replacements to make P into P′.

3. We collect data for a program executing on a processor, as follows:
Group Cpi Rel. freq.
Load 5 0.24
Store 4 0.12
ALU 4 0.55
Br/jmp 3 0.04
Other 4 0.05
Assume that the number of instructions executed is 12 x 109, and that the clock rate
is 1.7 GHz.
a. What is the average CPI for this program on this processor?

Average CPI = sumproduct(cpi, freq)/sum(freq) = (1.32+4*.72)=4.20

b. How much CPU time will the program require?

CPU time = Instructions * CPI/clock rate = 12 * 4.20/1.7 = 29.6471 sec
[Rest of the calculations are similar to slides of Sep 17 class.]

c. A simple optimizer will reduce the number of loads by 11% and the number of
ALU instructions by 15%. Give the average CPI and the number of instructions
executed for the optimized program.
Avg. cpi = 4.1948
New number of instructions executed = 1.0693 * (10^10)
d. What is the execution time for the optimized program?
new time = 26.3857 sec

Assignment#9
Problem: http://www.cs.utexas.edu/~hds/cs352/assignments/nov07/nov7.pdf
Solution: http://docs.google.com/Doc?id=dgt9wvxh_83fb2cxr

[Score Distribution 1:10, 2:10, 3:10]

1. Textbook: 8.1, 8.3

8.1

Maximum transactions per second supported

For System A:
Upper bound by processor
= 500 million instructions per second / (10,000 instructions per operation * 5 operations per transaction)
= 10,000 transactions per second

Upper bound by I/O
= 1500 operations per sec / 5 operations per transaction
= 300 transactions per sec

Net upper bound
= 300 transactions per sec

For System B:
Upper bound by processor (same processor as system A)
= 10,000 transactions per second

Upper bound by I/O
= 1000 I/O operations per sec / 5 operations per transaction
= 200 transactions per sec

Net upper bound
= 200 transactions per sec

8.3

Number of files that can be sent
= Total energy available / energy consumed per transfer including browsing

Total energy available
= 100k joules

[Assuming both M to be the same (either 10^6 or 2^20)]
Energy consumed per transfer
= browsing energy + transfer energy
= 350 J + 40 W * 40 MB*8 bits per B / 5 Mbits per sec
= 350 + 2560 J
= 2910 J

Number of files that can be sent
= Total energy available / energy consumed per transfer including browsing
= ceil (100,000 / 2910)
= 34

2. We have a disk unit with the following properties:
· rotational delay: 15,000 rpm
· controller overhead: 0.5 msec
· transfer rate: 25 Mbytes/sec
· average seek time: 7.5 msec
a. What is the average time to read a 512 byte sector?
b. What is the time to read 4 consecutive sectors on a track in one operation?
c. What is the maximum rate of disk operations (each operation reads/writes 512
bytes) for this disk unit on this system?
d. If the disk unit operates at this maximum rate, what is the data transfer rate,
assuming that 512 bytes are transferred per operation?

a. Average delay = average seek time + rotational latency + transfer time + controller overhead
= 7.5 msec + 0.5 rotation/(15,000 rotations per min / 60 seconds per min) sec + 512 bytes / 25 Mbytes per sec + 0.5 msec
= 7.5 + 2.0 + 0.01953 + 0.5 msec
= 10.01953 msec

b. For 4 consecutive sectors we need not seek disk heads again. Similarly, rotational delay and controller overhead remain the same. The only delay that changes is transfer time which is now four times the previous. Thus, the average delay is:
= 7.5 + 2.0 + 4* 0.01953 + 0.5
= 10.078 msec

c. (Assuming that the disk operations are not necessarily consecutive.)
If we read or write 512 bytes without stopping, each operation will take on an average 10.01953 msec.
Maximum rate of disk operations = 1/average delay = 1 operation/10.01953 msec = 99.805 operations per sec

d. If the disk operates at this maximum rate then the data transfer rate is given by:
Data transfer rate = bytes transferred per operation * operations per sec = 512 bytes per operation * 99.805 operations per sec = 51,100.2013 bytes per sec = 49.9 KB per sec = 51.1 kbytes per sec

3. We receive the following sequence of bytes (8 bits plus 1 bit parity); odd parity is
used. Indicate which bytes are correctly received and which are in error:
a. 0x110100010
b. 0x110100000
c. 0x111100001
d. 0x101100011
e. 0x101000011

a. 110100010
number of 1s = 3 (odd) parity bit should be 0. Error
b. 110100000
correct
c. 111100001
correct
d. 101100011
correct
e. 101000011
Error

You might have interpreted the numbers as hex 0x110100010 etc. which is also good enough if you have dealt with odd parity correctly.

Assignment#7
Problem: http://www.cs.utexas.edu/~hds/cs352/assignments/oct22/asg7.pdf
Solution: http://docs.google.com/Doc?id=dgt9wvxh_78gqkgj6

Distribution: 1:10 2:15 3:15 4:10
1.

7.9)
For each address x, its location in cache is x modulo 16. After going through all addresses we only have second 11 as hit, rest are all misses. Notice that second 3 is not a hit because 19 overwrote 3 in cache since 19%16 = 3.

The final content of cache is:

Cache(Block)	Address
0	48
1
2	2
3	3
4	4
5	21
6	6
7
8
9
10
11	11
12
13	13
14
15

7.10)

Similar to the above question, however, this time we have 4 cache sets of size 4 words each. Notice that we have less overwrites and more hits.
2 - miss, 3 - miss, ... (all misses),... 11 - hit, 3 - hit, ... (all missses), ... 11 hit.
Final cache contents

Cache Set
0	16	64	48	4
1	21	13
2	2	22	6
3	3	11	19	27

2.
a.
total cache 1 MB (2^20), each cache line has 16 bytes (2^4) therefore, 4 bits go for byte offset. Number of cache lines = total cache/size of cache line = 2^16. Thus, we need 16 index bits in 32 bit address. Each cache line has 16 bytes or 4 32-bit words. The remaining 12 bits go for tag.

31 20/19 4/3 0
-----------------------------------
tag (12)| index (16)| byte_offset (4)
-----------------------------------

b.
i. 0x40010200

0100 0000 0000 |0001 0000 0010 0000 |0000
tag | index |byte_offset
0x400 0x1020 0
MISS

ii. 0x40010204

0100 0000 0000 |0001 0000 0010 0000 |0100
tag | index |byte_offset
0x400 0x1020 4
MISS

iii. 0x40010228

0100 0000 0000 |0001 0000 0010 0010 |1000
tag | index |byte_offset
0x400 0x1022 8
MISS

iv. 0x4001021c

0100 0000 0000 |0001 0000 0010 0001 |1100
tag | index |byte_offset
0x400 0x1021 c
MISS

v. 0x5001020c

0101 0000 0000 |0001 0000 0010 0000 |1100
tag | index |byte_offset
0x500 0x1020 c
MISS

vi. 0x50010210

0101 0000 0000 |0001 0000 0010 0001 |0000
tag | index |byte_offset
0x500 0x1021 0
MISS

vii. 0x50010214

0101 0000 0000 |0001 0000 0010 0001 |0100
tag | index |byte_offset
0x500 0x1021 4
MISS

viii. 0x50010228

0101 0000 0000 |0001 0000 0010 0010 |1000
tag | index |byte_offset
0x500 0x1022 8
MISS

3.
a)
We know that each degree of associativity decreases the number of sets by a factor of two and thus decreases the number of bits used to index the cache by one and increases the number of bits in the tag by one. Since the cache is organized as a 4-way set associative cache now, we need a 14 (i.e. 16-2) for set index and 14 (i.e. 12+2) bits for tag.

31 18/17 4/3 0
------------------------------------------------------
tag (14) | index (14) | byte_offset (4)
------------------------------------------------------

b)
i. 0x40010200

0100 0000 0000 00|01 0000 0010 0000|0000
tag | index |byte_offset
0x1000 0x1020 0
MISS

ii. 0x40010204

0100 0000 0000 00|01 0000 0010 0000 |0100
tag | index |byte_offset
0x1000 0x1020 4
MISS

iii. 0x40010228

0100 0000 0000 00|01 0000 0010 0010 |1000
tag | index |byte_offset
0x1000 0x1022 8
MISS

iv. 0x4001021c

0100 0000 0000 00|01 0000 0010 0001 |1100
tag | index |byte_offset
0x1000 0x1021 c
MISS

v. 0x5001020c

0101 0000 0000 00|01 0000 0010 0000 |1100
tag | index |byte_offset
0x1400 0x1020 c
MISS

vi. 0x50010210

0101 0000 0000 00|01 0000 0010 0001 |00|00
tag | index |blk offst|byte_offset
0x1400 0x1021 0
MISS

vii. 0x50010214

0101 0000 0000 00|01 0000 0010 0001 |0100
tag | index |byte_offset
0x1400 0x1021 4
MISS

viii. 0x50010228

0101 0000 0000 00|01 0000 0010 0000 |0100
tag | index |byte_offset
0x1400 0x1020 4
MISS

4.
7.14)
The miss penalty is the time to transfer one block from main memory to the
cache. Assume that it takes 1 clock cycle to send the address to the main memory.
a. Configuration (a) requires 16 main memory accesses to retrieve a cache
block, and words of the block are transferred 1 at a time.
Miss penalty = 1 + 16 x 10 + 16 x 1 = 177 clock cycles.
b. Configuration (b) requires 4 main memory accesses to retrieve a cache block
and words of the block are transferred 4 at a time.
Miss penalty - 1 + 4 x 1 0 + 4 x 1 = 4 5 clock cycles.
c. Configuration (c) requires 4 main memory accesses to retrieve a cache
block, and words of the block are transferred 1 at a time.
Miss penalty = 1 + 4 x 1 0 + 1 6 x 1 = 57 clock cycles

7.15)

a) narrow memory 177 cycles
b) wide memory 45 cycles
c) interleaved memory 57

http://spreadsheets.google.com/pub?key=p7FynqhFWcPPSHRcJFDD7cQ

cycles CPI = 1.2 + .005 * cycles Ratio
177 2.085 1.46315789473684 (2.085/1.425)
45 1.425 1.04210526315789 (1.485/1.425)
57 1.485

Therefore, the processor is 1.46 times faster than when using wide memory than narrow memory, and 1.04 times faster than when using interleaved memory. The difference of wide memory with narrow memory is larger than that with interleaved memory.

Assignment#6
Problem: http://www.cs.utexas.edu/~hds/cs352/assignments/oct15/asg6.pdf
Solution: http://docs.google.com/Doc?id=dgt9wvxh_66hgj34g

6.14)
IF/ID: PC+4 = 104, rs = 2, rt = 1, rd =9, immed = 01001 00000 100010 (rd, shamt, funct)
ID/EX: PC+4 = 100, Reg A = 14, Reg B = 13, rs = 4, rt = 3, rd =8, sgnExtImmed = 0000 0000 0000 0000 01000 00000 100100 (rd(8), shamt(0), funct(24hex) of 'and' left shifted by 16)
EX/ME: ALUout = 15 (12 OR 11), dst = 7
ME/WB: mdr = 1052 (from lw), ALUout = 25 (13+12), dst = 6

6.17)
On its 2nd cycle an instruction reads from registers when it decodes (Instruction Decode/ID)
On its 5th cycle an instruction does a Write Back (WB)
Now,
At the end of the first cycle, instruction 1 is fetched.
At the end of the second cycle, instruction 1 reads registers.
At the end of the third cycle, instruction 2 reads registers.
At the end of the fourth cycle, instruction 3 reads registers.
At the end of the fifth cycle, instruction 4 reads registers, and instruction 1 writes
registers.
Therefore, at the end of the fifth cycle of execution, registers $6 and $ 1 are being
read (for 4th instruction add $7, $6, $1) and register $2 will be written (for 1st instruction add $2, $3, $1).

6.18 The forwarding unit is seeing if it needs to forward. It is looking at the
instructions in the fourth and fifth stages and checking to see whether they intend
to write to the register file and whether the register written is being used as an ALU
input. The instructions in 3rd, 4th and 5th stage at end of 5th cycle are 3rd, 2nd and 1st instructions:
add $5, $3, $7
sub $4, $3, $5
add $2, $3, $1
Thus, it is comparing 3 = 4? 3 = 2? 7 = 4? 7 = 2?

6.19 The hazard detection unit is checking to see whether the instruction in the
ALU stage is an lw instruction and whether the instruction in the ID stage is reading
the register that the lw will be writing. If it is, it needs to stall. If there is an lw
instruction, it checks to see whether the destination is register 6 or 1 (the registers
being read in the fourth instruction add $7, $6, $1).

6.36 Prediction accuracy = 100% * PredictRight/TotalBranches
a. Branch 1: prediction: T-T-T, right: 3, wrong: 0
Branch 2: prediction: T-T-T-T, right: 0, wrong: 4
Branch 3: prediction: T-T-T-T-T-T, right: 3, wrong: 3
Branch 4: prediction: T-T-T-T-T, right: 4, wrong: 1
Branch 5: prediction: T-T-T-T-T-T-T, right: 5, wrong: 2
Total: right: 15, wrong: 10
Accuracy = 100% * 15/25 = 60%
Solution* for Chapter 6 EXWCIMS
b. Branch 1: prediction: N-N-N, right: 0, wrong: 3
Branch 2: prediction: N-N-N-N, right: 4, wrong: 0
Branch 3: prediction: N-N-N-N-N-N, right: 3, wrong: 3
Branch 4: prediction: N-N-N-N-N, right: 1, wrong: 4
Branch 5: prediction: N-N-N-N-N-N-N, right: 2, wrong: 5
Total: right: 10, wrong: 15
Accuracy - 100% * 10/25 - 40%
c. Branch 1: prediction: T-T-T, right: 3, wrong: 0
Branch 2: prediction: T-N-N-N, right: 3, wrong: 1
Branch 3: prediction: T-T-N-T-N-T, right: 1, wrong: 5
Branch 4: prediction: T-T-T-T-N, right: 3, wrong: 2
Branch 5: prediction: T-T-T-N-T-T-N, right: 3, wrong: 4
Total: right: 13, wrong: 12
Accuracy = 100% * 13/25 = 52%
d. Branch 1: prediction: T-T-T, right: 3, wrong: 0
Branch 2: prediction: T-N-N-N, right: 3, wrong: 1
Branch 3: prediction: T-T-T-T-T-T, right: 3, wrong: 3
Branch 4: prediction: T-T-T-T-T, right: 4, wrong: 1
Branch 5: prediction: T-T-T-T-T-T-T, right: 5, wrong: 2
Total: right: 18, wrong: 7
Accuracy = 100% * 18/25 = 72%

6.39 Rearrange the instruction sequence such that the instruction reading a value
produced by a load instruction is right after the load. In this way, there will be a
stall after the load since the load value is not available till after its MEM stage.
lw $2. 100($6)

add $4. $2, $3
lw $3, 2OO($7)
add $6, $3, $5
sub $8, 14, $6
lw $7, 300($8)
beq $7, 18, Loop

Assignment#5
(Diagrams are courtesy Preston E. Lee, Jr.)

Problem Statement: http://www.cs.utexas.edu/users/hds/cs352/assignments/oct08/asg5.pdf
Sample Solution: http://docs.google.com/Doc?id=dgt9wvxh_63dszvfp

Points Distribution(Question:Points)--> 1:15 (3,4,4,4) 2:20 (5,5,10) 3:15

Q.1)
6.1)
a. It will not affect because the slowest instruction still takes 200 ps.
b. It will affect, because now the slowest instruction takes 250 ps.

6.2)
a. 100 microseconds
b. 20 times
c. both

6.3)
Memory stage of lw forwards to the execute stage input of next instruction. In rest of the instructions of the pipeline, output of execute stage is forwarded to the input of execute of next instruction.

6.4)

The last add instruction will cause a stall for one cycle so that output of memory stage of lw instruction can be sent to the input of execute stage of last add instruction. This is because last add needs $6 which depends on result of lw. The rest of the dependencies can be resolved by forwarding and are caused by dependence of $3 on result of first add.

Q.2)
6.7 & 6.8)
Very similar to the original figures, IM (Instruction Fetch in instruction memory) of sw starts one cycle after IM of add.

6.9)
Drawn similar to the given diagrams, take care that instructions are different now and half shade the pipeline registers: for read shade the right half and for write shade the left half. Also notice that in the sixth and final cycle sw does not write back and actually nothing happens.

Q.3)

2 forwardings: 0x1004 in cycle 3 from EX/ME.alu to ID/EX.A and 0x20 from ME/WB.mdr to ID/EX.B in cycle 5.
One stall: Causes noop in cycle3 IF/ID for result of lw to be forwarded to input of last add instruction, ME/WB.mdr to ID/EX.B in cycle 5.

Final values:
$8 = 0x1004
$9 = 0x20
$18 = 0x28

Tuesday, December 4, 2007

Problem: http://www.cs.utexas.edu/users/hds/cs352/assignments/nov26/nov26.pdf
Solution: http://docs.google.com/Doc?id=dgt9wvxh_88hh77pv

Scoring key: [-2 for each incorrect part]

1.

a. Speedup = 1/(.99/100 + 0.01) = 50.2513
b. Speedup = 1 / ( fraction(enhanced)/speedup(enhanced) + (1-fraction(enhanced)) )
After the algebra:
fraction(enhanced) = (1-1/Speedup)/(1-1/speedup(enhanced))
= 0.997475

2.

a. Remote memory access time = 200 nsec
cycles per sec = 2 * 10^9

cycles per remote memory access
= cycles per sec * remote memory access time
= 2 * 10^9 * 200 * 10^-9
= 400

b. CPI
= base CPI + CPI of remote memory access * percent instructions involving remote memory access
= 0.5 + 400*0.02
= 0.5 + 0.8
= 1.3

c. Number of Instructions, I
= clock rate * CPU time / CPI
= 2 * 10^9 * 3600 /0.5
= 144 * 10^11

Time taken for remote access
= Number of remote access instructions * CPI of remote access instructions / clock rate
= 0.02 * 144 * 10^11 * 400 / (2 * 10^9)
= 5760 sec

Net total time
= Time taken for remote access + Time taken for local accesses
= 5760 + 3600
= 9360 sec

d. Effective number of instructions
= 96% of 144 * 10^11 / 32 + 4% of 144 * 10^11
= 7% of 144 * 10^11

Expected Execution time
= Effective number of instructions * effective CPI / clock rate
= 7% of 144 * 10^11 * 1.3 / 2 * 10^9
= 655.2 sec

Alternatively,

By Amdahl's Law:

Speedup
= 1 / (0.04 + 0.96/32)
= 14.2857

Effective execution time
= Net total time from 2.c / Speedup
= 9360 * (0.04 + 0.96/32)
= 655.2 sec

3.

Every request message causes a reply.
a. rate of each link = 1000/0.000600 = 1,666,666.67 bytes/sec

b. msgs/sec/node = 4 x 100 x 2 = 800 msgs/sec; demand = 800 x 1000 =
800,000 bytes/sec

c. load at switch = 2000 x 800,000 = 1.6 GBytes/sec - so No, switch cannot
handle the load

d. switch can handle 1,000,000,000/(2000 x 4) = 125,000 bytes/sec/core

e. 125,000/1000 = 125 msgs/sec/core

f. not enough information

Errata and explanations:
--------------------------

2.c corrected as:
What is the CPU time of the program on a single processor with the remote accesses as specified?

3.f is not possible to solve with given data.

Common rule of thumb
---------------------
1.a G in GBit/sec is 10^9 and not 2^30. Here is the rule of thumb. In general, in a serial transfer rate G is 10^9 and in chunk read or write it is 2^30. The same goes for K and M. That is, when 1 MB copied from disk to memory, what is meant here is 2^20. While a transfer rate of 1 MB per sec is 10^6 B per sec.

Emails:
--------

Upon further reflection (and a suggestion by Dr. Boral), I believe that
problem 2.c should read

c. What is the CPU time of the program on a single processor but with
the remote accesses as specified?

In question 3.f, I ask for the CPU utilization; CPU utilization is the
percentage of the elapsed
time that the CPU is busy (not idle). The idea here is that using the
info in question 3.b, you
should assume that program is using the CPU all of the time and that if
it generates 100 messages/sec,
then this would be 0.010 second of CPU time per message. Now, if the
program sends a message
after 0.010 seconds of cpu time, then there is an idle period (while the
message is being sent and then
received). So the elapsed time is the cpu time + the idle time and the
cpu utilization is

(cpu time/message)/(cpu time/message + idle time/message)

My older wrong answer:
------------------------
3.
a.
1 Gbit/sec link
= 125*10^6 bytes per sec

b. Bandwidth demand by single node
= 4 cores per node * 100 messages per sec per core * 1000 bytes per message
= 400,000 bytes per sec per node

c. Load by 2000 nodes on switch
= 2000 nodes * Load per node
= 2000 nodes * 400,000 bytes per sec per node
= 8 * 10^8 bytes per sec
< 1 Gbyte per sec

Therefore, yes, the switch can handle the load imposed by the 2000 nodes.

d. Maximum off-node message rate that the switch will allow
= total capacity of switch in bits per sec / Load in bits per message
= (8 * 10^9 bits per sec) / (100 * 8 bits per message)
= 10^6 messages per sec allowed

Total cores
= 200 nodes * 4 cores per node
= 8000

Maximum off-node message rate PER CORE that the switch will allow
= Maximum off-node message rate that the switch will allow / Total cores
= 10^6 messages per sec allowed / 8000
= 125

e. Using data from 3.b: each process (per core) generates 100 messages per sec. As per 3.c this doesn't saturate the bus.
Thus 100 messages per sec is the achieved message rate per core on this system.

f. (Insufficient data, please see errata at the end of this document)

cs352

Saturday, December 8, 2007

Tuesday, December 4, 2007

Blog Archive

About Me