Some pranks I really appreciate but most are really not that interesting. I think saying something that is not true and be exhilarated about that others believed it is lame and ordinary. For eg just saying to someone hey see there is a lizard behind you and she turns back to see it and you start shouting April Fool .. April Fool .. this is very kiddish. There is absolutely no originality and any tension related to the confusion whether it's true or untrue. I am not boasting but mostly out of laziness when I used to be inactive and somewhat fat and lazy kid, such a joke was too mundane and boring for me to fall for. I would not even care to turn to see a lizard in high school it was just too much physical effort with little trade-off if not done. At worst the lizard would jump at me .. it would still be okay and less effort for me than for me to turn my head and look at the lizard on the wall. I would call such April Fool pranks first order.
I like to make pranks myself but of a different order. A second order April Fool prank would be stating a truth that no one knows about and there is immense confusion about it among the people. A third order would be where it's actually a false statement but think of it as true and think of it as a second order April Fool which I am saying to fool the lesser people. For eg if you are not actually married and you say I got married guyz .. then people think that you are saying a true statement which no one would believe and hence making others April Fool and doing that they are actually getting fooled. In the same way you can define Nth order April Fool. It can be defined as where the value of N is itself a confusion. That is no one really knows about it (except for a few chosen ones to keep the game on) and everyone else is just guessing ..
Is the statement true? Is it false? Is it actually false but said in a way so that people get fooled thinking that a true statement is said to appear as a false statement? Is it actually true but said in a way so that people believe it is for April Fool? Confusion .. that is April Fool for me .. confusion is the very essence of a good prank and I try to make the most of it sometimes hurting others too but that's for some other day!!
Friday, April 1, 2011
Saturday, December 8, 2007
Assignment#4
Problem Statement: http://www.cs.utexas.edu/users/hds/cs352/assignments/sep24/sep24.pdf
Sample Solution: http://docs.google.com/Doc?id=dgt9wvxh_54c7ggxk
Points Distribution(Question:Points)--> 1:10 2:20 3:20
1. a) jr rs
Jump register is implemented as PC = R[rs]
rs is Instruction[25-21] in Fig 5.17 which is used to select a register. Data is read from Read data 1 which is R[rs]. To assign this value to PC we add a multiplexer between output of Read data 1 and input of PC.
Extra Controls
A new control for the new multiplexer is added in the control unit. When jr is asserted, this control is asserted as well. We do not care about other signals.
b) lui rt, imm
Load upper immediate is implemented as
R[rt] := (imm << 16); PC = PC + 4;
Immediate (Instruction[15-0]) is left shifted by 16 and is channeled to input of Write data of register file. A new control for a new multiplexer is added in control unit.
Extra Controls
The new control is 1.
rt is Instruction[20-16] so RegDst is 0.
Since lui writes to register, RegWrite is 1.
branch is 0, since we want to increment PC
MemRead and MemWrite are 0, since we don't read or write memory.
We don't care about ALU and other controls.
2)
Assume that we have a multi-cycle MIPS processor as described in figure 5.28.
Give a complete description of how the following instructions (your description must
give the settings of all of the control signals and the contents of each of the effected
registers for each cycle:
a. sub $t3, $t1, $t2 equivalently sub $11, $9, $10
b. lw $s0, 4($t4) equivalently lw $16, 0x0004($12)
Assume that the following registers contain:
$t1 = 0x00000010, $t2 = 0x00000002, $t3 = 0x00000000, $t4 = 0x10001000
mem[0x10001000] = 0xfffffffd,
mem[0x10001004] = 0x00000789,
mem[0x10001008] = 0x00001111
Ans)
The control signals are as in Fig 5.38 with a correction in state 4: RegDst = 0, MemtoReg=1. The effect on registers are following:
a) sub $t3, $t1, $t2 equivalently sub $11, $9, $10
(1) Instruction Fetch
IR <= Memory[PC]
PC <= PC + 4
IR = 0x014B4822
(2) Instruction Decode and Register Fetch
A <= Reg[IR[25:21]]; (A:=R[rs])
B <= Reg[IR[20:16]]; (B:=R[rt])
A = 0x00000010, B = 0x00000002
(3) Execution
ALUOut <= A op B; (ALUOut=A-B)
ALUOut = 0x0000000E
(4) R-type Instruction Completion
Reg[IR[15:11]] <= ALUOut; (R[rd]=ALUOut)
$t3 = 0x0000000E
b) lw $s0, 4($t4) equivalently lw $16, 0x0004($12)
(1) Instruction Fetch
IR <= Memory[PC]
PC <= PC + 4
IR = 0x8D900004
(2) Instruction Decode and Register Fetch
A <= Reg[IR[25:21]]; (A:=R[rs])
A = 0x10001000
(3) Memory Address Computation
ALUOut <= A + sign-extend (IR[15-0]);
ALUOut = 0x10001004
(4) Memory Access
MDR <= Memory [ALUOut];
MDR = 0x00000789
(5) Memory Read Completion
Reg[IR[20:16]] <= MDR;
$s0 = 0x00000789
3.
http://spreadsheets.google.com/pub?key=p7FynqhFWcPPy8TPnE0f0Yg
Problem Statement: http://www.cs.utexas.edu/users/hds/cs352/assignments/sep24/sep24.pdf
Sample Solution: http://docs.google.com/Doc?id=dgt9wvxh_54c7ggxk
Points Distribution(Question:Points)--> 1:10 2:20 3:20
1. a) jr rs
Jump register is implemented as PC = R[rs]
rs is Instruction[25-21] in Fig 5.17 which is used to select a register. Data is read from Read data 1 which is R[rs]. To assign this value to PC we add a multiplexer between output of Read data 1 and input of PC.
Extra Controls
A new control for the new multiplexer is added in the control unit. When jr is asserted, this control is asserted as well. We do not care about other signals.
b) lui rt, imm
Load upper immediate is implemented as
R[rt] := (imm << 16); PC = PC + 4;
Immediate (Instruction[15-0]) is left shifted by 16 and is channeled to input of Write data of register file. A new control for a new multiplexer is added in control unit.
Extra Controls
The new control is 1.
rt is Instruction[20-16] so RegDst is 0.
Since lui writes to register, RegWrite is 1.
branch is 0, since we want to increment PC
MemRead and MemWrite are 0, since we don't read or write memory.
We don't care about ALU and other controls.
2)
Assume that we have a multi-cycle MIPS processor as described in figure 5.28.
Give a complete description of how the following instructions (your description must
give the settings of all of the control signals and the contents of each of the effected
registers for each cycle:
a. sub $t3, $t1, $t2 equivalently sub $11, $9, $10
b. lw $s0, 4($t4) equivalently lw $16, 0x0004($12)
Assume that the following registers contain:
$t1 = 0x00000010, $t2 = 0x00000002, $t3 = 0x00000000, $t4 = 0x10001000
mem[0x10001000] = 0xfffffffd,
mem[0x10001004] = 0x00000789,
mem[0x10001008] = 0x00001111
Ans)
The control signals are as in Fig 5.38 with a correction in state 4: RegDst = 0, MemtoReg=1. The effect on registers are following:
a) sub $t3, $t1, $t2 equivalently sub $11, $9, $10
(1) Instruction Fetch
IR <= Memory[PC]
PC <= PC + 4
IR = 0x014B4822
(2) Instruction Decode and Register Fetch
A <= Reg[IR[25:21]]; (A:=R[rs])
B <= Reg[IR[20:16]]; (B:=R[rt])
A = 0x00000010, B = 0x00000002
(3) Execution
ALUOut <= A op B; (ALUOut=A-B)
ALUOut = 0x0000000E
(4) R-type Instruction Completion
Reg[IR[15:11]] <= ALUOut; (R[rd]=ALUOut)
$t3 = 0x0000000E
b) lw $s0, 4($t4) equivalently lw $16, 0x0004($12)
(1) Instruction Fetch
IR <= Memory[PC]
PC <= PC + 4
IR = 0x8D900004
(2) Instruction Decode and Register Fetch
A <= Reg[IR[25:21]]; (A:=R[rs])
A = 0x10001000
(3) Memory Address Computation
ALUOut <= A + sign-extend (IR[15-0]);
ALUOut = 0x10001004
(4) Memory Access
MDR <= Memory [ALUOut];
MDR = 0x00000789
(5) Memory Read Completion
Reg[IR[20:16]] <= MDR;
$s0 = 0x00000789
3.
http://spreadsheets.google.com/pub?key=p
| a.CPI = | 4.19 | c.New CPI = | 4.56 | ||
| Instruction count = | time * clocks per sec / clocks per Instr | ||||
| b. | 63007159905 | ||||
| c.New CPI = | 4.56 | ||||
| Execution Time = | Instr.Count * clocks per Instr./clocks per sec | ||||
| d. | 95.77088305 | ||||
Assignment#11
Problem: http://www.cs.utexas.edu/users/hds/cs352/assignments/nov26/nov26.pdf
Solution: http://docs.google.com/Doc?id=dgt9wvxh_88hh77pv
Scoring key: [-2 for each incorrect part]
1.
a. Speedup = 1/(.99/100 + 0.01) = 50.2513
b. Speedup = 1 / ( fraction(enhanced)/speedup(enhanced) + (1-fraction(enhanced)) )
After the algebra:
fraction(enhanced) = (1-1/Speedup)/(1-1/speedup(enhanced))
= 0.997475
2.
a. Remote memory access time = 200 nsec
cycles per sec = 2 * 10^9
cycles per remote memory access
= cycles per sec * remote memory access time
= 2 * 10^9 * 200 * 10^-9
= 400
b. CPI
= base CPI + CPI of remote memory access * percent instructions involving remote memory access
= 0.5 + 400*0.02
= 0.5 + 0.8
= 1.3
c. Number of Instructions, I
= clock rate * CPU time / CPI
= 2 * 10^9 * 3600 /0.5
= 144 * 10^11
Time taken for remote access
= Number of remote access instructions * CPI of remote access instructions / clock rate
= 0.02 * 144 * 10^11 * 400 / (2 * 10^9)
= 5760 sec
Net total time
= Time taken for remote access + Time taken for local accesses
= 5760 + 3600
= 9360 sec
d. Effective number of instructions
= 96% of 144 * 10^11 / 32 + 4% of 144 * 10^11
= 7% of 144 * 10^11
Expected Execution time
= Effective number of instructions * effective CPI / clock rate
= 7% of 144 * 10^11 * 1.3 / 2 * 10^9
= 655.2 sec
Alternatively,
By Amdahl's Law:
Speedup
= 1 / (0.04 + 0.96/32)
= 14.2857
Effective execution time
= Net total time from 2.c / Speedup
= 9360 * (0.04 + 0.96/32)
= 655.2 sec
3.
Every request message causes a reply.
a. rate of each link = 1000/0.000600 = 1,666,666.67 bytes/sec
b. msgs/sec/node = 4 x 100 x 2 = 800 msgs/sec; demand = 800 x 1000 =
800,000 bytes/sec
c. load at switch = 2000 x 800,000 = 1.6 GBytes/sec - so No, switch cannot
handle the load
d. switch can handle 1,000,000,000/(2000 x 4) = 125,000 bytes/sec/core
e. 125,000/1000 = 125 msgs/sec/core
f. not enough information
Errata and explanations:
--------------------------
2.c corrected as:
What is the CPU time of the program on a single processor with the remote accesses as specified?
3.f is not possible to solve with given data.
Common rule of thumb
---------------------
1.a G in GBit/sec is 10^9 and not 2^30. Here is the rule of thumb. In general, in a serial transfer rate G is 10^9 and in chunk read or write it is 2^30. The same goes for K and M. That is, when 1 MB copied from disk to memory, what is meant here is 2^20. While a transfer rate of 1 MB per sec is 10^6 B per sec.
Emails:
--------
Upon further reflection (and a suggestion by Dr. Boral), I believe that
problem 2.c should read
c. What is the CPU time of the program on a single processor but with
the remote accesses as specified?
In question 3.f, I ask for the CPU utilization; CPU utilization is the
percentage of the elapsed
time that the CPU is busy (not idle). The idea here is that using the
info in question 3.b, you
should assume that program is using the CPU all of the time and that if
it generates 100 messages/sec,
then this would be 0.010 second of CPU time per message. Now, if the
program sends a message
after 0.010 seconds of cpu time, then there is an idle period (while the
message is being sent and then
received). So the elapsed time is the cpu time + the idle time and the
cpu utilization is
(cpu time/message)/(cpu time/message + idle time/message)
My older wrong answer:
------------------------
3.
a.
1 Gbit/sec link
= 125*10^6 bytes per sec
b. Bandwidth demand by single node
= 4 cores per node * 100 messages per sec per core * 1000 bytes per message
= 400,000 bytes per sec per node
c. Load by 2000 nodes on switch
= 2000 nodes * Load per node
= 2000 nodes * 400,000 bytes per sec per node
= 8 * 10^8 bytes per sec
< 1 Gbyte per sec
Therefore, yes, the switch can handle the load imposed by the 2000 nodes.
d. Maximum off-node message rate that the switch will allow
= total capacity of switch in bits per sec / Load in bits per message
= (8 * 10^9 bits per sec) / (100 * 8 bits per message)
= 10^6 messages per sec allowed
Total cores
= 200 nodes * 4 cores per node
= 8000
Maximum off-node message rate PER CORE that the switch will allow
= Maximum off-node message rate that the switch will allow / Total cores
= 10^6 messages per sec allowed / 8000
= 125
e. Using data from 3.b: each process (per core) generates 100 messages per sec. As per 3.c this doesn't saturate the bus.
Thus 100 messages per sec is the achieved message rate per core on this system.
f. (Insufficient data, please see errata at the end of this document)
Problem: http://www.cs.utexas.edu/users/hds/cs352/assignments/nov26/nov26.pdf
Solution: http://docs.google.com/Doc?id=dgt9wvxh_88hh77pv
Scoring key: [-2 for each incorrect part]
1.
a. Speedup = 1/(.99/100 + 0.01) = 50.2513
b. Speedup = 1 / ( fraction(enhanced)/speedup(enhanced) + (1-fraction(enhanced)) )
After the algebra:
fraction(enhanced) = (1-1/Speedup)/(1-1/speedup(enhanced))
= 0.997475
2.
a. Remote memory access time = 200 nsec
cycles per sec = 2 * 10^9
cycles per remote memory access
= cycles per sec * remote memory access time
= 2 * 10^9 * 200 * 10^-9
= 400
b. CPI
= base CPI + CPI of remote memory access * percent instructions involving remote memory access
= 0.5 + 400*0.02
= 0.5 + 0.8
= 1.3
c. Number of Instructions, I
= clock rate * CPU time / CPI
= 2 * 10^9 * 3600 /0.5
= 144 * 10^11
Time taken for remote access
= Number of remote access instructions * CPI of remote access instructions / clock rate
= 0.02 * 144 * 10^11 * 400 / (2 * 10^9)
= 5760 sec
Net total time
= Time taken for remote access + Time taken for local accesses
= 5760 + 3600
= 9360 sec
d. Effective number of instructions
= 96% of 144 * 10^11 / 32 + 4% of 144 * 10^11
= 7% of 144 * 10^11
Expected Execution time
= Effective number of instructions * effective CPI / clock rate
= 7% of 144 * 10^11 * 1.3 / 2 * 10^9
= 655.2 sec
Alternatively,
By Amdahl's Law:
Speedup
= 1 / (0.04 + 0.96/32)
= 14.2857
Effective execution time
= Net total time from 2.c / Speedup
= 9360 * (0.04 + 0.96/32)
= 655.2 sec
3.
Every request message causes a reply.
a. rate of each link = 1000/0.000600 = 1,666,666.67 bytes/sec
b. msgs/sec/node = 4 x 100 x 2 = 800 msgs/sec; demand = 800 x 1000 =
800,000 bytes/sec
c. load at switch = 2000 x 800,000 = 1.6 GBytes/sec - so No, switch cannot
handle the load
d. switch can handle 1,000,000,000/(2000 x 4) = 125,000 bytes/sec/core
e. 125,000/1000 = 125 msgs/sec/core
f. not enough information
Errata and explanations:
--------------------------
2.c corrected as:
What is the CPU time of the program on a single processor with the remote accesses as specified?
3.f is not possible to solve with given data.
Common rule of thumb
---------------------
1.a G in GBit/sec is 10^9 and not 2^30. Here is the rule of thumb. In general, in a serial transfer rate G is 10^9 and in chunk read or write it is 2^30. The same goes for K and M. That is, when 1 MB copied from disk to memory, what is meant here is 2^20. While a transfer rate of 1 MB per sec is 10^6 B per sec.
Emails:
--------
Upon further reflection (and a suggestion by Dr. Boral), I believe that
problem 2.c should read
c. What is the CPU time of the program on a single processor but with
the remote accesses as specified?
In question 3.f, I ask for the CPU utilization; CPU utilization is the
percentage of the elapsed
time that the CPU is busy (not idle). The idea here is that using the
info in question 3.b, you
should assume that program is using the CPU all of the time and that if
it generates 100 messages/sec,
then this would be 0.010 second of CPU time per message. Now, if the
program sends a message
after 0.010 seconds of cpu time, then there is an idle period (while the
message is being sent and then
received). So the elapsed time is the cpu time + the idle time and the
cpu utilization is
(cpu time/message)/(cpu time/message + idle time/message)
My older wrong answer:
------------------------
3.
a.
1 Gbit/sec link
= 125*10^6 bytes per sec
b. Bandwidth demand by single node
= 4 cores per node * 100 messages per sec per core * 1000 bytes per message
= 400,000 bytes per sec per node
c. Load by 2000 nodes on switch
= 2000 nodes * Load per node
= 2000 nodes * 400,000 bytes per sec per node
= 8 * 10^8 bytes per sec
< 1 Gbyte per sec
Therefore, yes, the switch can handle the load imposed by the 2000 nodes.
d. Maximum off-node message rate that the switch will allow
= total capacity of switch in bits per sec / Load in bits per message
= (8 * 10^9 bits per sec) / (100 * 8 bits per message)
= 10^6 messages per sec allowed
Total cores
= 200 nodes * 4 cores per node
= 8000
Maximum off-node message rate PER CORE that the switch will allow
= Maximum off-node message rate that the switch will allow / Total cores
= 10^6 messages per sec allowed / 8000
= 125
e. Using data from 3.b: each process (per core) generates 100 messages per sec. As per 3.c this doesn't saturate the bus.
Thus 100 messages per sec is the achieved message rate per core on this system.
f. (Insufficient data, please see errata at the end of this document)
Assignment#10
Problem: http://www.cs.utexas.edu/~hds/cs352/assignments/nov12/nov12.pdf
Solution: http://docs.google.com/Doc?id=dgt9wvxh_84fzwg52
1. Textbook: 8.18
Max sustainable simultaneous disk transfers
= Max sustainable bandwidth / Disk Transfer Rate
= Max sustainable bandwidth / 50 MB per sec (from page 570)
Now we need to calculate max sustainable bandwidth.
For 4 words:
We calculate the bus bandwidth for 4 words.
Bandwidth
= Size of 4 words / Time to transfer 4 words
= 4 * 32 / Time to Transfer 4 words
Time to transfer 4 words
= actual transfer time + memory access time
= (Total cycles required to transfer 4 words / clock speed) + memory access time.
Memory access time = 200ns (characteristic 4)
Total cycles = 2 (characteristic 2: 1 per 64 bit data transfer) + 1 (characteristic 2: to send address) + 2 (characteristic 3: between each bus operation) = 5
Therefore,
Time to transfer 4 words
= 5 cycles / 200MHz + 200 ns = 225 ns.
Bandwidth
= Size of 4 words / Time to transfer 4 words
= 4 * 32 / Time to Transfer 4 words
= 4 * 32 bits / 225 ns
= 4 * 4000 / 225 MB per sec
= 71.11 MB per sec
Similarly,
For 16 words:
We calculate the bus bandwidth for 16 words.
Bandwidth
= Size of 16 words / Time to transfer 16 words
= 16 * 32 / Time to Transfer 16 words
Time to transfer 16 words
= Total cycles required to transfer 16 words / clock speed + memory access time.
Memory access time = 200ns + 20 * 3 = 260 ns(characteristic 4: 3 additional set of four words)
Total cycles = 8 (characteristic 2: 1 per 64 bit data transfer) + 1 (characteristic 2: to send address) + 2 (characteristic 3: between each bus operation) = 11
Therefore,
Time to transfer 16 words
= 11 cycles / 200MHz + 200 ns = 315 ns.
Bandwidth
= Size of 16 words / Time to transfer 16 words
= 16 * 32 / Time to Transfer 16 words
= 16 * 32 bits / 315 ns
= 16 * 4000 / 315 MB per sec
= 203.1746 MB/sec
For 4-word block transfers, the bus bandwidth is 71.11 MB/sec. For 16-
word block transfers, the bus bandwidth was 203.175 MB/sec. The disk drive has a
transfer rate of 50 MB/sec. Thus for 4-word blocks we could sustain 71/50 = 1
simultaneous disk transfers, and for 16-word blocks we could sustain 203/50 = 4
simultaneous disk transfers. The number of simultaneous disk transfers is inherently
an integer and we want the sustainable value. Thus, we take the floor of the
quotient of bus bandwidth divided by disk transfer rate.
2. Textbook: 8.43
Redo the example on page 601, but instead assume that the reads are 8KB
reads. You can assume that the reads are always
to an idle disk, if one is available.
(We need to do exactly as per the text, just instead of 64KB reads we have 8KB reads)
Maximum I/O rate of CPU
= Instructions per sec / Instructions per I/O
= 3*10^9 / (200,000 + 100,000)
= 10,000 I/O per sec
Each I/O transfers 8KB, so
Maximum I/O rate of bus
= Bus Bandwidth / Bytes per I/O
= 1000*10^6 / 8,000
= 125,000 I/Os per sec
The CPU is the bottleneck and therefore, the maximum sustainable I/O rate is 10,000 I/Os per sec
Now, we need to calculate the maximum number of disks possible while within the maximum sustainable I/O rate bound
Max disks
<= Max sustainable I/O rate / I/O rate of disk
= Max sustainable I/O rate * Time per I/O at disk
Time per I/O at disk
= (Seek + rotational) + Transfer
= (6 ms) + 8 KB / 75 MB per sec
= 6.1066 ms
Therefore,
Max disks
<= Max sustainable I/O rate * Time per I/O at disk
= 10,000 I/Os per sec * 6.1066 ms
= 61.066
Max disks = 61
To compute the number of SCSI buses, we need to check the average transfer rate per disk to see if we can saturate the bus, which is given by
Transfer rate
= Transfer Size / Transfer Time
= 8 KB / 6.1066 ms
= 1.31 MB/sec
The SCSI Controller has max sustainable rate of 320 MB/sec and therefore can very easily handle 7 disks each of transfer rate 1.31 MB/sec.
Therefore,
Number of SCSI Controllers required
>= Number of disks required / number of disks per controller
= 61 / 7
= 8.714
Number of SCSI Controllers required = 9
3.
a. Maximum operations per second that the disk can sustain
= 1/average delay
= 1/ ( 0.0075 sec per operation)
= 133.33 operation per sec
b. Number of disk units required
>= arrival rate of operations / maximum operations per second
= 3 * 550 operations per sec / 133.33
= 12.375
Therefore, 13 units will be required
c. With a cache of hit rate 40% the number of disk units required
= Number of disk units required for misses which are 60% of total accesses
>= miss_rate * Number of disk units required without cache
= 0.6 * Number of disk units required without cache
= 0.6 * 12.375
= 7.425
Therefore, 8 units will be required
d. We have,
Number of disk units required with cache >= miss_rate * (Number of disk units required without cache)
=> 3 >= miss_rate * (Number of disk units required without cache)
=> 3 >= (1-hit_rate) * (Answer of 3.b)
=> hit_rate >= (1- (3/12.375))
= 0.75757
Therefore, required hit rate is 76%
Problem: http://www.cs.utexas.edu/~hds/cs352/assignments/nov12/nov12.pdf
Solution: http://docs.google.com/Doc?id=dgt9wvxh_84fzwg52
1. Textbook: 8.18
Max sustainable simultaneous disk transfers
= Max sustainable bandwidth / Disk Transfer Rate
= Max sustainable bandwidth / 50 MB per sec (from page 570)
Now we need to calculate max sustainable bandwidth.
For 4 words:
We calculate the bus bandwidth for 4 words.
Bandwidth
= Size of 4 words / Time to transfer 4 words
= 4 * 32 / Time to Transfer 4 words
Time to transfer 4 words
= actual transfer time + memory access time
= (Total cycles required to transfer 4 words / clock speed) + memory access time.
Memory access time = 200ns (characteristic 4)
Total cycles = 2 (characteristic 2: 1 per 64 bit data transfer) + 1 (characteristic 2: to send address) + 2 (characteristic 3: between each bus operation) = 5
Therefore,
Time to transfer 4 words
= 5 cycles / 200MHz + 200 ns = 225 ns.
Bandwidth
= Size of 4 words / Time to transfer 4 words
= 4 * 32 / Time to Transfer 4 words
= 4 * 32 bits / 225 ns
= 4 * 4000 / 225 MB per sec
= 71.11 MB per sec
Similarly,
For 16 words:
We calculate the bus bandwidth for 16 words.
Bandwidth
= Size of 16 words / Time to transfer 16 words
= 16 * 32 / Time to Transfer 16 words
Time to transfer 16 words
= Total cycles required to transfer 16 words / clock speed + memory access time.
Memory access time = 200ns + 20 * 3 = 260 ns(characteristic 4: 3 additional set of four words)
Total cycles = 8 (characteristic 2: 1 per 64 bit data transfer) + 1 (characteristic 2: to send address) + 2 (characteristic 3: between each bus operation) = 11
Therefore,
Time to transfer 16 words
= 11 cycles / 200MHz + 200 ns = 315 ns.
Bandwidth
= Size of 16 words / Time to transfer 16 words
= 16 * 32 / Time to Transfer 16 words
= 16 * 32 bits / 315 ns
= 16 * 4000 / 315 MB per sec
= 203.1746 MB/sec
For 4-word block transfers, the bus bandwidth is 71.11 MB/sec. For 16-
word block transfers, the bus bandwidth was 203.175 MB/sec. The disk drive has a
transfer rate of 50 MB/sec. Thus for 4-word blocks we could sustain 71/50 = 1
simultaneous disk transfers, and for 16-word blocks we could sustain 203/50 = 4
simultaneous disk transfers. The number of simultaneous disk transfers is inherently
an integer and we want the sustainable value. Thus, we take the floor of the
quotient of bus bandwidth divided by disk transfer rate.
2. Textbook: 8.43
Redo the example on page 601, but instead assume that the reads are 8KB
reads. You can assume that the reads are always
to an idle disk, if one is available.
(We need to do exactly as per the text, just instead of 64KB reads we have 8KB reads)
Maximum I/O rate of CPU
= Instructions per sec / Instructions per I/O
= 3*10^9 / (200,000 + 100,000)
= 10,000 I/O per sec
Each I/O transfers 8KB, so
Maximum I/O rate of bus
= Bus Bandwidth / Bytes per I/O
= 1000*10^6 / 8,000
= 125,000 I/Os per sec
The CPU is the bottleneck and therefore, the maximum sustainable I/O rate is 10,000 I/Os per sec
Now, we need to calculate the maximum number of disks possible while within the maximum sustainable I/O rate bound
Max disks
<= Max sustainable I/O rate / I/O rate of disk
= Max sustainable I/O rate * Time per I/O at disk
Time per I/O at disk
= (Seek + rotational) + Transfer
= (6 ms) + 8 KB / 75 MB per sec
= 6.1066 ms
Therefore,
Max disks
<= Max sustainable I/O rate * Time per I/O at disk
= 10,000 I/Os per sec * 6.1066 ms
= 61.066
Max disks = 61
To compute the number of SCSI buses, we need to check the average transfer rate per disk to see if we can saturate the bus, which is given by
Transfer rate
= Transfer Size / Transfer Time
= 8 KB / 6.1066 ms
= 1.31 MB/sec
The SCSI Controller has max sustainable rate of 320 MB/sec and therefore can very easily handle 7 disks each of transfer rate 1.31 MB/sec.
Therefore,
Number of SCSI Controllers required
>= Number of disks required / number of disks per controller
= 61 / 7
= 8.714
Number of SCSI Controllers required = 9
3.
a. Maximum operations per second that the disk can sustain
= 1/average delay
= 1/ ( 0.0075 sec per operation)
= 133.33 operation per sec
b. Number of disk units required
>= arrival rate of operations / maximum operations per second
= 3 * 550 operations per sec / 133.33
= 12.375
Therefore, 13 units will be required
c. With a cache of hit rate 40% the number of disk units required
= Number of disk units required for misses which are 60% of total accesses
>= miss_rate * Number of disk units required without cache
= 0.6 * Number of disk units required without cache
= 0.6 * 12.375
= 7.425
Therefore, 8 units will be required
d. We have,
Number of disk units required with cache >= miss_rate * (Number of disk units required without cache)
=> 3 >= miss_rate * (Number of disk units required without cache)
=> 3 >= (1-hit_rate) * (Answer of 3.b)
=> hit_rate >= (1- (3/12.375))
= 0.75757
Therefore, required hit rate is 76%
Assignment#3
Fall 2007 Schwetman
CS352 – Assignment #3
Sept. 17, 2007
Weight: 50 points
Due date: Monday, Sept. 24, 2007 (beginning of class)
1. For each of the following C statements, give a sequence of MIPS assembly
statements that compute the same result. Assume that all of the memory variables
are located in global memory and can be accessed using the “la” instruction
(pseudo-op).
a. i = i + 5;
la $t0, i
lw $t1, 0($t0)
addi $t1, $t1, 5
sw $t1, 0($t0)
b. y = (x – 1) + (z + 10);
la $t1, x
la $t2, z
lw $t4, 0($t1)
addi $t4, $t4, -1
lw $t5, 0($t2)
addi $t5, $t5, 10
add $t3, $t4, $t5
sw $t3, 0($t0)
c. if(x < 100) y = 5;
la $t0, x
lw $t2, 0($t0)
slti $t1, $t2, 100
beq $t1, $zero, end
la $t0, y
ldi $t1, 5
sw $t1, 0($t0)
end:
d. y = func(a, b); // do not write the code for func()
la $t0, a
la $t1, b
lw $a0, 0($t0)
lw $a1, 0($t1)
jal func
la $t0, y
sw v0, 0($t0)
e. swap(&a, &b); // do not write the code for swap()
la $t0, a
la $t1, b
jal swap
2. Textbook, problems for Chapter 4 (starting on page 272)
4.1
For P1, M2 is 4/3 (2 sec/1.5 sec) times as fast as M1. For P2, M1 is 2 times as
fast (10 sec/5 sec) as M2.
4.2
We know the number of instructions and the total time to execute the program.
The execution rate for each machine is simply the ratio of the two values.
Thus, the instructions per second for P1 on M1 is
(5×10^9 instructions/2 seconds)
= 2.5×10^9 IPS,
and the instructions for P1 on M2 is
(6×10^9 instructions/1.5 seconds)
= 4×10^9IPS.
2.5 * pow(10,9)
4*pow(10,9)
4.3
M2 runs 4/3 as fast as M1, but it costs 8/5 as much. As 8/5 is more than 4/3,
M1 has the better value.
4.17
cpuTime = InstructionCount * CPI / clocksPerSec
CPI = cpuTime * clocksPerSec/InstructionCount
CPI on M1 = 2 * 4/5 = 1.6
CPI on M2 = 1.5 *6/6 = 1.5
4.7
(a) expected CPU time = InstructionCount *CPI / ClocksPerSec = 1.2 sec
(b) 1.2/3 = 40%
4.11
Program P running on machine M takes (10^9 cycles/seconds) * 10 seconds = 10^10 cycles. P′ takes (10^9 cycles/seconds) * 9 seconds = 9×10^9 cycles. This leaves 10^9 less cycles after the optimization. Everytime we replace a mult with two adds, it takes 4 – 2 * 1 = 2 cycles less per replacement. Thus, there must have been 10^9 cycles /(2 cycles/replacement) = 5×10^8 replacements to make P into P′.
3. We collect data for a program executing on a processor, as follows:
Group Cpi Rel. freq.
Load 5 0.24
Store 4 0.12
ALU 4 0.55
Br/jmp 3 0.04
Other 4 0.05
Assume that the number of instructions executed is 12 x 109, and that the clock rate
is 1.7 GHz.
a. What is the average CPI for this program on this processor?
Average CPI = sumproduct(cpi, freq)/sum(freq) = (1.32+4*.72)=4.20
b. How much CPU time will the program require?
CPU time = Instructions * CPI/clock rate = 12 * 4.20/1.7 = 29.6471 sec
[Rest of the calculations are similar to slides of Sep 17 class.]
c. A simple optimizer will reduce the number of loads by 11% and the number of
ALU instructions by 15%. Give the average CPI and the number of instructions
executed for the optimized program.
Avg. cpi = 4.1948
New number of instructions executed = 1.0693 * (10^10)
d. What is the execution time for the optimized program?
new time = 26.3857 sec
Fall 2007 Schwetman
CS352 – Assignment #3
Sept. 17, 2007
Weight: 50 points
Due date: Monday, Sept. 24, 2007 (beginning of class)
1. For each of the following C statements, give a sequence of MIPS assembly
statements that compute the same result. Assume that all of the memory variables
are located in global memory and can be accessed using the “la” instruction
(pseudo-op).
a. i = i + 5;
la $t0, i
lw $t1, 0($t0)
addi $t1, $t1, 5
sw $t1, 0($t0)
b. y = (x – 1) + (z + 10);
la $t1, x
la $t2, z
lw $t4, 0($t1)
addi $t4, $t4, -1
lw $t5, 0($t2)
addi $t5, $t5, 10
add $t3, $t4, $t5
sw $t3, 0($t0)
c. if(x < 100) y = 5;
la $t0, x
lw $t2, 0($t0)
slti $t1, $t2, 100
beq $t1, $zero, end
la $t0, y
ldi $t1, 5
sw $t1, 0($t0)
end:
d. y = func(a, b); // do not write the code for func()
la $t0, a
la $t1, b
lw $a0, 0($t0)
lw $a1, 0($t1)
jal func
la $t0, y
sw v0, 0($t0)
e. swap(&a, &b); // do not write the code for swap()
la $t0, a
la $t1, b
jal swap
2. Textbook, problems for Chapter 4 (starting on page 272)
4.1
For P1, M2 is 4/3 (2 sec/1.5 sec) times as fast as M1. For P2, M1 is 2 times as
fast (10 sec/5 sec) as M2.
4.2
We know the number of instructions and the total time to execute the program.
The execution rate for each machine is simply the ratio of the two values.
Thus, the instructions per second for P1 on M1 is
(5×10^9 instructions/2 seconds)
= 2.5×10^9 IPS,
and the instructions for P1 on M2 is
(6×10^9 instructions/1.5 seconds)
= 4×10^9IPS.
2.5 * pow(10,9)
4*pow(10,9)
4.3
M2 runs 4/3 as fast as M1, but it costs 8/5 as much. As 8/5 is more than 4/3,
M1 has the better value.
4.17
cpuTime = InstructionCount * CPI / clocksPerSec
CPI = cpuTime * clocksPerSec/InstructionCount
CPI on M1 = 2 * 4/5 = 1.6
CPI on M2 = 1.5 *6/6 = 1.5
4.7
(a) expected CPU time = InstructionCount *CPI / ClocksPerSec = 1.2 sec
(b) 1.2/3 = 40%
4.11
Program P running on machine M takes (10^9 cycles/seconds) * 10 seconds = 10^10 cycles. P′ takes (10^9 cycles/seconds) * 9 seconds = 9×10^9 cycles. This leaves 10^9 less cycles after the optimization. Everytime we replace a mult with two adds, it takes 4 – 2 * 1 = 2 cycles less per replacement. Thus, there must have been 10^9 cycles /(2 cycles/replacement) = 5×10^8 replacements to make P into P′.
3. We collect data for a program executing on a processor, as follows:
Group Cpi Rel. freq.
Load 5 0.24
Store 4 0.12
ALU 4 0.55
Br/jmp 3 0.04
Other 4 0.05
Assume that the number of instructions executed is 12 x 109, and that the clock rate
is 1.7 GHz.
a. What is the average CPI for this program on this processor?
Average CPI = sumproduct(cpi, freq)/sum(freq) = (1.32+4*.72)=4.20
b. How much CPU time will the program require?
CPU time = Instructions * CPI/clock rate = 12 * 4.20/1.7 = 29.6471 sec
[Rest of the calculations are similar to slides of Sep 17 class.]
c. A simple optimizer will reduce the number of loads by 11% and the number of
ALU instructions by 15%. Give the average CPI and the number of instructions
executed for the optimized program.
Avg. cpi = 4.1948
New number of instructions executed = 1.0693 * (10^10)
d. What is the execution time for the optimized program?
new time = 26.3857 sec
Assignment#9
Problem: http://www.cs.utexas.edu/~hds/cs352/assignments/nov07/nov7.pdf
Solution: http://docs.google.com/Doc?id=dgt9wvxh_83fb2cxr
[Score Distribution 1:10, 2:10, 3:10]
1. Textbook: 8.1, 8.3
8.1
Maximum transactions per second supported
For System A:
Upper bound by processor
= 500 million instructions per second / (10,000 instructions per operation * 5 operations per transaction)
= 10,000 transactions per second
Upper bound by I/O
= 1500 operations per sec / 5 operations per transaction
= 300 transactions per sec
Net upper bound
= 300 transactions per sec
For System B:
Upper bound by processor (same processor as system A)
= 10,000 transactions per second
Upper bound by I/O
= 1000 I/O operations per sec / 5 operations per transaction
= 200 transactions per sec
Net upper bound
= 200 transactions per sec
8.3
Number of files that can be sent
= Total energy available / energy consumed per transfer including browsing
Total energy available
= 100k joules
[Assuming both M to be the same (either 10^6 or 2^20)]
Energy consumed per transfer
= browsing energy + transfer energy
= 350 J + 40 W * 40 MB*8 bits per B / 5 Mbits per sec
= 350 + 2560 J
= 2910 J
Number of files that can be sent
= Total energy available / energy consumed per transfer including browsing
= ceil (100,000 / 2910)
= 34
2. We have a disk unit with the following properties:
· rotational delay: 15,000 rpm
· controller overhead: 0.5 msec
· transfer rate: 25 Mbytes/sec
· average seek time: 7.5 msec
a. What is the average time to read a 512 byte sector?
b. What is the time to read 4 consecutive sectors on a track in one operation?
c. What is the maximum rate of disk operations (each operation reads/writes 512
bytes) for this disk unit on this system?
d. If the disk unit operates at this maximum rate, what is the data transfer rate,
assuming that 512 bytes are transferred per operation?
a. Average delay = average seek time + rotational latency + transfer time + controller overhead
= 7.5 msec + 0.5 rotation/(15,000 rotations per min / 60 seconds per min) sec + 512 bytes / 25 Mbytes per sec + 0.5 msec
= 7.5 + 2.0 + 0.01953 + 0.5 msec
= 10.01953 msec
b. For 4 consecutive sectors we need not seek disk heads again. Similarly, rotational delay and controller overhead remain the same. The only delay that changes is transfer time which is now four times the previous. Thus, the average delay is:
= 7.5 + 2.0 + 4* 0.01953 + 0.5
= 10.078 msec
c. (Assuming that the disk operations are not necessarily consecutive.)
If we read or write 512 bytes without stopping, each operation will take on an average 10.01953 msec.
Maximum rate of disk operations = 1/average delay = 1 operation/10.01953 msec = 99.805 operations per sec
d. If the disk operates at this maximum rate then the data transfer rate is given by:
Data transfer rate = bytes transferred per operation * operations per sec = 512 bytes per operation * 99.805 operations per sec = 51,100.2013 bytes per sec = 49.9 KB per sec = 51.1 kbytes per sec
3. We receive the following sequence of bytes (8 bits plus 1 bit parity); odd parity is
used. Indicate which bytes are correctly received and which are in error:
a. 0x110100010
b. 0x110100000
c. 0x111100001
d. 0x101100011
e. 0x101000011
a. 110100010
number of 1s = 3 (odd) parity bit should be 0. Error
b. 110100000
correct
c. 111100001
correct
d. 101100011
correct
e. 101000011
Error
You might have interpreted the numbers as hex 0x110100010 etc. which is also good enough if you have dealt with odd parity correctly.
Problem: http://www.cs.utexas.edu/~hds/cs352/assignments/nov07/nov7.pdf
Solution: http://docs.google.com/Doc?id=dgt9wvxh_83fb2cxr
[Score Distribution 1:10, 2:10, 3:10]
1. Textbook: 8.1, 8.3
8.1
Maximum transactions per second supported
For System A:
Upper bound by processor
= 500 million instructions per second / (10,000 instructions per operation * 5 operations per transaction)
= 10,000 transactions per second
Upper bound by I/O
= 1500 operations per sec / 5 operations per transaction
= 300 transactions per sec
Net upper bound
= 300 transactions per sec
For System B:
Upper bound by processor (same processor as system A)
= 10,000 transactions per second
Upper bound by I/O
= 1000 I/O operations per sec / 5 operations per transaction
= 200 transactions per sec
Net upper bound
= 200 transactions per sec
8.3
Number of files that can be sent
= Total energy available / energy consumed per transfer including browsing
Total energy available
= 100k joules
[Assuming both M to be the same (either 10^6 or 2^20)]
Energy consumed per transfer
= browsing energy + transfer energy
= 350 J + 40 W * 40 MB*8 bits per B / 5 Mbits per sec
= 350 + 2560 J
= 2910 J
Number of files that can be sent
= Total energy available / energy consumed per transfer including browsing
= ceil (100,000 / 2910)
= 34
2. We have a disk unit with the following properties:
· rotational delay: 15,000 rpm
· controller overhead: 0.5 msec
· transfer rate: 25 Mbytes/sec
· average seek time: 7.5 msec
a. What is the average time to read a 512 byte sector?
b. What is the time to read 4 consecutive sectors on a track in one operation?
c. What is the maximum rate of disk operations (each operation reads/writes 512
bytes) for this disk unit on this system?
d. If the disk unit operates at this maximum rate, what is the data transfer rate,
assuming that 512 bytes are transferred per operation?
a. Average delay = average seek time + rotational latency + transfer time + controller overhead
= 7.5 msec + 0.5 rotation/(15,000 rotations per min / 60 seconds per min) sec + 512 bytes / 25 Mbytes per sec + 0.5 msec
= 7.5 + 2.0 + 0.01953 + 0.5 msec
= 10.01953 msec
b. For 4 consecutive sectors we need not seek disk heads again. Similarly, rotational delay and controller overhead remain the same. The only delay that changes is transfer time which is now four times the previous. Thus, the average delay is:
= 7.5 + 2.0 + 4* 0.01953 + 0.5
= 10.078 msec
c. (Assuming that the disk operations are not necessarily consecutive.)
If we read or write 512 bytes without stopping, each operation will take on an average 10.01953 msec.
Maximum rate of disk operations = 1/average delay = 1 operation/10.01953 msec = 99.805 operations per sec
d. If the disk operates at this maximum rate then the data transfer rate is given by:
Data transfer rate = bytes transferred per operation * operations per sec = 512 bytes per operation * 99.805 operations per sec = 51,100.2013 bytes per sec = 49.9 KB per sec = 51.1 kbytes per sec
3. We receive the following sequence of bytes (8 bits plus 1 bit parity); odd parity is
used. Indicate which bytes are correctly received and which are in error:
a. 0x110100010
b. 0x110100000
c. 0x111100001
d. 0x101100011
e. 0x101000011
a. 110100010
number of 1s = 3 (odd) parity bit should be 0. Error
b. 110100000
correct
c. 111100001
correct
d. 101100011
correct
e. 101000011
Error
You might have interpreted the numbers as hex 0x110100010 etc. which is also good enough if you have dealt with odd parity correctly.
Assignment#7
Problem: http://www.cs.utexas.edu/~hds/cs352/assignments/oct22/asg7.pdf
Solution: http://docs.google.com/Doc?id=dgt9wvxh_78gqkgj6
Distribution: 1:10 2:15 3:15 4:10
1.
7.9)
For each address x, its location in cache is x modulo 16. After going through all addresses we only have second 11 as hit, rest are all misses. Notice that second 3 is not a hit because 19 overwrote 3 in cache since 19%16 = 3.
The final content of cache is:
7.10)
Similar to the above question, however, this time we have 4 cache sets of size 4 words each. Notice that we have less overwrites and more hits.
2 - miss, 3 - miss, ... (all misses),... 11 - hit, 3 - hit, ... (all missses), ... 11 hit.
Final cache contents
2.
a.
total cache 1 MB (2^20), each cache line has 16 bytes (2^4) therefore, 4 bits go for byte offset. Number of cache lines = total cache/size of cache line = 2^16. Thus, we need 16 index bits in 32 bit address. Each cache line has 16 bytes or 4 32-bit words. The remaining 12 bits go for tag.
31 20/19 4/3 0
-----------------------------------
tag (12)| index (16)| byte_offset (4)
-----------------------------------
b.
i. 0x40010200
0100 0000 0000 |0001 0000 0010 0000 |0000
tag | index |byte_offset
0x400 0x1020 0
MISS
ii. 0x40010204
0100 0000 0000 |0001 0000 0010 0000 |0100
tag | index |byte_offset
0x400 0x1020 4
MISS
iii. 0x40010228
0100 0000 0000 |0001 0000 0010 0010 |1000
tag | index |byte_offset
0x400 0x1022 8
MISS
iv. 0x4001021c
0100 0000 0000 |0001 0000 0010 0001 |1100
tag | index |byte_offset
0x400 0x1021 c
MISS
v. 0x5001020c
0101 0000 0000 |0001 0000 0010 0000 |1100
tag | index |byte_offset
0x500 0x1020 c
MISS
vi. 0x50010210
0101 0000 0000 |0001 0000 0010 0001 |0000
tag | index |byte_offset
0x500 0x1021 0
MISS
vii. 0x50010214
0101 0000 0000 |0001 0000 0010 0001 |0100
tag | index |byte_offset
0x500 0x1021 4
MISS
viii. 0x50010228
0101 0000 0000 |0001 0000 0010 0010 |1000
tag | index |byte_offset
0x500 0x1022 8
MISS
3.
a)
We know that each degree of associativity decreases the number of sets by a factor of two and thus decreases the number of bits used to index the cache by one and increases the number of bits in the tag by one. Since the cache is organized as a 4-way set associative cache now, we need a 14 (i.e. 16-2) for set index and 14 (i.e. 12+2) bits for tag.
31 18/17 4/3 0
------------------------------------------------------
tag (14) | index (14) | byte_offset (4)
------------------------------------------------------
b)
i. 0x40010200
0100 0000 0000 00|01 0000 0010 0000|0000
tag | index |byte_offset
0x1000 0x1020 0
MISS
ii. 0x40010204
0100 0000 0000 00|01 0000 0010 0000 |0100
tag | index |byte_offset
0x1000 0x1020 4
MISS
iii. 0x40010228
0100 0000 0000 00|01 0000 0010 0010 |1000
tag | index |byte_offset
0x1000 0x1022 8
MISS
iv. 0x4001021c
0100 0000 0000 00|01 0000 0010 0001 |1100
tag | index |byte_offset
0x1000 0x1021 c
MISS
v. 0x5001020c
0101 0000 0000 00|01 0000 0010 0000 |1100
tag | index |byte_offset
0x1400 0x1020 c
MISS
vi. 0x50010210
0101 0000 0000 00|01 0000 0010 0001 |00|00
tag | index |blk offst|byte_offset
0x1400 0x1021 0
MISS
vii. 0x50010214
0101 0000 0000 00|01 0000 0010 0001 |0100
tag | index |byte_offset
0x1400 0x1021 4
MISS
viii. 0x50010228
0101 0000 0000 00|01 0000 0010 0000 |0100
tag | index |byte_offset
0x1400 0x1020 4
MISS
4.
7.14)
The miss penalty is the time to transfer one block from main memory to the
cache. Assume that it takes 1 clock cycle to send the address to the main memory.
a. Configuration (a) requires 16 main memory accesses to retrieve a cache
block, and words of the block are transferred 1 at a time.
Miss penalty = 1 + 16 x 10 + 16 x 1 = 177 clock cycles.
b. Configuration (b) requires 4 main memory accesses to retrieve a cache block
and words of the block are transferred 4 at a time.
Miss penalty - 1 + 4 x 1 0 + 4 x 1 = 4 5 clock cycles.
c. Configuration (c) requires 4 main memory accesses to retrieve a cache
block, and words of the block are transferred 1 at a time.
Miss penalty = 1 + 4 x 1 0 + 1 6 x 1 = 57 clock cycles
7.15)
a) narrow memory 177 cycles
b) wide memory 45 cycles
c) interleaved memory 57
http://spreadsheets.google.com/pub?key=p7FynqhFWcPPSHRcJFDD7cQ
cycles CPI = 1.2 + .005 * cycles Ratio
177 2.085 1.46315789473684 (2.085/1.425)
45 1.425 1.04210526315789 (1.485/1.425)
57 1.485
Therefore, the processor is 1.46 times faster than when using wide memory than narrow memory, and 1.04 times faster than when using interleaved memory. The difference of wide memory with narrow memory is larger than that with interleaved memory.
Problem: http://www.cs.utexas.edu/~hds/cs352/assignments/oct22/asg7.pdf
Solution: http://docs.google.com/Doc?id=dgt9wvxh_78gqkgj6
Distribution: 1:10 2:15 3:15 4:10
1.
7.9)
The final content of cache is:
| Cache(Block) | Address |
| 0 | 48 |
| 1 | |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 21 |
| 6 | 6 |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | 11 |
| 12 | |
| 13 | 13 |
| 14 | |
| 15 |
7.10)
Similar to the above question, however, this time we have 4 cache sets of size 4 words each. Notice that we have less overwrites and more hits.
2 - miss, 3 - miss, ... (all misses),... 11 - hit, 3 - hit, ... (all missses), ... 11 hit.
Final cache contents
| Cache Set | ||||
| 0 | 16 | 64 | 48 | 4 |
| 1 | 21 | 13 | ||
| 2 | 2 | 22 | 6 | |
| 3 | 3 | 11 | 19 | 27 |
2.
a.
total cache 1 MB (2^20), each cache line has 16 bytes (2^4) therefore, 4 bits go for byte offset. Number of cache lines = total cache/size of cache line = 2^16. Thus, we need 16 index bits in 32 bit address. Each cache line has 16 bytes or 4 32-bit words. The remaining 12 bits go for tag.
31 20/19 4/3 0
-----------------------------------
tag (12)| index (16)| byte_offset (4)
-----------------------------------
b.
i. 0x40010200
0100 0000 0000 |0001 0000 0010 0000 |0000
tag | index |byte_offset
0x400 0x1020 0
MISS
ii. 0x40010204
0100 0000 0000 |0001 0000 0010 0000 |0100
tag | index |byte_offset
0x400 0x1020 4
MISS
iii. 0x40010228
0100 0000 0000 |0001 0000 0010 0010 |1000
tag | index |byte_offset
0x400 0x1022 8
MISS
iv. 0x4001021c
0100 0000 0000 |0001 0000 0010 0001 |1100
tag | index |byte_offset
0x400 0x1021 c
MISS
v. 0x5001020c
0101 0000 0000 |0001 0000 0010 0000 |1100
tag | index |byte_offset
0x500 0x1020 c
MISS
vi. 0x50010210
0101 0000 0000 |0001 0000 0010 0001 |0000
tag | index |byte_offset
0x500 0x1021 0
MISS
vii. 0x50010214
0101 0000 0000 |0001 0000 0010 0001 |0100
tag | index |byte_offset
0x500 0x1021 4
MISS
viii. 0x50010228
0101 0000 0000 |0001 0000 0010 0010 |1000
tag | index |byte_offset
0x500 0x1022 8
MISS
3.
a)
We know that each degree of associativity decreases the number of sets by a factor of two and thus decreases the number of bits used to index the cache by one and increases the number of bits in the tag by one. Since the cache is organized as a 4-way set associative cache now, we need a 14 (i.e. 16-2) for set index and 14 (i.e. 12+2) bits for tag.
31 18/17 4/3 0
------------------------------------------------------
tag (14) | index (14) | byte_offset (4)
------------------------------------------------------
b)
i. 0x40010200
0100 0000 0000 00|01 0000 0010 0000|0000
tag | index |byte_offset
0x1000 0x1020 0
MISS
ii. 0x40010204
0100 0000 0000 00|01 0000 0010 0000 |0100
tag | index |byte_offset
0x1000 0x1020 4
MISS
iii. 0x40010228
0100 0000 0000 00|01 0000 0010 0010 |1000
tag | index |byte_offset
0x1000 0x1022 8
MISS
iv. 0x4001021c
0100 0000 0000 00|01 0000 0010 0001 |1100
tag | index |byte_offset
0x1000 0x1021 c
MISS
v. 0x5001020c
0101 0000 0000 00|01 0000 0010 0000 |1100
tag | index |byte_offset
0x1400 0x1020 c
MISS
vi. 0x50010210
0101 0000 0000 00|01 0000 0010 0001 |00|00
tag | index |blk offst|byte_offset
0x1400 0x1021 0
MISS
vii. 0x50010214
0101 0000 0000 00|01 0000 0010 0001 |0100
tag | index |byte_offset
0x1400 0x1021 4
MISS
viii. 0x50010228
0101 0000 0000 00|01 0000 0010 0000 |0100
tag | index |byte_offset
0x1400 0x1020 4
MISS
4.
7.14)
The miss penalty is the time to transfer one block from main memory to the
cache. Assume that it takes 1 clock cycle to send the address to the main memory.
a. Configuration (a) requires 16 main memory accesses to retrieve a cache
block, and words of the block are transferred 1 at a time.
Miss penalty = 1 + 16 x 10 + 16 x 1 = 177 clock cycles.
b. Configuration (b) requires 4 main memory accesses to retrieve a cache block
and words of the block are transferred 4 at a time.
Miss penalty - 1 + 4 x 1 0 + 4 x 1 = 4 5 clock cycles.
c. Configuration (c) requires 4 main memory accesses to retrieve a cache
block, and words of the block are transferred 1 at a time.
Miss penalty = 1 + 4 x 1 0 + 1 6 x 1 = 57 clock cycles
7.15)
a) narrow memory 177 cycles
b) wide memory 45 cycles
c) interleaved memory 57
http://spreadsheets.google.com/pub?key=p
cycles CPI = 1.2 + .005 * cycles Ratio
177 2.085 1.46315789473684 (2.085/1.425)
45 1.425 1.04210526315789 (1.485/1.425)
57 1.485
Therefore, the processor is 1.46 times faster than when using wide memory than narrow memory, and 1.04 times faster than when using interleaved memory. The difference of wide memory with narrow memory is larger than that with interleaved memory.
Subscribe to:
Comments (Atom)