GCT CS
Final
quiz sample 1
CPU性能公式,简单计算 L01-ch1.pdf: 计算CPI: Page 33
| Op | Freqency | Cycle count |
|---|---|---|
| ALU op | 43% | 1 |
| Loads | 21% | 1 |
| Stores | 12% | 2 |
| Branches | 24% | 2 |
if store can do in 1 cycle but slow down clock 15%
awk "BEGIN {print 0.43 + 0.21 + 0.12*2 + 0.24*2 }" = 1.36
awk "BEGIN {print 0.43 + 0.21 + 0.12 + 0.24*2 }" = 1.24
t/p = P * CPI * T old = 1.36 new = 1.24 * 1.5 = 1.86
Don't make the change!
quiz sample 2
amdhal 加速比 spdup page 55- page56
加速比 = 1 / (未提升部分 + (提升部分/提升比例))
95%的部分得到提升,提升10%的话,加速比是1.094 5%的部分得到提升,提升10倍,加速比只有1.047
quiz sample 3
降低失效率的方法之一,编译器优化:
数组合并,循环交换,循环融合,分块
- Cache优化中的数组合并
/* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE];
/* After: 1 array of stuctures */
struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];
Reducing conflicts between val & key; improve spatial locality 减少了key和value的冲突失效,提高了空间局部性
- Cache优化中的循环融合
/* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
2 misses per access to a & c vs. one miss per access; improve spatial locality 2次miss vs 一次miss, 提高空间局部性
Quiz
- 什么是计算机体系结构?体系结构的分层模型? 简述Amdahl定律?
- 简述R-R、R-M、M-M型指令及其特点。
- 对于运算x=a-b+c,分别写出基于堆栈、累加器、Load/Store、R_M(1,3)、M_M(3,3)型ISA的伪代码指令序列。
- 结合CPU性能公式,分析CISC和 RISC指令集的功能设计。
- 编译技术对计算机体系结构有哪些方面的影响?
- CISC架构的主要缺陷?它为何能在计算机发展历程上长期占有重要地位?
- 何谓结构相关?一般有哪些解决方法? 何谓数据相关?一般有哪些解决方法?
- 何谓控制相关?一般有哪些解决方法?
- 什么是指令的动态调度和静态调度?各自有什么优缺点?
- 实现多指令流出,目前主要采用哪些技术?各自的技术特点是什么?
- 多指令流出处理器的性能受到哪些方面的限制?
- 什么是存储层次?哪些应用适合用单级存储结构?哪些应用适合使用多级存储结构?
- 引入cache的理论依据是什么?使用cache有什么优缺点?
- Cache的二种写策略及其特点?
- 简述cache失效的3C分类。
- 降低Cache的失效率主要有哪些方法?
- 降低Cache的失效开销主要有哪些方法?
- 降低Cache的命中时间主要有哪些方法?
- 对于一定容量的Cache,当块大小有小变大时,失效率为什么先下降,后又上升?对于16KB的Cache,合适的块大小一般是多少?
- 简述Victim Cache的原理?
- 对于磁盘存储介质,衡量记录密度的指标是什么?它又取决于哪二个参数指标?
- PCI、USB、FireWire、MMU、MPU、MIPS、CPI、IPL?
- 什么是RAID?简述RAID3、RAID5的主要技术特征。
- 简述向量机的部件组成,向量处理器的特点?对以下于运算, 试写出其对应的标量指令序列和向量指令序列。
For (I=0; I<100; I++) Q[I]=A[I]*B[I]+D[I]*(E[I]-5)
- 什么是SIMD?简述MIMD并行机的二种组成结构。
- 针对计算机体系结构的软件优化技术一般应考虑哪些方面?
Answers
什么是计算机体系结构?体系结构的分层模型? 简述Amdahl定律?
体系结构
In its broadest definition, computer architecture is the design of the
abstraction layers that allow us to implement information processing
applications efficiently using available manufacturing technologies.
计算机体系结构是对我们如何利用现有的制造技术来实现信息处理应用的抽象层的设计
分层模型
- Application 应用
- Algorithm 算法
- Programming Language 编程语言
- Operating System/Virtual Machinese 操作系统/虚拟机
- Instruction Set Architecture (ISA) 指令集
- Microarchitecture 微架构
- Gates/Register - Transfer Level (RLT) 寄存器,逻辑门,传输层
- Circuits 电路
- Devices 设备
- Physics 物理层
Amdhal 定律
⇒
加速比 = 1 / (未提升部分 + (提升部分/提升比例))
简述R-R、R-M、M-M型指令及其特点
- R-R
The basic operation in the processor is a register-to-register
instruction of the following format: This operation takes two source registers and , performs a dyadic operations on them and stores the resulting value in . As usual may be replaced by a small immediate constant.
- R-M
- Any operation can access memory
- First operand is loaded from the memory into a register
- Operation is performed on the register and the second operand (from the memory)
- Both operands and location of result are explicit
- Result is written into a register
- Result has to be explicitly stored back into memory
- M-M
- Operation is performed on the memory locations
- Result is written into the memory
对于运算x=a-b+c,分别写出基于堆栈、累加器、Load/Store、R_M(1,3)、M_M(3,3)型ISA的伪代码指令序列。
- Stack Based
push a
push b sub push c add pop x
- Accumulator
load a
sub b add c store x
- LOAD/STORE
load R1, a
load R2, b sub R3, R1, R2 load R4, c sub R5, R3, R4 store R5, x
- R-M
load R1, a
sub R3, R1, b add R4, R3, c store R4, x
- M-M
sub t, a, b
add x, t, c
结合CPU性能公式,分析CISC和 RISC指令集的功能设计
* CPU 性能公式:
From CISC to RISC
- Use fast RAM to build fast instruction cache of
user-visible instructions, not fixed hardware microroutines
- Can change contents of fast instruction memory to fit what
application needs right now
- Use simple ISA to enable hardwired pipelined
implementation
- Most compiled code only used a few of the available CISC
instructions
- Simpler encoding allowed pipelined implementations
- Further benefit with integration
- In early ‘80s, could finally fit 32-bit datapath + small caches
on a single chip
- No chip crossings in common case allows faster operation
* RISC use more instructions, thus use more memory and storage. * CISC make instruction complicatred and take more CPU cycle to execute.
CISC架构的主要缺陷?它为何能在计算机发展历程上长期占有重要地位?
* When memory is expensive and CPU clock is slow, CISC rules. But now things changed.
编译技术对计算机体系结构有哪些方面的影响?
- Schedules to maximize parallel execution 预计划增加执行并行度
- Guarantees intra-instruction parallelism 保证指令并行
- Schedules to avoid data hazards (no interlocks) 避免数据相关
- Typically separates operations with explicit NOPs 把操作和NOP隔离
何谓结构相关?一般有哪些解决方法?
- An instruction in the pipeline may need a resource being used by another instruction in the pipeline
流水线内一个指令可能需要其他指令使用中的资源
- Resolving Structural Hazards
- Structural hazards occurs when two instruction need same hardware resource at same time
- Can resolve in hardware by stalling newer instruction till older instruction finished with resource
新指令等待旧指令
- A structural hazard can always be avoided by adding more hardware to design
- E.g., if two instructions both need a port to memory at same time, could avoid hazard by adding second port to memory
增加硬件
- Our 5-stage pipe has no structural hazards by design
- Thanks to MIPS ISA, which was designed for pipelining
更好的设计
何谓数据相关?一般有哪些解决方法? 何谓控制相关?一般有哪些解决方法?
- An instruction may depend on something produced by an earlier instruction
指令依赖之前指令的生成物
- Dependence may be for a data value ⇒ data hazard
如果依赖的是数据,数据相关
- Dependence may be for the next instruction’s address ⇒ control hazard (branches, exceptions)
依赖的是地址,控制相关
- Resolving Data Hazards 解决数据相关
- Strategy 1:
- Wait for the result to be available by freezing earlier pipeline stages ⇒ interlocks
等待结果,冻结先前指令的状态
- Strategy 2:
- Route data as soon as possible after it is calculated to the earlier pipeline stage ⇒ bypass
尽早计算获取旁路,绕开
- Strategy 3:
- Speculate on the dependence. Two cases:
对依赖进行推测, 正确的话什么都不做,错误的话终止,并重启
- Guessed correctly ⇒ do nothing
- Guessed incorrectly ⇒ kill and restart
- Resolving control Hazards 解决控制相关
- Branches/Jumps 分支和跳转
- Exceptions/Interrupts 例外和中断
什么是指令的动态调度和静态调度?各自有什么优缺点?
TODO ??????
* 静态:
* Stall:直到分支方向确定
* 预测分支失败:直接执行后继指令,如果分支实际情况为分支成功,则撤销流水线中的指令对流水线状态的更新.
DLX分支指令平均47%为分支失败, 由于PC+4已经计算出来,因此可以用它来取下一条指令
* 预测分支成功: 平均53% DLX 分支为分支成功,但分支目标地址在ID段才能计算出目标地址
* 延迟转移:选择指令来填充延迟槽
- 动态
- 记分牌scoreboard:
- 没有定向数据通路
- 指令窗口较小,仅局限于基本块内的调度
- 功能部件数较少,容易产生结构相关
- 结构冲突时不能发射
- Tomasulo Algorithm: 性能受限于Common Data Bus
3、预测分支成功:平均53% DLX 分支为分支成功,但分支目标地址在ID段才能计算出目标地址 4、延迟转移:选择指令来填充延迟槽
实现多指令流出,目前主要采用哪些技术?各自的技术特点是什么?
To make CPI < 1: * Superscalar: 每个时钟周期发出的指令数目不定(1-8条) 由编译器或者硬件完成调度 IBM Power, SUN Ultra SPARC, DEC Alpha, HP 8000 * Very Long Instruction Words:: 每个时钟周期流出指令数目固定(4-16) 编译器调度,有多个操作合并成一条指令 DSP, 多媒体应用 1999/2000 HP与Intel达成协议共同研究VLIW
多指令流出处理器的性能受到哪些方面的限制?
TODO ???????
什么是存储层次?哪些应用适合用单级存储结构?哪些应用适合使用多级存储结构?
通过优化存储系统的组织来使得针对典型的应用平均访存时间最短 基本解决方法:多级层次结构 CPU-M1-M2 ...... Mn 存储系统速度接近速度最快M1,容量和价格接近最大最便宜的Mn。
引入cache的理论依据是什么?使用cache有什么优缺点?
CPU速度远远超过RAM读写速度 Processor-Memory Performance Gap “Tax”. Cache是CPU和主存之间的一个高速,小容量的存储器
Cache 减小CPU和RAM的Gap Cache 价格昂贵,容量小
Cache的二种写策略及其特点
- Write-Through vs Write-Back
- Write-Through
- 总是可以丢弃cache的数据 -- 内存里有最新数据
- Cache control bit 只有一个valid bit
- Memory或者Processor总有最新数据;Cache管理简单
- Write-Back
- 不能直接丢弃cache数据 -- 可能需要写入内存
- Cache control bit 有valid bit和dirty bit
- 应为数据经常被重写,带宽占用低得多;对于高延迟的Memory有更好的容忍度
简述cache失效的3C分类
- Compulsory 强制失效
- Capacity 容量失效
- Conflict 冲突失效
降低Cache的失效率主要有哪些方法?
- Larger cache 更大的Cache
- Reduce Misses via Larger Block Size 更大的Block Size
- Reduce Misses via Higher Associativity 更大的相关性
- Reducing Misses via Victim Cache 引入Victim Cache
- Reducing Misses via Pseudo-Associativity 伪相关性
- Reducing Misses by HW Prefetching Instr, Data 硬件预取指令和市局
- Reducing Misses by SW Prefetching Data 软件预取数据
- Reducing Misses by Compiler Optimizations 编译器优化
降低Cache的失效开销主要有哪些方法?
* Read priority over write on miss 读操作比写操作在失效上更优先 * Early Restart and Critical Word First on miss 及早重启及重要词先行 * Non-blocking Caches (Hit under Miss, Miss under Miss) 不阻塞的Cache, * Second Level Cache 二级Cache
降低Cache的命中时间主要有哪些方法?
* Small & Simple Caches 小而简单的Cache * Avoiding Address Translation 避免地址翻译 * Pipelining Caches 管道化Cache
对于一定容量的Cache,当块大小有小变大时,失效率为什么先下降,后又上升?对于16KB的Cache,合适的块大小一般是多少?
因为随着Cache块大小的增大,强制失效会减少,而冲突失效会增加16KB的Cache, 合适的块大小一般是64byte
简述Victim Cache的原理?
How to combine fast hit time of direct mapped yet still avoid conflict misses?
- Add buffer to place data discarded from cache
安排buffer存放Cache丢弃的数据
- Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache
- Used in Alpha, HP machines
对于磁盘存储介质,衡量记录密度的指标是什么?它又取决于哪二个参数指标?
Areal Density = BPI x TPI 区域密度 = 单位英寸的Bits X 单位英寸的道数
- Metric is Bits Per Square Inch
BPI * TPI
- BPI: Bits Per Inch, Bits recorded along a track
- TPI: Tracks Per Inch, Number of tracks per surface
什么是RAID?简述RAID3、RAID5的主要技术特征。
- RAID: Redundant Arrays of (Inexpensive) Disks
- RAID3: Parity Disk. 校验位
- Parity computed horizontally
- Logically a single high data bw disk
- P contains sum of other disks per stripe mod 2 (“parity”)
- If disk fails, subtract P from sum of other disks to find missing information
- Sum computed across recovery group to protect against hard disk failures, stored in P disk
- Logically, a single high capacity, high transfer rate disk: good for large transfers
- Wider arrays reduce capacity costs, but decreases availability
- 33% capacity cost for parity in this configuration
- RAID5: High I/O Rate Interleaved Parity 交错校验
A RAID 5 uses block-level striping with parity data distributed across ALL member disks
Small write Algorithm: 1 Logical Write = 2 Physical Reads + 2 Physical Writes * Interleaved parity blocks * Independent reads and writes * Logical write = 2 reads + 2 writes
简述向量机的部件组成,向量处理器的特点?对以下于运算, 试写出其对应的标量指令序列和向量指令序列。
For (I=0; I<100; I++) Q[I]=A[I]*B[I]+D[I]*(E[I]-5)
Properties of Vector Processors:
- Single vector instruction implies lots of work (loop)
- Fewer instruction fetches
- Each result independent of previous result
- Multiple operations can be executed in parallel
- Simpler design, high clock rate
- Compiler (programmer) ensures no dependencies
- Reduces branches and branch problems in pipelines
- Vector instructions access memory with known pattern
- Effective prefetching
- Amortize memory latency of over large number of elements
- Can exploit a high bandwidth memory system
- No (data) caches required!
Components of a Vector Processor:
- Scalar CPU: registers, datapaths, instruction fetch logic
- Vector register
- Fixed length memory bank holding a single vector
- Typically 8-32 vector registers, each holding 1 to 8 Kbits
- Has at least 2 read and 1 write ports
- MM: Can be viewed as array of 64b, 32b, 16b, or 8b elements
- Vector functional units (FUs)
- Fully pipelined, start new operation every clock
- Typically 2 to 8 FUs: integer and FP
- Multiple datapaths (pipelines) used for each unit to process multiple elements per cycle
- Vector load-store units (LSUs)
- Fully pipelined unit to load or store a vector
- Multiple elements fetched/stored per cycle
- May have multiple LSUs
- Cross-bar to connect FUs , LSUs, registers
对以下于运算, 试写出其对应的标量指令序列和向量指令序列
For (I=0; I<100; I++) Q[I]=A[I]*B[I]+D[I]*(E[I]-5)
Scalar:
LD R0, 5
ADDI R1, Ri, #800
loop: ld R5, 0(Re)
subd R5, R5, R0
ld R6, 0(Rd)
multid R6, R5, R6
ld R7, 0(Ra)
ld R8, 0(Rb)
multid R7, R7, R8
addd R6, R6, R7
ld 0(Rq), R6
addi Rq, Rq, #8
addi Ri, Ra, #8
addi Ri, Rb, #8
addi Ri, Rd, #8
addi Ri, Re, #8
sub R99, R1, Ri
bnz R99, loop
Vector:
ld r0, 5 vld V1, Ra vld V2, Rb vld V3, Rd vld V4, Re vsub.sv V5, V4, R0 vmul.vv V6, V5, V3 vmul.vv V7, V1, V2 vadd.vv V8, V6, V7 vst Rq, V8
什么是SIMD?简述MIMD并行机的二种组成结构。
- SISD: Single Instruction, Single Data
- conventional uniprocessor
- SIMD: Single Instruction, Multiple Data
- one instruction stream, multiple data paths
- distributed memory SIMD (MPP, DAP, CM-1&2, Maspar)
- shared memory SIMD (STARAN, vector computers)
- MIMD: Multiple Instruction, Multiple Data
- message passing machines (Transputers, nCube, CM-5)
- non-cache-coherent shared memory machines (BBN Butterfly, T3D)
- cache-coherent shared memory machines (Sequent, Sun Starfire, SGI Origin)
简述MIMD并行机的二种组成结构:
- 分布存储器并行处理机
- 共享存储器并行处理机
- Symmetric Multiprocessor
- Multiple processors in box with shared memory communication
- Current MultiCore chips like this
- Every processor runs copy of OS
- Non-uniform shared-memory with separate I/O through host
- Multiple processors
- Each with local memory
- general scalable network
- Extremely light “OS” on node provides simple services
- Scheduling/synchronization
- Network-accessible host for I/O
- Cluster
- Many independent machine connected with general network
- Communication through messages
PCI、USB、FireWire、MMU、MPU、MIPS、CPI、IPL?
- PCI: 外围组建互联,一种用来连接计算机和硬件设备的计算机总线
Peripheral Component Interconnect, a computer bus for attaching hardware devices in a computer.
- USB: 连接计算机系统与外部设备的一个串口总线标准,也是一种输入输出接口技术规范
- FireWire:The IEEE 1394 interface is a serial bus interface standard for high-speed communications and isochronous real-time data transfer
IEEE 1394,別名火线(FireWire)接口,是由苹果公司领导的开发联盟开发的一种高速传送接口,IEEE 1394是由蘋果電腦所創,其他製造商也已獲得授權生產
- MMU: 内存管理单元(英语:memory management unit,缩写为MMU),有时称作分页内存管理单元(英语:paged memory management unit,缩写为PMMU)。它是一种负责处理中央处理器(CPU)的内存访问请求的计算机硬件
- MPU: 电子计算机的主要设备之一。其功能主要是解释计算机指令以及处理计算机软件中的数据。
- MIPS: Million Instructions per second (IPS)单字长定点指令平均执行速度, 是一種計算電腦中央處理器速度的記量單位. 是一種採取精簡指令集(RISC)的處理器架構,1981年出現,由MIPS科技公司開發並授權
- CPI: 一条指令运行需要的时钟周期
In computer architecture, cycles per instruction (aka clock cycles per instruction, clocks per instruction, or CPI) is a term used to describe one aspect of a processor's performance: the number of clock cycles that happen when an instruction is being executed
- IPL: 中断优先级, 是当前系统的中断状态的一部分,表示了当前将会被接受的中断请求
The interrupt priority level (IPL) is a part of the current system interrupt state, which indicates the interrupt requests that will currently be accepted.
- ILP :Instruction-Level-Parallelism 指令层并行. 计算机的并行级是在指令层并行,也就是说计算机在同一时间可以执行两条以上的指令.
针对计算机体系结构的软件优化技术一般应考虑哪些方面?
编译器优化:
- Loop Unrolling 循环展开
- Static Branch Prediction 静态分支预测
- Detecting and Enhancing Loop Level Parallelism 检测和加强循环层次并行度
- 代码转换,消除依赖性
- 软件流水线
- 内存结构优化, Cache优化
- 减少Cache Miss: 数组合并;循环顺序调整;循环融合;数据分块
软件应该贴合计算机体系结构,要贴合局部性原理。 要根据软件所运行的硬件和系统平台来进行优化,比如大型机和PC机的区别,超线程CPU和双核CPU的区别、WINDOWS和UNIX的区别。 软件的优化包括了对软件体系结构、数据结构、算法、代码等几个方面的优化,这几个方面都