2016年1月6日 星期三

FPGA 速度架構筆記(Timing)

Reference Book : Advanced FPGA Design by Steve Kilts
這篇文章僅是記錄對我來說重要的部分,細節請參閱參考書籍, 這本書真的是本好書!

決定FPGA的速度有三大因素 throughput, latency, timing
Throughtput :  每秒可以處理的資料量(bits per second)
Latency :  輸入資料與輸出處理過後的資料之間的時間(time or clock cycle)
Timing :  sequential element之間的logic delay (clock period or frequency), 如果設計沒有”meet  
                 timing” 表示critical path 大於clock period


系統中兩個sequential element中的最大延遲會決定系統的max speed

Tclk-q is time from clock arrival until data arrives at Q;
Tlogic is propagation delay through logic between flip-flops;
Trouting is routing delay between flip-flops;
Tsetup is minimum time datamust arrive at D before the next rising edge of clock (setup time);
Tskew is propagation delay of clock between the launch flip-flop and the capture flip-flop.

要提高max speed有五種方法可以使用(Add Register Layers, Parallel Structures, Flatten Logic Structure, Register Balancing, Reorder Path)


Add Register Layers
critical path分段拆成幾個小path,但要先確定增加的clock cycle不會影響design specifications functionality
例如Y <= A* X+B* X1+C* X2; 拆成
prod1 <= A * X;
prod2 <= B * X1;
prod3 <= C * X2;
Y <= prod1 + prod2 + prod3;


Parallel Structures
這方法是把連續的邏輯平行處理,譬如一個8 bits的乘法器.可以拆成兩個四位元的來同時處理,再把結果合併起來


例子比較龐大,在書中第九頁,這個方法可以降低path delay


Flatten Logic Structures
這個方法跟Parallel Structures類似,但是用在priority encoding上面,例如下面的例子, synthesis and layout tools are smart enough to duplicate logic to reduce fanout, but they are not smart enough to break up logic structures that are coded in a serial fashion


// reference from : Advanced FPGA Design by Steve Kilts

module regwrite(
 input [3:0] ctrl,
 input clk,in,
 output reg [3:0] rout);
 
always@(posedge clk)
 if(ctrl[0]) rout[0] <= in;
 else if(ctrl[1]) rout[1] <= in;
 else if(ctrl[2]) rout[2] <= in;
 else if(ctrl[3]) rout[3] <= in;

end module



這樣寫法系統會自動合成address decoder,每個訊號都是互斥或,會增加path delay
所以作者推薦把 if else 打散,把條件平等化,依照順序寫也可做到priority control的效果,且可以減少path delay,因為每個訊號不互相影響


// reference from : Advanced FPGA Design by Steve Kilts
module regwrite(
 input [3:0] ctrl,
 input clk,in,
 output reg [3:0] rout);
 
always@(posedge clk)
 if(ctrl[0]) rout[0] <= in;
 if(ctrl[1]) rout[1] <= in;
 if(ctrl[2]) rout[2] <= in;
 if(ctrl[3]) rout[3] <= in;

end module


Register Balancing
這方法用來縮小兩個reg之間的最大延遲
例如下列這樣的寫法,critical path會存在Sum <= rA + rB + rC;
rA <= A;
rB <= B;
rC <= C;
Sum <= rA + rB + rC;
因此改成下面balance的寫法可以縮小critical path delay
rABSum <= A + B;
rC <= C;
Sum <= rABSum + rC;


Reorder Paths
若有數個pathcritical path連在一起,可將這些path重新組合,critical path接近destination register
例如下面例子


// reference from : Advanced FPGA Design by Steve Kilts

module randomlogic(
 input [7:0] A,B,C,
 input clk,
 input Cond1, Cond2,
 output reg [7:0] Out);
 
always@(posedge clk)
begin
 if(Cond1)
  Out <= A;
 else if(Cond2 && (C < 8))
  Out <= B;
 else
  Out <= C;
end 
end module

假設cout之間為critical path,C需要越過兩個gate來到mux(if else)
因此若把程式重新排列如下,就可以減少一級,見書中15頁的圖會更清楚


// reference from : Advanced FPGA Design by Steve Kilts
module randomlogic(
 input [7:0] A,B,C,
 input clk,
 input Cond1, Cond2,
 output reg [7:0] Out);
 
wire CondB = (Cond2 & !Cond1)

always@(posedge clk)
begin
 if(CondB && (C < 8))
  Out <= B;
 else if(Cond1)
  Out <= B;
 else
  Out <= C;
end 
end module


結論就是要把較複雜的比較式寫在前面,這樣可以減少critical path delay

FPGA 速度架構筆記(Latency)


Reference Book : Advanced FPGA Design by Steve Kilts
這篇文章僅是記錄對我來說重要的部分,細節請參閱參考書籍, 這本書真的是本好書!

決定FPGA的速度有三大因素 throughput, latency, timing
Throughtput :  每秒可以處理的資料量(bits per second)
Latency :  輸入資料與輸出處理過後的資料之間的時間(time or clock cycle)
Timing :  sequential element之間的logic delay (clock period or frequency), 如果設計沒有”meet  
                 timing” 表示critical path 大於clock period


throughput相反,要達到low Latency的秘訣就是減少pipeline, 書中以之前的例子繼續做比較
一個Low Latency設計,需要平行化(parallelisms),移除pipeline(removal pipelining),邏輯捷徑(logical short cuts), 但這會降低throughputmax clock speed.

下面的例子為low Latency, pipeline的環境中,每一個Stage都要等上一個stage完成後才能工作,但是在這個例子中不用pipeline,而用combinational expressions,所以不用等待前一級的資料


// reference from : Advanced FPGA Design by Steve Kilts

module power3(
	input [7:0] X,
	output [7:0] XPower);
	
reg [7:0] XPower1, XPower2;
reg [7:0] X1,X2;

assign XPower = XPower2 * X2;

always@(*)
begin 
	X1		<= X;
	XPower1	<= X;
end
	
always@(*)
begin 
	X2		<= X1;
	XPower2	<= XPower1*X1;
end

end module;

Throughput  : (8/1,假設每個clock都有資料輸入)
Latency : 第一個乘法器與第二個乘法器間的delay
Timing  : 兩個乘法器的delay

結論: 要降低latency,必須移除pipeline,但是這樣會增加register之間的combinational delay




FPGA 速度架構筆記(Throughtput)


Reference Book : Advanced FPGA Design by Steve Kilts
這篇文章僅是記錄對我來說重要的部分,細節請參閱參考書籍, 這本書真的是本好書!

決定FPGA的速度有三大因素 throughput, latency, timing
Throughtput :  每秒可以處理的資料量(bits per second)
Latency :  輸入資料與輸出處理過後的資料之間的時間(time or clock cycle)
Timing :  sequential element之間的logic delay (clock period or frequency), 如果設計沒有”meet
                 timing” 表示critical path 大於clock period

要達到high throughput的秘訣就是pipeline,書中舉例了一個For迴圈的例子
C語言(example from reference book)
XPower  = 1;
For( i = 0; i < 3 ; i++)
    XPower = X*XPower

通常會寫成下面的RTL,下面跟書中相同,我改了一下assign跟加了一下begin end


// reference from : Advanced FPGA Design by Steve Kilts
module power3(
	input [7:0] X,
	input clk,start,
	output [7:0] XPower,
	output finished);
	
reg [7:0] ncount;
reg [7:0] XPower;

assign finished = (ncount == 0)? 1:0;

always@(posedge clk)
begin 
	if(start)
	begin 
		XPower <= X;
		ncount <= 2;
	end 
	else if(!finished) begin 
	begin 
		ncount <= ncount - 1;
		XPower <= XPower*X;
	end 
end
end module;



上面這個例子
Throughput  : (8/3,8 bits 3clock cycle結果才出來)
Latenc : 3 clocks,
Timing  : 一個乘法器的delay

如果是以pipeline的方式來寫


// reference from : Advanced FPGA Design by Steve Kilts

module power3(
	input [7:0] X,
	input clk,
	output [7:0] XPower);
	
reg [7:0] XPower1, XPower2;
reg [7:0] X1,X2;


always@(posedge clk)
begin 
	// Pipeline stage 1
	X1		<= X;
	XPower1	<= X;
	// Pipeline stage 2
	X2		<= X1;
	XPower2	<= XPower1*X1;
	// Pipeline stage 3
	XPower	<= XPower1*X1;
end
	
end module;

Throughput  : (8/1),每個clock cycle都會有資料出來
Latenc : 3 clocks,
Timing  : 一個乘法器的delay
當然pipeline也有缺點就是他需要多使用一個register與乘法器

結論,要增大設計的throughput,須把迴圈寫成pipeline,但會增加設計的面積