SUMMARY Various kinds

来源：年旅网

IEICETRANS.ELECTRON.,VOL.E83–C,NO.2FEBRUARY2000

195

ALow-Voltage42.4G-BPSSingle-EndedRead-Modify-WriteBusandProgrammablePage-Sizeona3DFrame-Buﬀer

KazunariINOUE†a),Member,HideakiABE†,KaoriMORI†,

PAPERSpecialIssueonLow-PowerHigh-SpeedCMOSLSITechnologies

SUMMARYVariouskindsofhighbandwidtharchitectureusingtheembeddedDRAMtechnologyhavebeenpresentedpre-viously.Inmostcases,theyusewidebusimplementationand/orfastbusspeed,thatbothhavethepenaltyofdieareaandmuchpowerconsumptionatthesametime.Theproposingsingle-endedread-modify-writebusincreasesthebandwidthtwiceashigh,whileitmaintainsthesamebussizeandthesamebusspeed.Thedata-buscomprises1k-bitread-busand1k-bitwrite-busthateachworksconcurrently,andhasamplitudefrom0Vto1V,hencethemeasuredpowerconsumptionisonly0.3Watafrequencyof166MHz.Aprogrammablepage-sizereducesthepagemiss-rateandeﬃcientlyimprovesthebandwidththatiscomparabletothewidebusandfastspeedapproach.Alltheproposingfeaturesareimplementedona3Dframe-buﬀertoachieve42.4G-BPSbandwidth.keywords:

1.Introduction

EmbeddedDRAM(eRAM)technologyhasbeenre-portedtohavehighbandwidthandmoreadvantagesfor

certainapplications[1]–[6].Inparticularaframe-buﬀerisoneofthemostnoticeableeRAM,becausethecom-moditymemoryhastoodeepandnarrowbandwidthforthegraphicsapplication.Inmostcases,theypersistedinwidebusimplementationand/orfastbusspeed,bothhavethepenaltyofdieareaandmuchpower.EvenifhugenumberofdatabusplacesonDRAMarraytosavediearea[3],thepowerconsumptioncancausenoiseproblem.Asingle-endedread-modify-writebusincreasesthebandwidthtwiceashigh,whileitmain-tainssamebussizeandbusspeed.Inadditionthedatabushasamplitudefrom0Vto1V,hence,thelowvolt-ageoperationeﬃcientlyeliminatesthenoiseproblemaswellaslowpowerconsumption.

Anotherapproachtoobtainahighbandwidthistoimprovethepage-missthatisinherenttoDRAM.TheeRAMtechnologyisalsousefulhere,(e.g.hav-ingadditionaldata-latchthatconsecutivelycommuni-cateswiththedata-businsteadofsense-amptohidethepage-miss[4];buildingcascadedmulti-bankarchi-ManuscriptreceivedJune30,1999.†

ManuscriptrevisedOctober12,1999.TheauthorsarewithSystem-LSIdivisionAS-memorygroup,MitsubishiElectricCorporation,Itami-shi,6-81Japan.

a)E-mail:inoue.kazunari@lsi.melco.co.jp

andShujiFUKAGAWA†,Nonmembers

tectureandhavingauniqueaccesscontrolcircuitryforrelievingthepage-misspenalty[5],[6].TheeRAMtech-nologyreducesthepage-missandimprovestheaveragebandwidthtowardthepeakbandwidth.Inthispaperweproposeaprogrammablepage-size,thatadjustsitssizetoﬁtthediﬀerentkindsofdata-accesssequence,whichisgenerallyoccurredinthegraphicsapplication.

Whendata-accessissequencedfromrighttoleft,ahorizontalpage-sizeisapplied.Whendata-accessissequencedasatile,(i.e.righttoleftfourtimesﬁrstandshiftstobottomandrepeatrighttoleftfourtime),arectangularpage-sizeisappliedtoreducethepage-missrate.

Onentiregraphicssystem,theframe-buﬀerinter-faceswitharenderingprocessorandaRAMDAC.Therequiredbandwidthfortheframe-buﬀerisacombina-tionoftherenderingpixel-rateandthevideopixel-rate.

TherequiredbandwidthwillbediscussedinSect.2.Sections3,4,and5willbeintroducingpro-posedhighbandwidtharchitecture,whichcanminimizethepowerconsumptionandsiliconpenalty.Section3isexplainingtheexternalI/O-buswhileSect.4isre-latingtotheinternalI/O-bus.Section5explainstheproposeddatatransferbuﬀerandlow-voltagebus.Theprogrammablepage-sizepage-missreductiontechniquewillbediscussedinSect.6.2.

Band-GapbetweenFrame-BuﬀerandRen-dering/Video

Thissectionwillshowthecomputedbandwidth,be-tweentherenderingprocessor,video,andtheframe-buﬀer.

Figure1isanexampleofa3Dgraphicssystem.Ageneralrenderingprimitivein3Dapplicationisatriangle.

First,CPUpassesthevertexpixeldatatotheren-deringprocessortocomputetheinnerpixeldataofthetriangle.Thecomputationisdoneinthesetupunitandthescanneroftherenderingprocessor.Second,therearemanyrenderingproceduressuchastexturemapping,inthepixel-pipeline.Somepixel-proceduresaredonebyhardwareandsomearedonebysoftware.

Finally,therenderingprocessorgeneratesthe

196

Fig.1Agraphicssystemwhichisrendering,frame-buﬀer,andRAMDAC.

result-pixeldataandprovidesthemtotheframe-buﬀer.Aframe-buﬀerisamemory-LSI,whichdisplaysthepixel-dataonthescreen.Theoutputoftheframe-buﬀeristhevideo-datathatcontainsA-colorandB-color,whichisnamedaccordingtothedoublebuﬀer-ingusedintheanimation.Thevideo-dataiscoupledtoRAMDACandfollowthehorizontalscan-linethatisseenonthescreen.

2.1RenderingPixelRateandFrame-BuﬀerBand-widthArenderingprocessorconsistsofmultiplepixel-pipelines.Thebandwidthwhichframe-buﬀerdevotestotherenderingpixel-pipelineis,

BW[render]

=2(Source+Destination)*BPP*N*F(1)

Note:

BPP=BitPerPixel

N*F=#pixelpipeline*operatingfrequencyHere,thepixel-pipelineusesonlynew-data(=Source)in2D,while3Dapplicationneedsnew-dataandold-data(=Destination).Notall3D-proceduresusedes-tinationdata,butsomeofthemcertainlyneedbothsourceanddestinationasdescribedinFig.1.Thatiswhy3D-renderingisnamely“read-modify-write.”

Figure2showsthehistoricaltrendofBW[render].ItlookslikeasquarecurvebecauseofthedoublegrowthofBPP[BitPerPixel]andN*F[pixelpipeline].Assum-inganapplicationhastwopixelpipelinesworksinpar-allelat150MHzwith32-bitcolorand32-bitdepthforatotalof-BPP,thebandwidthofaframe-buﬀerre-quiresfromtherenderingprocessoris,

BW[render]=4.8G-Byte/sec

(2)2.2VideoPixelRateandtheFrame-BuﬀerBand-widthARAMDACcomprisesofapallet-RAMandaDAC

IEICETRANS.ELECTRON.,VOL.E83–C,NO.2FEBRUARY2000

Fig.2

ThehistoricaltrendoftherenderingBW.

Fig.3

ThehistoricaltrendofvideoBW.

(D/Aconverter)toproduceavisualdisplayfromthevideopixel-data.Thebandwidthofaframe-buﬀerre-quiresfromaRAMDACis,

BW[video]

=2(A/Bbuﬀer)*BPP*screen-size*1.4*frame-rate(3)

Note:

BPP=bitperpixelforcolorchannel

Adouble-buﬀeringtechnologyrequiresdoubledataratefortheframe-buﬀer.Becauseframe-buﬀerdoesnotdis-tinguishwhetherA-buﬀerorB-buﬀerisindicatedonthescreen.Bothbuﬀers’dataneedtobetransferredtoRAMDAC.Also,thevideopixel-rateisproportiontothescreensizeandtotheframerate.Figure3showsthehistoricaltrendofvideopixel-rate.Intoday’sap-plication,using32-bitcolor,1280*1024screensize,andaframe-rateof60fps,thebandwidthis,

BW[video]=881M-Byte/sec

(4)

Lastly,theentirebandwidththataframe-buﬀerneeds

INOUEetal.:ALOW-VOLTAGE42.4G-BPSSINGLE-ENDEDREAD-MODIFY-WRITEBUS

197

Fig.4Blockdiagramofreportedframe-buﬀer.

tooﬀerisasbelow.

BW[frame-buﬀer]=BW[render]+BW[video](5)

=5.7G-Byte/sec(6)Besidesahigh-endWS,thereisnootherframe-buﬀersolutionmeettheabovebandwidth.When3D-animationisusedintoday’sPCapplication;some-times,colormodeiscompromisedto16-bitcolor,and

othertime,thevideoframe-ratereducesto30fpsorlessbecauseofthelackofframe-buﬀerbandwidth.3.

HighBandwidthApproach

Fig.5

Theproposedgraphicsprocedure.

Figure4showstheblockdiagramofanexperimen-talembedded3Dframe-buﬀer.DRAMisusedastheframe-memoryandSRAMisusedasaﬁrstlevelcache.1k-bitRead-busand1k-bitwrite-bus,whichworkcon-currently,areplacedonDRAMarray.Implementing2k-bitbus-sizeonaDRAMarrayisclosedtothemax-imumnumberofbitsallowablebythedesign-rule[3].

Aprogrammablepage-sizecontrolisproposedinordertoreducethepage-missrate.Thepage-sizecontrolisalsoeﬀectivetoimprovetheaverageband-width.Video-portprovidestheserialaccessdataforthescreen.JTAGistheserialportfortesting.4.

BandwidthonI/O-BusandtheFrame-Buﬀer

NomatterhowDRAMembeddedtechnologyenablesawiderbustoachievefasterspeed,itismorediﬃcultforanexternalI/O-bustocatchupwithsuchahighbandwidth.Therefore,weareproposingapixel-rateimprovementontheexternalI/O-busﬁrst,andthenexplainsanewinternalI/O-busarchitecture.4.1BWontheExternalI/O-Bus

CurrentgraphicssystemcontainsthreekindsofLSI,

therenderingprocessor,theframe-buﬀer,andtheRAMDAC.Therenderingprocessorgeneratestheresult-pixelbysynthesizingthesourceandthedesti-nation.ThesourcecomesfromCPUandthedestina-tioncomesfromtheframe-buﬀer,thusbothdestina-tionandresult-pixelgobackandforthalongtheexter-nalI/O-busbetweentherenderingprocessorandtheframe-buﬀer.Therefore,theactualrenderingpixel-ratedecreasesto1/2ofthebandwidthontheexter-nalI/O-bus.Whenpixel-rateincreases2Xinspeed,theexternalI/O-busneedstobe4Xasfast.Inad-ditioneitherhavinga4Xbus-sizeoroperatinga4Xspeedtakes4Xofthepower,thusit’squitediﬃculttogetdoublerenderingpixel-rateontheexternalI/O-bus.TheRAMDACgeneratesthevisualpixel-databyMUXingA-buﬀer,B-buﬀer,andsometimesoverlay.Thesevideopixel-dataalwaysappearontheexternalI/O-bus.Thustheactualpixel-rateforvideois1/2orlessthanthebandwidthontheexternalI/O-bus.Again,it’snoteasytohavedoublevideopixel-rate.

Figure5showsaproposedgraphicsprocedure.Anindicateddata-pathimprovestheactualpixel-rateontheexternalI/O-bus.Animplementedgraphics-

198

hardwareontheframe-buﬀerreducesthetaskexecutedontheexternalI/O-bus.BecauseS/W/Z-testunitandBlendingunitareplacedintheframe-buﬀer,theresult-pixelisalsogeneratedinsidetheframe-buﬀer.Hence,puresource-pixelappearsontheexternalI/O-busbe-tweenrenderingandframe-buﬀer.Therenderingpixel-rateisexactlythesameasthebandwidthontheex-ternalI/O-bus[7].Thevideo-outputdatafromserial-portinFig.4containsA-buﬀerandB-buﬀer.Animple-mentedW-LUTindicateswhetherA-buﬀerorB-buﬀerisrequestedfromeachWindowonthescreen.There-fore,thevideopixel-datacomingfromtheframe-buﬀerisvisualdisplaydata.Thevideopixel-rateisalsoex-actlythesameasthebandwidthontheexternalI/O-bus.

Theimplementationofgraphicshardwareontotheframe-buﬀerimprovesthepixel-ratetwiceasfast,whileitmaintainsthesamedatarateandsamepowercon-sumptionontheext.I/O-bus.4.2BWontheInternalI/O-Bus

ThebandwidthontheinternalI/O-busiscalculatedasfollows.

BW[frame-buﬀer]

=[bussize]*[busspeed]/[∼pagemissoverhead]

(7)Here,[bussize]*[busspeed]meansmerelyapeakband-width.BecausetherenderingprocessandtheRAM-DACareconsecutivepixel-data,thepeakbandwidthdoesnotapplytotheframe-buﬀer.Thekeyofthegraphicsperformanceonthesystemlevelistheaveragebandwidth.Itismoreeﬀectivetoworkonthepage-missreductionandvideo-overhead,insteadofmakingawiderandfasterbus.Ingeneral,bussizeandbusspeeddependsontheprocessparameter,(i.e.thedesignruleofthemetalline/spacedeterminesthemaximumbussize),andthetransistorperformancemanagesthebusspeed.

Itisobviousthattherenderingcontroller,theframe-buﬀerandRAMDAChavetousethelatestpro-cesstechnology.Otherwise,iftheframe-buﬀerfocusesonthebussizeandbusspeedonly,itnevercatchesupwiththerenderingandvideoperformanceasthehistoricaltrendshowed.

Thispaperdiscussestheeﬀectivepixel-rateandproposestwoarchitectures.Oneisalow-voltagesingle-endedread-modify-writebusandtheotherisapro-grammablepage-size.5.

Single-EndedRead-Modify-WriteBusandDTB(DataTransferBuﬀer)

ThediﬀerentialI/O-busiswidelyusedinDRAMbe-causethesense-ampisnotcapableofdrivingtheglobal

IEICETRANS.ELECTRON.,VOL.E83–C,NO.2FEBRUARY2000

Fig.6ConventionalseparatedI/O-busarchitecture.

bus.Inreadoperation,aslightvoltage-diﬀerenceap-pearsonthediﬀerentialI/O-busanditisampliﬁedagainattheotherendofI/O-bus.Inwrite-operation,thediﬀerentialI/O-busisfullydrivenbythewrite-driveranddataiseasilywrittenintothesense-amp.MostoftheDRAMspeedbottleneckiscomingfromthesense-amptotheendofI/O-busbecauseitislikeananalog-operationusingtheslightsignal.

AsthediﬀerentialI/O-busworksforeitherthedes-tinationortheresult-pixelintheframe-buﬀer,theac-tualpixel-ratealsoreducesto1/2oftheinternalI/O-busbandwidth.Aneasiermethodtoacquiremorebandwidthistoincreasethebus-size,butitislimitedbydesign-rulesuchastheminimummetallinewidthandthespacing.Theothermethodtoimprovethepixel-rateistohaveaseparatedI/O-bus.Thatisaread-modify-writebusinwhichdestinationandresult-pixelworkconcurrently.ItmakesthepixelratetwiceasfastasthediﬀerentialI/O-bus.However,theread-modify-writebusneeds2Xofmetallines,oneistoreadandtheotheristowrite.Bydoingso,itmightre-ducepossiblebussize.Figure6showstheexampleofconventionalread/writeseparated-I/Obusarchitec-ture.Inthisﬁgure,therearetwokindsofbuslines,oneisGBR(globalbusforread)andtheotherisGBW(globalbusforwrite).TheothertwokindsofCSL(col-umnselectline),areR-CSL(readcolumnselectline)andW-CSL(writecolumnselectline)thatoperatesreadandwritetogether.Becausethemaximumnum-berofmetal-linesplacedonthememoryarraydependsontheprocessparameter,it’snoteasytohavedoubleGBsanddoubleCSLs.

Figure7showstheotherexampleoftheread/writeseparatedI/O-busarchitecture.R-CSLandW-CSLrunshorizontallyandit’sbettertohavethedoublesizeofGBvertically.Thesiliconareapenaltyforimple-mentingdoubleCSLandY-gateareserious.

ConsiderabledisadvantageofaseparatedI/O-bus

INOUEetal.:ALOW-VOLTAGE42.4G-BPSSINGLE-ENDEDREAD-MODIFY-WRITEBUS

199

architectureisthenoiseproblem.Asthewrite-busswitchesfast,whichis0VtofullVDD-level,thewrite-busaﬀectstheweaksignalsontheread-businthecon-currentoperation.ThisistheworstattRCD=min.becausesense-ampdoesnotcompletethebit-lineam-pliﬁcation.ItmusttakelongertRCD,whenwrite-busgoestodiﬀerentdirectionfromtheread-bus.Ashieldedpower-linebetweenwrite-busandread-busisasolutiontosolvethecouplingnoiseproblem.However,ittakesextrametallinesandcausesthebus-sizereduction.

Reportedlow-voltagesingle-endedread-modify-writebuscontributesdoublebandwidthwithoutin-creasingmetalline.Thesense-ampisn’tconnectedtotheread/writebusdirectly,butaDTB(DataTrans-ferBuﬀer)isplacedbetweenthesense-ampandtheread/writebus.ADRAMarrayhasthesense-ampinbothsides,andeachsense-amphasDTBandcolumnselectasshowninFig.8.Figure9istheDTBstructure.Thereareten-CSLsconnectedtolocal-businapage.BecauseDTBoperateseitherread-operationorwrite-operation,itisnotnecessarytoorganizetheseparatedlocalI/O-busandseparatedCSL.

Fig.7TheotherexampleofseparatedI/O-busarchitecture.

Inotherwords,thedata-pathfromthesense-amptoDTBremainsinsidethediﬀerentialarchitecturetosavesiliconarea,andtheotherdata-pathbetweenDTRandGBR/GBWismodiﬁedtosingle-endedstruc-turetoobtaintwicethebandwidth.Inthisexperi-ment,DTBandthecolumn-switchtakelessthan15µmheightonthesilicon,itisnotasigniﬁcantdiepenaltycomparedtotheconventionalseparatedI/O-busarchi-tecture.Inreadoperation,DTBconvertstheslightsignalonthelocalI/O-busintoMOS-levelsignalonGBR.Theloadisquitesmallforthesense-ampbecauseitdrivesjustlocal-busthatinvolvesten-CSLs.Thead-ditionalampliﬁer,suchascurrent-mirrortype,isnotnecessarybecauseGBRisMOS-levelsignal.Inwriteoperation,DTBdrivesthelocalI/O-buswithrespecttotheGBW.Inthisexperiment,theGBWusedthedif-ferentialbusbecausewrite-maskfunctionisadopted.Itisalsoeasiertoconvertotherarchitecturesintothesingle-endedarchitecture.

FurtherobjectiveofDTBimplementationistore-ducethepowerandtoeliminatethenoiseproblem.DTBandGBR/GBWuse1Vfortheirpower-supplycomesfrominternalvoltagedown-converter(VDC).Ontheotherhand,thesense-ampandthelocal-busworkat2VVDD-level.Figure10showsthewave-formofGBRandGBW.AsforGBR/GBWspeed,1V–0Vissuﬃcienttooperateat166MHz,becausetheanalog-operationinDRAMhasshutdownatDTB.GBR/GBWworkforjustasimpledata-pathusingMOS-levelsignal,whichdoesnothavetheequalizetimerequirement.

Thelow-voltagebuseﬃcientlyremovesthecou-plingnoiseproblemintheconcurrentoperations.EventhoughthecouplingnoiseoccursinGBR,itnevergoesintothesense-ampbecauseGBR/GBWisn’tconnectedtothesense-amp.

Anotherareathatcanimprovethenoiseproblemisusingseparatepower-lines,the2Vpowerlineforsense-ampandcells,andthe1VpowerlineforDTBanddata-bus.Therefore,thecolumnaccessthatuses

Fig.8Single-endedRMWbusandDTB.

200

Fig.9DTBstructure.

Fig.10

GBW/GBRwaveform.

bigbus-sizeneveraﬀectstherow-access.Intheconven-tionalarchitecture,becauseoftheuniﬁedpowerline,thesense-speed(=rowaccesstime)sometimesreducesbythecolumn-accessthatisoperatedinotherbanks.Obviously,thewiderbus-sizeandthefasterbus-speedmakeitworse.Theexperimentalchiphas1k-bitread-busand1k-bitwrite-bus,eachworksconcurrentlyat166MHz,buttheydonotaﬀecttherowaccessspeed.Figure11showsthemeasuredIDDcurrentofrow-accessandcolumnaccess.TheIDDofGBRisalmostthesameastherow-access,eventhough1k-bitworksatthesametime.Alsoascolumn-accessuses1Vandrow-accessuses2Vfortheirpowersupply,thepowerconsumptionof2k-bitbusislessthan0.3W.Accord-ingtothecurrentSDRAM(x16)characteristicdata,thepowerconsumptionincolumn-accessisevenlargerthanthatofrow-access.6.

Page-MissRateReduction

6.1Page-MissandVideo-Overhead

In3Dapplication,therenderingprimitivesaretrian-gles.Therearethreetypesoftriangles,strip,fan,andindependent,asshowninFig.12.Inanycase,arectan-gularpage-sizeisbetterthanahorizontalshape,sinceitmustincludesmoretrianglesinapage.Forageneraldata-accesssequenceofrenderingatiled-image,thatis

IEICETRANS.ELECTRON.,VOL.E83–C,NO.2FEBRUARY2000

Fig.11MeasuredIDDcurrent.

Fig.12

Primitivetrianglein3D.

4-pixellefttorightﬁrst,andscantobottomsecond,arectangularpage-sizereducestherenderingpage-missrate.Ontheotherhand,thescan-linealwaysrunsfromlefttoright,thusthehorizontalpage-sizeisbetterthantherectangularshapetoreducethevideooverhead.

Itislikelythatthepage-sizeiscompromisedbythetrade-oﬀbetweentherenderingpage-missrateandthevideooverhead.Table1showstherenderingper-formancelimitwhencommodity128-bitbusSDRAMisused.Here,severalcasesofpage-size,i.e.128*2-pixelforthehorizontalcaseand16*16-pixelfortherectan-gularcase,aretakenastheparameterinthisstudy.Apage-missratemeanstheaverageofrow-accesscountwhenatriangleisdrawn.Forexample,av.5.56-timesofrow-accessoccurswhen128*2-pixelpage-sizeisused,andav.1.86-timesofrow-accessoccurswhen16*16-pixelpage-sizeisused.Thesenumbersarecalculatedbyusingasimplesimulation.Arectangularpage-sizeisbetterinreducingtherenderingpage-missrate.

Apage-misscostis[page-missrate]*[row-accesstime],inwhich[tRCD/tRP:2/2=@24ns]isusedfortheﬁrstrow-accessand[tRRD:2=@12ns]isusedaf-tersecondrow-accessbecauseoftheeﬀectofthemulti-bankarchitecture.Ablock-sizemeansapixelcache-sizeinthecolumn-access.InTable1,BPP=bitisused,thus128-bitbusiscapableoftwo-pixel(Px*Py=1*2-pixel)workinparallel.Ablock-missratemeanstheaverageofcolumn-accesscountasthe25-pixeltriangleisdrawn.Block-misscost

INOUEetal.:ALOW-VOLTAGE42.4G-BPSSINGLE-ENDEDREAD-MODIFY-WRITEBUS

201

Table1

SDRAMperformancelimit.

Table24*4-pixelcolumnaccessperformancelimit.

means[block-missrate]*[column-accesstime].Here,[read-modify-write=@12ns]isusedforcolumn-access.Video-overheadmeanslostbandwidthforvideo-refresh.1280*1024@60fpsofscreensize,thatis9ns/pixelofvideo-outputrateisused.Takingthecaseof128*2-pixelpage-sizeforexample,thevideooverheadiscal-culatedasfollows.

[videooverhead]

=[accesstimeofascan-lineinapage]/[videopixel-rateinapage]=(24ns+6ns*128-pixel/2-pixel)/(9ns*128-pixel)=35.5%

/(1-videooverhead)

When128*2-pixelpage-sizeisused,

[25-pixelTriangleRenderingcost]=(182.4ns)/(1–35.5%)=282.8ns

(10)

(11)

[25-pixelTriangleRenderingrate]

=1/[renderingcost]=3.53M-triangle/sec(12)Inthisexample,regardlessoftherenderingpage-size,thebottleneckisalwaysinthecolumn-access.Thehori-zontalpage-sizeisbetterinreducingthevideooverheadandgettinghigherrenderingrate.TheotherexampleinTable2isbasedontheembeddedframe-buﬀerinthispaper.Theblock-sizeincreasesto4×4-pixelsince[BPP*4*4-pixel=1k-bitbus].Theblock-misscostequationdoesn’tneed*2becausetheread-modify-writebuscanoperatedestinationandresult-pixelatthesametime.Also,thisexperimentalframe-buﬀerishavingVideo-port,whichdoesn’tneedthecolumn-accessinvideo-refresh.Asaresult,theperformancelimitofthe

(8)

(9)

Assumingthatrow-accessandcolumn-accessoperatesindependently,therenderingcostiscalculatedasfol-lows.

[totalTriangleRenderingcost]

=max.(page-misscost,block-misscost)

IEICETRANS.ELECTRON.,VOL.E83–C,NO.2FEBRUARY2000

202

Fig.13Aprogrammablepage-size.

page.Ontheotherhandintherenderingmode,thediﬀerentXnisactivatedineachSRD.Word-linea0insub-array-A,a1insub-array-B,a2insub-array-C,anda3insub-array-Dareselectedtoperformtherectangu-larpage.Figure14showsSRDcircuitryanddecodingofX3...0.

Usingtheprogrammablepage-size,inwhich128*2-pixelforthevideo-pageand32*8-pixelfortherendering-page,theﬁnalrenderingrateofthisexperi-mentalframe-buﬀerisupdatedasfollows.

[Renderingpage-missrate]:2.11

Fig.14

SRDstructure.

(14)(15)

[Renderingpage-misscost]:37.4ns/triangle

frame-buﬀerwhichprovides4*4-pixelofcolumnaccess,isasfollows.

[FRB:25-pixelTriangleRenderingcost]=44.9ns

(13)when32×8-pixelpage-sizeisused.6.2ProgrammablePage-Size

Thelasthighbandwidthtechniquepresentedinthispaperisaprogrammablepage-size.Itoﬀerstwodiﬀer-entpage-sizes,oneisarectanglepage-sizefortheren-deringprocessandtheotherishorizontalpage-sizeforvideo-refresh.Figure13illustratestheprogrammablepage-sizeandtheword-linedecoding.Therow-decoderconsistsofmainrow-decoder(MRD)fortheprimaryselectandsubrow-decoder(SRD)forthesecondaryselect.Theword-lineisgeneratedbytheoutputofbothMRDandSRD“ANDed.”TheoutputofMRDisplacedeveryfourword-lines,sothatSRDdecodesfourtoonebytheinternalsignalX3...0asshowedinFig.14.

Inthevideomode,thesameXn(n=3...0)areactivatedinallSRD.InFig.13,word-linea0insub-array-A,b0insub-array-B,c0insub-array-C,andd0insub-array-Dareselectedtoperformthehorizontal

Thisisbecauseoftheeﬀectof32*8-pixelrectangularpage-sizefortherendering.

[Videooverhead]:4.2%

(16)

Thisisbecauseoftheeﬀectof128*2-pixelhorizontalpage-sizeforthevideo.

[Final25-pixelTriangleRenderingcost]=max.(page-misscost,block-misscost)/(1-videooverhead)=39.0ns(17)[Final25-pixelTriangleRenderingrate]=1/[renderingcost]=25.6M-triangle/sec

(18)

Resultedcontributionoftheprogrammablepage-sizeis15%higherrenderingratethandescribedinSect.6.1.7.

ExperimentalResult

A3Dframe-buﬀerusing0.25µmCMOSembeddedDRAMprocesstechnologyhasdeveloped.ThechipphotoisshowninFig.15.DRAMisdividedintoeightmemorymats,andSRAMisplacedatthetop(orbot-tom)ofeachmemorymat.Data-buscomprisesof1k-bitread-busand1k-bitwrite-busplusadditional40-bit

INOUEetal.:ALOW-VOLTAGE42.4G-BPSSINGLE-ENDEDREAD-MODIFY-WRITEBUS

Fig.15

Chipphoto.

Table3

Characteristicoftheembeddedframe-buﬀer.

busforaparticularframe,areplacedontheDRAMarraywithoutincreaseindiearea;word-line(MRD)usestheﬁrst-metallayer;data-bususesthesecond-metallayer;thepower-lineusesthethird-metallayer.Amemorycontrolcircuitryandpixelprocessor,arelocatedbetweenrightmemorymatandleftmemorymat.Allinputsareregisteredwhichlocateatthecen-terofchiptobalancetheskewfromthepintotheinput-register.Theimplementedgraphicshardwarere-ducesthetaskontheexternalI/O-busandimprovesthepixel-ratetwiceasfast.Thereportedembedded3Dframe-buﬀerrealized42.4G-BPSofbandwidthand25.6M-triangle/secofrenderingrate.AsummaryofthecharacteristicdataisshowinTable3.Thepowerconsumptionwhen2k-bitbusworkis0.3W,thatissameasthatofrow-access.8.

Conclusion

Alow-voltagesingle-endedread-modify-writebusar-

203

chitecturehasbeenpresented.Itachievesthedoublepixel-ratebydoingtheread-writebusconcurrentop-eration,whileitmaintainsthesamebus-sizeandsamebusspeedbytheeﬀectofsingle-endedstructure.Also,animplementedDTBconvertstheread-modify-writebusoperationintoMOS-levelsignalwhichamplitudeof0V–1V.BotharchitecturescontributetodeducethepowerconsumptionandtoeliminatethenoiseprobleminahighbandwidtheRAM.

Aprogrammablepage-sizeisalsopresentedtore-ducethepage-missrate,whicheﬃcientlyimprovestheaverageofbandwidthaswellaswidebusandfastspeedapproach.Itprovidesprogrammable-controltwodiﬀer-entkindsofpage,oneishorizontalshapeandtheotherisrectangularshape,toreducethepage-missratein3Dframe-buﬀerapplication.

References

[1]A.Yamazaki,T.Yamagata,Y.Arita,M.Taniguchi,and

M.Yamada,“LargescaleembeddedDRAMtechnology,”IEICETrans.Electron.,vol.E81-C,no.5,pp.750–757,May1998.

[2]T.Tsuruda,M.Kobayashi,M.Tsukude,T.Yamagata,

K.Arimoto,andM.Yamada,“Highspeedhigh-bandwidthdesignmethodologiesforon-chipDRAMcoremultimediasystemLSI’s,”IEEEJ.Solid-StateCircuit,vol.32,pp.477–482,March1997.

[3]R.Torrance,I.Mes,B.Hold,D.Jones,J.Crepeau,

P.DeMone,D.MacDonald,C.O’Connell,P.Gllingham,R.White,S.Duggins,andD.Fielder,“A33GB/s13.4Mbintegratedgraphicsacceleratorandframebuﬀer,”ISSCC98,Dig.ofTech.,p.340,1998.

[4]S.Miyano,K.Numata,K.Sato,T.Yabe,M.Wada,

R.Haga,M.Enkaku,M.Shiochi,Y.Kawashima,M.Iwase,M.Ohgata,J.Kumagai,T.Yoshida,M.Sakurai,S.Kaki,N.Yanagiya,H.Shinya,T.Furuyama,P.Hansen,M.Hannah,.Nagy,A.Nagarajan,and.Rungsea,“A1.6GB/sdata-transfer-rate8MbembeddedDRAM,”ISSCC95,Dig.ofTech.,p.300,1995.

[5]T.Watanabe,R.Fujita,K.Yanagisawa,H.Tanaka,

K.Ayukawa,M.Soga,Y.Tanaka,Y.Sugie,andY.Nakagome,“Amodulararchitecturefora6.4-Gbyte/s,8-Mbitmediachip,”1996SymposiumonVLSICircuits,Dig.ofTech.,p.42.

[6]K.Ayukawa,T.Watanabe,andS.Narita,“Anaccess-sequencecontrolschemetoenhancerandomaccessperfor-manceofembeddedDRAMs,”1997SymposiumonVLSICircuits,Dig.ofTech.,p.59.

[7]K.Inoue,H.Nakamra,andH.Kawai,“A10Mbframebuﬀer

memorywithZ-compareandA-bendunit,”IEEEJ.Solid-StateCircuit,vol.30,no.12,pp.1563–1568,Dec.1995.

204

IEICETRANS.ELECTRON.,VOL.E83–C,NO.2FEBRUARY2000

KazunariInouewasbornin1961.HereceivedhisB.S.degreeinelectricen-gineeringfromYokohamaNationalUni-versity,Japan,in1984.In1984,hejoinedMitsubishiElectricCorporation,Hyogo,Japan.Since1984,hehasbeenen-gagedinthedevelopmentofsemiconduc-tormemoriesforthegraphicsapplication.

HideakiAbewasbornin1966.HereceivedtheB.S.degreeinelectronicen-gineeringinKyotoInstituteofTechnol-ogy,Japan,in1990.In1990,hejoinedMitsubishiElectricCorporation,Hyogo,Japan.Since1990,hehasbeenen-gagedinthedevelopmentofsemiconduc-tormemoriesforthegraphicsapplication.

KaoriMoriwasbornin1969.Shere-ceivedtheB.S.degreeinphysicsinKyotoInstituteofTechnology,Japan,in1992.In1992,shejoinedMitsubishiElectricCorporation,Hyogo,Japan.Since1992,shehasbeenengagedinthedevelopmentofsemiconductormemoriesforthegraph-icsapplication.

ShujiFukagawawasbornin1972.HereceivedtheB.S.andM.S.degreesinelectronicengineeringinHimejiInsti-tuteofTechnology,Japan,in1997.In1997,hejoinedMitsubishiElectricCor-poration,Hyogo,Japan.Since1997,hehasbeenengagedinthedevelopmentofsemiconductormemoriesforthegraphicsapplication.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文