A Memory-Efficient Huffman Adaptive Coding Algorithm ? A Memory-Efficient Huffman Adaptive Coding

  • Published on
    29-Jun-2018

  • View
    212

  • Download
    0

Transcript

  • A Memory-Efficient Huffman Adaptive Coding Algorithm for Very Large Sets of Symbols

    StevenPigeonYoshuaBengio

    {pigeon,bengioy}@iro.umontreal.ca

    DpartementdInformatiqueetdeRechercheoprationelle(DIRO)UniversitdeMontral

    6/19/97

  • Note

    ThealgorithmdescribedinthispaperasalgorithmMisherebyreleasedtothepublicdomainbyits

    authors.Anyonecanusethisalgorithmfreeofroyaltiesprovidedthatanyproduct,commercialornot,includesa

    noticementioningtheuseofthisalgorithm,thenameofthealgorithmandthenameoftheauthors.

  • Abstract

    Theproblemofcomputingtheminimumredundancycodesasweobservesymbolsonebyonehasreceived

    alotofattention.However,existingalgorithmimplicitlyassumesthateitherwehaveasmallalphabetquite

    typically256symbolsorthatwehaveanarbitraryamountofmemoryatourdisposalforthecreationofthetree.

    Inreal lifeapplicationsonemayneedtoencodesymbolscomingfromamuchlargeralphabet,fore.g.coding

    integers.Wenowhavetodealnotwithhundredsofsymbolsbutpossiblywith millions ofsymbols.Whileother

    algorithmsuseanumberofnodesproportionaltothenumberofobservedsymbol,wehereproposeonethatusesa

    numberofnodesproportionaltothenumberoffrequencyclasses,whichis,quiteinterestingly,alwayssmalleror

    equaltothenumberofobservedsymbols.

  • 1.Introductionandmotivation.

    Ofallcompressionalgorithms,itiscertainlyHuffmansalgorithm[1]forgeneratingminimumredundancy

    codesthatisthemostpopular.ThegeneralproblemconsistsinassigningtoeachsymbolsofacertainsetSabinary

    codewithanintegerlengththatistheclosestpossibleto ( ) log ( )p s ,wherep(s)isanestimateoftheprobabilityofsymbol s, as would dictate information theory. For the set S, the optimal average code length is given by

    ( )H S p s p ss S

    ( ) ( ) log ( )= . Huffmansalgorithm,whichwewont cover ingreatdetail here, generatesaset of

    codes C (alsocalledcodebook)suchthateachs in Shasitscode c in C.Thesescodesareprefixfreecodesand

    instantaneouslydecodable,thatis,nocodeistheprefixofsomeothercodeandonecanknowwhenthelastbitof

    thecodeisreadandimmediatelyrecognizethecode.

    Huffmansalgorithmgeneratesasetofcodesforwhichtheaveragecodelengthisboundedbytheinterval

    [ ( ), ( ) . )H S H S p+ + 0 086 ,where p isthelargestprobabilityofallsymbols s in S [6,p.107].Whilebeingrather

    efficient,thisalgorithmhasseveralpracticaldisadvantages.

    Thefirstisthatitisastaticcode.HuffmansalgorithmprecomputesCafterhavingobtainedanestimate

    forp(s)forallsinS,andthenthecompressionproperusesthesamecodebookCforallthedata,whichsometimes

    canleadtocompressioninefficienciesifthedatainthesequencearenoti.i.d.,thatis,differssubstantiallyfromp(s).

    Onemustnotforgetthatweareusingp(s)insteadofamoreappropriatept(s),whichdependsont,thetimeorthe

    positioninthesequentiallyprocesseddata.Ifthecodebookwasadaptive,onemightcapturesomeinformation

    leadingtoabetterapproximationofpt(s)whichinturnleadstopossiblybettercompression.

  • Thesecondmajordisadvantageisthatthealgorithmmustseeallthedatabeforeonecanactuallyperform

    thecompression.Sincewearetryingtoestimatep(s)foradatasetcomposedofsymbolsoutofS,wemustscanall

    thedatainordertogetagoodestimate.Often,itisquiteimpracticaltodoso.Onemightnothavewantedspaceto

    storeallthedatabeforestartingcompressionor,asitisoftenthecaseincommunication,allthedatamightbean

    illusiveconcept.Ifweestimatep(s)outofasmallsampleofthedataweexposeourselvestothepossibilitythatthe

    estimateofp(s)outofasmallsampleisabadapproximationoftherealp(s).

    ThememoryspaceusedbyHuffmansalgorithm(whethertheimplementationismemoryefficientornot)

    isessentiallyO(n),wherenisthenumberofsymbolsinS,whichcanleadtoproblemswhennislarge.Intextbooks,

    oneneverfindsanexampleofcodebooksgeneratedbyHuffmansalgorithmwithmorethan256symbols.Inthis

    context,theworstcasecodebookconsistsofassigningacodetoeachofthe256possiblebytes.Intherealworld,

    however,onemightratherneedacodebookforasetofseveralthousands,orevenofseveralmillionsymbols,like

    allthembitslongintegers.Incertainspecialcases,onecanuseanofftheshelfcodelikethesocalledGolombs

    codes,orRicescodesorothers[6,7],butitisinfactrarethatthedistributionfunctionexactlymatchageometric

    distributionfunction.Golombscodesaredisastrousifusedforadistributionthatisnotgeometric!Ingeneral,in

    fact,thedistributionisincompletelyknownatbestsoitmightbehardtoderiveastandardparametriccodeforit.

    And,ofcourse,onemustgenerallyrejectthetrivialsolutionofbuildingupalookuptableofseveralmillionentries

    inmemory.

    Lastly,forthecompresseddatatobeunderstandablebythedecoder,thedecodermustknowthecodebook

    C.Eitheronetransmitsthecodebookitselforthefunction p(s).Inbothcasestheexpenseofdoingsowhenthe

    codebookislargemightbesohighastocompletelylosethegainmadeincompression.IfShas256symbols,asit

    isoftenthecase,thenthecostoftransmittingthecodebookremainsmodestcomparedtothecostoftransmitting

    thecompresseddatawhichcanbestillquitelarge.Whileactuallyhavingacertainlossofcompressionduetothe

    transmissionofthecodebook,itmightberelativelyinsignificantandthusstillgiveasatisfyingresult.Butwhen

    oneisdealingwithaverylargesetofsymbols,thecodebookmustalsobeverylarge.Evenifaverycleverwayof

    encodingthecodebookisused,itmightremainsocostlytotransmitthattheveryideaofcompressingthisdata

    usingaHuffmanlikealgorithmmustbeabandoned.

    In this technical report, to address the problems of adaptivity, memory management and code book

    transmissionproblems,weproposeanewalgorithmforadativeHuffmancoding.Thealgorithmallowsforvery

    goodadaptivity,efficientmemoryusageandefficientdecodingalgorithms.WefirstreviewHuffmansalgorithm

    andtherelatedadaptivealgorithmsfromFaller,Gallager,KnuthandVitter.Wethenpresentouralgorithmwhich

    consistsinthreeconceptualparts:setrepresentation,setmigrationandtreerebalancing.Wefinallydiscussabout

    variousinitializationschemesandcompareourresultsagainsttheCalgaryCorpusandstaticHuffmancoding(the

    ordinaryHuffmanalgorithm).

  • 2.HuffmansAlgorithmforMinimumRedundancyCodes.

    Huffmansalgorithmisalreadydescribedindetailinmanytextbooksliketheexcellent[6].Huffmans

    procedure,quitesimple,toconstructthecodebookforasetSandaprobabilityfunctionp(s)isshowninAlgorithm

    1andfig.1.Theobjectiveistoconstructatreesuchthatthepathfromaleafthatcontainsthesymbolsstotheroot

    hasalengthascloseaspossibleto ( ) log ( )p s .Toeachleafisassociatedaweight,whichisbasicallythenumberoftimesthesymbol s hasbeenseeninthedata(or p(s),bothcanbeusedinterchangeably).Huffmanalgorithms

    strategyistobuildatreewithweightsthatareasbalancedaspossible.

    WhatisunusualinHuffmansalgorithmisthatinsteadofbuildingthistreefromtherootdowntotheleaf

    (whichwouldthenbeShannonFanosalgorithm![6])itisbuiltfromtheleavesuptotheroot.Thefirststepisto

    buildalistwhichcontainsonlyleaves.Eachleafhasanassociatedsymbolsandaweight,p(s).Wepick,outofthat

    list,thetwoelementwiththesmallestweights.Wecreateanewnodesuchthatitschildrenarethetwopicks,and

    thatitsweightisthesumofitschildrensweight.Weputbackthenewnodeinthelist.Werepeatthisprocedure

    untilonlyonenodeisleft,whichwillbecometherootofthetree.Codesarethenassignedeasily.Eachtimewego

    leftinthetree,weassigna1,andwhenwegoright,weassigna0,likeinfig.1.Thecodeforaleafisthepath

    onemustfollowtoreachitfromtheroot.Thereareotheralgorithmsthatfirstcomputethelengthofthecodes

    beforeactuallyassigningbitstothecodes.Suchalgorithms[8,9]arequitefastandcanbedoneinsituwithoutthe

    expenseoftheextramemoryneededtomaintainthetreestructure.Onecanshowthattheaveragecodelengthis

    within[H(S),H(S)+p+0.086)oftheoptimalaveragecodelength,wherep istheprobabilityofthemostprobable

    symbolinS.

    TheShannonFanoalgorithmproceeds,aswesaid,attheoppositeofHuffmans.IntheShannonFano

    algorithm, we first construct a list L sorted in reverse order, fromthe most to the least probable symbol. All

  • symbols,atthattime,haveanemptycode.Wethencutthelistinhalf,butinsuchawaythatthetophalfhasthe

    sametotalweightasthebottomhalf(asbestaswecan,theresnowarrantythatonecanalwayscutthetotalweight

    ofthelistexactlyintwohalves!).Toallthecodesofthesymbolsofthetophalfofthelist,weappend1,andtoall

    thecodesinthelowerhalfweappend0.Werepeatthisprocessrecursivelyonthetophalfofthelistandthe

    bottomhalf, until weget a list withonlyoneelement, towhichcodenothingisappended.TheShannonFano

    algorithmishowevernotquiteasgoodasHuffmans:itsaveragecodelengthisprovablywithin[H(S),H(S)+1)of

    theoptimalentropyH(S).

    Foreachsymbolscreatealeafofweightp(s)PutallleavesinthelistL.

    Whilethereisatleast2elementsinL{PickAandBoutofLsuchthattheirweightsarethesmallests;

    Createanewnodepsuchthat{psweight=Asweight+BsweightMakeAandBpschildren;}PutpinthelistL;}

    Algorithm1.HuffmansAlgorithm

    6 5 4 2 1 6 5 4

    2 1

    3 6 5

    4

    2 1

    3

    7

    6 5 4

    2 1

    3

    711

    6 5 4

    2 1

    3

    11 7

    18

    a b c d e a b c

    d e

    a b

    c

    d e

    a b c

    d e a b c

    d e

    1

    1 1

    1

    0

    0 0

    0

    a11b10c01d001e000

    Fig.1.ExampleofconstructionofaHuffmantree.

    3.AdaptiveHuffmanalgorithms:AlgorithmsFGKand

    Faller, Gallager and Knuth independently proposedan algorithmfor creatingadaptive Huffmancodes

    [2,3,4]that wewilldesignateas theFGKalgorithm.TheFGKalgorithmneedssomesecondarystructuresto

    managetheinternalnodesofthetree.NotunlikeAVLtrees,theFGKalgorithmseeksnottobalancethetreebythe

    depthsofitsnodesbutbytheirweights.Inthatway,heavyleaves(thatis,frequentleaves)areneartherootand

  • infrequentleavesveryfar.Thealgorithmisalsomakingsurethatleavesareascloseto ( ) log ( )p s nodesawayfromtherootaspossible.However,thisalgorithmcanbequitebadissomecasesasVitter[5]shows.Thealgorithmcan

    encodethedatawithasmuchas(2H+1)tbits,whereHisnotH(S)buttheaveragecodelengthobtainedbythestatic

    Huffmanalgorithm,andtisthetotalnumberofsymbols(thelengthofthedatasequence).Itsupdateprocedureis

    showninAlgorithm2.Nodesarenumberedbreadthfirstfromlefttoright,startingattheroot.TheFGKalgorithm

    alsousesaspecialzeronodewhichisusedtointroducethesymbolsthathavenotbeenseensofarinthesequence.

    AnupdateisalwaysmadeinO(logp(s))operations.

    Vitter, in[5],alsoproposesanalgorithmforadaptiveHuffmancodes.Thisalgorithm,algorithm , to

    followVittersowndenomination,offersabetterboundoncompressionperformance thantheFGKalgorithm.

    Vittershowsthathisalgorithmisguaranteedtoencodeadatasequencewithin [H(S),H(S)+1+p+0.086)of the

    entropyforasequencesufficientlylong.Inthistechnicalreport,weonlypresentanoverviewofthealgorithm,but

    thereadermayconsult[5]and[10]fordetailsandaPascallanguageimplementation.ItalsoupdatesinO(logp(s))

    inanycase,buttheboundoncompressionisclearlybetterandthusjustifythecomplexificationofthealgorithm

    (whichisnottoogreat,either).

    Thesetwoalgorithms,however,donotprovideanysolutionstothecasewherethenumberofdistinct

    symbolsisreallylarge.Ineachcase,thenumberofnodesisstill2|S|+1,whichisprohibitivelyhigh(especiallywhen

    oneconsiderallthepointers,weights,andothersideinformationneededtoconstructandmaintainthetree)inthe

    casethatisofinteresttous,thatis,setswithpotentiallymillionsofsymbols.

    procedureUpdate_FGK(a){q=leaveassociatedtoaIf(qisazeronode)&&(k

  • procedureUpdate_(a);{LeafToInc=0;q=leafassociatedtoa;If(qisazeronode)and(k

  • 4.ProposedAlgorithm:AlgorithmM.

    Wewillnowpresentanalgorithmthatiswellsuitedforlargesetsofsymbols.Theynaturallyarisesina

    variety of contexts, suchas in the deflate compression algorithmand in JPEG. In the deflate compression

    algorithm,lengthsandpositionsofmatchesmadebyabasicLZ77compressoroverawindow32Ksymbolslongare

    encoded using a variable length coding scheme thus achieving better compression. In the JPEG still image

    compressionstandard,theimageisfirstdecomposedbyadiscretecosinetransformandthecoefficientsarethen

    quantized.Oncequantizationisdone,ascheme,notunliketheoneinFig.2.[11],isusedtorepresentthepossible

    valuesofthecoefficients.TherearehoweverfewervaluespossibleinJPEG,butstillafewthousands.Thereis

    plentyofothersituationswerealargenumberofsymbolsisneeded.

    Thealgorithmwenowproposeusesleavesthatrepresent setsofsymbolsratherthanindividual symbols

    andusesonlytwooperationstoremainascloseaspossibletotheoptimal:setmigrationandrebalancing.Thebasic

    ideaistoputinthesameleafallthesymbolsthathavetheexactsameprobability.Setmigrationhappenswhena

    symbolmovesfromonesettoanotherset.Thisiswhenasymbol,seenmtimes(andthusbelongingtothesetofall

    symbolsseen m times) isseenonemoretimeandismovedtothesetofsymbolsseen m+1times.Ifthesetof

    symbolsseenm+1timesdoesnotexist,wecreateitasthesiblingofthesetofsymbolsseenmtimes1.Ifthesetof

    symbolsseenmtimesisnowempty,wedestroyit.Sincerebalancingisonlyneededwhenanewset(leaf)iscreated

    ordeleted,weoftenavoiditasmoreandmoresetsarecreated,itbecomesmoreandmorelikelythattherightset

    alreadyexists.Rebalancing,whenneeded,willbeperformedinatmostO(logp(s))operations.Letusnowpresent

    indetailthisnewalgorithm.

    1Ifatthemomentoftheinsertionofnodez,thenodexhasalreadyasibling,wecreateanewparentnodetinplacenodex,suchthatits

    childrenarethenodexandthenodez.

  • Letussaythatwehavethesymbolsrepresentthesetoftheintegersfrom0to B. Theminimalinitial

    configurationof the tree is a single leaf (which is also theroot) containing the set {0,...,B}, to which a zero

    frequencyisassigned.Itcanbezerosincewedontworkdirectlywiththeprobability,butwiththenumberoftimes

    asymbolwasseen.Atthatpoint,thecodesaresimplythebinaryrepresentationofthenumbers0toB.Whenthe

    firstsymbolisseen,ithasacountof1.Sincewedonthaveasetofsymbolsseenexactly1time,wecreateit.That

    givesusatreethathasarootnodeandtwoleaves.Thecodesnowhaveaonebitprefix,followedbyeithernobits,if

    itisthe(only)symbolinthenewset,or ( ) log { , , ,..., }0 1 3 B bits(foranyother).Asothersymbolsareseenforthefirsttimethesetofallsymbolsseenoncegrowsandthesetofallpossiblebutyetunobservedsymbolsshrinks.

    Whenweencounterasymbolasecondtime,wecreatethesetofsymbolsseenexactly2times,andsoon.This

    situationisshowninfig.3.Notethatthenumberswrittenonnodesaretheirweights(aparentsweightisthesumof

    itstwochildrensweight,andtheleavesweightaresimplythenumberoftimethesymbolsinitsassociatedsethave

    beenseensofartimesthenumberofsymbolsinitsset).Ifaregionofthetreeistoomuchoutofbalance,we

    rebalanceit,startingatthepositionwherethenewsetwasinserted.

    Codes RepresentableValues1 010+1bit 1,+11100+2bits 3,2,+2,+31101+3bits 7,6,5,4,4,5,6,711100+5bits 15,14,...,9,8,8,9,...,13,14,1511101+6bits 47,...,16,16,...,47

    etc. etc.

    Fig.2.Anexampleofcodeforalargesetandahypotheticaldistribution.

  • 1

    5

    x

    x+5

    {0,...,B}log(B)bits

    {0,1,3,...,B}1+log(B)bits

    {2}0

    weobserve2:

    1

    x

    x

    x+1

    Andsomeothers...

    {0,1,3,7,...,10,12,...,B}1+log(B)bits.

    {2,4,5,6,11}0+3bits

    weobserve6:

    {0,1,3,7,...,10,12,...,B}1+log(B)bits{2,4,5,11}01+2bits

    x

    x+55

    4

    {6}00

    Fig.3.Creationofnewsets.Assignedcodesaretotherightofthesets.Thevalueofxdependsofthetypeofprioronewantstousefortheprobabilitydistribution(fore.g.,x=1isaLaplacianprior).

    Thebigquestionis,ofcourse,whendoweknowitstimetorebalancethetree?Ifwelookatfig.4.,wesee

    thataftertheinsertionofthenewset{7},aninternalnodehascount5whileitsiblinghasonlyacountof1.We

    wouldlikethisinternalnodetobenearertotherootofatleastonestepsinceitsintuitivelyclearthatthisnode(and

    allitschildren)isfarmorefrequentthanitssibling.Theruletoshiftupanodeis:

    If(thenodesweigth>itssiblingsweight+1)and(thenodesweight>itsunclesweight)thenshiftupthatnode.

    Theshiftingupitselfisexactlyasinastandard,offtheshelf,AVLtree.Fig.5.showshowtheshiftingup

    ofatreeworks.Ifforaparticularnodetheruleisverified,ashiftingupoccurs.Theremightbemorethanoneshift

    up.Werepeatuntiltheruleceasestoapplyoruntilwereachtheroot.And,ofcourse,weupdateweightsuptothe

    root.Thealgorithm,thatwecallM,asinmigrationislistedinpseudocodeatalgorithm4.

  • 5

    {0,1,3,7,...,10,12,...,B}

    {2,4,5,11}{6}1

    6

    4

    x

    UpdatingaSubtree

    10x+10

    {0,1,3,7,...,10,12,...,B}

    {2,4,5,11}

    {6}

    5

    1

    4

    x

    6

    x+4

    x+10

    Needsrebalancing!

    Done!

    Fig.4.Rebalancingthetree.

    C

    B A

    A

    CB

    A

    B C

    (a) (b) (c)

    Fig.5.ShiftingupsubtreeA.(a)Exchangingsubtreewithunclesubtree,(b)rotationand(c)finalstate.

  • ProcedureUpdate_M(a){q,p:pointerstoleaves;

    p=find(a);//Leaf/setcontainingaq=find(psfrequency+1);//weretomigrate?

    if(q!=)//somewheretomigrate?{removeafrompsset;psweight=psweightpsfrequencyAddatoqsset;qsweight=qsweight+psfrequency

    ShiftUp(q);

    If(p=)removepfromthetree;elseShiftUp(pssibling);

    }else{createanewnodet;tsleftchildisp;tsrightchildisanewnoden;nsset={a};nsweight=psfrequency+1;nsfrequency=psfrequency+1;replacetheoldpinthetreebyt;

    removeafrompsset;psweight=psweightpsfrequency;

    If(p=)removepfromthetree;elseShiftUp(pssibling);ShiftUp(t);}}

    Algorithm4.AlgorithmM.

    ProcedureShiftUp(t){while(tisnottheroot){tsweight=tsrightchildsweight+tsleftchildsweight;if((tsweight>tssiblingsweigth+1)&&(tsweight>tsunclesweight))then{q=parentofparentoft;exchangetwithtsuncle;exchangeqsrightandleftchild;Updatetsancientparentsweight;}t=tsparent;}}

  • Algorithm5.ShiftUpalgorithm.

    4.1.AsymptoticperformanceofAlgorithmM.

    ThecodegeneratedbythealgorithmMhasanentropywithin[H(S),H(S)+2)givenasufficientlylong

    sequenceofsymbols.Letusexplainhowonemightderivethisresult.First,allcodeshavetwoparts:aprefixanda

    suffix.Theprefixidentifiesinwhichsubset K S thesymbol s is.Thesuffixidentifiesthesymbolwithinthesubset.Ifweassignanintegernumberofbitstobothprefixandsuffix,thelengthofthecodeis,foreachsymbols

    inasubset K S ,givenby

    l s l K l s K p K p s K p s( ) ( ) ( | ) log ( ) log ( | ) log ( )= + = =

    thatweapproximateinthealgorithmby

    ~( ) ~( ) ~( | ) Prefix( ) log| |l s l K l s K K K= + = +

    wherePrefix(K)isthelengthoftheprefixforthesetK.Thisisaverygoodapproximationofl(s)whenall

    thesymbolsinthesubsetKhaveroughlythesameprobabilityandPrefix(K)isneartheoptimalprefixlength.

    First,weconsiderthesuffixes.AllsymbolsinasetKareequiprobablebydefinition2.Oneobservesthatin

    thatcase log ( | )p s K isexactly log| |K .WhenweassignanaturalcodetoeverysymbolsinKthiscodewillbe

    oflength log| |K ,whichisnevermorethanonebittoolongcomparedtotheoptimalcode.Aworstcaseforasuffixiswhenthenumberofsymbolinthissetis2n+1,forsomen,andinthatcasewewillwasteabout1bit.The

    bestcasewillbewhenthesizeofthesetisexactly2m,forsomem,forwhichwewilluseexactlymbits.

    TheprefixesareShannonFanocodesforthesetsKirememberthattheleavesofthetreearesetsandnot

    individualsymbolsandthatallsymbolsinasethavethesameprobability.Sincetheshiftupalgorithmtriesto

    produceequallyweightedsubtreesitisessentiallyaclassicalShannonFanoalgorithmfromthebottomupinsteadof

    topdown.TheShannonFanoisntoptimal(asHuffmansalgorithmis)butitproducescodeswithin[H(S),H(S)+1)

    oftheentropyofthesetS={K1,K2,...,Km}[6].2However,onemaywanttoputinthesetKiallsymbolsofapproximatelythesameprobability.Inthatcase,theboundschangebecause

    theroundingofthecodeswillbedifferent.Thecodewillnotdegradeiftheoptimalcodeoftheleastprobablesymbolofasetislessthanonebit

    shorterthantheoptimalcodeofthemostprobable.Thatis,onemustsatisfytheconstraint

    + log min{ ( | )} log max{ ( | )}p s K p s Ki i 1 .

  • Combiningthetworesultsleadstotheboundsonentropyof[H(S),H(S)+2).ThatmakesAlgorithmMvery

    interestingespeciallywhenweconsiderthatwedontcreateasmanynodesasthetwootheradaptivealgorithms.In

    section5,Results,wecomparethenumberofnodescreated.

    4.2.Initialisation

    Onewaytoinitializethealgorithmisasinglenodewhosesetcontainsallpossiblesymbols(offrequency

    13,ofweightB).Butonemayalsochoosetostartwithseveralsubsets,eachhavingdifferentpriors.Ifwetakeback

    theexampleinfig.2,wecouldbuildwithaprioriknowledge,asinfig.6,aninitialsolutionalreadyclosetowhat

    wewant,withcorrectinitialsetsandweights.Thatwouldgiveabettercompressionrightfromthestart,sincewe

    wouldneedmuchlessadaptation.Thatdoesnotimply,however,thatwealwayshavetotransmitthecodebookor

    thetreepriortodecompression.Invariousschemestheinitialconfigurationofthetreemaybestandardizedand

    wouldthereforebeimplicit.Inourhypotheticalexampleof fig.2,wecandecidethatthetreeisalwaysthesame

    whenencodingordecodingstarts.

    {allothervalues}

    {47,...,16,16,...,47}

    {15,...,8,8,...15}{7,...,4,4,...,7}

    {3,2,+2,+3}{1,+1}

    {0}{0}

    Fig.6.Apossibledifferentinitialconfigurationforthetree.

    4.3.FastDecoding

    YetanotheradvantageofalgorithmMisthatitmayprovideforafasterdecodingthantheotheralgorithms.

    Whenwedecodethecompresseddatawiththeotheralgorithms,wereadthebitsonebyoneaswegodownthetree

    untilaleafisreached.InalgorithmM,however,wedonthavetoreadallthebitsonebyonesinceoncewereacha

    leaf,weknowhowmanybitstherearestilltobereadandthatallowsustomakeasingleoperationtogetabitstring

    whichisreadilyconvertedintoaninteger,whichinturnisconvertedintothecorrectsymbol.Onmostgeneral

    processors,thesametimeisneededtoreadasinglebitortoextractabitstring,sinceitisgenerallythesame

    3Thatis,weassumethatthedistributionisuniformaLaplacianprior.

  • operationsbutwithdifferentmaskvalues.Sothespeedupisproportionaltothenumberofbitsreadsimultaneously.

    Thistechniquecouldleadtoveryefficientdecoders.

    4.4.EfficiencyConsiderationsandDrawbacks

    ThemajordrawbackofalgorithmMistherepresentationusedforthesets.Whenthesetsareverysparse

    andirregular(andforwhichnoreallyconvenientrepresentationexists,asitisthecasewith,forexample,abunchof

    prime numbers) they can use a lot of memory if one does not use a good representation. One must use a

    sophisticatedrepresentationforthesetsifwewanttomanagememoryefficientlyand todofastoperationsonthe

    sets.

    Ideally,wewouldliketohaveO(1)operationsonasetaddingandremovingelementsinconstanttime

    butitwillbequitegoodifitisinO(logn)forasetofnelements.WhileAntiAVLtreesareusedinalgorithmM

    forthecodesthemselves,AVLtreesmightbeagoodchoicefortheinternalrepresentationofthesets.However,

    AVLtreesdoeatalotofmemoryinadditionofbeingabigO(logn)andthetotalnumberofnodesinthe

    treeswouldbe thesameas in anyother Huffmanlike treealgorithms,2|S|1. OnemayuseanAVLtreewith

    intervalsasleavesorevenabitmappedbooleanarrayifthenumberofdifferentsymbolsisnottoogreat(for64k

    symbols,8kwouldbeneededforsuchadevice).

    Onealternativerepresentationisasparseskiplistofintervalsandintegers.Weuseintervalnodeswhen

    thereareenoughsuccessiveintegers(like{1,2,3,4,5,6})andsingleintegerswhenitisnot(like{3,5,7,11,13}).The

    representationweusedhadaquiteefficientmemoryusage.Fore.g.,when3wasaddedtotheset{1,2,4,5}(two

    intervals,{1,...,2}and{4,...,5})theresultingsetwasrepresentedbyonlyoneinterval:{1,...,5}whichneedsonlytwo

    intratherthanfive.Savingsaregreaterwhenthewidthoftheintervalgrows.Needlesstosaythatgreatcaremust

    beexercisedwhenonechoosestherepresentationofthesets.

  • 5.ExperimentalResults.

    WefinallypresentourresultsandcomparealgorithmMagainstHuffmansalgorithm.Table1.summarizes

    theresultsontheCalgarycorpus.TheHuffmancolumnofTable1isthenumberofbitsbysymbolobtainedwhen

    weuseatwopassesstaticHuffmancode(thatis,thatwereadtheentirefiletogatherinformationonthep(s)and

    thengeneratethecodebook).Inonecase,weomittedthecostoftransmittingthedictionary,asitisoftendone.In

    theothercolumn,namelyHuffman*,wetookintoaccount thecostofsendingthecodebook,whichisalways

    assumedtobe1K,sincewithHuffmanandMweassumedapriori thateachfilehad256differentsymbols.For

    algorithm,wefindacolumnnamedalgorithm*.ThisisPaulHowardsversionofalgorithm (see[5]and

    Acknowledgments). In this program the actual encoding is helped by an arithmetic coder. In the columns of

    algorithmM,wecanseealsothenumberofbitsbysymbolsobtainedandweseethattheyareingeneralveryclose

    totheoptimalHuffmancodes,andinmostcasessmallerthanHuffman*.Inthenodescolumnweseehowmany

    nodeswere in the treewhencompressionwascompleted. TheMigs%columncounts, in percent, howmany

    updatesweresolvedusingonlysetmigration.TheShiftUpscolumncountsthenumberoftimeanodewasshifted

    up.

    Forall theexperiments, theinitial configurationshowninfig. 7. wasusedsincemostof theCalgary

    Corpusfilesaretextfilesofsomekind,totheexceptionofGeo,Obj1,Obj2andPicthatarebinaryfiles.

    Hereagainwestresstheimportancetheinitialconfigurationhasforouralgorithm.IfweexamineTable1,wesee

    that most of the updates dont needeither set migrationnor rebalancing. These updates consist in addingand

    removingnodesfromthetree.Thissituationariseswhensimplemigrationfails,thatis,whenthedestinationset

    doesnotexistorthesourcesethasonlyoneelement(thereforeneedstobedestroyed).Wealsoseethatshiftingup

    inrebalancingisarelativelyrareevent,whichgoodevenifitisnotacostlyoperation.

  • WithHuffmansalgorithmand256symbols,onealwaysgets511nodes(sinceabinarytree,wehave2|S|+1

    nodes,countingleaves,foraset S)whilethisnumbervarieswithalgorithmMdependingonhowmanysetsare

    createdduringcompression/decompression.Onecanseethatintheaverage,manylessnodesarecreated.Insteadof

    511nodes,anaverageof192iscreatedbythefilesoftheCalgaryCorpus.Thisisaboutonly37.6%ofthenumberof

    nodescreatedbyotheralgorithms.Whenweuse16bitspersymbol(whichwereobtainedbytheconcatenationof

    twoadjacentbytes),wefindthatalgorithmMcreatesanaverageof360.6nodesinsteadofanaverageof3856.6

    withtheotheralgorithms.Thisisabout9.3%:tentimesless.Ofcourse,thereexistdegeneratecaseswhenthesame

    numberofnodeswillbecreated,butnevermore.

    AlgoMBits/symb

    File Length Huffman* Huffman Algo* algoM Nodes Migs(%) ShiftUpsBib 111261 5.30 5.23 5.24 5.33 161 3.92 2374Book1 768771 4.57 4.56 4.61 153 0.62 2494Book2 610856 4.83 4.82 4.91 191 0.94 3570Geo 102400 5.75 5.67 5.82 369 25.6 5068News 377109 5.25 5.23 5.23 5.31 197 2.00 4651Obj1 21504 6.35 5.97 6.19 225 35.88 1091Obj2 246814 6.32 6.29 6.40 449 12.04 11802Paper1 53161 5.17 5.02 5.03 5.12 171 5.86 1408Paper2 82199 4.73 4.63 4.65 4.73 155 3.12 1152Paper3 46526 4.87 4.69 4.71 4.86 147 5.22 978Paper4 13286 5.35 4.73 4.80 4.98 109 11.97 523Paper5 11954 5.65 4.97 5.05 5.20 131 16.48 665Paper6 38105 5.25 5.04 5.06 5.15 161 8.15 1357Pic 513216 1.68 1.66 1.66 1.68 169 0.76 1957ProgC 39611 5.44 5.23 5.25 5.38 177 11.31 1984ProgL 71646 4.91 4.80 4.81 4.92 147 4.32 1442ProgP 49379 5.07 4.90 4.91 5.00 159 7.13 1902Trans 93695 5.66 5.57 5.43 5.69 189 6.06 2926

    :5.12 :4.95 :4.75 :5.07

  • Table1.PerformanceofAlgorithmM.TheHuffman*columnrepresentstheaveragecodelengthifthecostoftransmissionofthecode book is included, while the Huffmancolumnonly take into account the codes themselves. Algorithm * is Vitters algorithmplusarithmeticcoding.AlgorithmMdoesnottransmitthecodebook,thebits/s arereallytheaveragescodelengthofAlgorithmM.ThenumberofnodeswiththestaticHuffman(Huffman,Huffman*)isalways511256symbolswereassumedforeachfiles.Greyentriescorrespondtounavailabledata.

    {32,...,127}

    {0,...,31,128,...,255}

    1

    01

    Fig.7.Initialconfigurationsforthetests.

    InTable2, wepresent theresults, again, butwith16bits symbols. Wesupposedthat eachfile of the

    CalgaryCorpuswascomposedof16bitslongsymbols.WecompareVittersalgorithm ,staticHuffman(both

    withandwithoutthecostoftransmissionofthedictionary)andalgorithmM.

    OnemaynoticethatVittersalgorithmisalmostalwaysthebestwhenweconsider8bitspersymbols(it

    winsoverthetwootheralgorithms13timesoutof18)butthatthesituationisreversedinfavorofalgorithmM

    whenweconsider16bitspersymbols.Inthelattersituation,algorithmMwins13timesoutof18.Theaverage

    numberofbitsoutputedbyeachalgorithmisalsoagoodindicator.Whenthesymbolshave8bits,weobservethat

    Huffman*generatescodesthathaveanaverageof5.12bits/symbols,Algorithm4.75bits/symbolsandalgorithm

    M5.07bits/symbols,whichleavesalgorithmasaclearwinner.However,again,thesituationisreversedwhenwe

    consider 16 bits symbols. Huffman*has an average of 9.17 bits/symbols, algorithm 8.97bits/symbols and

    algorithmMgeneratescodesofaveragelengthof8.67bits/symbols.Hereagain,algorithmMwinsoverthetwo

    others.

    TheprototypeprogramwaswritteninC++andrunsona120Mhz486PCunderWindowsNTanda

    numberofothermachines,likeaSolarisSpark20,UltraSparkandSGIChallenge.OnthePCversion,theprogram

    canencodeabout70,000symbolspersecondwhenleavesareindividualsymbolsandabout15,000symbols/second

    whentheyareverysparsesets.

    Inconclusion,AlgorithmMisagoodwaytoperformadaptiveHuffmancoding,especiallywhenavery

    large number of different symbols is neededand wedont have that muchmemory. Wealso showed that the

    compressionperformanceis veryclose totheoptimalstatic Huffmancodeandthat thebounds[H(S),H(S)+2)

    guarantyus that thecoderemains closeenoughofoptimality. Furthermore, thisalgorithmisfreetousesince

    releasedtothepublicdomain.

  • Huffman AlgoMBits/symb

    File Length Symbols Nodes Huffman* Huffman algo* algoM NodesBib 111261 1323 2645 8.96 8.58 10.48 8.98 423Book1 768771 1633 3265 8.21 8.14 8.35 873Book2 610856 2739 5477 8.70 8.56 8.81 835Geo 102400 2042 4083 9.86 9.22 9.74 281News 377109 3686 7371 9.61 9.30 9.62 9.66 713Obj1 21504 3064 6127 13.72 9.17 9.65 97Obj2 246814 6170 12339 9.72 8.93 9.40 483Paper1 53161 1353 2705 9.45 8.64 9.47 9.13 281Paper2 82199 1121 2241 8.57 8.13 8.57 8.48 369Paper3 46526 1011 2021 8.93 8.23 8.94 8.68 281Paper4 13286 705 1409 9.83 8.13 9.84 8.81 131Paper5 11954 812 1623 10.60 8.43 10.60 9.13 113Paper6 38105 1218 2435 9.63 8.61 9.66 9.14 231Pic 513216 2321 4641 2.53 2.39 3.74 2.47 273ProgC 39611 1443 2885 9.97 8.80 9.97 9.37 221ProgL 71646 1032 2063 8.46 8.00 8.48 8.37 315ProgP 49379 1254 2507 8.86 8.05 8.89 8.56 223Trans 93695 1791 3581 9.52 8.91 9.34 9.39 347

    :9.17 :8.23 :8.97 :8.67

    Table2.Comparisonofalgorithmswithalargernumberofsymbols.Greyentriescorrespondtounavailabledata.

  • Acknowledgements

    SpecialthankstoJ.S.Vitter(jsv@cs.duke.edu)whograciouslygavemeaworkingimplementationofhis

    algorithminPascal.WewouldalsoliketothankPaulG.HowardofAT&TResearch(pgh@research.att.com)for

    hisCimplementationofAlgorithm , thatwecalled *,sinceinhisversionVittersalgorithmisfollowedby

    arithmeticcoding(J.S.VitterwasPaulG.HowardsPh.D.advisor).

  • References

    [1] D.A. Huffman A method for the construction of minimum redundancy codes in.ProceedingsoftheI.R.E.,v40(1951)p.10981101

    [2] R.G.GallagerVariationonathemebyHuffmanIEEETrans.onInformationTheory,IT24,v6(nov1978)p.668674

    [3] N.FallerAnadaptivesystemfordatacompressionrecordsofthe7thAsilomarconferenceonCircuits,Systems&Computers,1973,p.393397

    [4] D.E.KnuthDynamicHuffmanCodingJournalofAlgorithms,v61983p.163180

    [5] J.S.VitterDesignandanalysisofDynamicHuffmanCodesJournaloftheACM,v34#4(octobre1987)p.823843

    [6] T.C.Bell,J.G.Cleary,I.H.WittenTextcompressionPrenticeHall,1990(QA76.9T48B45).

    [7] G.Seroussi,M.J.WeinbergerOnAdaptiveStrategiesforanExtendedFamilyofGolombtypeCodesinDCC1997,Snowbird,IEEEComputerSocietyPress.

    [8] AlistairMoffatandJ.KatajainenInplaceCalculationofminimumredundancycodes inProceedings of the Workshopon Algorithmsand Data Structures, Kingston University, 1995,LNCS995SpringerVerlag.

    [9] Alistair Moffat et AndrewTurpin, On the implementation of minimumredundancy prefixcodes(extendedabstract),inDCC1996,p.171179.

    [10] J.S.VitterDynamicHuffmanCodingManuscriptfromtheInternetwithaPascalsource.

    [11] StevenPigeonAFastImageCompressionMethodBasedontheFastHartleyTransformAT&TResearch,SpeechandImageProcessingLab6TechnicalReport(number???)

    [12] DouglasW.JonesApplicationofSplayTreestoDataCompressionCACM,V31#8(August

    1988)p.9981007

    NoteAbstract1. Introduction and motivation.2. Huffmans Algorithm for Minimum-Redundancy Codes.3. Adaptive Huffman algorithms: Algorithms FGK and L4. Proposed Algorithm: Algorithm M.4.1. Asymptotic performance of Algorithm M.4.2. Initialisation4.3. Fast Decoding4.4. Efficiency Considerations and Drawbacks

    5. Experimental Results.AcknowledgementsReferences