lzma-specification.txt

mozilla-central/other-licenses/7zstub/src/DOC/lzma-specification.txt

Enable keyboard shortcuts

Source code

File a bug in Firefox :: Installer

Revision control

Copy as Markdown

Other Tools

LZMA specification (DRAFT version)↩

----------------------------------↩

↩

Author: Igor Pavlov↩

Date: 2015-06-14↩

↩

This specification defines the format of LZMA compressed data and lzma file format.↩

↩

Notation ↩

--------↩

↩

We use the syntax of C++ programming language.↩

We use the following types in C++ code:↩

  unsigned - unsigned integer, at least 16 bits in size↩

  int      - signed integer, at least 16 bits in size↩

  UInt64   - 64-bit unsigned integer↩

  UInt32   - 32-bit unsigned integer↩

  UInt16   - 16-bit unsigned integer↩

  Byte     - 8-bit unsigned integer↩

  bool     - boolean type with two possible values: false, true↩

↩

↩

lzma file format↩

================↩

↩

The lzma file contains the raw LZMA stream and the header with related properties.↩

↩

The files in that format use ".lzma" extension.↩

↩

The lzma file format layout:↩

↩

Offset Size Description↩

↩

  0     1   LZMA model properties (lc, lp, pb) in encoded form↩

  1     4   Dictionary size (32-bit unsigned integer, little-endian)↩

  5     8   Uncompressed size (64-bit unsigned integer, little-endian)↩

 13         Compressed data (LZMA stream)↩

↩

LZMA properties:↩

↩

    name  Range          Description↩

↩

      lc  [0, 8]         the number of "literal context" bits↩

      lp  [0, 4]         the number of "literal pos" bits↩

      pb  [0, 4]         the number of "pos" bits↩

dictSize  [0, 2^32 - 1]  the dictionary size ↩

↩

The following code encodes LZMA properties:↩

↩

void EncodeProperties(Byte *properties)↩

{↩

  properties[0] = (Byte)((pb * 5 + lp) * 9 + lc);↩

  Set_UInt32_LittleEndian(properties + 1, dictSize);↩

}↩

↩

If the value of dictionary size in properties is smaller than (1 << 12),↩

the LZMA decoder must set the dictionary size variable to (1 << 12).↩

↩

#define LZMA_DIC_MIN (1 << 12)↩

↩

  unsigned lc, pb, lp;↩

  UInt32 dictSize;↩

  UInt32 dictSizeInProperties;↩

↩

  void DecodeProperties(const Byte *properties)↩

  {↩

    unsigned d = properties[0];↩

    if (d >= (9 * 5 * 5))↩

      throw "Incorrect LZMA properties";↩

    lc = d % 9;↩

    d /= 9;↩

    pb = d / 5;↩

    lp = d % 5;↩

    dictSizeInProperties = 0;↩

    for (int i = 0; i < 4; i++)↩

      dictSizeInProperties |= (UInt32)properties[i + 1] << (8 * i);↩

    dictSize = dictSizeInProperties;↩

    if (dictSize < LZMA_DIC_MIN)↩

      dictSize = LZMA_DIC_MIN;↩

  }↩

↩

If "Uncompressed size" field contains ones in all 64 bits, it means that↩

uncompressed size is unknown and there is the "end marker" in stream,↩

that indicates the end of decoding point.↩

In opposite case, if the value from "Uncompressed size" field is not↩

equal to ((2^64) - 1), the LZMA stream decoding must be finished after↩

specified number of bytes (Uncompressed size) is decoded. And if there ↩

is the "end marker", the LZMA decoder must read that marker also.↩

↩

↩

The new scheme to encode LZMA properties↩

----------------------------------------↩

↩

If LZMA compression is used for some another format, it's recommended to↩

use a new improved scheme to encode LZMA properties. That new scheme was↩

used in xz format that uses the LZMA2 compression algorithm.↩

The LZMA2 is a new compression algorithm that is based on the LZMA algorithm.↩

↩

The dictionary size in LZMA2 is encoded with just one byte and LZMA2 supports↩

only reduced set of dictionary sizes:↩

  (2 << 11), (3 << 11),↩

  (2 << 12), (3 << 12),↩

  ...↩

  (2 << 30), (3 << 30),↩

  (2 << 31) - 1↩

↩

The dictionary size can be extracted from encoded value with the following code:↩

↩

  dictSize = (p == 40) ? 0xFFFFFFFF : (((UInt32)2 | ((p) & 1)) << ((p) / 2 + 11));↩

↩

Also there is additional limitation (lc + lp <= 4) in LZMA2 for values of ↩

"lc" and "lp" properties:↩

↩

  if (lc + lp > 4)↩

    throw "Unsupported properties: (lc + lp) > 4";↩

↩

There are some advantages for LZMA decoder with such (lc + lp) value↩

limitation. It reduces the maximum size of tables allocated by decoder.↩

And it reduces the complexity of initialization procedure, that can be ↩

important to keep high speed of decoding of big number of small LZMA streams.↩

↩

It's recommended to use that limitation (lc + lp <= 4) for any new format↩

that uses LZMA compression. Note that the combinations of "lc" and "lp" ↩

parameters, where (lc + lp > 4), can provide significant improvement in ↩

compression ratio only in some rare cases.↩

↩

The LZMA properties can be encoded into two bytes in new scheme:↩

↩

Offset Size Description↩

↩

  0     1   The dictionary size encoded with LZMA2 scheme↩

  1     1   LZMA model properties (lc, lp, pb) in encoded form↩

↩

↩

The RAM usage ↩

=============↩

↩

The RAM usage for LZMA decoder is determined by the following parts:↩

↩

1) The Sliding Window (from 4 KiB to 4 GiB).↩

2) The probability model counter arrays (arrays of 16-bit variables).↩

3) Some additional state variables (about 10 variables of 32-bit integers).↩

↩

↩

The RAM usage for Sliding Window↩

--------------------------------↩

↩

There are two main scenarios of decoding:↩

↩

1) The decoding of full stream to one RAM buffer.↩

↩

  If we decode full LZMA stream to one output buffer in RAM, the decoder ↩

  can use that output buffer as sliding window. So the decoder doesn't ↩

  need additional buffer allocated for sliding window.↩

↩

2) The decoding to some external storage.↩

↩

  If we decode LZMA stream to external storage, the decoder must allocate↩

  the buffer for sliding window. The size of that buffer must be equal ↩

  or larger than the value of dictionary size from properties of LZMA stream.↩

↩

In this specification we describe the code for decoding to some external↩

storage. The optimized version of code for decoding of full stream to one↩

output RAM buffer can require some minor changes in code.↩

↩

↩

The RAM usage for the probability model counters↩

------------------------------------------------↩

↩

The size of the probability model counter arrays is calculated with the ↩

following formula:↩

↩

size_of_prob_arrays = 1846 + 768 * (1 << (lp + lc))↩

↩

Each probability model counter is 11-bit unsigned integer.↩

If we use 16-bit integer variables (2-byte integers) for these probability ↩

model counters, the RAM usage required by probability model counter arrays ↩

can be estimated with the following formula:↩

↩

  RAM = 4 KiB + 1.5 KiB * (1 << (lp + lc))↩

↩

For example, for default LZMA parameters (lp = 0 and lc = 3), the RAM usage is↩

↩

  RAM_lc3_lp0 = 4 KiB + 1.5 KiB * 8 = 16 KiB↩

↩

The maximum RAM state usage is required for decoding the stream with lp = 4 ↩

and lc = 8:↩

↩

  RAM_lc8_lp4 = 4 KiB + 1.5 KiB * 4096 = 6148 KiB↩

↩

If the decoder uses LZMA2's limited property condition ↩

(lc + lp <= 4), the RAM usage will be not larger than↩

↩

  RAM_lc_lp_4 = 4 KiB + 1.5 KiB * 16 = 28 KiB↩

↩

↩

The RAM usage for encoder↩

-------------------------↩

↩

There are many variants for LZMA encoding code.↩

These variants have different values for memory consumption.↩

Note that memory consumption for LZMA Encoder can not be ↩

smaller than memory consumption of LZMA Decoder for same stream.↩

↩

The RAM usage required by modern effective implementation of ↩

LZMA Encoder can be estimated with the following formula:↩

↩

  Encoder_RAM_Usage = 4 MiB + 11 * dictionarySize.↩

↩

But there are some modes of the encoder that require less memory.↩

↩

↩

LZMA Decoding↩

=============↩

↩

The LZMA compression algorithm uses LZ-based compression with Sliding Window↩

and Range Encoding as entropy coding method.↩

↩

↩

Sliding Window↩

--------------↩

↩

LZMA uses Sliding Window compression similar to LZ77 algorithm.↩

↩

LZMA stream must be decoded to the sequence that consists↩

of MATCHES and LITERALS:↩

↩

  - a LITERAL is a 8-bit character (one byte).↩

    The decoder just puts that LITERAL to the uncompressed stream.↩

↩

  - a MATCH is a pair of two numbers (DISTANCE-LENGTH pair).↩

    The decoder takes one byte exactly "DISTANCE" characters behind↩

    current position in the uncompressed stream and puts it to ↩

    uncompressed stream. The decoder must repeat it "LENGTH" times.↩

↩

The "DISTANCE" can not be larger than dictionary size.↩

And the "DISTANCE" can not be larger than the number of bytes in↩

the uncompressed stream that were decoded before that match.↩

↩

In this specification we use cyclic buffer to implement Sliding Window↩

for LZMA decoder:↩

↩

class COutWindow↩

{↩

  Byte *Buf;↩

  UInt32 Pos;↩

  UInt32 Size;↩

  bool IsFull;↩

↩

public:↩

  unsigned TotalPos;↩

  COutStream OutStream;↩

↩

  COutWindow(): Buf(NULL) {}↩

  ~COutWindow() { delete []Buf; }↩

↩

  void Create(UInt32 dictSize)↩

  {↩

    Buf = new Byte[dictSize];↩

    Pos = 0;↩

    Size = dictSize;↩

    IsFull = false;↩

    TotalPos = 0;↩

  }↩

↩

  void PutByte(Byte b)↩

  {↩

    TotalPos++;↩

    Buf[Pos++] = b;↩

    if (Pos == Size)↩

    {↩

      Pos = 0;↩

      IsFull = true;↩

    }↩

    OutStream.WriteByte(b);↩

  }↩

↩

  Byte GetByte(UInt32 dist) const↩

  {↩

    return Buf[dist <= Pos ? Pos - dist : Size - dist + Pos];↩

  }↩

↩

  void CopyMatch(UInt32 dist, unsigned len)↩

  {↩

    for (; len > 0; len--)↩

      PutByte(GetByte(dist));↩

  }↩

↩

  bool CheckDistance(UInt32 dist) const↩

  {↩

    return dist <= Pos || IsFull;↩

  }↩

↩

  bool IsEmpty() const↩

  {↩

    return Pos == 0 && !IsFull;↩

  }↩

};↩

↩

↩

In another implementation it's possible to use one buffer that contains ↩

Sliding Window and the whole data stream after uncompressing.↩

↩

↩

Range Decoder↩

-------------↩

↩

LZMA algorithm uses Range Encoding (1) as entropy coding method.↩

↩

LZMA stream contains just one very big number in big-endian encoding.↩

LZMA decoder uses the Range Decoder to extract a sequence of binary↩

symbols from that big number.↩

↩

The state of the Range Decoder:↩

↩

struct CRangeDecoder↩

{↩

  UInt32 Range; ↩

  UInt32 Code;↩

  InputStream *InStream;↩

↩

  bool Corrupted;↩

}↩

↩

The notes about UInt32 type for the "Range" and "Code" variables:↩

↩

  It's possible to use 64-bit (unsigned or signed) integer type↩

  for the "Range" and the "Code" variables instead of 32-bit unsigned,↩

  but some additional code must be used to truncate the values to ↩

  low 32-bits after some operations.↩

↩

  If the programming language does not support 32-bit unsigned integer type ↩

  (like in case of JAVA language), it's possible to use 32-bit signed integer, ↩

  but some code must be changed. For example, it's required to change the code↩

  that uses comparison operations for UInt32 variables in this specification.↩

↩

The Range Decoder can be in some states that can be treated as ↩

"Corruption" in LZMA stream. The Range Decoder uses the variable "Corrupted":↩

↩

  (Corrupted == false), if the Range Decoder has not detected any corruption.↩

  (Corrupted == true), if the Range Decoder has detected some corruption.↩

↩

The reference LZMA Decoder ignores the value of the "Corrupted" variable.↩

So it continues to decode the stream, even if the corruption can be detected↩

in the Range Decoder. To provide the full compatibility with output of the ↩

reference LZMA Decoder, another LZMA Decoder implementations must also ↩

ignore the value of the "Corrupted" variable.↩

↩

The LZMA Encoder is required to create only such LZMA streams, that will not ↩

lead the Range Decoder to states, where the "Corrupted" variable is set to true.↩

↩

The Range Decoder reads first 5 bytes from input stream to initialize↩

the state:↩

↩

bool CRangeDecoder::Init()↩

{↩

  Corrupted = false;↩

  Range = 0xFFFFFFFF;↩

  Code = 0;↩

↩

  Byte b = InStream->ReadByte();↩

↩

  for (int i = 0; i < 4; i++)↩

    Code = (Code << 8) | InStream->ReadByte();↩

↩

  if (b != 0 || Code == Range)↩

    Corrupted = true;↩

  return b == 0;↩

}↩

↩

The LZMA Encoder always writes ZERO in initial byte of compressed stream.↩

That scheme allows to simplify the code of the Range Encoder in the ↩

LZMA Encoder. If initial byte is not equal to ZERO, the LZMA Decoder must↩

stop decoding and report error.↩

↩

After the last bit of data was decoded by Range Decoder, the value of the↩

"Code" variable must be equal to 0. The LZMA Decoder must check it by ↩

calling the IsFinishedOK() function:↩

↩

  bool IsFinishedOK() const { return Code == 0; }↩

↩

If there is corruption in data stream, there is big probability that↩

the "Code" value will be not equal to 0 in the Finish() function. So that↩

check in the IsFinishedOK() function provides very good feature for ↩

corruption detection.↩

↩

The value of the "Range" variable before each bit decoding can not be smaller ↩

than ((UInt32)1 << 24). The Normalize() function keeps the "Range" value in ↩

described range.↩

↩

#define kTopValue ((UInt32)1 << 24)↩

↩

void CRangeDecoder::Normalize()↩

{↩

  if (Range < kTopValue)↩

  {↩

    Range <<= 8;↩

    Code = (Code << 8) | InStream->ReadByte();↩

  }↩

}↩

↩

Notes: if the size of the "Code" variable is larger than 32 bits, it's↩

required to keep only low 32 bits of the "Code" variable after the change↩

in Normalize() function.↩

↩

If the LZMA Stream is not corrupted, the value of the "Code" variable is↩

always smaller than value of the "Range" variable.↩

But the Range Decoder ignores some types of corruptions, so the value of↩

the "Code" variable can be equal or larger than value of the "Range" variable↩

for some "Corrupted" archives.↩

↩

↩

LZMA uses Range Encoding only with binary symbols of two types:↩

  1) binary symbols with fixed and equal probabilities (direct bits)↩

  2) binary symbols with predicted probabilities↩

↩

The DecodeDirectBits() function decodes the sequence of direct bits:↩

↩

UInt32 CRangeDecoder::DecodeDirectBits(unsigned numBits)↩

{↩

  UInt32 res = 0;↩

  do↩

  {↩

    Range >>= 1;↩

    Code -= Range;↩

    UInt32 t = 0 - ((UInt32)Code >> 31);↩

    Code += Range & t;↩

↩

    if (Code == Range)↩

      Corrupted = true;↩

↩

    Normalize();↩

    res <<= 1;↩

    res += t + 1;↩

  }↩

  while (--numBits);↩

  return res;↩

}↩

↩

↩

The Bit Decoding with Probability Model↩

---------------------------------------↩

↩

The task of Bit Probability Model is to estimate probabilities of binary↩

symbols. And then it provides the Range Decoder with that information.↩

The better prediction provides better compression ratio.↩

The Bit Probability Model uses statistical data of previous decoded↩

symbols.↩

↩

That estimated probability is presented as 11-bit unsigned integer value↩

that represents the probability of symbol "0".↩

↩

#define kNumBitModelTotalBits 11↩

↩

Mathematical probabilities can be presented with the following formulas:↩

     probability(symbol_0) = prob / 2048.↩

     probability(symbol_1) =  1 - Probability(symbol_0) =  ↩

                           =  1 - prob / 2048 =  ↩

                           =  (2048 - prob) / 2048↩

where the "prob" variable contains 11-bit integer probability counter.↩

↩

It's recommended to use 16-bit unsigned integer type, to store these 11-bit↩

probability values:↩

↩

typedef UInt16 CProb;↩

↩

Each probability value must be initialized with value ((1 << 11) / 2),↩

that represents the state, where probabilities of symbols 0 and 1 ↩

are equal to 0.5:↩

↩

#define PROB_INIT_VAL ((1 << kNumBitModelTotalBits) / 2)↩

↩

The INIT_PROBS macro is used to initialize the array of CProb variables:↩

↩

#define INIT_PROBS(p) \↩

 { for (unsigned i = 0; i < sizeof(p) / sizeof(p[0]); i++) p[i] = PROB_INIT_VAL; }↩

↩

↩

The DecodeBit() function decodes one bit.↩

The LZMA decoder provides the pointer to CProb variable that contains ↩

information about estimated probability for symbol 0 and the Range Decoder ↩

updates that CProb variable after decoding. The Range Decoder increases ↩

estimated probability of the symbol that was decoded:↩

↩

#define kNumMoveBits 5↩

↩

unsigned CRangeDecoder::DecodeBit(CProb *prob)↩

{↩

  unsigned v = *prob;↩

  UInt32 bound = (Range >> kNumBitModelTotalBits) * v;↩

  unsigned symbol;↩

  if (Code < bound)↩

  {↩

    v += ((1 << kNumBitModelTotalBits) - v) >> kNumMoveBits;↩

    Range = bound;↩

    symbol = 0;↩

  }↩

  else↩

  {↩

    v -= v >> kNumMoveBits;↩

    Code -= bound;↩

    Range -= bound;↩

    symbol = 1;↩

  }↩

  *prob = (CProb)v;↩

  Normalize();↩

  return symbol;↩

}↩

↩

↩

The Binary Tree of bit model counters↩

-------------------------------------↩

↩

LZMA uses a tree of Bit model variables to decode symbol that needs↩

several bits for storing. There are two versions of such trees in LZMA:↩

  1) the tree that decodes bits from high bit to low bit (the normal scheme).↩

  2) the tree that decodes bits from low bit to high bit (the reverse scheme).↩

↩

Each binary tree structure supports different size of decoded symbol↩

(the size of binary sequence that contains value of symbol).↩

If that size of decoded symbol is "NumBits" bits, the tree structure ↩

uses the array of (2 << NumBits) counters of CProb type. ↩

But only ((2 << NumBits) - 1) items are used by encoder and decoder.↩

The first item (the item with index equal to 0) in array is unused.↩

That scheme with unused array's item allows to simplify the code.↩

↩

unsigned BitTreeReverseDecode(CProb *probs, unsigned numBits, CRangeDecoder *rc)↩

{↩

  unsigned m = 1;↩

  unsigned symbol = 0;↩

  for (unsigned i = 0; i < numBits; i++)↩

  {↩

    unsigned bit = rc->DecodeBit(&probs[m]);↩

    m <<= 1;↩

    m += bit;↩

    symbol |= (bit << i);↩

  }↩

  return symbol;↩

}↩

↩

template <unsigned NumBits>↩

class CBitTreeDecoder↩

{↩

  CProb Probs[(unsigned)1 << NumBits];↩

↩

public:↩

↩

  void Init()↩

  {↩

    INIT_PROBS(Probs);↩

  }↩

↩

  unsigned Decode(CRangeDecoder *rc)↩

  {↩

    unsigned m = 1;↩

    for (unsigned i = 0; i < NumBits; i++)↩

      m = (m << 1) + rc->DecodeBit(&Probs[m]);↩

    return m - ((unsigned)1 << NumBits);↩

  }↩

↩

  unsigned ReverseDecode(CRangeDecoder *rc)↩

  {↩

    return BitTreeReverseDecode(Probs, NumBits, rc);↩

  }↩

};↩

↩

↩

LZ part of LZMA ↩

---------------↩

↩

LZ part of LZMA describes details about the decoding of MATCHES and LITERALS.↩

↩

↩

The Literal Decoding↩

--------------------↩

↩

The LZMA Decoder uses (1 << (lc + lp)) tables with CProb values, where ↩

each table contains 0x300 CProb values:↩

↩

  CProb *LitProbs;↩

↩

  void CreateLiterals()↩

  {↩

    LitProbs = new CProb[(UInt32)0x300 << (lc + lp)];↩

  }↩

↩

  void InitLiterals()↩

  {↩

    UInt32 num = (UInt32)0x300 << (lc + lp);↩

    for (UInt32 i = 0; i < num; i++)↩

      LitProbs[i] = PROB_INIT_VAL;↩

  }↩

↩

To select the table for decoding it uses the context that consists of↩

(lc) high bits from previous literal and (lp) low bits from value that↩

represents current position in outputStream.↩

↩

If (State > 7), the Literal Decoder also uses "matchByte" that represents ↩

the byte in OutputStream at position the is the DISTANCE bytes before ↩

current position, where the DISTANCE is the distance in DISTANCE-LENGTH pair↩

of latest decoded match.↩

↩

The following code decodes one literal and puts it to Sliding Window buffer:↩

↩

  void DecodeLiteral(unsigned state, UInt32 rep0)↩

  {↩

    unsigned prevByte = 0;↩

    if (!OutWindow.IsEmpty())↩

      prevByte = OutWindow.GetByte(1);↩

↩

    unsigned symbol = 1;↩

    unsigned litState = ((OutWindow.TotalPos & ((1 << lp) - 1)) << lc) + (prevByte >> (8 - lc));↩

    CProb *probs = &LitProbs[(UInt32)0x300 * litState];↩

↩

    if (state >= 7)↩

    {↩

      unsigned matchByte = OutWindow.GetByte(rep0 + 1);↩

      do↩

      {↩

        unsigned matchBit = (matchByte >> 7) & 1;↩

        matchByte <<= 1;↩

        unsigned bit = RangeDec.DecodeBit(&probs[((1 + matchBit) << 8) + symbol]);↩

        symbol = (symbol << 1) | bit;↩

        if (matchBit != bit)↩

          break;↩

      }↩

      while (symbol < 0x100);↩

    }↩

    while (symbol < 0x100)↩

      symbol = (symbol << 1) | RangeDec.DecodeBit(&probs[symbol]);↩

    OutWindow.PutByte((Byte)(symbol - 0x100));↩

  }↩

↩

↩

The match length decoding↩

-------------------------↩

↩

The match length decoder returns normalized (zero-based value) ↩

length of match. That value can be converted to real length of the match ↩

with the following code:↩

↩

#define kMatchMinLen 2↩

↩

    matchLen = len + kMatchMinLen;↩

↩

The match length decoder can return the values from 0 to 271.↩

And the corresponded real match length values can be in the range ↩

from 2 to 273.↩

↩

The following scheme is used for the match length encoding:↩

↩

  Binary encoding    Binary Tree structure    Zero-based match length ↩

  sequence                                    (binary + decimal):↩

↩

  0 xxx              LowCoder[posState]       xxx↩

  1 0 yyy            MidCoder[posState]       yyy + 8↩

  1 1 zzzzzzzz       HighCoder                zzzzzzzz + 16↩

↩

LZMA uses bit model variable "Choice" to decode the first selection bit.↩

↩

If the first selection bit is equal to 0, the decoder uses binary tree ↩

  LowCoder[posState] to decode 3-bit zero-based match length (xxx).↩

↩

If the first selection bit is equal to 1, the decoder uses bit model ↩

  variable "Choice2" to decode the second selection bit.↩

↩

  If the second selection bit is equal to 0, the decoder uses binary tree ↩

    MidCoder[posState] to decode 3-bit "yyy" value, and zero-based match↩

    length is equal to (yyy + 8).↩

↩

  If the second selection bit is equal to 1, the decoder uses binary tree ↩

    HighCoder to decode 8-bit "zzzzzzzz" value, and zero-based ↩

    match length is equal to (zzzzzzzz + 16).↩

↩

LZMA uses "posState" value as context to select the binary tree ↩

from LowCoder and MidCoder binary tree arrays:↩

↩

    unsigned posState = OutWindow.TotalPos & ((1 << pb) - 1);↩

↩

The full code of the length decoder:↩

↩

class CLenDecoder↩

{↩

  CProb Choice;↩

  CProb Choice2;↩

  CBitTreeDecoder<3> LowCoder[1 << kNumPosBitsMax];↩

  CBitTreeDecoder<3> MidCoder[1 << kNumPosBitsMax];↩

  CBitTreeDecoder<8> HighCoder;↩

↩

public:↩

↩

  void Init()↩

  {↩

    Choice = PROB_INIT_VAL;↩

    Choice2 = PROB_INIT_VAL;↩

    HighCoder.Init();↩

    for (unsigned i = 0; i < (1 << kNumPosBitsMax); i++)↩

    {↩

      LowCoder[i].Init();↩

      MidCoder[i].Init();↩

    }↩

  }↩

↩

  unsigned Decode(CRangeDecoder *rc, unsigned posState)↩

  {↩

    if (rc->DecodeBit(&Choice) == 0)↩

      return LowCoder[posState].Decode(rc);↩

    if (rc->DecodeBit(&Choice2) == 0)↩

      return 8 + MidCoder[posState].Decode(rc);↩

    return 16 + HighCoder.Decode(rc);↩

  }↩

};↩

↩

The LZMA decoder uses two instances of CLenDecoder class.↩

The first instance is for the matches of "Simple Match" type,↩

and the second instance is for the matches of "Rep Match" type:↩

↩

  CLenDecoder LenDecoder;↩

  CLenDecoder RepLenDecoder;↩

↩

↩

The match distance decoding↩

---------------------------↩

↩

LZMA supports dictionary sizes up to 4 GiB minus 1.↩

The value of match distance (decoded by distance decoder) can be ↩

from 1 to 2^32. But the distance value that is equal to 2^32 is used to↩

indicate the "End of stream" marker. So real largest match distance ↩

that is used for LZ-window match is (2^32 - 1).↩

↩

LZMA uses normalized match length (zero-based length) ↩

to calculate the context state "lenState" do decode the distance value:↩

↩

#define kNumLenToPosStates 4↩

↩

    unsigned lenState = len;↩

    if (lenState > kNumLenToPosStates - 1)↩

      lenState = kNumLenToPosStates - 1;↩

↩

The distance decoder returns the "dist" value that is zero-based value ↩

of match distance. The real match distance can be calculated with the↩

following code:↩

↩

  matchDistance = dist + 1; ↩

↩

The state of the distance decoder and the initialization code: ↩

↩

  #define kEndPosModelIndex 14↩

  #define kNumFullDistances (1 << (kEndPosModelIndex >> 1))↩

  #define kNumAlignBits 4↩

↩

  CBitTreeDecoder<6> PosSlotDecoder[kNumLenToPosStates];↩

  CProb PosDecoders[1 + kNumFullDistances - kEndPosModelIndex];↩

  CBitTreeDecoder<kNumAlignBits> AlignDecoder;↩

↩

  void InitDist()↩

  {↩

    for (unsigned i = 0; i < kNumLenToPosStates; i++)↩

      PosSlotDecoder[i].Init();↩

    AlignDecoder.Init();↩

    INIT_PROBS(PosDecoders);↩

  }↩

↩

At first stage the distance decoder decodes 6-bit "posSlot" value with bit↩

tree decoder from PosSlotDecoder array. It's possible to get 2^6=64 different ↩

"posSlot" values.↩

↩

    unsigned posSlot = PosSlotDecoder[lenState].Decode(&RangeDec);↩

↩

The encoding scheme for distance value is shown in the following table:↩

↩

posSlot (decimal) /↩

      zero-based distance (binary)↩

 0    0↩

 1    1↩

 2    10↩

 3    11↩

↩

 4    10 x↩

 5    11 x↩

 6    10 xx↩

 7    11 xx↩

 8    10 xxx↩

 9    11 xxx↩

10    10 xxxx↩

11    11 xxxx↩

12    10 xxxxx↩

13    11 xxxxx↩

↩

14    10 yy zzzz↩

15    11 yy zzzz↩

16    10 yyy zzzz↩

17    11 yyy zzzz↩

...↩

62    10 yyyyyyyyyyyyyyyyyyyyyyyyyy zzzz↩

63    11 yyyyyyyyyyyyyyyyyyyyyyyyyy zzzz↩

↩

where ↩

  "x ... x" means the sequence of binary symbols encoded with binary tree and ↩

      "Reverse" scheme. It uses separated binary tree for each posSlot from 4 to 13.↩

  "y" means direct bit encoded with range coder.↩

  "zzzz" means the sequence of four binary symbols encoded with binary↩

      tree with "Reverse" scheme, where one common binary tree "AlignDecoder"↩

      is used for all posSlot values.↩

↩

If (posSlot < 4), the "dist" value is equal to posSlot value.↩

↩

If (posSlot >= 4), the decoder uses "posSlot" value to calculate the value of↩

  the high bits of "dist" value and the number of the low bits.↩

↩

  If (4 <= posSlot < kEndPosModelIndex), the decoder uses bit tree decoders.↩

    (one separated bit tree decoder per one posSlot value) and "Reverse" scheme.↩

    In this implementation we use one CProb array "PosDecoders" that contains ↩

    all CProb variables for all these bit decoders.↩

↩

  if (posSlot >= kEndPosModelIndex), the middle bits are decoded as direct ↩

    bits from RangeDecoder and the low 4 bits are decoded with a bit tree ↩

    decoder "AlignDecoder" with "Reverse" scheme.↩

↩

The code to decode zero-based match distance:↩

↩

  unsigned DecodeDistance(unsigned len)↩

  {↩

    unsigned lenState = len;↩

    if (lenState > kNumLenToPosStates - 1)↩

      lenState = kNumLenToPosStates - 1;↩

↩

    unsigned posSlot = PosSlotDecoder[lenState].Decode(&RangeDec);↩

    if (posSlot < 4)↩

      return posSlot;↩

↩

    unsigned numDirectBits = (unsigned)((posSlot >> 1) - 1);↩

    UInt32 dist = ((2 | (posSlot & 1)) << numDirectBits);↩

    if (posSlot < kEndPosModelIndex)↩

      dist += BitTreeReverseDecode(PosDecoders + dist - posSlot, numDirectBits, &RangeDec);↩

    else↩

    {↩

      dist += RangeDec.DecodeDirectBits(numDirectBits - kNumAlignBits) << kNumAlignBits;↩

      dist += AlignDecoder.ReverseDecode(&RangeDec);↩

    }↩

    return dist;↩

  }↩

↩

↩

↩

LZMA Decoding modes↩

-------------------↩

↩

There are 2 types of LZMA streams:↩

↩

1) The stream with "End of stream" marker.↩

2) The stream without "End of stream" marker.↩

↩

And the LZMA Decoder supports 3 modes of decoding:↩

↩

1) The unpack size is undefined. The LZMA decoder stops decoding after ↩

   getting "End of stream" marker. ↩

   The input variables for that case:↩

↩

      markerIsMandatory = true↩

      unpackSizeDefined = false↩

      unpackSize contains any value↩

↩

2) The unpack size is defined and LZMA decoder supports both variants, ↩

   where the stream can contain "End of stream" marker or the stream is↩

   finished without "End of stream" marker. The LZMA decoder must detect ↩

   any of these situations.↩

   The input variables for that case:↩

↩

      markerIsMandatory = false↩

      unpackSizeDefined = true↩

      unpackSize contains unpack size↩

↩

3) The unpack size is defined and the LZMA stream must contain ↩

   "End of stream" marker↩

   The input variables for that case:↩

↩

      markerIsMandatory = true↩

      unpackSizeDefined = true↩

      unpackSize contains unpack size↩

↩

↩

The main loop of decoder↩

------------------------↩

↩

The main loop of LZMA decoder:↩

↩

Initialize the LZMA state.↩

loop↩

{↩

  // begin of loop↩

  Check "end of stream" conditions.↩

  Decode Type of MATCH / LITERAL. ↩

    If it's LITERAL, decode LITERAL value and put the LITERAL to Window.↩

    If it's MATCH, decode the length of match and the match distance. ↩

        Check error conditions, check end of stream conditions and copy↩

        the sequence of match bytes from sliding window to current position↩

        in window.↩

  Go to begin of loop↩

}↩

↩

The reference implementation of LZMA decoder uses "unpackSize" variable↩

to keep the number of remaining bytes in output stream. So it reduces ↩

"unpackSize" value after each decoded LITERAL or MATCH.↩

↩

The following code contains the "end of stream" condition check at the start↩

of the loop:↩

↩

    if (unpackSizeDefined && unpackSize == 0 && !markerIsMandatory)↩

      if (RangeDec.IsFinishedOK())↩

        return LZMA_RES_FINISHED_WITHOUT_MARKER;↩

↩

LZMA uses three types of matches:↩

↩

1) "Simple Match" -     the match with distance value encoded with bit models.↩

↩

2) "Rep Match" -        the match that uses the distance from distance↩

                        history table.↩

↩

3) "Short Rep Match" -  the match of single byte length, that uses the latest ↩

                        distance from distance history table.↩

↩

The LZMA decoder keeps the history of latest 4 match distances that were used ↩

by decoder. That set of 4 variables contains zero-based match distances and ↩

these variables are initialized with zero values:↩

↩

  UInt32 rep0 = 0, rep1 = 0, rep2 = 0, rep3 = 0;↩

↩

The LZMA decoder uses binary model variables to select type of MATCH or LITERAL:↩

↩

#define kNumStates 12↩

#define kNumPosBitsMax 4↩

↩

  CProb IsMatch[kNumStates << kNumPosBitsMax];↩

  CProb IsRep[kNumStates];↩

  CProb IsRepG0[kNumStates];↩

  CProb IsRepG1[kNumStates];↩

  CProb IsRepG2[kNumStates];↩

  CProb IsRep0Long[kNumStates << kNumPosBitsMax];↩

↩

The decoder uses "state" variable value to select exact variable ↩

from "IsRep", "IsRepG0", "IsRepG1" and "IsRepG2" arrays.↩

The "state" variable can get the value from 0 to 11.↩

Initial value for "state" variable is zero:↩

↩

  unsigned state = 0;↩

↩

The "state" variable is updated after each LITERAL or MATCH with one of the↩

following functions:↩

↩

unsigned UpdateState_Literal(unsigned state)↩

{↩

  if (state < 4) return 0;↩

  else if (state < 10) return state - 3;↩

  else return state - 6;↩

}↩

unsigned UpdateState_Match   (unsigned state) { return state < 7 ? 7 : 10; }↩

unsigned UpdateState_Rep     (unsigned state) { return state < 7 ? 8 : 11; }↩

unsigned UpdateState_ShortRep(unsigned state) { return state < 7 ? 9 : 11; }↩

↩

The decoder calculates "state2" variable value to select exact variable from ↩

"IsMatch" and "IsRep0Long" arrays:↩

↩

unsigned posState = OutWindow.TotalPos & ((1 << pb) - 1);↩

unsigned state2 = (state << kNumPosBitsMax) + posState;↩

↩

The decoder uses the following code flow scheme to select exact ↩

type of LITERAL or MATCH:↩

↩

IsMatch[state2] decode↩

  0 - the Literal↩

  1 - the Match↩

    IsRep[state] decode↩

      0 - Simple Match↩

      1 - Rep Match↩

        IsRepG0[state] decode↩

          0 - the distance is rep0↩

            IsRep0Long[state2] decode↩

              0 - Short Rep Match↩

              1 - Rep Match 0↩

          1 - ↩

            IsRepG1[state] decode↩

              0 - Rep Match 1↩

              1 - ↩

                IsRepG2[state] decode↩

                  0 - Rep Match 2↩

                  1 - Rep Match 3↩

↩

↩

LITERAL symbol↩

--------------↩

If the value "0" was decoded with IsMatch[state2] decoding, we have "LITERAL" type.↩

↩

At first the LZMA decoder must check that it doesn't exceed ↩

specified uncompressed size:↩

↩

      if (unpackSizeDefined && unpackSize == 0)↩

        return LZMA_RES_ERROR;↩

↩

Then it decodes literal value and puts it to sliding window:↩

↩

      DecodeLiteral(state, rep0);↩

↩

Then the decoder must update the "state" value and "unpackSize" value;↩

↩

      state = UpdateState_Literal(state);↩

      unpackSize--;↩

↩

Then the decoder must go to the begin of main loop to decode next Match or Literal.↩

↩

↩

Simple Match↩

------------↩

↩

If the value "1" was decoded with IsMatch[state2] decoding,↩

we have the "Simple Match" type.↩

↩

The distance history table is updated with the following scheme:↩

↩

      rep3 = rep2;↩

      rep2 = rep1;↩

      rep1 = rep0;↩

↩

The zero-based length is decoded with "LenDecoder":↩

↩

      len = LenDecoder.Decode(&RangeDec, posState);↩

↩

The state is update with UpdateState_Match function:↩

↩

      state = UpdateState_Match(state);↩

↩

and the new "rep0" value is decoded with DecodeDistance:↩

↩

      rep0 = DecodeDistance(len);↩

↩

That "rep0" will be used as zero-based distance for current match.↩

↩

If the value of "rep0" is equal to 0xFFFFFFFF, it means that we have ↩

"End of stream" marker, so we can stop decoding and check finishing ↩

condition in Range Decoder:↩

↩

      if (rep0 == 0xFFFFFFFF)↩

        return RangeDec.IsFinishedOK() ?↩

            LZMA_RES_FINISHED_WITH_MARKER :↩

            LZMA_RES_ERROR;↩

↩

If uncompressed size is defined, LZMA decoder must check that it doesn't ↩

exceed that specified uncompressed size:↩

↩

      if (unpackSizeDefined && unpackSize == 0)↩

        return LZMA_RES_ERROR;↩

↩

Also the decoder must check that "rep0" value is not larger than dictionary size↩

and is not larger than the number of already decoded bytes:↩

↩

      if (rep0 >= dictSize || !OutWindow.CheckDistance(rep0))↩

        return LZMA_RES_ERROR;↩

↩

Then the decoder must copy match bytes as described in ↩

"The match symbols copying" section.↩

↩

↩

Rep Match↩

---------↩

↩

If the LZMA decoder has decoded the value "1" with IsRep[state] variable,↩

we have "Rep Match" type.↩

↩

At first the LZMA decoder must check that it doesn't exceed ↩

specified uncompressed size:↩

↩

      if (unpackSizeDefined && unpackSize == 0)↩

        return LZMA_RES_ERROR;↩

↩

Also the decoder must return error, if the LZ window is empty:↩

↩

      if (OutWindow.IsEmpty())↩

        return LZMA_RES_ERROR;↩

↩

If the match type is "Rep Match", the decoder uses one of the 4 variables of↩

distance history table to get the value of distance for current match.↩

And there are 4 corresponding ways of decoding flow. ↩

↩

The decoder updates the distance history with the following scheme ↩

depending from type of match:↩

↩

- "Rep Match 0" or "Short Rep Match":↩

      ; LZMA doesn't update the distance history    ↩

↩

- "Rep Match 1":↩

      UInt32 dist = rep1;↩

      rep1 = rep0;↩

      rep0 = dist;↩

↩

- "Rep Match 2":↩

      UInt32 dist = rep2;↩

      rep2 = rep1;↩

      rep1 = rep0;↩

      rep0 = dist;↩

↩

- "Rep Match 3":↩

      UInt32 dist = rep3;↩

      rep3 = rep2;↩

      rep2 = rep1;↩

      rep1 = rep0;↩

      rep0 = dist;↩

↩

Then the decoder decodes exact subtype of "Rep Match" using "IsRepG0", "IsRep0Long",↩

"IsRepG1", "IsRepG2".↩

↩

If the subtype is "Short Rep Match", the decoder updates the state, puts ↩

the one byte from window to current position in window and goes to next ↩

MATCH/LITERAL symbol (the begin of main loop):↩

↩

          state = UpdateState_ShortRep(state);↩

          OutWindow.PutByte(OutWindow.GetByte(rep0 + 1));↩

          unpackSize--;↩

          continue;↩

↩

In other cases (Rep Match 0/1/2/3), it decodes the zero-based ↩

length of match with "RepLenDecoder" decoder:↩

↩

      len = RepLenDecoder.Decode(&RangeDec, posState);↩

↩

Then it updates the state:↩

↩

      state = UpdateState_Rep(state);↩

↩

Then the decoder must copy match bytes as described in ↩

"The Match symbols copying" section.↩

↩

↩

The match symbols copying↩

-------------------------↩

↩

If we have the match (Simple Match or Rep Match 0/1/2/3), the decoder must↩

copy the sequence of bytes with calculated match distance and match length.↩

If uncompressed size is defined, LZMA decoder must check that it doesn't ↩

exceed that specified uncompressed size:↩

↩

    len += kMatchMinLen;↩

    bool isError = false;↩

    if (unpackSizeDefined && unpackSize < len)↩

    {↩

      len = (unsigned)unpackSize;↩

      isError = true;↩

    }↩

    OutWindow.CopyMatch(rep0 + 1, len);↩

    unpackSize -= len;↩

    if (isError)↩

      return LZMA_RES_ERROR;↩

↩

Then the decoder must go to the begin of main loop to decode next MATCH or LITERAL.↩

↩

↩

↩

NOTES↩

-----↩

↩

This specification doesn't describe the variant of decoder implementation ↩

that supports partial decoding. Such partial decoding case can require some ↩

changes in "end of stream" condition checks code. Also such code ↩

can use additional status codes, returned by decoder.↩

↩

This specification uses C++ code with templates to simplify describing.↩

The optimized version of LZMA decoder doesn't need templates.↩

Such optimized version can use just two arrays of CProb variables:↩

  1) The dynamic array of CProb variables allocated for the Literal Decoder.↩

  2) The one common array that contains all other CProb variables.↩

↩

↩

References:      ↩

↩

1. G. N. N. Martin, Range encoding: an algorithm for removing redundancy ↩

   from a digitized message, Video & Data Recording Conference, ↩

   Southampton, UK, July 24-27, 1979.↩