Struct encoding_rs::Decoder [−][src]
pub struct Decoder { /* fields omitted */ }
Expand description
A converter that decodes a byte stream into Unicode according to a character encoding in a streaming (incremental) manner.
The various decode_*
methods take an input buffer (src
) and an output
buffer dst
both of which are caller-allocated. There are variants for
both UTF-8 and UTF-16 output buffers.
A decode_*
method decodes bytes from src
into Unicode characters stored
into dst
until one of the following three things happens:
-
A malformed byte sequence is encountered (
*_without_replacement
variants only). -
The output buffer has been filled so near capacity that the decoder cannot be sure that processing an additional byte of input wouldn’t cause so much output that the output buffer would overflow.
-
All the input bytes have been processed.
The decode_*
method then returns tuple of a status indicating which one
of the three reasons to return happened, how many input bytes were read,
how many output code units (u8
when decoding into UTF-8 and u16
when decoding to UTF-16) were written (except when decoding into String
,
whose length change indicates this), and in the case of the
variants performing replacement, a boolean indicating whether an error was
replaced with the REPLACEMENT CHARACTER during the call.
The number of bytes “written” is what’s logically written. Garbage may be
written in the output buffer beyond the point logically written to.
Therefore, if you wish to decode into an &mut str
, you should use the
methods that take an &mut str
argument instead of the ones that take an
&mut [u8]
argument. The former take care of overwriting the trailing
garbage to ensure the UTF-8 validity of the &mut str
as a whole, but the
latter don’t.
In the case of the *_without_replacement
variants, the status is a
DecoderResult
enumeration (possibilities Malformed
, OutputFull
and
InputEmpty
corresponding to the three cases listed above).
In the case of methods whose name does not end with
*_without_replacement
, malformed sequences are automatically replaced
with the REPLACEMENT CHARACTER and errors do not cause the methods to
return early.
When decoding to UTF-8, the output buffer must have at least 4 bytes of
space. When decoding to UTF-16, the output buffer must have at least two
UTF-16 code units (u16
) of space.
When decoding to UTF-8 without replacement, the methods are guaranteed
not to return indicating that more output space is needed if the length
of the output buffer is at least the length returned by
max_utf8_buffer_length_without_replacement()
. When decoding to UTF-8
with replacement, the length of the output buffer that guarantees the
methods not to return indicating that more output space is needed is given
by max_utf8_buffer_length()
. When decoding to UTF-16 with
or without replacement, the length of the output buffer that guarantees
the methods not to return indicating that more output space is needed is
given by max_utf16_buffer_length()
.
The output written into dst
is guaranteed to be valid UTF-8 or UTF-16,
and the output after each decode_*
call is guaranteed to consist of
complete characters. (I.e. the code unit sequence for the last character is
guaranteed not to be split across output buffers.)
The boolean argument last
indicates that the end of the stream is reached
when all the bytes in src
have been consumed.
A Decoder
object can be used to incrementally decode a byte stream.
During the processing of a single stream, the caller must call decode_*
zero or more times with last
set to false
and then call decode_*
at
least once with last
set to true
. If decode_*
returns InputEmpty
,
the processing of the stream has ended. Otherwise, the caller must call
decode_*
again with last
set to true
(or treat a Malformed
result as
a fatal error).
Once the stream has ended, the Decoder
object must not be used anymore.
That is, you need to create another one to process another stream.
When the decoder returns OutputFull
or the decoder returns Malformed
and
the caller does not wish to treat it as a fatal error, the input buffer
src
may not have been completely consumed. In that case, the caller must
pass the unconsumed contents of src
to decode_*
again upon the next
call.
Infinite loops
When converting with a fixed-size output buffer whose size is too small to accommodate one character or (when applicable) one numeric character reference of output, an infinite loop ensues. When converting with a fixed-size output buffer, it generally makes sense to make the buffer fairly large (e.g. couple of kilobytes).
Implementations
The Encoding
this Decoder
is for.
BOM sniffing can change the return value of this method during the life of the decoder.
Available via the C wrapper.
Query the worst-case UTF-8 output size with replacement.
Returns the size of the output buffer in UTF-8 code units (u8
)
that will not overflow given the current state of the decoder and
byte_length
number of additional input bytes when decoding with
errors handled by outputting a REPLACEMENT CHARACTER for each malformed
sequence or None
if usize
would overflow.
Available via the C wrapper.
Query the worst-case UTF-8 output size without replacement.
Returns the size of the output buffer in UTF-8 code units (u8
)
that will not overflow given the current state of the decoder and
byte_length
number of additional input bytes when decoding without
replacement error handling or None
if usize
would overflow.
Note that this value may be too small for the _with_replacement
case.
Use max_utf8_buffer_length()
for that case.
Available via the C wrapper.
Incrementally decode a byte stream into UTF-8 with malformed sequences replaced with the REPLACEMENT CHARACTER.
See the documentation of the struct for documentation for decode_*
methods collectively.
Available via the C wrapper.
Incrementally decode a byte stream into UTF-8 with malformed sequences replaced with the REPLACEMENT CHARACTER with type system signaling of UTF-8 validity.
This methods calls decode_to_utf8
and then zeroes
out up to three bytes that aren’t logically part of the write in order
to retain the UTF-8 validity even for the unwritten part of the buffer.
See the documentation of the struct for documentation for decode_*
methods collectively.
Available to Rust only.
Incrementally decode a byte stream into UTF-8 with malformed sequences
replaced with the REPLACEMENT CHARACTER using a String
receiver.
Like the others, this method follows the logic that the output buffer is
caller-allocated. This method treats the capacity of the String
as
the output limit. That is, this method guarantees not to cause a
reallocation of the backing buffer of String
.
The return value is a tuple that contains the DecoderResult
, the
number of bytes read and a boolean indicating whether replacements
were done. The number of bytes written is signaled via the length of
the String
changing.
See the documentation of the struct for documentation for decode_*
methods collectively.
Available to Rust only.
Incrementally decode a byte stream into UTF-8 without replacement.
See the documentation of the struct for
documentation for decode_*
methods
collectively.
Available via the C wrapper.
Incrementally decode a byte stream into UTF-8 with type system signaling of UTF-8 validity.
This methods calls decode_to_utf8
and then zeroes out up to three
bytes that aren’t logically part of the write in order to retain the
UTF-8 validity even for the unwritten part of the buffer.
See the documentation of the struct for documentation for decode_*
methods collectively.
Available to Rust only.
Incrementally decode a byte stream into UTF-8 using a String
receiver.
Like the others, this method follows the logic that the output buffer is
caller-allocated. This method treats the capacity of the String
as
the output limit. That is, this method guarantees not to cause a
reallocation of the backing buffer of String
.
The return value is a pair that contains the DecoderResult
and the
number of bytes read. The number of bytes written is signaled via
the length of the String
changing.
See the documentation of the struct for documentation for decode_*
methods collectively.
Available to Rust only.
Query the worst-case UTF-16 output size (with or without replacement).
Returns the size of the output buffer in UTF-16 code units (u16
)
that will not overflow given the current state of the decoder and
byte_length
number of additional input bytes or None
if usize
would overflow.
Since the REPLACEMENT CHARACTER fits into one UTF-16 code unit, the
return value of this method applies also in the
_without_replacement
case.
Available via the C wrapper.
Incrementally decode a byte stream into UTF-16 with malformed sequences replaced with the REPLACEMENT CHARACTER.
See the documentation of the struct for documentation for decode_*
methods collectively.
Available via the C wrapper.
Incrementally decode a byte stream into UTF-16 without replacement.
See the documentation of the struct for
documentation for decode_*
methods
collectively.
Available via the C wrapper.
Checks for compatibility with storing Unicode scalar values as unsigned bytes taking into account the state of the decoder.
Returns None
if the decoder is not in a neutral state, including waiting
for the BOM, or if the encoding is never Latin1-byte-compatible.
Otherwise returns the index of the first byte whose unsigned value doesn’t directly correspond to the decoded Unicode scalar value, or the length of the input if all bytes in the input decode directly to scalar values corresponding to the unsigned byte values.
Does not change the state of the decoder.
Do not use this unless you are supporting SpiderMonkey/V8-style string storage optimizations.
Available via the C wrapper.